<a href="https://colab.research.google.com/github/ZanButt/RedditSentimentAnalysis/blob/main/analyze_subreddit_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p align = "center" draggable=”false” ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png" 
     width="200px"
     height="auto"/>
</p>

# Reddit and HuggingFace Starter Kit

In this live coding session, we leverage the Python Reddit API Wrapper (`PRAW`) to retrieve data from subreddits on [Reddit](https://www.reddit.com), and perform sentiment analysis using [`pipelines`](https://huggingface.co/docs/transformers/main_classes/pipelines) from [HuggingFace ( 🤗 the GitHub of Machine Learning )](https://techcrunch.com/2022/05/09/hugging-face-reaches-2-billion-valuation-to-build-the-github-of-machine-learning/), powered by [transformer](https://arxiv.org/pdf/1706.03762.pdf).

## Objectives

At the end of the session, you will 

- know how to work with APIs
- feel more comfortable navigating thru documentation, even inspecting the source code
- understand what a `pipeline` object is in HuggingFace
- perform sentiment analysis using `pipeline`
- run a python script in command line and get the results

## How to Submit

- At the end of each task, commit* the work into your own remote repo
- After completing all three tasks, make sure to push the notebook containing all code blocks and output cells to your remote repo
- Submit the link to the notebook in Canvas

\***NEVER** commit a notebook displaying errors unless it is instructed otherwise. However, commit often; recall git ABC = **A**lways **B**e **C**ommitting.

## Tasks

### Task I: Instantiate a Reddit API Object

The first task is to instantiate a Reddit API object using [PRAW](https://praw.readthedocs.io/en/stable/), through which you will retrieve data. PRAW is a wrapper for [Reddit API](https://www.reddit.com/dev/api) that makes interacting with the Reddit API easier unless you are already an expert of [`requests`](https://docs.python-requests.org/en/latest/).

#### 0.  Get updates from `FourthBrain/MLE-7`

Under your forked local repo, **fetch** and download new updates from repo `FourthBrain/MLE-7` locally so you can start development. 
    
<details>
<summary>If you haven't added `FourthBrain/MLE-7` as a remote repo, click here for instructions:</summary>   
You fork repo `FourthBrain/MLE-7` to `yourhandle/MLE-7`, clone it locally, and now you are under directory `MLE-7`. By default, you will see one server name `origin` pointing to your repo:  
    
```
$git remote -v 
origin  git@github.com:yourhandle/MLE-7.git (fetch)
origin  git@github.com:yourhandle/MLE-7.git (push)
```

Think of fetch = read and push = write. 

Now add `FourthBrain/MLE-7` as a remote repo

```
$git add remote fourthbrain git@github.com:FourthBrain/MLE-7.git
$git remote -v
fourthbrain	git@github.com:FourthBrain/MLE-7.git (fetch)
fourthbrain	git@github.com:FourthBrain/MLE-7.git (push)
origin git@github.com:yourhandle/MLE-7.git (fetch)
origin git@github.com:yourhandle/MLE-7.git (push)
```

then before each session starts, run `git fetch fourthbrain` to get updates (why not `git pull`?).

check out [Working with Remotes](https://git-scm.com/book/en/v2/Git-Basics-Working-with-Remotes) for more explanations.
</details>

#### 1. Install packages

In [None]:
pip install -U transformers praw torch numpy pandas

Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 2.7 MB/s 
[?25hCollecting praw
  Downloading praw-7.6.0-py3-none-any.whl (188 kB)
[K     |████████████████████████████████| 188 kB 36.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 33.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 3.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 63.9 MB/s 
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.3.2-py3-none-any.whl (54 kB)
[K     |████████████████████████

```
conda activate {your_virtual_environment_name}
pip install -U transformers praw torch numpy pandas
```

####  2. Create a new app on Reddit 

Create a new app on Reddit and save secret tokens; refer to [post in medium](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for more details.

- Create a Reddit account if you don't have one, log into your account.
- To access the API, we need create an app. Slight updates, on the website, you need to navigate to `preference` > `app`, or click [this link](https://www.reddit.com/prefs/apps) and scroll all the way down. 
- Click to create a new app, fill in the **name**, choose `script`, fill in  **description** and **redirect url** ( The redirect URI is where the user is sent after they've granted OAuth access to your application; more info [here](and details are in [](https://github.com/reddit-archive/reddit/wiki/OAuth2) for our purpose, you can enter some random url, e.g., www.google.com; ) as shown below.
    <div>
    <img src="./img/reddit_create_app.png" width="500"/>
    </div>   
- Jolt down `client_id` (left upper corner) and `client_secret` 
    <div>
    <img src="./img/reddit_secret_tokens.png" width="300"/>
    </div>

- Create `secrets.py` in the same directory with this notebook, fill in `client_id` and `secret_id` obtained from the last step. We will need to import those constants in the next step.
    ```
    REDDIT_API_CLIENT_ID = {client_id}
    REDDIT_API_CLIENT_SECRET = {secret_id}
    REDDIT_API_USER_AGENT = {can_be_any_string, e.g., : "BotBot"}
    ```
- Add `secrets.py` to your `.gitignore` file if not already done. NEVER push credentials to a repo, private or public. 

#### 3. Instantiate a `Reddit` object

Now you are ready to create a read-only `Reddit` instance. Refer to [documentation](https://praw.readthedocs.io/en/stable/code_overview/reddit_instance.html) when necessary.

In [None]:
import praw
import secrets

# Create a Reddit object which allows us to interact with the Reddit API

# please note that there was intially information in these fields however this is sensitive information so it was removed
reddit = praw.Reddit(
    client_id= "",
    client_secret = "",
    user_agent="",
    check_for_async=False
  
)

In [None]:
print(reddit) 

<praw.reddit.Reddit object at 0x7f278553c810>


<details>
<summary>Expected output:</summary>   

```<praw.reddit.Reddit object at 0x10f8a0ac0>```
</details>

#### 4. Instantiate a `subreddit` object

Lastly, create a `subreddit` object for your favorite subreddit and inspect the object. The expected output you will see ar from `r/machinelearning` unless otherwise specified.

In [None]:
sred = reddit.subreddit("machinelearning")

What is the display name of the subreddit?

In [None]:
sred.display_name

'machinelearning'

<details>
<summary>Expected output:</summary>   

    machinelearning
</details>

How about its title, is it different from the display name?

In [None]:

sred.title

'Machine Learning'

<details>
<summary>Expected output:</summary>   

    Machine Learning
</details>

Print out the description of the subreddit:

In [None]:

print(sred.description)

**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
***[Chat with us on Slack](https://join.slack.com/t/rml-talk/shared_invite/enQtNjkyMzI3NjA2NTY2LWY0ZmRjZjNhYjI5NzYwM2Y0YzZhZWNiODQ3ZGFjYmI2NTU3YjE1ZDU5MzM2ZTQ4ZGJmOTFmNWVkMzFiMzVhYjg)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles le

<details>
<summary>Expected output:</summary>

    **[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
    --------
    +[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
    --------
    +[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
    --------
    +[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
    --------
    +[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict
</details>

### Task II: Parse comments

#### 1. Top Posts of All Time

Find titles of top 10 posts of **all time** from your favorite subreddit. Refer to [Obtain Submission Instances from a Subreddit Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)) if necessary. Verify if the titles match what you read on Reddit.

In [None]:
# try run this line, what do you see? press q once you are done
?subreddit.top

Object `subreddit.top` not found.


In [None]:

for submission in reddit.subreddit("machinelearning").top(limit=10):
    print(submission.title)

[Project] From books to presentations in 10s with AR + ML
[D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
[R] First Order Motion Model applied to animate paintings
[N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
[D] This AI reveals how much time politicians stare at their phone at work
[D] Types of Machine Learning Papers
[D] The machine learning community has a toxicity problem
[Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
[P] Using oil portraits and First Order Model to bring the paintings back to life
[D] Convolution Neural Network Visualization - Made with Unity 3D and lots of Code / source - stefsietz (IG)


<details> <summary>Expected output:</summary>

    [Project] From books to presentations in 10s with AR + ML
    [D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
    [R] First Order Motion Model applied to animate paintings
    [N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
    [D] This AI reveals how much time politicians stare at their phone at work
    [D] Types of Machine Learning Papers
    [D] The machine learning community has a toxicity problem
    [Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
    [P] Using oil portraits and First Order Model to bring the paintings back to life
    [D] Convolution Neural Network Visualization - Made with Unity 3D and lots of Code / source - stefsietz (IG)    
</details>

#### 2. Top 10 Posts of This Week

What are the titles of the top 10 posts of **this week** from your favorite subreddit?

In [None]:

for submission in reddit.subreddit("machinelearning").top(limit=10,time_filter="week"):
    print(submission.title)


[News] New Google tech - Geospatial API uses computer vision and machine learning to turn 15 years of street view imagery into a 3d canvas for augmented reality developers
[N] Apple Executive Who Left Over Return-to-Office Policy Joins Google AI Unit: Ian Goodfellow, a former director of machine learning at Apple, is joining DeepMind.
[R] Symphony Generation with Permutation Invariant Language Model
[D] Research Director at Deepmind says all we need now is scaling
[P] I was tired of screenshotting plots in Jupyter to share my results. Wanted something better, information rich. So I built a new %%share magic that freezes a cell, captures its code, output & data and returns a URL for sharing.
[P] I made an open-source demo of OpenAI's CLIP model running completely in the browser - no server involved. Compute embeddings for (and search within) a local directory of images, or search 200k popular images from Reddit (as shown in this video). Link to demo and Github repo in comments.
[N] Intr

<details><summary>Expected output:</summary>

    [N] Ian Goodfellow, Apple’s director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said “I believe strongly that more flexibility would have been the best policy for my team.” He was likely the company’s most cited ML expert.
    [R][P] Thin-Plate Spline Motion Model for Image Animation + Gradio Web Demo
    [P] I’ve been trying to understand the limits of some of the available machine learning models out there. Built an app that lets you try a mix of CLIP from Open AI + Apple’s version of MobileNet, and more directly on your phone's camera roll.
    [R] Meta is releasing a 175B parameter language model
    [N] Hugging Face raised $100M at $2B to double down on community, open-source & ethics
    [P] T-SNE to view and order your Spotify tracks
    [D] : HELP Finding a Book - A book written for Google Engineers about foundational Math to support ML
    [R] Scaled up CLIP-like model (~2B) shows 86% Zero-shot on Imagenet
    [D] Do you use NLTK or Spacy for text preprocessing?
    [D] Democratizing Diffusion Models - LDMs: High-Resolution Image Synthesis with Latent Diffusion Models, a 5-minute paper summary by Casual GAN Papers
</details>

#### 3. Comment Code

Add comments to the code block below to describe what each line of the code does (Refer to [Obtain Comment Instances Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html) when necessary). The code is adapted from [this tutorial](https://praw.readthedocs.io/en/stable/tutorials/comments.html)

The purpose is 
1. to understand what the code is doing 
2. start to comment your code whenever it is not self-explantory if you have not (others will thank you, YOU will thank you later 😊) 

In [None]:
#%%time
from praw.models import MoreComments

# intializing top comments as empty array
top_comments = []

# taking a for loop of top 10 in given subreddit
for submission in sred.top(limit=10):
    # for the top comment in the submission comments
    for top_level_comment in submission.comments:
        # check if there are more comments in the top level comments and if so continue
        if isinstance(top_level_comment, MoreComments):
            continue
        # add the comments 
        top_comments.append(top_level_comment.body)

#### 4. Inspect Comments

How many comments did you extract from the last step? Examine a few comments. 

In [None]:

len(top_comments)

693

In [None]:
import random

[random.choice(top_comments) for i in range(3)]

['Is that Alan Turing?!!!???',
 'Without sound I feel like I am in the Hogwarts.',
 'Should be combined with an reward system for not using the phone during the work haha']

<details> <summary>Some of the comments from `r/machinelearning` subreddit are:</summary>

    ['Awesome visualisation',
    'Similar to a stack or connected neurons.',
    'Will this Turing pass the Turing Test?']
</details>

#### 5. Extract Top Level Comment from Subreddit `TSLA`.

Write your code to extract top level comments from the top 10 topics of a time period, e.g., year, from subreddit `TSLA` and store them in a list `top_comments_tsla`.  

In [None]:


for submission in reddit.subreddit("TSLA").top(limit=10,time_filter="week"):
    print(submission.title)




# intializing top comments as empty array
top_comments_tsla = []
# taking a for loop of top 10 in given subreddit
for submission in reddit.subreddit("TSLA").top(limit=10,time_filter="year"):
    # for the top comment in the submission comments
    for top_level_comment in submission.comments:
        # check if there are more comments in the top level comments and if so continue
        if isinstance(top_level_comment, MoreComments):
            continue
        # add the comments 
        top_comments_tsla.append(top_level_comment.body)


Does NASDAQ drive a Tesla? Prepare for liftoff. VROOM!
Musk puts on hold $44-billion deal for Twitter
Elon Musk Attacks S&P Over Exxon Outscoring Tesla On ESG: What He’s Saying
Bulls Hold. Chickens sell.
Tesla has the top 3 electric cars in the US, and it's
Tesla is now only taking Cybertruck reservations in North America
‘Ridiculous,’ ‘Wacktivism:’ Cathie Wood, Elon Musk React To Tesla’s Removal From S&P 500 ESG Index
Elon Musk Wants to Refinance His Twitter Bid so Its Less Risky
Woes to TSLA stock politically
Musk is fighting the whole system


In [None]:
len(top_comments_tsla) # Expected: 174 for r/machinelearning

173

In [None]:
[random.choice(top_comments_tsla) for i in range(3)]

['100%',
 'Maybe we’re all a little crazy…but I’m holding and hoping that it’s a great call today. It will be brutal if there are surprises in Q1 or forecast. Hopefully Shanghai won’t be a reason for continued selling. Investors are in a bear mood over NFLX!',
 'Also, #buy the dip and drive your cost per share down. #HODL.']

<details>
<summary>Some of the comments from `r/TSLA` subreddit:</summary>

    ['I bought puts',
    '100%',
    'Yes. And I’m bag holding 1200 calls for Friday and am close to throwing myself out the window']
</details>

### Task III: Sentiment Analysis

Let us analyze the sentiment of comments scraped from `r/TSLA` using a pre-trained HuggingFace model to make the inference. Take a [Quick tour](https://huggingface.co/docs/transformers/quicktour). 

#### 1. Import `pipeline`

In [None]:
from transformers import pipeline

#### 2. Create a Pipeline to Perform Task "sentiment-analysis"

In [None]:
sentiment_model = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

#### 3. Get one comment from list `top_comments_tsla` from Task II - 5.

In [None]:
comment = random.choice(top_comments_tsla)

In [None]:
comment

'Play options and use the gainz to buy more shares. Congrats tho'

The example comment is: `'Bury Burry!!!!!'`. Print out what you get. For reproducibility, use the same comment in the next step; consider setting a seed.

#### 4. Make Inference!

In [None]:
sentiment = sentiment_model(comment)

What is the type of the output `sentiment`?

```
YOUR ANSWER HERE
```

In [None]:
print(f'The comment: {comment}')
print(f'Predicted Label is {sentiment[0]["label"]} and the score is {sentiment[0]["score"]:.3f}')

The comment: Play options and use the gainz to buy more shares. Congrats tho
Predicted Label is POSITIVE and the score is 0.800


For the example comment, the output is:

    The comment: Bury Burry!!!!!
    Predicted Label is NEGATIVE and the score is 0.989

### Task IV: Put All Together

Let's pull all the piece together, create a simple script that does 

- get the subreddit
- get comments from the top posts for given subreddit
- run sentiment analysis 

#### Complete the Script

Once you complete the code, running the following block writes the code into a new Python script and saves it as `top_tlsa_comment_sentiment.py` under the same directory with the notebook. 

In [None]:
%%writefile top_tlsa_comment_sentiment.py

import secrets
import random

from typing import Dict, List

from praw import Reddit
from praw.models.reddit.subreddit import Subreddit
from praw.models import MoreComments

from transformers import pipeline


def get_subreddit(display_name:str) -> Subreddit:
    """Get subreddit object from display name

    Args:
        display_name (str): [description]

    Returns:
        Subreddit: [description]
    """
    # again the information here was taken out for the fields (client_id,client_secret,user_agent)
    reddit = Reddit(
    client_id= "",
    client_secret = "",
    user_agent="",
    check_for_async=False
  )
    
    subreddit = reddit.subreddit(display_name)
    return subreddit

def get_comments(subreddit:Subreddit, limit:int=3) -> List[str]:
    """ Get comments from subreddit

    Args:
        subreddit (Subreddit): [description]
        limit (int, optional): [description]. Defaults to 3.

    Returns:
        List[str]: List of comments
    """
    top_comments = []
    for submission in subreddit.top(limit=limit):
        for top_level_comment in submission.comments:
            if isinstance(top_level_comment, MoreComments):
                continue
            top_comments.append(top_level_comment.body)
    return top_comments

def run_sentiment_analysis(comment:str) -> Dict:
    """Run sentiment analysis on comment using default distilbert model
    
    Args:
        comment (str): [description]
        
    Returns:
        str: Sentiment analysis result
    """
    sentiment_model = pipeline("sentiment-analysis")
    sentiment = sentiment_model(comment)
    return sentiment[0]


if __name__ == '__main__':
    submission = get_subreddit("TSLA")
    comments = get_comments(submission)
    comment = random.choice(comments)
    sentiment = run_sentiment_analysis(comment)
    
    print(f'The comment: {comment}')
    print(f'Predicted Label is {sentiment["label"]} and the score is {sentiment["score"]:.3f}')

Overwriting top_tlsa_comment_sentiment.py


Run the following block to see the output.

In [None]:
!python top_tlsa_comment_sentiment.py

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
2022-05-20 01:30:51.542029: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
The comment: You're a fucking legend
Predicted Label is POSITIVE and the score is 0.998


<details><summary> Expected output:</summary>

    No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
    The comment: When is DOGE flying
    Predicted Label is POSITIVE and the score is 0.689
</details>