<a href="https://colab.research.google.com/github/cullena20/RedditSentiment/blob/main/RedditSentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reddit Sentiment Analysis!

In [1]:
from IPython import display  # control displaying of printed output in loops
from pprint import pprint  # pretty print json and lists
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
!pip install praw
import praw

Collecting praw
[?25l  Downloading https://files.pythonhosted.org/packages/2c/15/4bcc44271afce0316c73cd2ed35f951f1363a07d4d5d5440ae5eb2baad78/praw-7.1.0-py3-none-any.whl (152kB)
[K     |████████████████████████████████| 153kB 4.3MB/s 
[?25hCollecting prawcore<2.0,>=1.3.0
  Downloading https://files.pythonhosted.org/packages/1d/40/b741437ce4c7b64f928513817b29c0a615efb66ab5e5e01f66fe92d2d95b/prawcore-1.5.0-py3-none-any.whl
Collecting websocket-client>=0.54.0
[?25l  Downloading https://files.pythonhosted.org/packages/4c/5f/f61b420143ed1c8dc69f9eaec5ff1ac36109d52c80de49d66e0c36c3dfdf/websocket_client-0.57.0-py2.py3-none-any.whl (200kB)
[K     |████████████████████████████████| 204kB 29.5MB/s 
[?25hCollecting update-checker>=0.17
  Downloading https://files.pythonhosted.org/packages/0c/ba/8dd7fa5f0b1c6a8ac62f8f57f7e794160c1f86f31c6d0fb00f582372a3e4/update_checker-0.18.0-py3-none-any.whl
Installing collected packages: prawcore, websocket-client, update-checker, praw
Successfully instal

In [2]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

## Exploring the Reddit API using PRAW

Access the Reddit API. This allows you to easilly acess data from Reddit. To do this go to https://www.reddit.com/prefs/apps/.

In [3]:
reddit = praw.Reddit(client_id='<your client_id>',
                     client_secret='<your client_secret>',
                     user_agent='<your user_agent>',
                     username='<your user_name')

In [4]:
# subreddit1 = reddit.subreddits.search_by_name('datascience', exact=True)  returns a list of search results
subreddit = reddit.subreddit('datascience')
print("Display Name:")
print(subreddit.display_name) 
print()
print("Title:")
print(subreddit.title)   
print()
print("Description")      
print(subreddit.description) 

Display Name:
datascience

Title:
Data Science

Description



In [5]:
posts = set()  # use a set to clear any duplicates
for post in subreddit.new(limit=None):
  posts.add(post)
  display.clear_output()  # only one output that changes
  print(len(posts))
posts = list(posts)  # easier to work with lists

741


In [6]:
post = posts[2]
print(post.title)
print(post.author)
print(post.score)
print(post.id)
print(post.url)

Are there any people who started off with data science with a non-computer science background after they started working but still managed to make a decent career in it?
None
280
k5y56t
https://www.reddit.com/r/datascience/comments/k5y56t/are_there_any_people_who_started_off_with_data/


Using the Reddit API, we can also explore comments. Maybe we can make a model that looks at comments as well as titles?

In [None]:
# this creates a list of comments from the post we already defined
comments = list(post.comments)
# pprint(vars(comments[1]))  # gives us variables for comment
print('Post Title:', post.title)
print()
print('Comment: ', comments[1].body)
print()
print('Comment Author: ', comments[1].author)
print('Score: ', comments[1].score)  # would be nice to have model weigh this too

## Basic Sentiment Analysis Using Pretrained Models


For now, we will explore various pretrained models that detect negative and positive sentiment. Alternativley, we could train our own model using a dataset and sklearn. However, these pretrained models actually perform pretty well.

The main model that we are using is vader from nltk. This model has been pretrained specifically for social media text. A detailed paper describing the model can be found at https://www.researchgate.net/publication/275828927_VADER_A_Parsimonious_Rule-based_Model_for_Sentiment_Analysis_of_Social_Media_Text.

Code for other models is commented out because we are not using them. You can uncomment to explore them though.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
sia = SIA()
# from textblob import TextBlob
# !pip install flair
# import flair
# flair_sentiment = flair.models.TextClassifier.load('en-sentiment')

In [None]:
sentence = "This food was great but the service was only okay"
print("NLTK VADER")
print(sia.polarity_scores(sentence))
# print()
# print("Text Blob:")
# print(TextBlob(sentence).sentiment)
# print()
# print("Flair:")
# s = flair.data.Sentence(sentence)
# flair_sentiment.predict(s)
# total_sentiment = s.labels
# print(total_sentiment)

In [None]:
results = list()

for post in posts:
    pol_score = sia.polarity_scores(post.title)
    pol_score['headline'] = post.title
    results.append(pol_score)

pprint(results[:3], width=100)  # pretty print

Now we will store the data as a pandas dataframe. We will create a new column, 'label', that will store if the headline is positive (1), neutral (0), or negative (-1). We have used 0.2 and -0.2 as our thresholds but this can be altered (giving us different results).

In [None]:
df = pd.DataFrame.from_records(results)
sorted_df = df.sort_values(by='compound')
df['label'] = 0  # creates label column
df.loc[df['compound'] > 0.2, 'label'] = 1  # if compound score is greater than 0.2 we label it as positive
df.loc[df['compound'] < -0.2, 'label'] = -1  # if compound score is less than -0.2 we label it as positive
df.sample(n=10,axis='rows')  # prints 10 random items from the dataframe

Now that we have our results, we can save them in a csv file!

In [25]:
percentages = dict(df.label.value_counts(normalize=True) * 100)
for key in percentages.keys():
    if key == -1:
        percentages['Negative'] = percentages[key]
        del percentages[key]
    if key == 0:
        percentages['Neutral'] = percentages[key]
        del percentages[key]
    if key == 1:
        percentages['Positive'] = percentages[key]
        del percentages[key]
percentages

{'Negative': 9.581646423751687,
 'Neutral': 62.61808367071525,
 'Positive': 27.800269905533064}

## Exploring Our Results

We can explore the most positive and negative headlines using the below code.

In [None]:
sorted_df = df.sort_values(by='compound')
print('Five Most Positive Titles:')
for headline in list(sorted_df.tail(5)['headline']):
  print(headline)
print()
print('Five Most Negative Titles:')
for headline in list(sorted_df.head(5)['headline']):
  print(headline)


This code will print the first five negative results and the first five positive results. These do not take into account how positive or negative that they are.

In [None]:
positive_results = df[df['label'] == 1]
negative_results = df[df['label'] == -1]
print("Postitive Results:")
pprint(list(positive_results['headline'])[:5]) 
print()
print("Negative Results:")
pprint(list(negative_results['headline'])[:5]) 

Now we can determine the overall sentiment of a subreddit by creating percentages of positive, neutral, and negative headlines.

In [16]:
percentages = df.label.value_counts(normalize=True) * 100
print("Count:")
print(df.label.value_counts())
print()
print("Percentages:")
print(percentages)

Count:
 0    464
 1    206
-1     71
Name: label, dtype: int64

Percentages:
 0    62.618084
 1    27.800270
-1     9.581646
Name: label, dtype: float64


The below code lets us visualize the above results.

In [None]:
sns.barplot(x=percentages.index, y=percentages)
plt.xlabel = ['Negative', 'Nuetral', 'Positive']
plt.plot()

## Future Model Improvement

We classified a subreddit's sentiment soley based on the titles of its top posts. Maybe we can look at more data such as comments and upvotes. Using this same idea, we may also be able to make a more practical model, such as a fake news or hate speech detector. This would involve using different models and potentially needing training data.

## Now What?

In this notebook, we have explored the Reddit API, and different sentiment analysis models. We have also been able to visualize our results. Using these ideas and the code as a foundation, we can turn this into something more accesible. For example, we can make a website where a user types in a subreddit and get a sentiment analysis back. We can make a cli to do analyze subreddits in the command line.