# Data Science for Social Justice Workshop: Sentiment Analysis – Project Notebook

Use this notebook for carrying out the analyses from the workshop notebook on your own subreddit data.

## Import the Dataset

On a subreddit like AITA, the manner in which the OP expresses sentiment on the involved parties influences how commenters interpret and ultimately vote on the situation. Expressions of sentiment reflect the norms carried by the community under study.

Let's take a look at some example comments. First, we import the dataset:

In [None]:
import matplotlib.pyplot as plt
import os
import pickle
import pandas as pd

%matplotlib inline

In [None]:
df = pd.read_csv('../../data/YOUR_FILE.csv')

Take a look at example submissions. What is the sentiment?

In [None]:
print(df['selftext'].iloc[501])

In [None]:
print(df['selftext'].iloc[200])

In [None]:
print(df['selftext'].iloc[750])

# Using VADER to Conduct Sentiment Analysis

In [None]:
# Import the VADER SentimentIntensityAnalyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# Create analyzer object
analyzer = SentimentIntensityAnalyzer()

In [None]:
print(analyzer.polarity_scores(df['selftext'].iloc[501]))
print(analyzer.polarity_scores(df['selftext'].iloc[200]))
print(analyzer.polarity_scores(df['selftext'].iloc[750]))

# Characterizing Sentiment at Scale

In [None]:
# Get rid of [removed] posts
df_filtered = df[df['selftext'] != '[removed]'].copy()
# Get rid of NA posts
df_filtered = df_filtered[~df_filtered['selftext'].isna()]

In [None]:
# Create list to store scores
compound_scores = []

In [None]:
# This may take a few minutes to run

# Iterate through the selftext of each post
for post in df_filtered['selftext']:
    # Calculate sentiment
    sentiment = analyzer.polarity_scores(post)
    # Store each score
    compound_scores.append(sentiment['compound'])

In [None]:
# Store the compound scores in the dataframe
df_filtered['sentiment'] = compound_scores

Let's take a look at the distribution of sentiment. What do you observe?

In [None]:
df_filtered['sentiment'].hist(grid=False)
plt.xlabel('Compound Sentiment Score', fontsize=15)
plt.ylabel('Frequency', fontsize=15)

## Does Post Score Correlate with Sentiment?

In [None]:
from scipy.stats import binned_statistic
import numpy as np

def plot_score_vs_sentiment(sentiment, score, n_bins=9):
    """Plots the average score within ranges of sentiment values.
    
    Parameters
    ----------
    sentiment : pd.Series
        The sentiment column from your dataframe.
    score : pd.Series
        The score column from your dataframe.
    n_bins : int
        The number of bins to plot.
    """
    # Calculate binned sentiment values
    bin_means, bin_edges, binnumber = binned_statistic(sentiment,
                                                       score,
                                                       statistic='mean',
                                                       bins=np.linspace(-1, 1, n_bins))
    # Calculate bin width of bar plot
    binwidth = np.ediff1d(bin_edges)[0]
    fig, ax = plt.subplots(1, 1, figsize=(8, 6))
    ax.bar(x=bin_edges[:-1] + binwidth / 2, height=bin_means, width=binwidth)
    ax.set_xlim([-1, 1])
    return fig, ax

In [None]:
plot_score_vs_sentiment(df_filtered['sentiment'], df_filtered['score'], n_bins=10)
plt.xlim([-1.05, 1.05])
plt.ylim([5000, 6700])
plt.xlabel('Compound Sentiment', fontsize=15)
plt.ylabel('Average Post Score', fontsize=15)

We definitely observe a pattern: Posts with the most negative and most positive sentiment generally have higher scores!

# 🎬 Demo : Sentiment Analysis with spaCy and TextBlob

In [None]:
# Perform these installs first
%pip install textblob
%pip install spacytextblob
%python -m textblob.download_corpora

In [None]:
import numpy as np
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [None]:
# Create NLP object
nlp = spacy.load('en_core_web_sm')
# Important: we have to add textblob to our spaCy pipeline
nlp.add_pipe('spacytextblob')

In [None]:
# Apply the spaCy pipeline to each post
# This command will take a while to run if your dataset is big
docs = list(nlp.pipe(df_filtered['selftext']))

TextBlob calculates sentiments in the variable "polarity". It also includes a variable called "subjectivity", which ranges from 0 to 1. It estimates the level of subjectivity expressed in the post (values closer to 1 are higher subjectivity).

In [None]:
# Store the polarities in a list
polarities = []
for doc in docs:
    polarities.append(doc._.polarity)
df_filtered['polarities'] = polarities

In [None]:
# Store the subjectivities in a list
subjectivities = []
for doc in docs:
    subjectivities.append(doc._.subjectivity)
df_filtered['subjectivity'] = subjectivities