# Sentiment analysis

This notebook describes the sentiment analysis steps that were undertaken. In the first part, we extract the sentiment of each quotes. In the second, we provide some descriptive statistics of the final dataset.

## Setup

In [1]:
# Built-in
import os

# Third parties
import numpy as np
import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

In [2]:
# Initialization needed for some modules

# tqdm for pandas
tqdm.pandas()

# NLTK configuration
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\olivi\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [3]:
# Configuration
DATA_PATH = "data"
PKL_PATH = os.path.join(DATA_PATH, "pkl")
CSV_PATH = os.path.join(DATA_PATH, "csv")
RESOURCES_PATH = os.path.join(DATA_PATH, "resources")

In [4]:
# Utils functions

def get_sentiment(row: pd.Series) -> pd.Series:
    """
    Compute the sentiment score of a given row
    """   
    
    row['NLTK_score'] = sia.polarity_scores(row['quotation'])
    return row

## 1. Compute sentiment score

Since we have all our extracted mentions dataset, we will simply load each year and create a final aggregated dataframe, since it is not too big (around 100k quotes).

In [24]:
df_lst = []

mentions = [os.path.join(CSV_PATH, f"20{i:02d}_mentions.csv") for i in range(15, 21)]  

for mention in mentions:
    df_mention = pd.read_csv(mention)
    df_lst.append(df_mention)

# Concatenate every year together
df = pd.concat(df_lst) 

In [26]:
# Compute the sentiment score
df = df.progress_apply(get_sentiment, axis=1)

100%|██████████| 105929/105929 [01:35<00:00, 1111.21it/s]


Since the computed score is json formatted, we will extract every key of that column and create a new column in the dataset for each.

In [28]:
# Split in columns to get values 
df = pd.concat([df, df['NLTK_score'].progress_apply(pd.Series)], axis=1)

100%|██████████| 105929/105929 [00:23<00:00, 4595.34it/s]


In [29]:
# Sanity check
df.sample(2)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,subset,...,honorificSuffix,fullName,position,stateName,parties,NLTK_score,neg,neu,pos,compound
8830,2017-03-08-013119,as secret as donald trump's tax returns.,lloyd doggett,['Q363817'],2017-03-08 11:30:10,28.0,"[['Lloyd Doggett', '0.8057'], ['None', '0.1471...",['http://gantdaily.com/2017/03/08/house-begins...,E,True,...,II,,Representative,TX,Democrat,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0
10565,2015-11-06-032532,i am disappointed the president today rejected...,kelly ayotte,['Q22354'],2015-11-06 07:25:30,1.0,"[['Kelly Ayotte', '0.6593'], ['None', '0.2289'...",['http://www.wmur.com/politics/hassan-supports...,E,True,...,,kelly ayotte,Senator,NH,Republican,"{'neg': 0.068, 'neu': 0.698, 'pos': 0.234, 'co...",0.068,0.698,0.234,0.9382


In [30]:
# Save the final dataframe both in csv and pickle
df.to_pickle(os.path.join(PKL_PATH, "final_subset.pkl"))
df.to_csv(os.path.join(CSV_PATH, "final_subset.csv"))

Now that we have our final subset, we can conduct our exploratory data analysis on it.

## 2. Sentiment analysis

We will now perform some preliminary analysis, having in mind that we want to analyze the evolution of the sentiment scores accross time. In the first section, we will present some basic descriptive statistics about the data we are working with.

In [None]:
# To avoid running the above cells, we load the dataframe directly
# either from csv or pickle
df = pd.read_pickle(os.path.join(PKL_PATH, "final_subset.pkl"))

### Descriptive statistics

### Analysis

In [17]:
# Split the df by party
df_rep = df[df["parties"] == "Republican"]
df_dem = df[df["parties"] == "Democrat"]

In [19]:
print(f"{len(df_rep)=}")
print(f"{len(df_dem)=}")

len(df_rep)=56257
len(df_dem)=49672
