# Sentiment analysis

This notebook describes the sentiment analysis steps that were undertaken. In the first part, we extract the sentiment of each quotes. In the second, we provide some descriptive statistics of the final dataset.

## Setup

In [2]:
# Built-in
import json
import bz2
import os
import time
import csv

# Third parties
import numpy as np
import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

In [3]:
# Initialization needed for some modules

# tqdm for pandas
tqdm.pandas()

# NLTK configuration
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\olivi\AppData\Roaming\nltk_data...


In [4]:
# Configuration
DATA_PATH = "data"
PKL_PATH = os.path.join(DATA_PATH, "pkl")
CSV_PATH = os.path.join(DATA_PATH, "csv")
RESOURCES_PATH = os.path.join(DATA_PATH, "resources")

In [5]:
# Utils functions

def get_sentiment(row: pd.Series) -> pd.Series:
    """
    Compute the sentiment score of a given row
    """   
    
    row['NLTK score'] = sia.polarity_scores(row['quotation'])
    return row

## 1. Compute sentiment score

Since we have all our extracted mentions dataset, we will simply load each year and create a final aggregated dataframe, since it is not too big (around 100k quotes).

In [7]:
df_lst = []

mentions = [os.path.join(CSV_PATH, f"20{i:02d}_mentions.csv") for i in range(15, 21)]  

for mention in mentions:
    df_mention = pd.read_csv(mention)
    df_lst.append(df_mention)

# Concatenate every year together
df = pd.concat(df_lst) 

In [9]:
# Compute the sentiment score
df = df.progress_apply(get_sentiment, axis=1)

100%|██████████| 105929/105929 [00:47<00:00, 2225.40it/s]


Since the computed score is json formatted, we will extract every key of that column and create a new column in the dataset for each.

In [10]:
# Split in columns to get values 
df = pd.concat([df, df['NLTK score'].progress_apply(pd.Series)], axis=1)

100%|██████████| 105929/105929 [00:23<00:00, 4533.58it/s]


In [16]:
# Sanity check
df.sample(2)

Unnamed: 0.1,Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,...,parties,NLTK score,neg,neu,pos,compound,neg.1,neu.1,pos.1,compound.1
7983,199559,2015-02-25-111879,whether you are talking to a member of the hou...,john mchugh,"['Q6247887', 'Q6247891']",2015-02-25 17:00:45,2.0,"[['John McHugh', '0.6448'], ['None', '0.3552']]",['http://www.nationaldefensemagazine.org/_layo...,E,...,Republican,"{'neg': 0.0, 'neu': 0.847, 'pos': 0.153, 'comp...",0.0,0.847,0.153,0.8042,0.0,0.847,0.153,0.8042
13718,85344,2017-08-23-163100,we're closely following the terrible events un...,president donald trump,['Q22686'],2017-08-23 08:58:44,2.0,"[['President Donald Trump', '0.6922'], ['None'...",['http://www.politifact.com/truth-o-meter/arti...,E,...,Republican,"{'neg': 0.169, 'neu': 0.69, 'pos': 0.141, 'com...",0.169,0.69,0.141,-0.7184,0.169,0.69,0.141,-0.7184


In [15]:
# Save the final dataframe both in csv and pickle
df.to_pickle(os.path.join(PKL_PATH, "final_subset.pkl"))
df.to_csv(os.path.join(CSV_PATH, "final_subset.csv"))

Now that we have our final subset, we can conduct our exploratory data analysis on it.

## 2. EDA

Blabla

In [17]:
# Split the df by party
df_rep = df[df["parties"] == "Republican"]
df_dem = df[df["parties"] == "Democrat"]

In [19]:
print(f"{len(df_rep)=}")
print(f"{len(df_dem)=}")

len(df_rep)=56257
len(df_dem)=49672
