![alternative text](../images/alatheia.png)

# Sentiment Analysis POC

* This notebook is a scratch pad, for sentiment analysis POC. 
* The idea is to try out few pre-trained sentiment analysis models and see which one works for our use case. 

## Installations

In [22]:
# # ## installing required libraries
# ! pip install beautifulsoup4
# ! pip install pandas
# ! pip install numpy
# ! pip install plotly
# ! pip install nbformat
# ! pip install ipykernel
# ! pip install matplotlip
# ! pip install wordcloud
# ! pip install gensim
# ! pip install pyLDAvis
# ! pip install nltk
# ! pip install -U pip setuptools wheel
# ! pip install -U spacy
# ! python -m spacy download en_core_web_trf 
# ! python -m spacy download en_core_web_md
# ! pip install joblib
# ! pip install tqdm
# ! pip install transformers
! pip install torch




[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Importing Data

In [23]:
## lets load 
import pandas as pd
# pd.set_option('display.max_rows', 500)
# pd.set_option('display.max_columns', 500)
# pd.set_option('display.width', 1000)

import re
import string
from bs4 import BeautifulSoup

import nltk
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('stopwords')
from pprint import pprint

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV

from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.models.ldamulticore import LdaMulticore

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as io

# loading library
import pickle

from joblib import dump, load

from tqdm.auto import tqdm
import torch

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gaura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Reading Data

In [24]:
## reading manaully scrapped data
data = pd.read_csv('../data/scrapped_fox_data_clean.csv')
print(data.shape)

(3972, 12)


## Sentiment Analysis Using VADER
(Valence Aware Dictionary and sEntiment Reasoner)

##### Notes
* Uses Bag of words approach
* Gives +tive, -tive or neutral values to each of the words in the sentence and then gives combined value of that to tell us whether the sentence is positive, negative or neutral
* Does not account for relationship between words :(

In [25]:
from nltk.sentiment import SentimentIntensityAnalyzer


In [26]:
example = data["text"][0]
example

'Former governor and first term Democratic Sen. Maggie Hassan of New Hampshire and Republican challenger Don Bolduc took aim at each other over inflation, abortion, national security, the border crisis, election denialism, and many more issues in their third and final debate in their crucial battleground state race that’s among a handful across the country that will likely determine if the GOP wins back the Senate majority. But ahead of the verbal crossfire on the debate stage, Bolduc – a former Army general who served ten tours of duty in the war in Afghanistan – was allegedly assaulted as he arrived at the debate site at Saint Anselm College’s New Hampshire Institute of Politics on Wednesday evening. According to the Bolduc campaign, a bystander standing in the crowd outside the debate site took a swing at the former general as he arrived. The campaign says Bolduc was slightly grazed but not injured.&nbsp; Rick Wiley of the Bolduc campaign tells Fox News the person who threw the punc

In [27]:
# nltk.download()

In [28]:
sia = SentimentIntensityAnalyzer()

In [29]:
sia.polarity_scores(example)

{'neg': 0.105, 'neu': 0.812, 'pos': 0.083, 'compound': -0.9869}

In [30]:
## running the sentiment analysis on entire dataset
print(data["title"][0])
sia.polarity_scores(data["title"][0])

Hassan and Bolduc trade fire in final showdown after GOP nominee comes under attack arriving at debate


{'neg': 0.268, 'neu': 0.732, 'pos': 0.0, 'compound': -0.6705}

In [31]:
## lets rename index to id
data.rename(columns={"index":"id"}, inplace=True)

In [32]:
results = {}
for idx, row in tqdm(data.iterrows(), total=data.shape[0]):
    text = row["title"]
    id = idx
    results[id] = sia.polarity_scores(text)    

100%|██████████| 3972/3972 [00:00<00:00, 5449.09it/s]


In [33]:
results_df = pd.DataFrame.from_dict(results, orient='index')

In [34]:
combined_df = pd.concat([data, results_df], axis=1)
combined_df.head()

Unnamed: 0,title,description,url,last_published_date,authors,text,published_day,published_month,num_authors,author,word_count,line_count,neg,neu,pos,compound
0,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,https://www.foxnews.com/politics/hassan-bolduc...,2022-11-02 22:47:00-04:00,[{'name': 'Paul Steinhauser'}],Former governor and first term Democratic Sen....,2,11,1,Paul_Steinhauser,1271,62,0.268,0.732,0.0,-0.6705
1,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,https://www.foxnews.com/politics/biden-speech,2022-11-02 22:15:46-04:00,[{'name': 'Haris Alic'}],President Biden urged Democrats on Wednesday t...,2,11,1,Haris_Alic,478,22,0.298,0.702,0.0,-0.5267
2,NYC's Naked Cowboy makes endorsement for gov w...,New York City's Naked Cowboy endorsed Lee Zeld...,https://www.foxnews.com/politics/nyc-naked-cow...,2022-11-02 21:58:25-04:00,[{'name': 'Adam Sabes'}],The famous Naked Cowboy in New York City's Tim...,2,11,1,Adam_Sabes,205,18,0.0,0.757,0.243,0.5423
3,Wisconsin courts shoot down liberal groups' at...,A Wisconsin appeals court and a circuit judge ...,https://www.foxnews.com/politics/wisconsin-cou...,2022-11-02 21:44:40-04:00,[{'name': 'Bradford Betz'}],Liberal groups in Wisconsin seeking to change ...,2,11,1,Bradford_Betz,381,20,0.29,0.71,0.0,-0.5423
4,Texas gubernatorial candidate Beto O'Rourke jo...,Texas gubernatorial nominee Beto O’Rourke is t...,https://www.foxnews.com/politics/texas-guberna...,2022-11-02 20:38:30-04:00,[{'name': 'Bradford Betz'}],Texas gubernatorial nominee Beto O’Rourke is a...,2,11,1,Bradford_Betz,267,15,0.0,1.0,0.0,0.0


In [35]:
combined_df.loc[combined_df["pos"].idxmax(), "title"]

"Former President Trump celebrates 'ALL' endorsement wins in primary: 'Great candidates!'"

In [36]:
combined_df.describe()

Unnamed: 0,published_day,published_month,num_authors,word_count,line_count,neg,neu,pos,compound
count,3972.0,3972.0,3972.0,3972.0,3972.0,3972.0,3972.0,3972.0,3972.0
mean,16.787513,8.453424,1.119084,610.804884,31.652064,0.109221,0.812165,0.078617,-0.065043
std,8.934083,1.335676,0.447909,340.7851,19.232276,0.136235,0.158318,0.110806,0.394567
min,1.0,6.0,0.0,31.0,3.0,0.0,0.163,0.0,-0.9661
25%,9.0,7.0,1.0,399.0,21.0,0.0,0.701,0.0,-0.3612
50%,18.0,8.0,1.0,534.0,28.0,0.0,0.823,0.0,0.0
75%,25.0,10.0,1.0,734.0,37.0,0.194,1.0,0.155,0.1779
max,31.0,11.0,5.0,9672.0,647.0,0.837,1.0,0.668,0.936


## Roberta Model

In [37]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device

'cpu'

In [42]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
from scipy.special import softmax

In [1]:
# nltk.download()

In [43]:
## pulling a specific model pretrained on sentiment analysis
task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
# config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### Trying Modeling Long Text

In [46]:
text = data["text"][1]
tokens = tokenizer.encode_plus(text, add_special_tokens=False, return_tensors="pt")

In [47]:
print(len(tokens['input_ids'][0]))
tokens

648


{'input_ids': tensor([[ 6517, 15478,  2966,  1574,    15,   307,     7,   311,    62,    23,
             5,  4583,   148,   220,   186,    18, 12076,  1727,    50,    22,
         25487,     5,  2933,  1572,    14, 32351,    13,   476,   113,     7,
          6638,   409,    23,   470,  4593,     4,   947,   282, 39596,   131,
         15478,  2966,    10,  2180,     9,  2732,    23,    10,  1557,   496,
          1674,   515,  1025,  1332,  5088,    11,   663,     6,   211,     4,
           347,     4,    45,     7,   185,     5,   729,    13,  4159,     4,
            20,   394,  3811,    14,  1574,    56,     7,   311,    62,    23,
             5,  4583,   142,  1858,   115,    45,    28, 29168,    19,   559,
           476,     4,   359,   282, 39596,   131,    22,   170,   214,  2114,
            10, 17032,  1151,    60,    26, 15478,     4,    22,   170,   531,
            19,    65,  8642,     6, 16681,  2236,  1994,     6,    25,    10,
           247,     8,   224,    89,  

In [51]:
# define target chunksize
chunksize = 512

# split into chunks of 510 tokens, we also convert to list (default is tuple which is immutable)
input_id_chunks = list(tokens['input_ids'][0].split(chunksize - 2))
mask_chunks = list(tokens['attention_mask'][0].split(chunksize - 2))


# loop through each chunk
for i in range(len(input_id_chunks)):
    # add CLS and SEP tokens to input IDs
    input_id_chunks[i] = torch.cat([
        torch.tensor([101]), input_id_chunks[i], torch.tensor([102])
    ])
    # add attention tokens to attention mask
    mask_chunks[i] = torch.cat([
        torch.tensor([1]), mask_chunks[i], torch.tensor([1])
    ])
    # get required padding length
    pad_len = chunksize - input_id_chunks[i].shape[0]
    # check if tensor length satisfies required chunk size
    if pad_len > 0:
        # if padding length is more than 0, we must add padding
        input_id_chunks[i] = torch.cat([
            input_id_chunks[i], torch.Tensor([0] * pad_len)
        ])
        mask_chunks[i] = torch.cat([
            mask_chunks[i], torch.Tensor([0] * pad_len)
        ])

# check length of each tensor
for chunk in input_id_chunks:
    print(len(chunk))
# print final chunk so we can see 101, 102, and 0 (PAD) tokens are all correctly placed
# chunk

512
512


In [52]:
chunk

tensor([1.0100e+02, 9.0000e+00, 5.0000e+00, 1.1720e+03, 4.9600e+02, 1.6740e+03,
        4.0000e+00, 2.2000e+01, 5.7710e+03, 1.8580e+03, 1.0910e+03, 2.0610e+03,
        1.5000e+01, 5.0000e+00, 7.4300e+02, 1.4000e+01, 9.4800e+02, 1.4400e+02,
        7.0000e+00, 1.9830e+03, 6.0000e+00, 1.5478e+04, 8.0000e+00, 1.5740e+03,
        3.2000e+01, 2.3420e+03, 8.4590e+03, 1.1000e+01, 5.0000e+00, 5.0700e+02,
        3.6000e+02, 1.4200e+02, 5.1000e+01, 3.3000e+01, 6.8500e+02, 2.8420e+03,
        1.9000e+01, 5.0000e+00, 1.3790e+03, 9.0000e+00, 1.2320e+03, 3.3060e+03,
        7.0000e+00, 1.2000e+02, 3.0000e+01, 7.2000e+01, 1.4370e+03, 9.9300e+02,
        1.8580e+03, 5.8000e+01, 6.7000e+01, 2.1190e+03, 7.0000e+00, 2.9508e+04,
        1.5478e+04, 1.8000e+01, 1.4500e+03, 6.0000e+00, 5.8400e+02, 5.0000e+00,
        3.9400e+02, 2.1000e+01, 6.4750e+03, 7.0000e+00, 2.5010e+03, 5.0000e+00,
        7.1690e+03, 5.1080e+03, 9.0000e+00, 1.5740e+03, 1.4900e+02, 2.4900e+03,
        1.2000e+01, 2.7669e+04, 2.9610e+

In [53]:

input_ids = torch.stack(input_id_chunks)
attention_mask = torch.stack(mask_chunks)

input_dict = {
    'input_ids': input_ids.long(),
    'attention_mask': attention_mask.int()
}
input_dict

{'input_ids': tensor([[  101,  6517, 15478,  ...,  3428,  7760,   102],
         [  101,     9,     5,  ...,     0,     0,     0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0]], dtype=torch.int32)}

In [54]:

outputs = model(**input_dict)
probs = torch.nn.functional.softmax(outputs[0], dim=-1)
probs = probs.mean(dim=0)
probs

tensor([0.3930, 0.5565, 0.0505], grad_fn=<MeanBackward1>)

#### End Trying Modeling Long Text

In [30]:
print(data["title"][1])
sia.polarity_scores(data["title"][1])

Biden suggests voting for Republicans is a threat to democracy


{'neg': 0.298, 'neu': 0.702, 'pos': 0.0, 'compound': -0.5267}

In [31]:
## running on roberta model
def get_roberta_sentiment(text):
    encoded_text = tokenizer(text, return_tensors="pt")
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    ## scores are in order of negative, neutral and positive
    scores_dict = {"roberta_neg":scores[0], "roberta_neu":scores[1], "roberta_pos":scores[2]}
    return scores_dict


scores = get_roberta_sentiment(data["title"][1])
scores

{'roberta_neg': 0.66241485,
 'roberta_neu': 0.3247515,
 'roberta_pos': 0.012833751}

In [32]:
results = {}
for idx, row in tqdm(data.iterrows(), total=data.shape[0]):
    try:
        text = row["title"]
        id = idx
        vader_results = sia.polarity_scores(text)
        vader_results_rename = {}
        for k,v in vader_results.items():
            vader_results_rename[f"vader_{k}"] = v
        roberta_results = get_roberta_sentiment(text)
        both = {**vader_results_rename, **roberta_results}
        results[idx] = both
    except RuntimeError:
        print(f"Broke for id {idx}")


100%|██████████| 3972/3972 [03:06<00:00, 21.30it/s]


In [33]:
results_df = pd.DataFrame.from_dict(results, orient='index')
results_df

Unnamed: 0,vader_neg,vader_neu,vader_pos,vader_compound,roberta_neg,roberta_neu,roberta_pos
0,0.268,0.732,0.000,-0.6705,0.214919,0.765167,0.019915
1,0.298,0.702,0.000,-0.5267,0.662415,0.324751,0.012834
2,0.000,0.757,0.243,0.5423,0.005948,0.538470,0.455582
3,0.290,0.710,0.000,-0.5423,0.346270,0.637059,0.016671
4,0.000,1.000,0.000,0.0000,0.056946,0.885563,0.057491
...,...,...,...,...,...,...,...
3967,0.152,0.691,0.157,0.0258,0.224884,0.703206,0.071910
3968,0.000,1.000,0.000,0.0000,0.570033,0.403358,0.026609
3969,0.080,0.747,0.172,0.3818,0.502582,0.481971,0.015446
3970,0.173,0.827,0.000,-0.3182,0.048049,0.920026,0.031925


In [34]:
final_df = pd.concat([data, results_df], axis=1)


In [35]:
final_df.loc[final_df["roberta_pos"].idxmax(), "title"]

"Crist praises Biden, says president is 'phenomenal' and he 'can't wait' to have his support in Florida"

In [107]:
final_df.loc[final_df["vader_pos"].idxmax(), "title"]

"Former President Trump celebrates 'ALL' endorsement wins in primary: 'Great candidates!'"

So the `roBerta` model throws an error when the input text is too long.  I wonder if can breakdown the article into multiple lines, run thru the model and then take some kind of average for the sentiment analysis. Would that make sense? 

In [39]:
from nltk.tokenize import sent_tokenize

In [49]:
test_text = data.sample(1)["text"].values[0]

# get_roberta_sentiment(test_text)

In [51]:
sentences = sent_tokenize(test_text)

In [53]:
sent_sentiments = [get_roberta_sentiment(sent) for sent in sentences]
    

In [56]:
sent_sentiments_df = pd.DataFrame(sent_sentiments)
sent_sentiments_df.mean()

roberta_neg    0.099001
roberta_neu    0.771131
roberta_pos    0.129868
dtype: float32

In [57]:
import numpy as np

In [58]:
text_results = {}
for idx, row in tqdm(data.iterrows(), total=data.shape[0]):
    try:
        text = row["text"]
        id = idx
        vader_results = sia.polarity_scores(text)
        vader_results_rename = {}
        for k,v in vader_results.items():
            vader_results_rename[f"vader_{k}"] = v
        sentences = sent_tokenize(test_text)    
        sent_sentiments = [get_roberta_sentiment(sent) for sent in sentences]
        neg = [sent["roberta_neg"] for sent in sent_sentiments]
        pos = [sent["roberta_pos"] for sent in sent_sentiments]
        neu = [sent["roberta_neu"] for sent in sent_sentiments]
        roberta_results = {"roberta_neg":np.mean(neg), "roberta_pos":np.mean(pos), "roberta_neu":np.mean(neu)}
        both = {**vader_results_rename, **roberta_results}
        results[idx] = both
    except RuntimeError:
        print(f"Broke for id {idx}")

100%|██████████| 3972/3972 [1:39:18<00:00,  1.50s/it]


In [63]:
sent_sentiments
neg = [sent["roberta_neg"] for sent in sent_sentiments]
pos = [sent["roberta_pos"] for sent in sent_sentiments]
neu = [sent["roberta_neu"] for sent in sent_sentiments]

In [69]:
roberta_results

{'roberta_neg': 0.09900135,
 'roberta_pos': 0.1298681,
 'roberta_neu': 0.77113056}

In [70]:
np.mean(neu)

0.77113056

In [73]:
pos

[0.18134831,
 0.063188896,
 0.023074752,
 0.03626485,
 0.12284745,
 0.05646329,
 0.02577409,
 0.10676716,
 0.042654786,
 0.10904879,
 0.012940776,
 0.121790655,
 0.09689276,
 0.8645746,
 0.42098108,
 0.02996848,
 0.026538102,
 0.05098097,
 0.021859746,
 0.18340261]

: 

In [67]:
results_df = pd.DataFrame.from_dict(results, orient='index')


Unnamed: 0,vader_neg,vader_neu,vader_pos,vader_compound,roberta_neg,roberta_pos,roberta_neu
0,0.105,0.812,0.083,-0.9869,0.099001,0.129868,0.771131
1,0.088,0.865,0.046,-0.9520,0.099001,0.129868,0.771131
2,0.087,0.849,0.064,-0.7897,0.099001,0.129868,0.771131
3,0.075,0.899,0.026,-0.9544,0.099001,0.129868,0.771131
4,0.068,0.853,0.078,-0.0152,0.099001,0.129868,0.771131
...,...,...,...,...,...,...,...
3967,0.047,0.848,0.105,0.9983,0.099001,0.129868,0.771131
3968,0.113,0.842,0.044,-0.9981,0.099001,0.129868,0.771131
3969,0.024,0.800,0.176,0.9989,0.099001,0.129868,0.771131
3970,0.058,0.899,0.044,-0.8732,0.099001,0.129868,0.771131


In [68]:
results_df.query("roberta_neu == 0.77113056")

Unnamed: 0,vader_neg,vader_neu,vader_pos,vader_compound,roberta_neg,roberta_pos,roberta_neu
0,0.105,0.812,0.083,-0.9869,0.099001,0.129868,0.771131
1,0.088,0.865,0.046,-0.9520,0.099001,0.129868,0.771131
2,0.087,0.849,0.064,-0.7897,0.099001,0.129868,0.771131
3,0.075,0.899,0.026,-0.9544,0.099001,0.129868,0.771131
4,0.068,0.853,0.078,-0.0152,0.099001,0.129868,0.771131
...,...,...,...,...,...,...,...
3967,0.047,0.848,0.105,0.9983,0.099001,0.129868,0.771131
3968,0.113,0.842,0.044,-0.9981,0.099001,0.129868,0.771131
3969,0.024,0.800,0.176,0.9989,0.099001,0.129868,0.771131
3970,0.058,0.899,0.044,-0.8732,0.099001,0.129868,0.771131
