![alternative text](../images/alatheia.png)

# Sentiment Analysis Of News Articles

* The purpose of this notebook is to run sentiment analysis on news articles and update the dataset with the results. 
* We've decided to use a pre-trained model for this because, 
    * The time constraint for this project, blocks us from training a custom sentiment analysis model. 
    * Would like to take this opportunity to learn how to use pre-trained models and transfer learning. 
* We are going to use a pretrained model, which was trained for sentiment analysis on twitter.
* All the research and prototyping on how to use this model was done in a spearate notebook `poc_sentiment_analysis`

## Installations

In [1]:
# # ## installing required libraries
# ! pip install beautifulsoup4
# ! pip install pandas
# ! pip install numpy
# ! pip install plotly
# ! pip install nbformat
# ! pip install ipykernel
# ! pip install matplotlip
# ! pip install wordcloud
# ! pip install gensim
# ! pip install pyLDAvis
# ! pip install nltk
# ! pip install -U pip setuptools wheel
# ! pip install -U spacy
# ! python -m spacy download en_core_web_trf 
# ! python -m spacy download en_core_web_md
# ! pip install joblib
# ! pip install tqdm
! pip install transformers
! pip install torch




[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip






[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Imports

In [60]:
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
from scipy.special import softmax
from nltk.sentiment import SentimentIntensityAnalyzer

import pandas as pd
import torch


In [61]:
# nltk.download()

In [100]:
## input file name
input_file = "../data/apify_text_topics_v2.csv"

## Utility Functions

In [63]:
## helper function to process the long text and break it into smaller chunks
def process_long_text(text, tokenizer):
    tokens = tokenizer.encode_plus(text, add_special_tokens=False, return_tensors="pt")
    # define target chunksize
    chunksize = 512

    # split into chunks of 510 tokens, we also convert to list (default is tuple which is immutable)
    input_id_chunks = list(tokens['input_ids'][0].split(chunksize - 2))
    mask_chunks = list(tokens['attention_mask'][0].split(chunksize - 2))


    # loop through each chunk
    for i in range(len(input_id_chunks)):
        # add CLS and SEP tokens to input IDs
        input_id_chunks[i] = torch.cat([
            torch.tensor([101]), input_id_chunks[i], torch.tensor([102])
        ])
        # add attention tokens to attention mask
        mask_chunks[i] = torch.cat([
            torch.tensor([1]), mask_chunks[i], torch.tensor([1])
        ])
        # get required padding length
        pad_len = chunksize - input_id_chunks[i].shape[0]
        # check if tensor length satisfies required chunk size
        if pad_len > 0:
            # if padding length is more than 0, we must add padding
            input_id_chunks[i] = torch.cat([
                input_id_chunks[i], torch.Tensor([0] * pad_len)
            ])
            mask_chunks[i] = torch.cat([
                mask_chunks[i], torch.Tensor([0] * pad_len)
            ])
    input_ids = torch.stack(input_id_chunks)
    attention_mask = torch.stack(mask_chunks)

    input_dict = {
        'input_ids': input_ids.long(),
        'attention_mask': attention_mask.int()
    }
    return input_dict


## helper function to predict the setiment for the given text
def predict_sentiment_using_roberta(text, model, tokenizer):
    input_dict = process_long_text(text, tokenizer)
    outputs = model(**input_dict)
    probs = torch.nn.functional.softmax(outputs[0], dim=-1)
    probs = probs.mean(dim=0)
    probs = probs.detach().numpy()
    return probs


## helper function to predict the setiment using vader
def perform_sentiment_analysis_using_vader(text, model):
    sentiment = model.polarity_scores(text)
    return sentiment

## Performing Sentiment Analysis Using Roberta

### Initialize Model

In [64]:
sia = SentimentIntensityAnalyzer()

### Reading Data

In [101]:
data = pd.read_csv(input_file)
data.head()

Unnamed: 0,url,author,date,title,soft_title,description,text,day,month,year,...,clean_text,topic_number_1,topic_number_2,topic_number_3,topic_probability_1,topic_probability_2,topic_probability_3,topic_words_1,topic_words_2,topic_words_3
0,https://www.foxnews.com/politics/biden-says-xi...,Greg Norman,2022-11-14 00:00:00+00:00,Biden says after Xi meeting he doesn’t believe...,Biden says after Xi meeting he doesn’t believe...,President Biden said following his meeting wit...,President Biden told reporters Monday followin...,14,11,2022,...,president biden told reporters monday followin...,28,5.0,4.0,0.481921,0.302021,0.059743,"china,chinese,taiwan,united_state,threat,defen...","biden,president,white_house,administration,joe...","organization,letter,position,school_board,empl..."
1,https://www.foxnews.com/politics/gop-rep-calve...,Sophia Slacik,2022-11-14 00:00:00+00:00,GOP Rep. Calvert wins election in competitive ...,GOP Rep. Calvert wins election in competitive ...,"The race for California 41st House district, o...",The Associated Press projects that Rep. Ken C...,14,11,2022,...,the associated press projects that rep ken cal...,45,43.0,46.0,0.397262,0.201822,0.176006,"democrats,republican,republicans,democrat,demo...","flight,town,fire,bar,visitor,hall,beer,trip,ro...","voter,poll,vote,ballot,georgia,republican,voti..."
2,https://www.foxnews.com/politics/pelosi-not-ev...,Haris Alic,2022-11-14 00:00:00+00:00,Pelosi 'not even thinking' about political fut...,Pelosi 'not even thinking' about political fut...,House Speaker Nancy Pelosi’s spokesman said th...,House Speaker Nancy Pelosi’s spokesman forcefu...,14,11,2022,...,house speaker nancy pelosi’s spokesman forcefu...,45,48.0,51.0,0.229805,0.204087,0.188578,"democrats,republican,republicans,democrat,demo...","attack,pelosi,violence,depape,speaker,wing,thr...","senate,vote,democrats,congress,legislation,rep..."
3,https://www.foxnews.com/politics/arizona-gover...,Paul Steinhauser,2022-10-25 00:00:00+00:00,"Katie Hobbs defeats GOP challenger Kari Lake, ...",Arizona gov election: Katie Hobbs defeats GOP ...,Democratic Secretary of State Katie Hobbs has ...,The Fox News Decision Desk can project that De...,25,10,2022,...,the fox news decision desk can project that de...,49,36.0,27.0,0.552282,0.129222,0.062635,"candidate,race,senate,republican,campaign,demo...","trump,president,republican,capitol,donald_trum...","school,education,teacher,theory,district,kid,s..."
4,https://www.foxnews.com/us/idaho-quadruple-hom...,Paul Best,2022-11-14 00:00:00+00:00,"'Crime of passion,' 'burglary gone wrong' amon...",Idaho quadruple student homicide: 'Crime of pa...,Idaho police are trying to narrow down a motiv...,Four college students were killed around 3:00 ...,14,11,2022,...,four college students were killed around or in...,53,29.0,3.0,0.326747,0.253755,0.078065,"police,officer,victim,suspect,murder,incident,...","crime,los_angele,death,safe,city,resident,viol...","kid,mother,daughter,friend,mom,girl,husband,ba..."


### Sample Testing

In [89]:
sample_row = data.sample(1)
sample_text = sample_row['text'].values[0]
sample_clean_text = sample_row['clean_text'].values[0]

In [90]:
sample_text

'The Republican super  aligned with Senate Minority Leader Mitch McConnell said Friday that it would cancel the millions of dollars it had reserved to spend on  ads in New Hampshire\'s Senate race.\n\nThe Senate Leadership Fund said it would slash $5.6 million from the state, which was one of the party\'s top targets to flip.\n\nIn a statement to Fox News, Senate Leadership Fund president Steven Law said that "as the\xa0cycle comes to a close, we are shifting resources to where they can be most effective to achieve our ultimate goal: winning the majority."\n\nThe move comes just two weeks after the National Republican Senatorial Committee – the Senate ’s re-election arm – scrapped its own television ad reservations in New Hampshire.\n\n      $23        \n\nAt the beginning of September, the fund reserved $23 million to run ads in the battleground state.\n\nHowever, that was before Republican and former Army Gen. Don Bolduc won the Sept. 13 primary.\n\nBolduc is resolute that he would n

In [91]:
sample_clean_text

'the republican super aligned with senate minority leader mitch mcconnell said friday that it would cancel the millions of dollars it had reserved to spend on ads in new hampshires senate racethe senate leadership fund said it would slash million from the state which was one of the partys top targets to flipin a statement to fox news senate leadership fund president steven law said that as the cycle comes to a close we are shifting resources to where they can be most effective to achieve our ultimate goal winning the majoritythe move comes just two weeks after the national republican senatorial committee – the senate ’s reelection arm – scrapped its own television ad reservations in new hampshire at the beginning of september the fund reserved million to run ads in the battleground statehowever that was before republican and former army gen don bolduc won the sept primarybolduc is resolute that he would not support mcconnell as the leadergeneral bolduc has defied the naysayers from the

In [92]:
sentiments = perform_sentiment_analysis_using_vader(sample_text, sia)
sentiments

{'neg': 0.048, 'neu': 0.873, 'pos': 0.079, 'compound': 0.8456}

In [93]:
sentiments = perform_sentiment_analysis_using_vader(sample_clean_text, sia)
sentiments

{'neg': 0.051, 'neu': 0.866, 'pos': 0.083, 'compound': 0.8456}

###### Notes
* So the news article is been tagged as overall `negative`, it could be because
    * `Negative` sentiment score is more than positive sentiment score. 
    * The article is about funeral, we'll need to do some more sample tests. 
* But other than the funeral article, most of the scores seem to align with eye ball intuition. 
* So there is some difference between sentiment analysis results between `raw` and `clean` text
* The numbers are not too significantly different, and since vader works in a similar bag of words approach, I think clean text should work as expected. But in our case we have also removed punchtuations so its safer to go with `raw text`

In [86]:
## running vader on the entire dataset
def map_vader_sentiment_analysis(row):
    text = row['text']
    sentiments = perform_sentiment_analysis_using_vader(text, sia)
    return pd.Series(sentiments)
    

In [87]:
sentiment_df = data.apply(map_vader_sentiment_analysis, axis=1)

In [88]:
sentiment_df

Unnamed: 0,neg,neu,pos,compound
0,0.028,0.870,0.102,0.9756
1,0.015,0.910,0.075,0.8176
2,0.016,0.889,0.095,0.9680
3,0.066,0.819,0.115,0.8934
4,0.131,0.824,0.044,-0.9939
...,...,...,...,...
7755,0.253,0.680,0.067,-0.9989
7756,0.096,0.846,0.058,-0.9408
7757,0.158,0.788,0.054,-0.9971
7758,0.058,0.842,0.100,0.9877


##### Notes
* Yay! We have sentiments for all the records, lets combine the datasets and write to a file for `Exploratory Data Analytics`

In [94]:
## combining the dataframes
vader_sentiment_analysis = pd.concat([data, sentiment_df], axis=1)

In [95]:
## saving the dataframe
vader_sentiment_analysis

Unnamed: 0,url,author,date,title,soft_title,description,text,day,month,year,...,topic_probability_1,topic_probability_2,topic_probability_3,topic_words_1,topic_words_2,topic_words_3,neg,neu,pos,compound
0,https://www.foxnews.com/politics/biden-says-xi...,Greg Norman,2022-11-14 00:00:00+00:00,Biden says after Xi meeting he doesn’t believe...,Biden says after Xi meeting he doesn’t believe...,President Biden said following his meeting wit...,President Biden told reporters Monday followin...,14,11,2022,...,0.481921,0.302021,0.059743,"china,chinese,taiwan,united_state,threat,defen...","biden,president,white_house,administration,joe...","organization,letter,position,school_board,empl...",0.028,0.870,0.102,0.9756
1,https://www.foxnews.com/politics/gop-rep-calve...,Sophia Slacik,2022-11-14 00:00:00+00:00,GOP Rep. Calvert wins election in competitive ...,GOP Rep. Calvert wins election in competitive ...,"The race for California 41st House district, o...",The Associated Press projects that Rep. Ken C...,14,11,2022,...,0.397262,0.201822,0.176006,"democrats,republican,republicans,democrat,demo...","flight,town,fire,bar,visitor,hall,beer,trip,ro...","voter,poll,vote,ballot,georgia,republican,voti...",0.015,0.910,0.075,0.8176
2,https://www.foxnews.com/politics/pelosi-not-ev...,Haris Alic,2022-11-14 00:00:00+00:00,Pelosi 'not even thinking' about political fut...,Pelosi 'not even thinking' about political fut...,House Speaker Nancy Pelosi’s spokesman said th...,House Speaker Nancy Pelosi’s spokesman forcefu...,14,11,2022,...,0.229805,0.204087,0.188578,"democrats,republican,republicans,democrat,demo...","attack,pelosi,violence,depape,speaker,wing,thr...","senate,vote,democrats,congress,legislation,rep...",0.016,0.889,0.095,0.9680
3,https://www.foxnews.com/politics/arizona-gover...,Paul Steinhauser,2022-10-25 00:00:00+00:00,"Katie Hobbs defeats GOP challenger Kari Lake, ...",Arizona gov election: Katie Hobbs defeats GOP ...,Democratic Secretary of State Katie Hobbs has ...,The Fox News Decision Desk can project that De...,25,10,2022,...,0.552282,0.129222,0.062635,"candidate,race,senate,republican,campaign,demo...","trump,president,republican,capitol,donald_trum...","school,education,teacher,theory,district,kid,s...",0.066,0.819,0.115,0.8934
4,https://www.foxnews.com/us/idaho-quadruple-hom...,Paul Best,2022-11-14 00:00:00+00:00,"'Crime of passion,' 'burglary gone wrong' amon...",Idaho quadruple student homicide: 'Crime of pa...,Idaho police are trying to narrow down a motiv...,Four college students were killed around 3:00 ...,14,11,2022,...,0.326747,0.253755,0.078065,"police,officer,victim,suspect,murder,incident,...","crime,los_angele,death,safe,city,resident,viol...","kid,mother,daughter,friend,mom,girl,husband,ba...",0.131,0.824,0.044,-0.9939
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7755,https://www.foxnews.com/us/buffalo-mass-shooti...,Adam Sabes,2022-05-15 00:00:00+00:00,Erie County DA says domestic terrorism charges...,Buffalo mass shooting: Erie County DA says dom...,Erie District Attorney John Flynn said during ...,Erie County District Attorney John Flynn indic...,15,5,2022,...,0.467040,0.224623,0.141534,"police,officer,victim,suspect,murder,incident,...","investigation,attorney,evidence,judge,committe...","attack,pelosi,violence,depape,speaker,wing,thr...",0.253,0.680,0.067,-0.9989
7756,https://www.foxnews.com/sports/wimbledon-fan-t...,Paulina Dedaj,2022-08-24 00:00:00+00:00,Wimbledon fan takes legal action after Nick Ky...,Wimbledon fan takes legal action after Nick Ky...,Anna Palus has filed a defamation suit against...,Nick Kyrgios has found himself in another lega...,24,8,2022,...,0.243563,0.220295,0.145837,"trial,charge,count,allegation,spacey,rape,wein...","investigation,attorney,evidence,judge,committe...","transgender,mental_health,age,gender,adult,sea...",0.096,0.846,0.058,-0.9408
7757,https://www.foxnews.com/politics/house-republi...,Chris Pandolfo,2022-11-04 00:00:00+00:00,"House Republicans release 1,000-page report al...","House Republicans release 1,000-page report al...",House Judiciary Committee Republicans have rel...,House Republicans released a new report on Fri...,4,11,2022,...,0.396616,0.167171,0.159419,"investigation,attorney,evidence,judge,committe...","organization,letter,position,school_board,empl...","america,democracy,virginia,conservative,americ...",0.158,0.788,0.054,-0.9971
7758,https://www.foxnews.com/entertainment/whoopi-g...,Ashley Hume,2022-10-19 00:00:00+00:00,Whoopi Goldberg and 'RHOA' alum Claudia Jordan...,Whoopi Goldberg and 'RHOA' alum Claudia Jordan...,"Meghan Markle's claims of feeling ""objectified...",Whoopi Goldberg and ‘Real Housewives of Atlant...,19,10,2022,...,0.187952,0.138135,0.128420,"host,audience,word,guy,idea,friend,viewer,joke...","movie,actor,actress,role,character,hollywood,c...","queen,harry,duke,royal,king,duchess,sussex,pri...",0.058,0.842,0.100,0.9877


In [97]:
## before saving the dataframe lets rename the columns
vader_sentiment_analysis.rename(columns={"neg":"negative", "neu":"neutral", "pos": "positive"}, inplace=True)

In [98]:
vader_sentiment_analysis.head()

Unnamed: 0,url,author,date,title,soft_title,description,text,day,month,year,...,topic_probability_1,topic_probability_2,topic_probability_3,topic_words_1,topic_words_2,topic_words_3,negative,neutral,positive,compound
0,https://www.foxnews.com/politics/biden-says-xi...,Greg Norman,2022-11-14 00:00:00+00:00,Biden says after Xi meeting he doesn’t believe...,Biden says after Xi meeting he doesn’t believe...,President Biden said following his meeting wit...,President Biden told reporters Monday followin...,14,11,2022,...,0.481921,0.302021,0.059743,"china,chinese,taiwan,united_state,threat,defen...","biden,president,white_house,administration,joe...","organization,letter,position,school_board,empl...",0.028,0.87,0.102,0.9756
1,https://www.foxnews.com/politics/gop-rep-calve...,Sophia Slacik,2022-11-14 00:00:00+00:00,GOP Rep. Calvert wins election in competitive ...,GOP Rep. Calvert wins election in competitive ...,"The race for California 41st House district, o...",The Associated Press projects that Rep. Ken C...,14,11,2022,...,0.397262,0.201822,0.176006,"democrats,republican,republicans,democrat,demo...","flight,town,fire,bar,visitor,hall,beer,trip,ro...","voter,poll,vote,ballot,georgia,republican,voti...",0.015,0.91,0.075,0.8176
2,https://www.foxnews.com/politics/pelosi-not-ev...,Haris Alic,2022-11-14 00:00:00+00:00,Pelosi 'not even thinking' about political fut...,Pelosi 'not even thinking' about political fut...,House Speaker Nancy Pelosi’s spokesman said th...,House Speaker Nancy Pelosi’s spokesman forcefu...,14,11,2022,...,0.229805,0.204087,0.188578,"democrats,republican,republicans,democrat,demo...","attack,pelosi,violence,depape,speaker,wing,thr...","senate,vote,democrats,congress,legislation,rep...",0.016,0.889,0.095,0.968
3,https://www.foxnews.com/politics/arizona-gover...,Paul Steinhauser,2022-10-25 00:00:00+00:00,"Katie Hobbs defeats GOP challenger Kari Lake, ...",Arizona gov election: Katie Hobbs defeats GOP ...,Democratic Secretary of State Katie Hobbs has ...,The Fox News Decision Desk can project that De...,25,10,2022,...,0.552282,0.129222,0.062635,"candidate,race,senate,republican,campaign,demo...","trump,president,republican,capitol,donald_trum...","school,education,teacher,theory,district,kid,s...",0.066,0.819,0.115,0.8934
4,https://www.foxnews.com/us/idaho-quadruple-hom...,Paul Best,2022-11-14 00:00:00+00:00,"'Crime of passion,' 'burglary gone wrong' amon...",Idaho quadruple student homicide: 'Crime of pa...,Idaho police are trying to narrow down a motiv...,Four college students were killed around 3:00 ...,14,11,2022,...,0.326747,0.253755,0.078065,"police,officer,victim,suspect,murder,incident,...","crime,los_angele,death,safe,city,resident,viol...","kid,mother,daughter,friend,mom,girl,husband,ba...",0.131,0.824,0.044,-0.9939


In [99]:
## writing the dataframe to csv
vader_sentiment_analysis.to_csv("../data/apify_text_topics_sentiment_analysis_vader_v1.csv", index=False)

## Performing Sentiment Analysis Using Roberta

### Initialize Model & Tokenizer

In [43]:
## pulling a specific model pretrained on sentiment analysis
task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
# MODEL = "siebert/sentiment-roberta-large-english"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Reading Data 

In [102]:
data = pd.read_csv(input_file)
data.head()

Unnamed: 0,url,author,date,title,soft_title,description,text,day,month,year,...,clean_text,topic_number_1,topic_number_2,topic_number_3,topic_probability_1,topic_probability_2,topic_probability_3,topic_words_1,topic_words_2,topic_words_3
0,https://www.foxnews.com/politics/biden-says-xi...,Greg Norman,2022-11-14 00:00:00+00:00,Biden says after Xi meeting he doesn’t believe...,Biden says after Xi meeting he doesn’t believe...,President Biden said following his meeting wit...,President Biden told reporters Monday followin...,14,11,2022,...,president biden told reporters monday followin...,28,5.0,4.0,0.481921,0.302021,0.059743,"china,chinese,taiwan,united_state,threat,defen...","biden,president,white_house,administration,joe...","organization,letter,position,school_board,empl..."
1,https://www.foxnews.com/politics/gop-rep-calve...,Sophia Slacik,2022-11-14 00:00:00+00:00,GOP Rep. Calvert wins election in competitive ...,GOP Rep. Calvert wins election in competitive ...,"The race for California 41st House district, o...",The Associated Press projects that Rep. Ken C...,14,11,2022,...,the associated press projects that rep ken cal...,45,43.0,46.0,0.397262,0.201822,0.176006,"democrats,republican,republicans,democrat,demo...","flight,town,fire,bar,visitor,hall,beer,trip,ro...","voter,poll,vote,ballot,georgia,republican,voti..."
2,https://www.foxnews.com/politics/pelosi-not-ev...,Haris Alic,2022-11-14 00:00:00+00:00,Pelosi 'not even thinking' about political fut...,Pelosi 'not even thinking' about political fut...,House Speaker Nancy Pelosi’s spokesman said th...,House Speaker Nancy Pelosi’s spokesman forcefu...,14,11,2022,...,house speaker nancy pelosi’s spokesman forcefu...,45,48.0,51.0,0.229805,0.204087,0.188578,"democrats,republican,republicans,democrat,demo...","attack,pelosi,violence,depape,speaker,wing,thr...","senate,vote,democrats,congress,legislation,rep..."
3,https://www.foxnews.com/politics/arizona-gover...,Paul Steinhauser,2022-10-25 00:00:00+00:00,"Katie Hobbs defeats GOP challenger Kari Lake, ...",Arizona gov election: Katie Hobbs defeats GOP ...,Democratic Secretary of State Katie Hobbs has ...,The Fox News Decision Desk can project that De...,25,10,2022,...,the fox news decision desk can project that de...,49,36.0,27.0,0.552282,0.129222,0.062635,"candidate,race,senate,republican,campaign,demo...","trump,president,republican,capitol,donald_trum...","school,education,teacher,theory,district,kid,s..."
4,https://www.foxnews.com/us/idaho-quadruple-hom...,Paul Best,2022-11-14 00:00:00+00:00,"'Crime of passion,' 'burglary gone wrong' amon...",Idaho quadruple student homicide: 'Crime of pa...,Idaho police are trying to narrow down a motiv...,Four college students were killed around 3:00 ...,14,11,2022,...,four college students were killed around or in...,53,29.0,3.0,0.326747,0.253755,0.078065,"police,officer,victim,suspect,murder,incident,...","crime,los_angele,death,safe,city,resident,viol...","kid,mother,daughter,friend,mom,girl,husband,ba..."


### Sample Testing

In [37]:
sample_row = data.sample(1)
sample_text = sample_row['text'].values[0]
sample_clean_text = sample_row['clean_text'].values[0]

In [38]:
sample_clean_text

'the city of el paso has launched a new migrant data dashboard that gives a glimpse into the extraordinary numbers the border city is seeing as part of the ongoing migrant crisis racking the southwest borderthe migrant situational awareness dashboard available on the city’s website offers a breakdown of the massive numbers being encountered and released into the communityaccording to the data the numbers went from less than releases per week into the community during the summer to over last week the number of migrants in custody has also jumped from fewer than a few weeks ago to over the data show that there are over being released into the community each day on average while the city is providing over meals a day to hungry migrantsfor months el paso has been sounding the alarm about the migrant surge it is facing images have shown migrants including many from venezuela camped out on the streets as shelters and facilities have been overwhelmedmeanwhile the city which has a democratic m

In [39]:
sample_text

'The city of El Paso has launched a new migrant data dashboard that gives a glimpse into the extraordinary numbers the border city is seeing as part of the ongoing migrant crisis racking the southwest border.\n\nThe Migrant Situational Awareness Dashboard, available on the city’s website, offers a breakdown of the massive numbers being encountered and released into the community.\n\nAccording to the data, the numbers went from less than 1,700 releases per week into the community during the summer to over 6,800 last week. The number of migrants in  custody has also jumped from fewer than 3,000 a few weeks ago to over 4,500.\n\nThe data show that there are over 1,000 being released into the community each day on average, while the city is providing over 900 meals a day to hungry migrants.\n\nFor months, El Paso has been sounding the alarm about the migrant surge it is facing. Images have shown migrants, including many from Venezuela, camped out on the streets as shelters and facilities h

In [30]:
import re
## lets remove the all caps place holder text
clean_text = re.sub(r'\b[A-Z]+\b', '', sample_text)

In [44]:
sentiments = predict_sentiment_using_roberta(sample_text, model, tokenizer)
sentiments

SequenceClassifierOutput(loss=None, logits=tensor([[-0.7919,  1.6844, -1.0441],
        [-1.1643,  1.6180, -0.6690]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


array([0.06316391, 0.86477184, 0.07206428], dtype=float32)

In [45]:
sentiments = predict_sentiment_using_roberta(sample_clean_text, model, tokenizer)
sentiments

SequenceClassifierOutput(loss=None, logits=tensor([[-1.1675,  1.6963, -0.7131],
        [-2.3141,  1.4862,  0.6135]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


array([0.03263433, 0.78313947, 0.18422621], dtype=float32)

###### Notes
* So it seems removing placeholder caps lock does have an effect on sentiment analysis.
* Also clean text has some effect in prediction, but am not sure if the prediction improves in accuracy or not. 
* For now we'll proceed with raw text as it also contains punctuations and expressions

In [51]:
## lets run the sentiment analysis on the entire dataset
def map_sentiment_analysis(row):
    text = sample_row['text'].values[0]
    sentiments = predict_sentiment_using_roberta(text, model, tokenizer)
    return pd.Series({
        "negative": sentiments[0],
        "neutral": sentiments[1],
        "positive": sentiments[2]
    })

In [56]:
sentiments = data.apply(map_sentiment_analysis, axis=1)

In [59]:
sentiments.describe()

Unnamed: 0,negative,neutral,positive
count,7760.0,7760.0,7760.0
mean,0.063164,0.864772,0.072064
std,0.0,0.0,0.0
min,0.063164,0.864772,0.072064
25%,0.063164,0.864772,0.072064
50%,0.063164,0.864772,0.072064
75%,0.063164,0.864772,0.072064
max,0.063164,0.864772,0.072064


###### Notes
* For some reason, all the values I got in sentiment are some. 
* I think there is some bug in the logic to find break down the long test and finding average. 
* For now we'll skip this model and focus on `Vader` analysis. 