# Sentiment Analysis Of News Articles

* The purpose of this notebook is to run sentiment analysis on news articles and update the dataset with the results. 
* We've decided to use a pre-trained model for this because, 
    * The time constraint for this project, blocks us from training a custom sentiment analysis model. 
    * Would like to take this opportunity to learn how to use pre-trained models and transfer learning. 
* We are going to use a pretrained model, which was trained for sentiment analysis on twitter.
* All the research and prototyping on how to use this model was done in a spearate notebook `poc_sentiment_analysis`

## Installations

In [1]:
# # ## installing required libraries
# ! pip install beautifulsoup4
# ! pip install pandas
# ! pip install numpy
# ! pip install plotly
# ! pip install nbformat
# ! pip install ipykernel
# ! pip install matplotlip
# ! pip install wordcloud
# ! pip install gensim
# ! pip install pyLDAvis
# ! pip install nltk
# ! pip install -U pip setuptools wheel
# ! pip install -U spacy
# ! python -m spacy download en_core_web_trf 
# ! python -m spacy download en_core_web_md
# ! pip install joblib
# ! pip install tqdm
! pip install transformers
! pip install torch




[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip






[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Imports

In [2]:
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
from scipy.special import softmax

import pandas as pd
import torch


  from .autonotebook import tqdm as notebook_tqdm


## Utility Functions

In [42]:
## helper function to process the long text and break it into smaller chunks
def process_long_text(text, tokenizer):
    tokens = tokenizer.encode_plus(text, add_special_tokens=False, return_tensors="pt")
    # define target chunksize
    chunksize = 512

    # split into chunks of 510 tokens, we also convert to list (default is tuple which is immutable)
    input_id_chunks = list(tokens['input_ids'][0].split(chunksize - 2))
    mask_chunks = list(tokens['attention_mask'][0].split(chunksize - 2))


    # loop through each chunk
    for i in range(len(input_id_chunks)):
        # add CLS and SEP tokens to input IDs
        input_id_chunks[i] = torch.cat([
            torch.tensor([101]), input_id_chunks[i], torch.tensor([102])
        ])
        # add attention tokens to attention mask
        mask_chunks[i] = torch.cat([
            torch.tensor([1]), mask_chunks[i], torch.tensor([1])
        ])
        # get required padding length
        pad_len = chunksize - input_id_chunks[i].shape[0]
        # check if tensor length satisfies required chunk size
        if pad_len > 0:
            # if padding length is more than 0, we must add padding
            input_id_chunks[i] = torch.cat([
                input_id_chunks[i], torch.Tensor([0] * pad_len)
            ])
            mask_chunks[i] = torch.cat([
                mask_chunks[i], torch.Tensor([0] * pad_len)
            ])
    input_ids = torch.stack(input_id_chunks)
    attention_mask = torch.stack(mask_chunks)

    input_dict = {
        'input_ids': input_ids.long(),
        'attention_mask': attention_mask.int()
    }
    return input_dict


## helper function to predict the setiment for the given text
def predict_sentiment(text, model, tokenizer):
    input_dict = process_long_text(text, tokenizer)
    outputs = model(**input_dict)
    probs = torch.nn.functional.softmax(outputs[0], dim=-1)
    probs = probs.mean(dim=0)
    probs = probs.detach().numpy()
    return probs

## Performing Sentiment Analysis

### Initialize Model & Tokenizer

In [43]:
## pulling a specific model pretrained on sentiment analysis
task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
# MODEL = "siebert/sentiment-roberta-large-english"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Reading Data 

In [26]:
data = pd.read_csv("../data/apify_text_topics_v2.csv")
data.head()

Unnamed: 0,url,author,date,title,soft_title,description,text,day,month,year,...,clean_text,topic_number_1,topic_number_2,topic_number_3,topic_probability_1,topic_probability_2,topic_probability_3,topic_words_1,topic_words_2,topic_words_3
0,https://www.foxnews.com/politics/biden-says-xi...,Greg Norman,2022-11-14 00:00:00+00:00,Biden says after Xi meeting he doesn’t believe...,Biden says after Xi meeting he doesn’t believe...,President Biden said following his meeting wit...,President Biden told reporters Monday followin...,14,11,2022,...,president biden told reporters monday followin...,28,5.0,4.0,0.481921,0.302021,0.059743,"china,chinese,taiwan,united_state,threat,defen...","biden,president,white_house,administration,joe...","organization,letter,position,school_board,empl..."
1,https://www.foxnews.com/politics/gop-rep-calve...,Sophia Slacik,2022-11-14 00:00:00+00:00,GOP Rep. Calvert wins election in competitive ...,GOP Rep. Calvert wins election in competitive ...,"The race for California 41st House district, o...",The Associated Press projects that Rep. Ken C...,14,11,2022,...,the associated press projects that rep ken cal...,45,43.0,46.0,0.397262,0.201822,0.176006,"democrats,republican,republicans,democrat,demo...","flight,town,fire,bar,visitor,hall,beer,trip,ro...","voter,poll,vote,ballot,georgia,republican,voti..."
2,https://www.foxnews.com/politics/pelosi-not-ev...,Haris Alic,2022-11-14 00:00:00+00:00,Pelosi 'not even thinking' about political fut...,Pelosi 'not even thinking' about political fut...,House Speaker Nancy Pelosi’s spokesman said th...,House Speaker Nancy Pelosi’s spokesman forcefu...,14,11,2022,...,house speaker nancy pelosi’s spokesman forcefu...,45,48.0,51.0,0.229805,0.204087,0.188578,"democrats,republican,republicans,democrat,demo...","attack,pelosi,violence,depape,speaker,wing,thr...","senate,vote,democrats,congress,legislation,rep..."
3,https://www.foxnews.com/politics/arizona-gover...,Paul Steinhauser,2022-10-25 00:00:00+00:00,"Katie Hobbs defeats GOP challenger Kari Lake, ...",Arizona gov election: Katie Hobbs defeats GOP ...,Democratic Secretary of State Katie Hobbs has ...,The Fox News Decision Desk can project that De...,25,10,2022,...,the fox news decision desk can project that de...,49,36.0,27.0,0.552282,0.129222,0.062635,"candidate,race,senate,republican,campaign,demo...","trump,president,republican,capitol,donald_trum...","school,education,teacher,theory,district,kid,s..."
4,https://www.foxnews.com/us/idaho-quadruple-hom...,Paul Best,2022-11-14 00:00:00+00:00,"'Crime of passion,' 'burglary gone wrong' amon...",Idaho quadruple student homicide: 'Crime of pa...,Idaho police are trying to narrow down a motiv...,Four college students were killed around 3:00 ...,14,11,2022,...,four college students were killed around or in...,53,29.0,3.0,0.326747,0.253755,0.078065,"police,officer,victim,suspect,murder,incident,...","crime,los_angele,death,safe,city,resident,viol...","kid,mother,daughter,friend,mom,girl,husband,ba..."


### Sample Testing

In [37]:
sample_row = data.sample(1)
sample_text = sample_row['text'].values[0]
sample_clean_text = sample_row['clean_text'].values[0]

In [38]:
sample_clean_text

'the city of el paso has launched a new migrant data dashboard that gives a glimpse into the extraordinary numbers the border city is seeing as part of the ongoing migrant crisis racking the southwest borderthe migrant situational awareness dashboard available on the city’s website offers a breakdown of the massive numbers being encountered and released into the communityaccording to the data the numbers went from less than releases per week into the community during the summer to over last week the number of migrants in custody has also jumped from fewer than a few weeks ago to over the data show that there are over being released into the community each day on average while the city is providing over meals a day to hungry migrantsfor months el paso has been sounding the alarm about the migrant surge it is facing images have shown migrants including many from venezuela camped out on the streets as shelters and facilities have been overwhelmedmeanwhile the city which has a democratic m

In [39]:
sample_text

'The city of El Paso has launched a new migrant data dashboard that gives a glimpse into the extraordinary numbers the border city is seeing as part of the ongoing migrant crisis racking the southwest border.\n\nThe Migrant Situational Awareness Dashboard, available on the city’s website, offers a breakdown of the massive numbers being encountered and released into the community.\n\nAccording to the data, the numbers went from less than 1,700 releases per week into the community during the summer to over 6,800 last week. The number of migrants in  custody has also jumped from fewer than 3,000 a few weeks ago to over 4,500.\n\nThe data show that there are over 1,000 being released into the community each day on average, while the city is providing over 900 meals a day to hungry migrants.\n\nFor months, El Paso has been sounding the alarm about the migrant surge it is facing. Images have shown migrants, including many from Venezuela, camped out on the streets as shelters and facilities h

In [30]:
import re
## lets remove the all caps place holder text
clean_text = re.sub(r'\b[A-Z]+\b', '', sample_text)

In [44]:
sentiments = predict_sentiment(sample_text, model, tokenizer)
sentiments

SequenceClassifierOutput(loss=None, logits=tensor([[-0.7919,  1.6844, -1.0441],
        [-1.1643,  1.6180, -0.6690]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


array([0.06316391, 0.86477184, 0.07206428], dtype=float32)

In [45]:
sentiments = predict_sentiment(sample_clean_text, model, tokenizer)
sentiments

SequenceClassifierOutput(loss=None, logits=tensor([[-1.1675,  1.6963, -0.7131],
        [-2.3141,  1.4862,  0.6135]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


array([0.03263433, 0.78313947, 0.18422621], dtype=float32)

###### Notes
* So it seems removing placeholder caps lock does have an effect on sentiment analysis.