# Sentiment analysis

In this notebook two sentiment analysis techniques will be applied:
1. VADER (Valence Aware Dictionaty and sEntiment Reasoner) - Bag of words approach
2. Roberta Pretrained Model from huggingface
3. Huggingface Pipeline

Read in Data and `NLTK` Basics

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import webvtt

plt.style.use('ggplot')

import nltk

In [3]:
# Path to your .vtt file
vtt_file = '/Users/nico/Desktop/Projects/word_analyses_DW/code/youtube_output/¿Cómo podría el conflicto entre Israel y Hamás ampliarse en Medio Oriente？.es.vtt'

vtt = webvtt.read(vtt_file)

# print the content in WebVTT format
print(vtt.content)

WEBVTT

00:00:02.230 --> 00:00:02.240
todo el mundo habla del riesgo de una

00:00:02.240 --> 00:00:03.909
todo el mundo habla del riesgo de una
escalada<00:00:02.639><c> en</c><00:00:02.800><c> medio</c><00:00:03.080><c> oriente</c><00:00:03.719><c> las</c>

00:00:03.909 --> 00:00:03.919
escalada en medio oriente las

00:00:03.919 --> 00:00:05.950
escalada en medio oriente las
iniciativas<00:00:04.560><c> diplomáticas</c><00:00:05.480><c> ya</c><00:00:05.600><c> se</c><00:00:05.759><c> han</c>

00:00:05.950 --> 00:00:05.960
iniciativas diplomáticas ya se han

00:00:05.960 --> 00:00:08.190
iniciativas diplomáticas ya se han
activado<00:00:06.440><c> e</c><00:00:06.680><c> intentan</c><00:00:07.200><c> frenéticamente</c>

00:00:08.190 --> 00:00:08.200
activado e intentan frenéticamente

00:00:08.200 --> 00:00:10.350
activado e intentan frenéticamente
detener<00:00:08.760><c> la</c><00:00:08.880><c> propagación</c><00:00:09.400><c> del</c><00:00:09.599><c> conflicto</c>

00:00:10.350 -->

## Data cleaning

Subtitle input file needs to be transform to a clean body of text. For this:
+ Time marks needs to be removed.
+ Values seems to be triplicated.

In [143]:
# Path to your .vtt file
vtt_file = '/Users/nico/Desktop/Projects/Youtube-project/data/en_DW/‘Beyond worst case scenario’ The situation on the ground in Gaza  DW News.en.vtt'

# Open and extract text from the .vtt file, remove duplicates
text_output = []
seen_lines = set()  # Set to keep track of unique lines
for caption in webvtt.read(vtt_file):
    for line in caption.text.splitlines():
        if line not in seen_lines:  # Check for duplicates
            text_output.append(line)
            seen_lines.add(line)

# Join the unique extracted text into a single string variable
text = ' '.join(text_output)
text

"right turning now to Israel's war on Hamas Israel says its Army carried out a large ground incursion in the Gaza Strip overnight to attack Hamas positions the military described it as its biggest incursion into Gaza yet the United Nations humanitarian coordinator for the Palestinian territories has warned that nowhere in Gaza is safe as Israel steps up its preparations for an expected ground Invasion Israeli tanks and Northern Gaza setting the stage for Israel's next phase of combat a full ground war has not started yet but air strikes are hitting the Gaza Strip nonstop theyve already reduced entire neighborhoods to Rubble because of the bombing destruction and killing I came to seek shelter inside this camp with nine members of my family in two cars I had to sleep out in the open in the heat for 10 days to get a tent Israel says it is striking targets of the islamist militant group Hamas considered a terrorist organization by many Western governments according to the UN about 1.4 mil

## Tokenize the text

I will just use a pre-trainned ML model for better tokenization than rule based tokenizer.

### Setup for tokenization

The setup is environment dependant (i.e. your operative system or hardware). To set it up, use https://spacy.io/usage

In [None]:
# Setup [It will take some time to run the installation!]
!pip install -U pip setuptools wheel
!pip install -U 'spacy[apple]'
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_trf



### Tokenizing

In [15]:
import spacy

In [19]:
nlp = spacy.load("en_core_web_trf") # small model


  model.load_state_dict(torch.load(filelike, map_location=device))


In [22]:
%%time
doc = nlp(text)

  with torch.cuda.amp.autocast(self._mixed_precision):


CPU times: user 3.53 s, sys: 1.34 s, total: 4.87 s
Wall time: 913 ms


In [25]:
for token in doc[:10]:
    print (token)

right
turning
now
to
Israel
's
war
on
Hamas
Israel


#### Token exploratory data analysis

> There are several atributes, for more detail https://spacy.io/api/token#attributes

After running the previous loop, the variable `token` have contained the last element of the loop, in this case "Israel"

In [26]:
token.text

'Israel'

In [28]:
token.left_edge

Israel

In [29]:
token.right_edge

Israel

Not having other words at the left and right edges indicate that the word is an entity by itself.

In [31]:
token.ent_type_

'GPE'

'GPE' indicate Geo-Politycal Entity

In [34]:
doc[8].ent_type_

'ORG'

'Hamas' (`doc[8]`) was classified as an Organization

In [37]:
doc[1].lemma_ # Index number 8 of the current doc is 'turning', lemma return the infinitive form of the word

'turn'

In [38]:
token.morph # Morphological analysis.

Number=Sing

In [40]:
doc[1].morph # Morphological analysis of 'turning'

Aspect=Prog|Tense=Pres|VerbForm=Part

In [44]:
token.pos_ # stands for part of speech

'PROPN'

### Speech tagging

In [45]:
for token in doc[:20]:
     print(token.text, token.pos_, token.dep_)

right ADV advmod
turning VERB advcl
now ADV advmod
to ADP prep
Israel PROPN poss
's PART case
war NOUN pobj
on ADP prep
Hamas PROPN pobj
Israel PROPN nsubj
says VERB ccomp
its PRON poss
Army PROPN nsubj
carried VERB ccomp
out ADP prt
a DET det
large ADJ amod
ground NOUN compound
incursion NOUN dobj
in ADP prep


In [49]:
from spacy import displacy
displacy.render(doc[:20], style='dep')

Did the models delimitated clearly the different sentences? (the text have no puntuation)

In [50]:
displacy.render(doc, style="ent")

### Word Vectors (or word embeddings)

Are numerical representations of words. Is a good way to say to a computer how close can different words be. Also give context about the meaning inside a sentence for example. Mean sintactical and semantical meaning (ask Lorenz to explain this)

In [None]:
!python -m spacy download en_core_web_md

In [59]:
nlp = spacy.load('en_core_web_md')

In [60]:
%%time
# tokenize again with the medium model
doc = nlp(text)

CPU times: user 122 ms, sys: 47.8 ms, total: 170 ms
Wall time: 270 ms


In [66]:

sentence1 = list(doc.sents)[0]
sentence1 # is not really parsed by sentences

right turning now to Israel's war on Hamas Israel says its Army carried out a large ground incursion in the Gaza Strip overnight to attack Hamas positions the military described it as its biggest incursion into Gaza yet the United Nations humanitarian coordinator for the Palestinian territories has warned that nowhere in Gaza is safe as Israel steps up its preparations for an expected ground Invasion Israeli tanks and Northern Gaza setting the stage for Israel's next phase of combat a full ground war has not started yet but air strikes are hitting the Gaza Strip nonstop theyve already reduced entire neighborhoods to Rubble because of the bombing destruction and killing I came to seek shelter inside this camp with nine members of my family in two cars I had to sleep out in the open in the heat for 10 days to get a tent Israel says it is striking targets of the islamist militant group Hamas considered a terrorist organization by many Western governments according to the UN about 1.4 mill

#### We can look for the closest words to a defined word

In [65]:
import numpy as np

your_word = "war"

ms = nlp.vocab.vectors.most_similar(
    np.array([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
words

['war-',
 'dewars',
 'battlecry',
 'conflictual',
 'War-',
 'Zelyony',
 'Bioinvasions',
 'battlefronts',
 'hellion',
 'militares']

#### We can compare how close are 2 documents

##### Load several text files

In [119]:
# Open and extract text from the .vtt file, remove duplicates
def clean_subtitles(vtt_file):
    text_output = []
    seen_lines = set()  # Set to keep track of unique lines
    for caption in webvtt.read(vtt_file):
        for line in caption.text.splitlines():
            if line not in seen_lines:  # Check for duplicates
                text_output.append(line)
                seen_lines.add(line)

    # Join the unique extracted text into a single string variable
    text = ' '.join(text_output)
    return(text)

In [68]:
# list all subtitle files in a folder

# function to scan files
from pathlib import Path

def get_files_with_extension(folder_location, file_extension):
    # Ensure the file extension starts with a dot
    if not file_extension.startswith('.'):
        file_extension = '.' + file_extension
    
    # Use pathlib to get all files with the specified extension
    files = list(Path(folder_location).rglob(f'*{file_extension}'))
    
    return [str(file) for file in files]

In [None]:
# Path to your .vtt file
dataset_location = '/Users/nico/Desktop/Projects/Youtube-project/data/en_DW'
file_extension = "vtt" # for subtitles
files_list = get_files_with_extension(dataset_location, file_extension)
files_list

In [120]:
# Loading, cleaning and tokenizing in one step
text1 = clean_subtitles(files_list[0])
text2 = clean_subtitles(files_list[1])
text3 = clean_subtitles(files_list[2])

doc1 = nlp(text1)
doc2 = nlp(text2)
doc3 = nlp(text3)

#### Compare files

In [122]:
doc1.similarity(doc2)

  doc1.similarity(doc2)


0.9780748129347001

It seems like when comparing 2 whole documents, it classifies as 100% similar. It can be related by the long lenght of them and the similar topic.

In [123]:
list(doc.sents)[0].similarity(list(doc.sents)[2])

0.9768521785736084

When comparing sentences it compares them as similar, but not the same

### Sentiment analysis (`TextBlop`)

When analysing sentiments with `TextBlop`, the sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

In [None]:
!pip install textblob

In [133]:
import spacy
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer

nlp = spacy.load("en_core_web_sm")

def preprocess_text(text):

    # Tokenize the text using Spacy
    doc = nlp(text)

    # Remove stop words
    filtered_tokens = [token.lemma_ for token in doc if not token.is_stop and token.text.isalpha()]

    # Join the tokens back into a string
    preprocessed_text = " ".join(filtered_tokens)
    return preprocessed_text

# Cleaning up  the text
preprocessed_text = preprocess_text(doc3)

# Create an instance of TextBlob with the cleaned text
blob = TextBlob(preprocessed_text)

# Getting the polarity of text which is between -1 (negative) and 1 (positive)
polarity = blob.sentiment.polarity

if polarity > 0.3:
    print("Positive")
elif polarity < -0.3:
    print("Negative")
else:
    print("Neutral")                        # 

Neutral


When using `TextBlop`, it seems to clasify everything as neutral. Is that something related with journalistic editorial?

#### VADER sentiment scoring

We will use NLTK's `SentimentIntensityAnalyzer` to get the neg/neu/pos scores of the text.

This uses a "bag of words" approach:
1. Stop words are removed
2. each word is scored and combined to a total score.

In [95]:
!pip install nltk



In [128]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm

sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/nico/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [145]:
sia.polarity_scores(text) # it uses the raw text as input, not tokenized

{'neg': 0.102, 'neu': 0.838, 'pos': 0.06, 'compound': -0.9989}

In [146]:
sia.polarity_scores(text1) # it uses the raw text as input, not tokenized

{'neg': 0.074, 'neu': 0.867, 'pos': 0.058, 'compound': -0.9588}

In [131]:
sia.polarity_scores(text2) # it uses the raw text as input, not tokenized

{'neg': 0.119, 'neu': 0.811, 'pos': 0.069, 'compound': -0.9984}

In [132]:
sia.polarity_scores(text3) # it uses the raw text as input, not tokenized

{'neg': 0.102, 'neu': 0.866, 'pos': 0.032, 'compound': -0.9665}

> dominates "neu" with different levels of 'neg', 'pos' and 'compound'. As neutral as Switzerland

#### Roberta Pretrained Model¶
+ Use a model trained of a large corpus of data.
+ Transformer model accounts for the words but also the context related to other words.

In [None]:
!pip install transformers
!pip install scipy

In [157]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

In [160]:
# huggingface provide several pre-trainned models for classifications
MODEL = f"cardiffnlp/xlm-roberta-base-tweet-sentiment-en"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [147]:
text

"right turning now to Israel's war on Hamas Israel says its Army carried out a large ground incursion in the Gaza Strip overnight to attack Hamas positions the military described it as its biggest incursion into Gaza yet the United Nations humanitarian coordinator for the Palestinian territories has warned that nowhere in Gaza is safe as Israel steps up its preparations for an expected ground Invasion Israeli tanks and Northern Gaza setting the stage for Israel's next phase of combat a full ground war has not started yet but air strikes are hitting the Gaza Strip nonstop theyve already reduced entire neighborhoods to Rubble because of the bombing destruction and killing I came to seek shelter inside this camp with nine members of my family in two cars I had to sleep out in the open in the heat for 10 days to get a tent Israel says it is striking targets of the islamist militant group Hamas considered a terrorist organization by many Western governments according to the UN about 1.4 mil

In [177]:
sia.polarity_scores(text1[:2394])


{'neg': 0.094, 'neu': 0.841, 'pos': 0.065, 'compound': -0.876}

In [187]:
# Run for Roberta Model
encoded_text = tokenizer(text1[:2285], return_tensors='pt') # the size of the tensors work a little strange for this model, are not tokens or characters, its something else
output = model(**encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
scores_dict = {
    'roberta_neg' : scores[0],
    'roberta_neu' : scores[1],
    'roberta_pos' : scores[2]
}
scores_dict

{'roberta_neg': 0.9649233,
 'roberta_neu': 0.030621981,
 'roberta_pos': 0.004454721}

In [188]:
def polarity_scores_roberta(example):
    encoded_text = tokenizer(example, return_tensors='pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    scores_dict = {
        'roberta_neg' : scores[0],
        'roberta_neu' : scores[1],
        'roberta_pos' : scores[2]
    }
    return scores_dict