<a href="https://colab.research.google.com/github/felipemaiapolo/zelenskyy_speeches/blob/main/extracting_sentiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting sentiments from texts

In [None]:
%%capture
!pip install --upgrade --force-reinstall numpy=='1.20.3' pandas=='1.3.3' gensim=='4.2.0' wordcloud=='1.8.2.2' tqdm=='4.62.2' fsspec=='2022.5.0'

## Setting-up

Loading packages (maybe you will need to restart the notebook after installing packages):

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import gensim.downloader as api
from wordcloud import WordCloud
import string
import json

Some useful functions:

In [None]:
#Tokenizer
def tokenize(txt):
    txt=txt.lower()
    txt=txt.replace('’s','')
    txt=txt.replace('“','')
    txt=txt.replace('”','')
    txt=txt.replace('...','')
    txt=txt.translate(str.maketrans('', '', string.punctuation))
    txt=txt.split(' ')
    return txt

#Function that get the sentiment form a specific word
def get_sentiment(word, pos, neg):
    return np.sum([model.similarity(word, p) for p in pos])-np.sum([model.similarity(word, n) for n in neg])

Defining the lists of positive and negative words:

In [None]:
pos=['good', 'excellent', 'correct', 'best', 'happy', 'positive', 'fortunate']     
neg=['bad','terrible', 'wrong', 'worst', 'disappointed', 'negative', 'unfortunate']

## Extracting sentiments from speeches

Loading data

In [None]:
sp = pd.read_csv('https://github.com/felipemaiapolo/zelenskyy_speeches/raw/main/data/zelensky_speeches.csv')
sp.date = pd.to_datetime(sp.date)
sp.head()

Unnamed: 0.1,Unnamed: 0,url,date,title,text
0,0,https://www.president.gov.ua/en/news/den-ukray...,2022-07-24 22:02:00,The Day of Ukrainian Statehood on July 28 will...,"Good health to you, fellow Ukrainians! An impo..."
1,1,https://www.president.gov.ua/en/news/zsu-krok-...,2022-07-23 23:42:00,Armed Forces of Ukraine advancing step by step...,Dear Ukrainian men and women! The one hundred ...
2,2,https://www.president.gov.ua/en/news/vijna-ne-...,2022-07-23 19:04:00,The war did not break Ukraine and will not bre...,Dear First Ladies and Gentlemen! Dear Ladies a...
3,3,https://www.president.gov.ua/en/news/rosiya-ro...,2022-07-22 22:28:00,"Russia did everything to destroy Ukraine"" s=""""...",Dear Ukrainian men and women! Dear all our def...
4,4,https://www.president.gov.ua/en/news/mayemo-is...,2022-07-21 22:25:00,We have a significant potential for the advanc...,"Good health to you, fellow Ukrainians! Today, ..."


Checking embeddings models available in Gensim:

In [None]:
info = api.info()
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:],
        )
    )
    print('\n\n\n')

__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors of the movie matrix.




conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.




fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).




glove-twitter-100 (1193514 records): Pre-trained vectors based on  2B twee

We are going to load **"glove-wiki-gigaword-300"**:

In [None]:
model = api.load("glove-wiki-gigaword-300")



Getting the model vocab.:

In [None]:
vocab_model=list(set(model.key_to_index))

Getting the vocab. used in the texts:

In [None]:
vocab=[]

for txt in tqdm(sp.text.tolist()):
    vocab+=tokenize(txt)

vocab=set(vocab)

100%|██████████| 273/273 [00:00<00:00, 1820.29it/s]


Now, we get the sentiments for every word in `vocab` if that word is in `vocab_model`:

In [None]:
sentiments=[get_sentiment(w, pos, neg) for w in tqdm(vocab) if w in vocab_model]
vocab=[w for w in tqdm(vocab) if w in vocab_model]

100%|██████████| 10363/10363 [03:36<00:00, 47.86it/s]
100%|██████████| 10363/10363 [03:31<00:00, 49.05it/s]


In [None]:
len(vocab), len(sentiments)

(9477, 9477)

Observing the most negative and positive words in `vocab`:

In [None]:
K=20

[[vocab[k],sentiments[k]] for k in np.argsort(sentiments)[:K]]

[['horrible', -1.909968],
 ['terrible', -1.864738],
 ['horrific', -1.7512368],
 ['senseless', -1.6989087],
 ['shameful', -1.6725583],
 ['blaming', -1.6478922],
 ['appalling', -1.6252333],
 ['horrendous', -1.6073263],
 ['blamed', -1.520782],
 ['caused', -1.4948518],
 ['consequences', -1.494211],
 ['dreadful', -1.4655025],
 ['worst', -1.463681],
 ['catastrophic', -1.4476106],
 ['inaction', -1.4466158],
 ['shocking', -1.4452888],
 ['worse', -1.4109223],
 ['ugly', -1.392833],
 ['disgusting', -1.3827171],
 ['vile', -1.3789988]]

In [None]:
[[vocab[k],sentiments[k]] for k in np.argsort(sentiments)[-K:]]

[['lively', 1.1587266],
 ['diligent', 1.1587312],
 ['cultivate', 1.161393],
 ['outstanding', 1.1703495],
 ['maintain', 1.1762311],
 ['harmonious', 1.1765563],
 ['reliable', 1.1805874],
 ['energetic', 1.1925653],
 ['flexible', 1.1978438],
 ['perfect', 1.1993508],
 ['provides', 1.2047523],
 ['provide', 1.2438107],
 ['guide', 1.2438293],
 ['enjoy', 1.2594359],
 ['stable', 1.2601318],
 ['healthy', 1.261559],
 ['solid', 1.2723947],
 ['innovative', 1.285939],
 ['good', 1.3519733],
 ['best', 1.4320018]]

Creating a dictionary containing the sentiments of all words in `vocab` that are also in `vocab_model`:

In [None]:
ws={}
for i in range(len(vocab)):
    ws[vocab[i]] = sentiments[i]

Tokenizing speeches:

In [None]:
texts=[]
for txt in sp.text.tolist():
    texts.append(tokenize(txt))

Calculating sentiment for every speech:

In [None]:
texts_sent=[np.mean([ws[w] for w in txt if w in vocab_model]) for txt in tqdm(texts)]

100%|██████████| 273/273 [1:30:57<00:00, 19.99s/it]


In [None]:
len(texts_sent)

273

Creating dataset:

In [None]:
sp['sent']=pd.Series(texts_sent)
sp=sp.loc[:,['date', 'title', 'text', 'url', 'sent']]

Saving data:

In [None]:
!mkdir data
sp.to_csv('data/texts_sent.csv', index=False)