# Sentiment Analysis

## Introduction

Now I'll proceed to analyse the sentiment of the headlines, for that I found a model trained in Spanish called [PySentimiento](https://github.com/pysentimiento/pysentimiento), which analyses posts in terms of wether they are **positive**, **negative** or **neutral**. Besides that package, I'll also be using **Spacy** and **TextBlob** as well as the package that combines both and allows us to work with them in combination.

Now I'll be looking at sentiment in different ways.

1. **Per tweet:** I'll analyse each tweet by itself.
2. **Per week:** I'll group the tweets per day for each of the newspapers.

Finally, I'll be looking at the results in relationship with twitter's engagement metrics.

In [76]:
import os
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio

from dotenv import load_dotenv, find_dotenv
from pysentimiento import create_analyzer

In [77]:
load_dotenv()

BASE_DIR = os.environ.get("BASE_DIR")
BEARER_TOKEN = os.environ.get("BEARER_TOKEN")

In [78]:
pd.set_option("display.max_colwidth", 300)
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 50)
pd.set_option("display.precision", 2)
pd.set_option("display.float_format",  "{:,.3f}".format)

pio.templates.default = "plotly_white"
pio.kaleido.scope.default_scale = 2

gruvbox_colors = ["#fabd2f", "#b8bb26", "#458588", "#fe8019", "#b16286", "#fb4943", "#689d6a", "#d79921", "#98971a", "#83a598", "#d65d0e", "#d3869b", "#cc241d", "#8ec07c", "#b57614", "#79740e", "#076678", "#af3a03", "#8f3f71", "#9d0006", "#4d7b58", "#fbf1c7", "#928374", "#282828"]

In [79]:
TIME_STAMPS = [(2022, 35), (2022, 40), (2022, 45), (2022, 50), (2023, 3)]

### Data Loading

There are three documents that I want to load. The corpus frame, document term matrix and clean data.

In [80]:
corpus = pd.read_feather(f"{BASE_DIR}/data/processed/corpus-sentiment-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")
stats_data = pd.read_feather(f"{BASE_DIR}/data/processed/stats_data-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")
top30_df = pd.read_feather(f"{BASE_DIR}/data/processed/top30_df-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")
top30_stats = pd.read_feather(f"{BASE_DIR}/data/processed/top30-stats-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")

In [81]:
top30_stats.set_index("index", inplace=True)
corpus.set_index("index", inplace=True)
stats_data.set_index("index", inplace=True)

In [82]:
corpus.head()

Unnamed: 0_level_0,id,created_at,newspaper,corpus,sentiment_output,sentiment_probas_NEG,sentiment_probas_NEU,sentiment_probas_POS,emotion_output,emotion_probas_others,emotion_probas_joy,emotion_probas_surprise,emotion_probas_sadness,emotion_probas_fear,emotion_probas_anger,emotion_probas_disgust,year,week,year_week
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,1564039479391838209,2022-08-28 23:57:24+00:00,elcomercio_peru,venezuela colombia retoman relaciones diplomáticas rotas hace tres años,NEG,0.631,0.364,0.005,others,0.905,0.004,0.023,0.042,0.014,0.005,0.007,2022,34,2022w34
2,1564032331706470401,2022-08-28 23:29:00+00:00,elcomercio_peru,amlo afirma que familias ya aceptaron plan de rescate de mineros,NEU,0.031,0.889,0.08,others,0.961,0.016,0.009,0.006,0.003,0.002,0.002,2022,34,2022w34
3,1564028601053347843,2022-08-28 23:14:11+00:00,elcomercio_peru,zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones,NEU,0.114,0.881,0.005,others,0.942,0.002,0.021,0.009,0.018,0.003,0.005,2022,34,2022w34
5,1564023766937731073,2022-08-28 22:54:58+00:00,elcomercio_peru,autoridades confirman transmisión comunitaria de viruela del mono en panamá,NEU,0.14,0.851,0.009,others,0.95,0.008,0.01,0.017,0.008,0.004,0.003,2022,34,2022w34
7,1564017585561141248,2022-08-28 22:30:25+00:00,elcomercio_peru,las imágenes de los enfrentamientos entre seguidores de cristina kirchner la policía en argentina,NEU,0.193,0.793,0.014,others,0.891,0.006,0.041,0.009,0.035,0.005,0.012,2022,34,2022w34


In [83]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34924 entries, 0 to 23
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   id                       34924 non-null  object             
 1   created_at               34924 non-null  datetime64[ns, UTC]
 2   newspaper                34924 non-null  object             
 3   corpus                   34924 non-null  object             
 4   sentiment_output         34924 non-null  object             
 5   sentiment_probas_NEG     34924 non-null  float64            
 6   sentiment_probas_NEU     34924 non-null  float64            
 7   sentiment_probas_POS     34924 non-null  float64            
 8   emotion_output           34924 non-null  object             
 9   emotion_probas_others    34924 non-null  float64            
 10  emotion_probas_joy       34924 non-null  float64            
 11  emotion_probas_surprise  34924 

In [84]:
stats_data.head()

Unnamed: 0_level_0,created_at,possibly_sensitive,id,retweet_count,reply_count,like_count,quote_count,referenced_tweets,newspaper,edit_history_tweet_ids,impression_count
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,2022-08-28 23:57:24+00:00,False,1564039479391838209,0,0,6,1,,elcomercio_peru,,
2,2022-08-28 23:29:00+00:00,False,1564032331706470401,0,0,2,0,,elcomercio_peru,,
3,2022-08-28 23:14:11+00:00,False,1564028601053347843,6,7,18,1,,elcomercio_peru,,
5,2022-08-28 22:54:58+00:00,False,1564023766937731073,1,0,1,1,,elcomercio_peru,,
7,2022-08-28 22:30:25+00:00,False,1564017585561141248,3,0,8,0,,elcomercio_peru,,


In [85]:
stats_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34924 entries, 0 to 23
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   created_at              34924 non-null  datetime64[ns, UTC]
 1   possibly_sensitive      34924 non-null  bool               
 2   id                      34924 non-null  object             
 3   retweet_count           34924 non-null  int64              
 4   reply_count             34924 non-null  int64              
 5   like_count              34924 non-null  int64              
 6   quote_count             34924 non-null  int64              
 7   referenced_tweets       3593 non-null   object             
 8   newspaper               34924 non-null  object             
 9   edit_history_tweet_ids  21727 non-null  object             
 10  impression_count        4499 non-null   float64            
dtypes: bool(1), datetime64[ns, UTC](1), float64(

## Sentiment per tweet

In [None]:
analyzer = create_analyzer(task="sentiment", lang="es")

In [None]:
emotion_analyzer = create_analyzer(task="emotion", lang="es")

In [None]:
corpus["sentiment"] = corpus["corpus"].apply(lambda x: analyzer.predict(x))
corpus["emotion"] = corpus["corpus"].apply(lambda x: emotion_analyzer.predict(x))

In [None]:
corpus.head()

In [None]:
corpus["sentiment_output"] = corpus["sentiment"].apply(lambda x: x.output)
corpus["sentiment_probas_NEG"] = corpus["sentiment"].apply(lambda x: x.probas["NEG"])
corpus["sentiment_probas_NEU"] = corpus["sentiment"].apply(lambda x: x.probas["NEU"])
corpus["sentiment_probas_POS"] = corpus["sentiment"].apply(lambda x: x.probas["POS"])

In [None]:
corpus["emotion_output"] = corpus["emotion"].apply(lambda x: x.output)
corpus["emotion_probas_others"] = corpus["emotion"].apply(lambda x: x.probas["others"])
corpus["emotion_probas_joy"] = corpus["emotion"].apply(lambda x: x.probas["joy"])
corpus["emotion_probas_surprise"] = corpus["emotion"].apply(lambda x: x.probas["surprise"])
corpus["emotion_probas_sadness"] = corpus["emotion"].apply(lambda x: x.probas["sadness"])
corpus["emotion_probas_fear"] = corpus["emotion"].apply(lambda x: x.probas["fear"])
corpus["emotion_probas_anger"] = corpus["emotion"].apply(lambda x: x.probas["anger"])
corpus["emotion_probas_disgust"] = corpus["emotion"].apply(lambda x: x.probas["disgust"])

In [None]:
corpus["year"] = corpus["created_at"].dt.isocalendar().year
corpus["week"] = corpus["created_at"].dt.isocalendar().week

In [None]:
corpus["year_week"] = corpus["year"].astype("str") + "w" + corpus["week"].astype("str")

In [None]:
corpus.drop(labels=["sentiment", "emotion"], axis=1).reset_index().to_feather(f"{BASE_DIR}/data/processed/corpus-sentiment-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")

In [86]:
sentiment_counts = pd.DataFrame(corpus[["newspaper", "year_week", "sentiment_output"]].groupby(by=["newspaper", "year_week"]).value_counts())
sentiment_counts.rename({0: "counts"},axis=1, inplace=True)
sentiment_counts.reset_index(inplace=True)

In [64]:
sentiment_counts.head()

Unnamed: 0,newspaper,year_week,sentiment_output,counts
0,DiarioElPeruano,2022w33,NEU,211
1,DiarioElPeruano,2022w33,NEG,10
2,DiarioElPeruano,2022w33,POS,8
3,DiarioElPeruano,2022w34,NEU,183
4,DiarioElPeruano,2022w34,POS,11


In [74]:
fig = px.bar(
    sentiment_counts,
    x="sentiment_output",
    y="counts",
    facet_row="newspaper",
    facet_col="year_week",
    color="sentiment_output",
    color_discrete_sequence=gruvbox_colors,
    title="Sentiment per newspaper",
    height=2100,
    width=1600
    )

fig.for_each_annotation(lambda a: a.update(text=f"{a.text.split('=')[-1]}"))
fig.update_xaxes(matches=None, showticklabels=True)
fig.update_yaxes(matches=None, showticklabels=True)

fig.write_html(f"{BASE_DIR}/reports/sentiment_bar-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.html")

fig.show()

In [70]:
emotion_counts = pd.DataFrame(corpus[["newspaper", "year_week", "emotion_output"]].groupby(by=["newspaper", "year_week"]).value_counts())
emotion_counts.rename({0: "counts"},axis=1, inplace=True)
emotion_counts.reset_index(inplace=True)

In [71]:
emotion_counts.head()

Unnamed: 0,newspaper,year_week,emotion_output,counts
0,DiarioElPeruano,2022w33,others,222
1,DiarioElPeruano,2022w33,joy,6
2,DiarioElPeruano,2022w33,anger,1
3,DiarioElPeruano,2022w34,others,194
4,DiarioElPeruano,2022w34,joy,4


In [75]:
fig = px.bar(
    emotion_counts,
    x="emotion_output",
    y="counts",
    facet_row="newspaper",
    facet_col="year_week",
    color="emotion_output",
    color_discrete_sequence=gruvbox_colors,
    title="Emotion per newspaper",
    height=2100,
    width=2100
    )

fig.for_each_annotation(lambda a: a.update(text=f"{a.text.split('=')[-1]}"))
fig.update_xaxes(matches=None, showticklabels=True)
fig.update_yaxes(matches=None, showticklabels=True)
fig.update_layout(showlegend=False)

fig.write_html(f"{BASE_DIR}/reports/emotion_bar-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.html")

fig.show()

Results have been so far inconclusive un terms of both, sentiment and emotion, with **neutral** and **others** as main results. Still, there are two roads we could take to see the full picture, since we are looking a the results per week.
1. Check the partial results per week per newspaper
2. Translate the tweets to english and do sentiment analysis on the translation (with the limitations its comes)

In [89]:
sentiment_prob = corpus[["newspaper", "year_week", "sentiment_probas_NEG", "sentiment_probas_NEU", "sentiment_probas_POS"]].groupby(by=["newspaper", "year_week"]).mean()
sentiment_prob.reset_index(inplace=True)

In [97]:
sentiment_prob = sentiment_prob.melt(id_vars=["newspaper", "year_week"], value_vars=["sentiment_probas_NEG", "sentiment_probas_NEU", "sentiment_probas_POS"], var_name="sentiment", value_name="probability")

In [102]:
sentiment_prob["sentiment"] = sentiment_prob["sentiment"].apply(lambda x: x.replace("sentiment_probas_", ""))

In [103]:
sentiment_prob.head()

Unnamed: 0,newspaper,year_week,sentiment,probability
0,DiarioElPeruano,2022w33,NEG,0.076
1,DiarioElPeruano,2022w34,NEG,0.078
2,DiarioElPeruano,2022w39,NEG,0.054
3,DiarioElPeruano,2022w44,NEG,0.056
4,DiarioElPeruano,2022w49,NEG,0.081


In [111]:
fig = px.bar(
    sentiment_prob,
    x="sentiment",
    y="probability",
    facet_row="newspaper",
    facet_col="year_week",
    color="sentiment",
    color_discrete_sequence=gruvbox_colors,
    title="Sentiment probability per newspaper",
    height=2100,
    width=1600
    )

fig.for_each_annotation(lambda a: a.update(text=f"{a.text.split('=')[-1]}"))
fig.update_xaxes(matches=None, showticklabels=True)
fig.update_yaxes(showticklabels=True)
fig.update_layout(showlegend=False)

fig.write_html(f"{BASE_DIR}/reports/sentiment-prob-bar-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.html")

fig.show()

In [92]:
emotion_prob = corpus[["newspaper", "year_week", "emotion_probas_anger", "emotion_probas_disgust", "emotion_probas_fear", "emotion_probas_joy", "emotion_probas_others", "emotion_probas_sadness", "emotion_probas_surprise"]].groupby(by=["newspaper", "year_week"]).mean()
emotion_prob.reset_index(inplace=True)

In [99]:
emotion_prob = emotion_prob.melt(id_vars=["newspaper", "year_week"], value_vars=["emotion_probas_anger", "emotion_probas_disgust", "emotion_probas_fear", "emotion_probas_joy", "emotion_probas_others", "emotion_probas_sadness", "emotion_probas_surprise"], var_name="emotion", value_name="probability")

In [104]:
emotion_prob["emotion"] = emotion_prob["emotion"].apply(lambda x: x.replace("emotion_probas_", ""))

In [105]:
emotion_prob.head()

Unnamed: 0,newspaper,year_week,emotion,probability
0,DiarioElPeruano,2022w33,anger,0.008
1,DiarioElPeruano,2022w34,anger,0.006
2,DiarioElPeruano,2022w39,anger,0.004
3,DiarioElPeruano,2022w44,anger,0.004
4,DiarioElPeruano,2022w49,anger,0.004


In [112]:
fig = px.bar(
    emotion_prob,
    x="emotion",
    y="probability",
    facet_row="newspaper",
    facet_col="year_week",
    color="emotion",
    color_discrete_sequence=gruvbox_colors,
    title="Emotion per newspaper",
    height=2100,
    width=2100
    )

fig.for_each_annotation(lambda a: a.update(text=f"{a.text.split('=')[-1]}"))
fig.update_xaxes(matches=None, showticklabels=True)
fig.update_yaxes(showticklabels=True)
fig.update_layout(showlegend=False)

fig.write_html(f"{BASE_DIR}/reports/emotion-prob_bar-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.html")

fig.show()

Results are still inconclusive. Still it does make sense, because we are talking about newspaper headlines so they are supposed to be a bit more neutral.

In [114]:
corpus["corpus_len"] = corpus["corpus"].apply(lambda x: len(x))

In [116]:
corpus[["newspaper", "year_week", "corpus_len"]].groupby(by=["year_week"]).sum()


The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.



Unnamed: 0_level_0,corpus_len
year_week,Unnamed: 1_level_1
2022w33,753242
2022w34,538434
2022w39,610327
2022w44,489837
2022w49,597057
2023w2,451792
