# Exploratory Data Analysis

## Introduction

After cleaning the data we are going to take a look a it. And since we want to know how the information changes across time, we will be looking at tweets from different weeks.

1. **Most common words:** Find them and create word clouds. See if anything needs to be removed.
2. **Size of vocabulary:** Look at the number of unique words used
3. **Engagement metrics across time:** A much insightfull look into the stats obtained during data cleaning.


In [63]:
import json
import numpy as np
import os
import pandas as pd
import plotly.express as px
import plotly.io as pio
import random
import re
import spacy

from dotenv import load_dotenv
from itertools import product
from wordcloud import WordCloud

In [2]:
load_dotenv()

BASE_DIR = os.environ.get("BASE_DIR")
BEARER_TOKEN = os.environ.get("BEARER_TOKEN")

In [83]:
pd.set_option("display.max_colwidth", 300)
pd.set_option("display.max_rows", 25)
pd.set_option("display.precision", 2)
pd.set_option("display.float_format",  "{:,.2f}".format)

pio.templates.default = "plotly_white"
pio.kaleido.scope.default_scale = 2

gruvbox_colors = ["#fe8019", "#d65d0e", "#af3a03", "#8ec07c", "#689d6a", "#4d7b58", "#d3869b", "#b16286", "#8f3f71", "#83a598", "#458588", "#076678", "#fabd2f", "#d79921", "#b57614", "#b8bb26", "#98971a", "#79740e", "#fb4943", "#cc241d",  "#fbf1c7"]
random.shuffle(gruvbox_colors)

In [4]:
TIME_STAMPS = [(2022, 35), (2022, 40), (2022, 45), (2022, 50), (2023, 3)]

### Data Loading

There are three documents that I want to load. The corpus frame, document term matrix and clean data.

In [5]:
corpus = pd.read_feather(f"{BASE_DIR}/data/processed/corpus-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")
dtm = pd.read_feather(f"{BASE_DIR}/data/processed/dtm-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")
stats_data = pd.read_feather(f"{BASE_DIR}/data/processed/stats_data-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")

In [6]:
dtm.set_index("index", inplace=True)
corpus.set_index("index", inplace=True)
stats_data.set_index("index", inplace=True)

In [7]:
dtm.head()

Unnamed: 0_level_0,venezuela,colombia,retoman,relaciones,diplomáticas,rotas,años,amlo,afirma,familias,...,lironda,quioscos,leerlo,segmentos,unger,divulgador,jubila,lúcido,agradezco,pasados
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1564039479391838209,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1563901376391954432,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1563603477875642368,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1563584254411685890,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1563357468478283777,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
corpus.head()

Unnamed: 0_level_0,id,created_at,newspaper,corpus
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1564039479391838209,2022-08-28 23:57:24+00:00,elcomercio_peru,venezuela colombia retoman relaciones diplomáticas rotas hace tres años
2,1564032331706470401,2022-08-28 23:29:00+00:00,elcomercio_peru,amlo afirma que familias ya aceptaron plan de rescate de mineros
3,1564028601053347843,2022-08-28 23:14:11+00:00,elcomercio_peru,zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones
5,1564023766937731073,2022-08-28 22:54:58+00:00,elcomercio_peru,autoridades confirman transmisión comunitaria de viruela del mono en panamá
7,1564017585561141248,2022-08-28 22:30:25+00:00,elcomercio_peru,las imágenes de los enfrentamientos entre seguidores de cristina kirchner la policía en argentina


In [8]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34924 entries, 0 to 23
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   id          34924 non-null  object             
 1   created_at  34924 non-null  datetime64[ns, UTC]
 2   newspaper   34924 non-null  object             
 3   corpus      34924 non-null  object             
dtypes: datetime64[ns, UTC](1), object(3)
memory usage: 1.3+ MB


In [9]:
stats_data.head()

Unnamed: 0_level_0,created_at,possibly_sensitive,id,retweet_count,reply_count,like_count,quote_count,referenced_tweets,newspaper,edit_history_tweet_ids,impression_count
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,2022-08-28 23:57:24+00:00,False,1564039479391838209,0,0,6,1,,elcomercio_peru,,
2,2022-08-28 23:29:00+00:00,False,1564032331706470401,0,0,2,0,,elcomercio_peru,,
3,2022-08-28 23:14:11+00:00,False,1564028601053347843,6,7,18,1,,elcomercio_peru,,
5,2022-08-28 22:54:58+00:00,False,1564023766937731073,1,0,1,1,,elcomercio_peru,,
7,2022-08-28 22:30:25+00:00,False,1564017585561141248,3,0,8,0,,elcomercio_peru,,


## Most Common Words

In [10]:
newspapers = corpus["newspaper"].unique()

In [11]:
year_weeks = corpus["created_at"].dt.isocalendar()[["year", "week"]]
year_weeks.drop_duplicates(inplace=True)
year_weeks = year_weeks.to_numpy()

In [12]:
dtm_newspaper = pd.DataFrame(index=dtm.columns)

In [13]:
corpus["year"] = corpus["created_at"].dt.isocalendar().year
corpus["week"] = corpus["created_at"].dt.isocalendar().week

In [14]:
dtm_newspaper = pd.DataFrame(index=dtm.columns)

for year_week, newspaper in product(year_weeks, newspapers):
    data_ids = corpus.loc[(corpus["newspaper"] == newspaper) & (corpus["year"] == year_week[0]) & (corpus["week"] == year_week[1]) , ["id"]]
    filtered_data = dtm.filter(items=data_ids["id"], axis=0)
    dtm_newspaper[f"{newspaper}-{year_week[0]}_{year_week[1]}"] = filtered_data.sum(axis=0)

In [15]:
top30_dict = {}

for newspaper in dtm_newspaper.columns:
    top = dtm_newspaper[newspaper].sort_values(ascending=False).head(30)
    top30_dict[newspaper] = list(zip(top.index, top.values))

In [16]:
top30_dict

{'elcomercio_peru-2022_34': [('agosto', 40.0),
  ('perú', 32.0),
  ('lima', 23.0),
  ('millones', 21.0),
  ('perucheck', 19.0),
  ('covid', 19.0),
  ('años', 17.0),
  ('mundo', 14.0),
  ('ucrania', 14.0),
  ('eeuu', 14.0),
  ('colombia', 13.0),
  ('méxico', 13.0),
  ('venezuela', 12.0),
  ('muerte', 12.0),
  ('mono', 12.0),
  ('viruela', 12.0),
  ('rusia', 12.0),
  ('cambio', 11.0),
  ('tipo', 11.0),
  ('dólar', 11.0),
  ('us', 11.0),
  ('mujer', 10.0),
  ('reporta', 10.0),
  ('contagios', 9.0),
  ('precio', 9.0),
  ('unidos', 9.0),
  ('país', 9.0),
  ('policía', 9.0),
  ('pasó', 8.0),
  ('bono', 8.0)],
 'larepublica_pe-2022_34': [('lrdeportes', 214.0),
  ('politicalr', 199.0),
  ('video', 141.0),
  ('perú', 69.0),
  ('envivo', 57.0),
  ('paredes', 56.0),
  ('castillo', 52.0),
  ('lima', 51.0),
  ('agosto', 50.0),
  ('años', 45.0),
  ('verificadorlr', 43.0),
  ('yenifer', 39.0),
  ('pedro', 35.0),
  ('partido', 35.0),
  ('prisión', 33.0),
  ('alianza', 29.0),
  ('fiscalía', 28.0),
  ('

In [17]:
top30_df = pd.DataFrame.from_records(top30_dict)

In [18]:
top30_df = top30_df.melt(value_vars=top30_df.columns, var_name="newspaper_date", value_name="word_count")

top30_df[["newspaper", "year_week"]] = top30_df["newspaper_date"].str.split(r"-", expand=True)
top30_df[["year", "week"]] = top30_df["year_week"].str.split(r"_", expand=True)
top30_df[["word", "count"]] = pd.DataFrame(top30_df["word_count"].to_list(), index=top30_df.index)

top30_df.drop(["word_count", "newspaper_date", "year_week"], axis=1, inplace=True)

top30_df["year"] = pd.to_numeric(top30_df["year"])
top30_df["week"] = pd.to_numeric(top30_df["week"])

In [19]:
top30_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2160 entries, 0 to 2159
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   newspaper  2160 non-null   object 
 1   year       2160 non-null   int64  
 2   week       2160 non-null   int64  
 3   word       2160 non-null   object 
 4   count      2160 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 84.5+ KB


In [20]:
top30_df.head()

Unnamed: 0,newspaper,year,week,word,count
0,DiarioElPeruano,2022,33,nacional,29.0
1,DiarioElPeruano,2022,33,presidente,27.0
2,DiarioElPeruano,2022,33,país,24.0
3,DiarioElPeruano,2022,33,pedro,22.0
4,DiarioElPeruano,2022,33,castillo,22.0


In [21]:
top30_df.to_feather(f"{BASE_DIR}/data/processed/top30_df-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")

In [39]:
top30_df["hot_topics"] = top30_df["word"].map({
    "castillo": "castillo",
    "pedro": "castillo",
    "dina": "boluarte",
    "boluarte": "boluarte",
    "perú": "país",
    "país": "país",
    "congreso": "congreso",
    "covid": "covid",
    "protestas": "protestas",
    "manifestaciones": "protestas"
})
top30_df["hot_topics"].fillna("", inplace=True)

In [22]:
with open(f"{BASE_DIR}/data/processed/top_30-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.json", "w") as file:
    json.dump(top30_dict, file)

In [84]:
fig = px.bar(
    top30_df,
    x="word",
    y="count",
    facet_row="newspaper",
    facet_col="week",
    color="hot_topics",
    color_discrete_sequence=gruvbox_colors,
    title="Top 30 words per newspaper per week",
    height=3200,
    width=3200
    )

fig.for_each_annotation(lambda a: a.update(text=f"{a.text.split('=')[-1]}"))
fig.update_xaxes(matches=None, showticklabels=True, categoryorder='total descending')
fig.update_yaxes(matches=None, showticklabels=True)

fig.write_html(f"{BASE_DIR}/reports/top30_bar-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.html")

fig.show()

## Number of words

### Analysis

In [113]:
unique_list = []

# Identify the non-zero items in the document-term matrix
for newspaper in dtm_newspaper.columns:
    uniques = dtm_newspaper[newspaper].to_numpy().nonzero()[0].size
    unique_list.append(uniques)

# Create a new datafra,e that contains this unique word count
data_words = pd.DataFrame(list(zip(dtm_newspaper.columns, unique_list)), columns=['newspaper', 'unique_words'])
data_words.set_index('newspaper', inplace=True)
data_words.sort_values(by='unique_words', ascending=False)
data_words.reset_index(inplace=True)

In [114]:
data_words[["newspaper", "year_week"]] = data_words["newspaper"].str.split(r"-", expand=True)
data_words[["year", "week"]] = data_words["year_week"].str.split(r"_", expand=True)

data_words.drop(["year_week"], axis=1, inplace=True)

data_words["year"] = pd.to_numeric(data_words["year"])
data_words["week"] = pd.to_numeric(data_words["week"])

In [122]:
data_words.head()

Unnamed: 0,newspaper,unique_words,year,week
0,elcomercio_peru,2128,2022,34
1,larepublica_pe,4841,2022,34
2,peru21noticias,4087,2022,34
3,tromepe,3290,2022,34
4,Gestionpe,2187,2022,34


In [123]:
data_words.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   newspaper     72 non-null     object
 1   unique_words  72 non-null     int64 
 2   year          72 non-null     int64 
 3   week          72 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 2.4+ KB


since the number of unique words might be linked to the number of tweets, I will add a column with the number of tweets for each newspaper.

In [118]:
tweet_number = pd.DataFrame(corpus.groupby(by=["newspaper", "year", "week"]).count()["id"])
tweet_number.rename(columns={'id':'tweet_number'}, inplace=True)

In [126]:
tweet_number.reset_index(inplace=True)

In [127]:
tweet_number.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   index         70 non-null     int64 
 1   newspaper     70 non-null     object
 2   year          70 non-null     UInt32
 3   week          70 non-null     UInt32
 4   tweet_number  70 non-null     int64 
dtypes: UInt32(2), int64(2), object(1)
memory usage: 2.4+ KB


In [129]:
data_words = data_words.merge(tweet_number)

In [130]:
data_words.head()

Unnamed: 0,newspaper,unique_words,year,week,index,tweet_number
0,elcomercio_peru,2128,2022,34,36,457
1,larepublica_pe,4841,2022,34,53,827
2,peru21noticias,4087,2022,34,59,961
3,tromepe,3290,2022,34,65,807
4,Gestionpe,2187,2022,34,13,848


In [131]:
data_words.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70 entries, 0 to 69
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   newspaper     70 non-null     object
 1   unique_words  70 non-null     int64 
 2   year          70 non-null     int64 
 3   week          70 non-null     int64 
 4   index         70 non-null     int64 
 5   tweet_number  70 non-null     int64 
dtypes: int64(5), object(1)
memory usage: 3.8+ KB


In [132]:
data_words["word_tweet_ratio"] = data_words["unique_words"]/data_words["tweet_number"]
data_words.sort_values(by='word_tweet_ratio', ascending=False)

Unnamed: 0,newspaper,unique_words,year,week,index,tweet_number,word_tweet_ratio
11,ensustrece,342,2022,34,41,24,14.25
57,ensustrece,261,2022,49,44,22,11.86
69,ensustrece,186,2023,2,45,16,11.62
33,ensustrece,215,2022,39,42,19,11.32
45,ensustrece,59,2022,44,43,6,9.83
...,...,...,...,...,...,...,...
62,Gestionpe,2421,2023,2,17,919,2.63
50,Gestionpe,2119,2022,49,16,809,2.62
4,Gestionpe,2187,2022,34,13,848,2.58
26,Gestionpe,2655,2022,39,14,1199,2.21


In [133]:
data_words.to_csv(f'{BASE_DIR}/reports/tables/words_tweets-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.csv')

In [137]:
fig = px.scatter(
    data_words,
    "unique_words",
    "tweet_number",
    facet_col="week",
    color="newspaper",
    color_discrete_sequence=gruvbox_colors,
    title="Unique words per newspaper",
    width=2400,
    height=600
)

fig.show()