# Data Cleaning

## Introduction

This notebook goes through the steps taken with the data collected in order to get cleaned organized data in two standard text formats. The notebook will contain the nexts steps.

1. **Cleaning the data -** I will use text pre-procesing techniques to get the dta into shape.
2. **Organizing the data -** I'l organize the data into a way that is easy to input into other algoithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of texts
2. **Document-Term Matrix** - words counts in matrix format

### Problem Statement

My goal is to look look a the latest headlines of the main newspapers in Perú and note simmilarities and differences.

In [146]:
import emoji
import json
import numpy as np
import os
import pandas as pd
import plotly.express as px
import plotly.io as pio
import re
import string
import spacy

from collections import Counter
from dotenv import load_dotenv
from itertools import product

Since most of the data we are dealing with is text data, I'm goig to be using some common text pre-processing techniques.

For that I'm going to follow the MVP __(Minimum Viable Product)__ approach. For that the main resource I'll be using is a talk from PyOhio by [Alice Zhao](https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/1-Data-Cleaning.ipynb). The cleaning steps I'll be taking are.

**Removing tweets that are outside of the scope**

* Tweets corresponding to the cover page announcement
* Tweets corresponding to caricature of the day
* Tweets corresponding to the horoscope
* ...

**Common data cleaning steps on all text:**

* Make text all lowercase
* Remove punctuation
* Remove numerical values
* Remove common non.sensical text (\n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**

* Stemming/lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos

In [2]:
load_dotenv()

BASE_DIR = os.environ.get("BASE_DIR")
BEARER_TOKEN = os.environ.get("BEARER_TOKEN")

In [3]:
pd.set_option("display.max_colwidth", 300)
pd.set_option("display.max_rows", 25)
pd.set_option("display.precision", 2)
pd.set_option("display.float_format",  "{:,.2f}".format)

pio.templates.default = "plotly_white"
pio.kaleido.scope.default_scale = 2

gruvbox_colors = ["#458588", "#FABD2F", "#B8BB26", "#CC241D", "#B16286", "#8EC07C", "#FE8019"]

## Data Loading

In [4]:
with open(f'{BASE_DIR}/data/raw/newspapers_id.json', 'r') as read_file:
    newspapers_id = json.load(read_file)

In [43]:
newspaper_df_list = []
TIME_STAMPS = [(2022, 35), (2022, 40), (2022, 45), (2022, 50), (2023, 3)]

for newspaper, (year, week) in product(newspapers_id, TIME_STAMPS):

    with open(f'{BASE_DIR}/data/raw/{year}w{week}_data_{newspaper}.json', 'r') as read_file:
        json_file = json.load(read_file)

    json_data = json_file["data"]

    newspaper_df = pd.json_normalize(json_data)
    newspaper_df["newspaper"] = newspaper

    newspaper_df_list.append(newspaper_df)

In [44]:
data_raw = pd.concat(newspaper_df_list)

In [45]:
data_raw["created_at"] = pd.to_datetime(data_raw["created_at"], infer_datetime_format=True)

In [46]:
data_raw.columns = data_raw.columns.str.removeprefix("public_metrics.")

In [47]:
data_raw.head()

Unnamed: 0,created_at,possibly_sensitive,id,text,retweet_count,reply_count,like_count,quote_count,referenced_tweets,newspaper,edit_history_tweet_ids,impression_count
0,2022-08-28 23:57:24+00:00,False,1564039479391838209,Venezuela y Colombia retoman relaciones diplomáticas rotas hace tres años https://t.co/L6uVA6LcEE,0,0,6,1,,elcomercio_peru,,
1,2022-08-28 23:49:59+00:00,False,1564037610393280512,“Me dijeron que estaba llevando vergüenza a la universidad”: la profesora obligada a renunciar por postear fotos en bikini https://t.co/zAe98GI7W2,0,0,5,1,,elcomercio_peru,,
2,2022-08-28 23:29:00+00:00,False,1564032331706470401,AMLO afirma que familias ya aceptaron plan de rescate de 10 mineros https://t.co/dG3VJXWgNa,0,0,2,0,,elcomercio_peru,,
3,2022-08-28 23:14:11+00:00,False,1564028601053347843,Zelensky: los ocupantes rusos sentirán las consecuencias de “futuras acciones” https://t.co/mNJTLz0SS7,6,7,18,1,,elcomercio_peru,,
4,2022-08-28 23:09:07+00:00,False,1564027328157683713,Essalud: realizan con éxito operativo de donación de órganos para salvar vida de siete pacientes en espera https://t.co/3sDo7q9Nuu,1,0,11,0,,elcomercio_peru,,


In [10]:
data_raw.tail()

Unnamed: 0,created_at,possibly_sensitive,id,text,retweet_count,reply_count,like_count,quote_count,referenced_tweets,newspaper,edit_history_tweet_ids,impression_count
19,2023-01-10 02:37:07+00:00,False,1612639646231527424,"""Se va Tomás Unger, magnífico divulgador científico. Se jubila a los 92 años y lúcido. Yo nunca dejé de leerlo y le agradezco por todos estos años de información, de historias"". #Hildebrandt ahora en su podcast de https://t.co/ofhuiX1sZu https://t.co/UrlOl4SUVd",25,9,226,3,,ensustrece,[1612639646231527424],16189.0
20,2023-01-10 02:33:51+00:00,False,1612638821660721155,"""@peru21noticias está quebrado, es un diario inviable. Entonces, Cecilia Valenzuela, Alfredo Torres y la hermana de Gilberto Hume están haciendo una bolsa para comprarlo. Necesitan un chaleco ilustrado de la derecha. Necesitan un diario combativo y doctrinario"". #Hildebrandt https://t.co/7Xnmt3DAwO",158,29,448,10,,ensustrece,[1612638821660721155],29084.0
21,2023-01-10 02:09:05+00:00,False,1612632590606794756,"""Señora Boluarte: si usted ha decidido no gobernar, renuncie. Usted espera, espera, espera, no sé qué espera y luego autoriza la bala"". #Hildebrandt ahora en https://t.co/ofhuiX1sZu https://t.co/xamhBxUpTI",1350,288,3774,46,,ensustrece,[1612632590606794756],159409.0
22,2023-01-10 02:04:37+00:00,False,1612631466286133248,"""Los peruanos no escarmentamos, no aprendemos, siempre nos creemos por encima de todo. Van 40 muertos, señora @DinaErcilia. El gobierno no dialoga, no hace política y cuando la derecha le dice que está siendo pasivo, pide más muertos. El gobierno no dialoga, balea"". #Hildebrandt https://t.co/4VU...",1036,310,2365,41,,ensustrece,[1612631466286133248],85233.0
23,2023-01-09 20:52:45+00:00,False,1612552984340078610,"Hoy regresa el podcast de César #Hildebrandt. Va cada lunes a la 9 p.m. en https://t.co/ofhuiX1sZu. No se requiere suscripción ni pago alguno para ver la transmisión EN VIVO; para acceder al archivo de vídeos pasados, sí. https://t.co/VHjWhcGSY0",95,34,438,3,,ensustrece,[1612552984340078610],29116.0


In [11]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51235 entries, 0 to 23
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   created_at              51235 non-null  datetime64[ns, UTC]
 1   possibly_sensitive      51235 non-null  bool               
 2   id                      51235 non-null  object             
 3   text                    51235 non-null  object             
 4   retweet_count           51235 non-null  int64              
 5   reply_count             51235 non-null  int64              
 6   like_count              51235 non-null  int64              
 7   quote_count             51235 non-null  int64              
 8   referenced_tweets       5091 non-null   object             
 9   newspaper               51235 non-null  object             
 10  edit_history_tweet_ids  34073 non-null  object             
 11  impression_count        7352 non-null   floa

## First look on data

From the first looks, as well a the look on the head and tail I get to see how many of the variables are categorical and how many are numerical. Also, I only found missing values in the referenced tweets field. So, first for the numerical value I'm going to take a look of some metrics before the EDA.

**Numerical variables:**

1. `created_at`: Timestamp of the tweet
2. `retweet_count`: Number of times a tweet was retweeted
3. `like_count`: Number of likes a tweet has
4. `quote_count`: Number of times a tweet was quoted

**Categorical variables**

1. `id`: Unique identifier of tweet
2. `positively_sensitive`: Boolean variable of whether a tweet might contain sensitive information
3. `text`: Actual text of the tweet
4. `referenced_tweet`: Whether this tweet is a retweet or a quoted tweet
5. `newspaper`: Twitter handle if the newspaper the tweet belongs to

In [48]:
fig = px.histogram(
    data_raw,
    x="created_at",
    facet_row="newspaper",
    color_discrete_sequence=gruvbox_colors,
    title="Number of tweets per newspaper",
    height=1600,
    width=1000
)

fig.update_traces(xbins_size="D1")
fig.for_each_annotation(lambda a: a.update(text=f"@{a.text.split('=')[-1]}"))

fig.show()

fig.write_image(f"{BASE_DIR}/reports/figures/1-tweeets-per-newspaper-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.jpg")

In [49]:
data_stats = pd.DataFrame()

data_stats["raw_tweet_count"] = data_raw["newspaper"].value_counts()

data_stats = data_stats.merge(data_raw.loc[data_raw["referenced_tweets"].notna(), "newspaper"].value_counts(), how="left", left_index=True, right_index=True)
data_stats.rename(columns={"newspaper": "raw_referenced_tweet_count"}, inplace=True)

data_stats = data_stats.merge(data_raw.loc[data_raw["possibly_sensitive"] == True, "newspaper"].value_counts(), how="left", left_index=True, right_index=True)
data_stats.rename(columns={"newspaper": "raw_possibly_sensitive_count"}, inplace=True)

data_stats = data_stats.merge(data_raw.groupby("newspaper").sum(numeric_only=True).drop("possibly_sensitive", axis=1), how="left", left_index=True, right_index=True)
data_stats.rename(columns={"retweet_count": "raw_retweet_count", "reply_count": "raw_reply_count", "like_count": "raw_like_count", "quote_count": "raw_quote_count"}, inplace=True)

In [78]:
data_raw[data_raw["possibly_sensitive"] == True]

Unnamed: 0,created_at,possibly_sensitive,id,text,retweet_count,reply_count,like_count,quote_count,referenced_tweets,newspaper,edit_history_tweet_ids,impression_count,text_clean
108,2022-10-02 03:09:17+00:00,True,1576408955529707521,Bebé recién nacido fue encontrado en la basura al sur de Bogotá https://t.co/WJCnELvkR0,4,1,12,0,,elcomercio_peru,[1576408955529707521],,bebé recién nacido fue encontrado en la basura al sur de bogotá
742,2022-12-05 21:35:28+00:00,True,1599880158810365977,Aumentan a 40 las niñas entre 11 a 14 años que ya se convirtieron en madres en la región Ica https://t.co/cP9HBb7Bkd,1,5,4,0,,diariocorreo,[1599880158810365977],,aumentan a 40 las niñas entre 11 a 14 años que ya se convirtieron en madres en la región ica
249,2022-11-04 17:16:09+00:00,True,1588580874186104832,Vasectomía gratis del 14 al 18 de noviembre en Lima: ¿afecta el deseo sexual? todo sobre este procedimiento para varones https://t.co/DQ9x0n7HBe,1,0,2,0,,diarioojo,[1588580874186104832],,vasectomía gratis del 14 al 18 de noviembre en lima afecta el deseo sexual todo sobre este procedimiento para varones


In [51]:
data_stats["raw_reference_to_tweets_ratio"] = data_stats["raw_referenced_tweet_count"]/data_stats["raw_tweet_count"]
data_stats["raw_sensitive_to_tweets_ratio"] = data_stats["raw_possibly_sensitive_count"]/data_stats["raw_tweet_count"]
data_stats["raw_retweet_to_tweets_ratio"] = data_stats["raw_retweet_count"]/data_stats["raw_tweet_count"]
data_stats["raw_reply_to_tweets_ratio"] = data_stats["raw_reply_count"]/data_stats["raw_tweet_count"]
data_stats["raw_like_to_tweets_ratio"] = data_stats["raw_like_count"]/data_stats["raw_tweet_count"]
data_stats["raw_quote_to_tweets_ratio"] = data_stats["raw_quote_count"]/data_stats["raw_tweet_count"]

I decided to look at ratios related to the total ammount of tweets, because, as seen from the graph above, there is a big difference in the ammount of tweets from each newspaper.

In [52]:
data_stats.T

Unnamed: 0,Gestionpe,peru21noticias,tromepe,larepublica_pe,diarioojo,diariocorreo,elcomercio_peru,ExpresoPeru,DiarioElPeruano,elbuho_pe,larazon_pe,ensustrece
raw_tweet_count,8169.0,7571.0,7186.0,7064.0,5063.0,4841.0,4529.0,2599.0,2305.0,899.0,879.0,130.0
raw_referenced_tweet_count,30.0,244.0,3.0,4065.0,,4.0,258.0,,68.0,395.0,,24.0
raw_possibly_sensitive_count,,2.0,1.0,,2.0,1.0,3.0,,,1.0,,
raw_retweet_count,11985.0,61410.0,8924.0,137796.0,2800.0,15481.0,18388.0,90404.0,12838.0,11472.0,2477.0,21269.0
raw_reply_count,11074.0,45983.0,19886.0,28637.0,7616.0,14963.0,10625.0,82260.0,10078.0,828.0,458.0,5180.0
raw_like_count,46076.0,197619.0,56303.0,142432.0,20775.0,47437.0,40572.0,299029.0,34004.0,9241.0,4666.0,44677.0
raw_quote_count,1717.0,6378.0,1537.0,5968.0,624.0,1983.0,1616.0,9158.0,1135.0,346.0,131.0,918.0
impression_count,3312134.0,3880890.0,1606484.0,5505507.0,405654.0,1201940.0,2661255.0,3780412.0,1696783.0,406532.0,39958.0,761751.0
raw_reference_to_tweets_ratio,0.0,0.03,0.0,0.58,,0.0,0.06,,0.03,0.44,,0.18
raw_sensitive_to_tweets_ratio,,0.0,0.0,,0.0,0.0,0.0,,,0.0,,


In [53]:
data_stats.to_csv(f"{BASE_DIR}/reports/tables/1-raw_stats-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.csv")

## Removing non relevant tweets

From checking the tweet feeds, and dataframe, I noticed that there are tweets that do not speak of the discourse of the newspaper, such as horoscopes caricatures and portada post.

In [119]:
data = data_raw

data.drop(data[data["text"].str.contains('horóscopo diario', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('horóscopo de', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('horóscopo hoy', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('horóscopo y tarot', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('horóscopo', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('Buenos días', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('caricatura de', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('las caricaturas de', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('portada impresa', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('portada de hoy', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('en portada', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('trome gol', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('no te pierdas las chiquitas de hoy', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data[data["text"].str.contains('esta es la portada', flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data.loc[data["text"].str.contains("Aquí la portada del", flags=re.IGNORECASE, regex=True)].index, inplace=True)
data.drop(data.loc[data["text"].str.contains("yapaza", flags=re.IGNORECASE, regex=True)].index, inplace=True)

In [55]:
data_stats = data_stats.merge(data["newspaper"].value_counts(), how="left", left_index=True, right_index=True)
data_stats.rename(columns={"newspaper": "clean_tweet_count"}, inplace=True)

data_stats = data_stats.merge(data.loc[data["referenced_tweets"].notna(), "newspaper"].value_counts(), how="left", left_index=True, right_index=True)
data_stats.rename(columns={"newspaper": "clean_referenced_tweet_count"}, inplace=True)

data_stats = data_stats.merge(data.loc[data["possibly_sensitive"] == True, "newspaper"].value_counts(), how="left", left_index=True, right_index=True)
data_stats.rename(columns={"newspaper": "clean_possibly_sensitive_count"}, inplace=True)

data_stats = data_stats.merge(data.groupby("newspaper").sum(numeric_only=True).drop("possibly_sensitive", axis=1), how="left", left_index=True, right_index=True)
data_stats.rename(columns={"retweet_count": "clean_retweet_count", "reply_count": "clean_reply_count", "like_count": "clean_like_count", "quote_count": "clean_quote_count"}, inplace=True)

In [63]:
data_stats.rename(columns={"impression_count_x": "raw_impression_count", "impression_count_y": "clean_impression_count"}, inplace=True)

In [64]:
data_stats["clean_to_raw_tweet_ratio"] = data_stats["clean_tweet_count"]/data_stats["raw_tweet_count"]
data_stats["clean_to_raw_impression_ratio"] = data_stats["clean_impression_count"]/data_stats["raw_impression_count"]

In [65]:
data_stats["clean_reference_to_tweets_ratio"] = data_stats["clean_referenced_tweet_count"]/data_stats["clean_tweet_count"]
data_stats["clean_sensitive_to_tweets_ratio"] = data_stats["clean_possibly_sensitive_count"]/data_stats["clean_tweet_count"]
data_stats["clean_retweet_to_tweets_ratio"] = data_stats["clean_retweet_count"]/data_stats["clean_tweet_count"]
data_stats["clean_reply_to_tweets_ratio"] = data_stats["clean_reply_count"]/data_stats["clean_tweet_count"]
data_stats["clean_like_to_tweets_ratio"] = data_stats["clean_like_count"]/data_stats["clean_tweet_count"]
data_stats["clean_quote_to_tweets_ratio"] = data_stats["clean_quote_count"]/data_stats["clean_tweet_count"]

In [68]:
data_stats = data_stats.T

In [69]:
data_stats

Unnamed: 0,Gestionpe,peru21noticias,tromepe,larepublica_pe,diarioojo,diariocorreo,elcomercio_peru,ExpresoPeru,DiarioElPeruano,elbuho_pe,larazon_pe,ensustrece
raw_tweet_count,8169.00,7571.00,7186.00,7064.00,5063.00,4841.00,4529.00,2599.00,2305.00,899.00,879.00,130.00
raw_referenced_tweet_count,30.00,244.00,3.00,4065.00,,4.00,258.00,,68.00,395.00,,24.00
raw_possibly_sensitive_count,,2.00,1.00,,2.00,1.00,3.00,,,1.00,,
raw_retweet_count,11985.00,61410.00,8924.00,137796.00,2800.00,15481.00,18388.00,90404.00,12838.00,11472.00,2477.00,21269.00
raw_reply_count,11074.00,45983.00,19886.00,28637.00,7616.00,14963.00,10625.00,82260.00,10078.00,828.00,458.00,5180.00
...,...,...,...,...,...,...,...,...,...,...,...,...
clean_sensitive_to_tweets_ratio,,,,,0.00,0.00,0.00,,,,,
clean_retweet_to_tweets_ratio,1.44,7.70,1.15,19.41,0.51,3.00,3.92,30.58,5.62,10.42,0.70,122.25
clean_reply_to_tweets_ratio,1.34,5.60,2.34,3.58,1.57,2.96,2.00,30.46,4.80,0.78,0.15,29.60
clean_like_to_tweets_ratio,5.47,24.54,6.93,18.16,4.08,9.22,8.24,101.75,16.07,6.13,1.47,277.91


In [70]:
data_stats.to_csv(f"{BASE_DIR}/reports/tables/1-clean_stats-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.csv")

## Text cleaning and tokenization

When it comes to text processing, and specially for tweets, there are some common text patterns that do not add any meaning to the message being conveyed. For example: links, hashtags and tags.

In [120]:
def clean_text_first_pass(text):
    """Get rid of other punctuation and non-sensical text identified.

    Args:
        text (string): text to be processed.
    """
    text = text.lower()
    text = re.sub("http[s]?(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", text) # Eliminates links
    text = re.sub("\w*\d\w*", "", text) # Eliminates numbers
    text = re.sub("[%s]" % re.escape(string.punctuation), "", text) # Eliminates punctuarion
    text = re.sub("[‘’“”…«»►¿¡|│]", "", text)
    text = re.sub("\n", " ", text)

    return text

first_pass = lambda x: clean_text_first_pass(x)

In [121]:
data["text_clean"] = data.text.apply(first_pass)
data["text_clean"]

0                                                                                                                                        venezuela y colombia retoman relaciones diplomáticas rotas hace tres años 
2                                                                                                                                                amlo afirma que familias ya aceptaron plan de rescate de  mineros 
3                                                                                                                                      zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones 
5                                                                                                                                      autoridades confirman transmisión comunitaria de viruela del mono en panamá 
7                                                                                                              las imágenes de los enfrentamientos entre

In [122]:
data[data["possibly_sensitive"] == True]

Unnamed: 0,created_at,possibly_sensitive,id,text,retweet_count,reply_count,like_count,quote_count,referenced_tweets,newspaper,edit_history_tweet_ids,impression_count,text_clean,corpus
108,2022-10-02 03:09:17+00:00,True,1576408955529707521,Bebé recién nacido fue encontrado en la basura al sur de Bogotá https://t.co/WJCnELvkR0,4,1,12,0,,elcomercio_peru,[1576408955529707521],,bebé recién nacido fue encontrado en la basura al sur de bogotá,bebé recién nacido fue encontrado en la basura al sur de bogotá
742,2022-12-05 21:35:28+00:00,True,1599880158810365977,Aumentan a 40 las niñas entre 11 a 14 años que ya se convirtieron en madres en la región Ica https://t.co/cP9HBb7Bkd,1,5,4,0,,diariocorreo,[1599880158810365977],,aumentan a las niñas entre a años que ya se convirtieron en madres en la región ica,aumentan a las niñas entre a años que ya se convirtieron en madres en la región ica
249,2022-11-04 17:16:09+00:00,True,1588580874186104832,Vasectomía gratis del 14 al 18 de noviembre en Lima: ¿afecta el deseo sexual? todo sobre este procedimiento para varones https://t.co/DQ9x0n7HBe,1,0,2,0,,diarioojo,[1588580874186104832],,vasectomía gratis del al de noviembre en lima afecta el deseo sexual todo sobre este procedimiento para varones,vasectomía gratis del al de noviembre en lima afecta el deseo sexual todo sobre este procedimiento para varones


In [83]:
data.head()

Unnamed: 0,created_at,possibly_sensitive,id,text,retweet_count,reply_count,like_count,quote_count,referenced_tweets,newspaper,edit_history_tweet_ids,impression_count,text_clean
0,2022-08-28 23:57:24+00:00,False,1564039479391838209,Venezuela y Colombia retoman relaciones diplomáticas rotas hace tres años https://t.co/L6uVA6LcEE,0,0,6,1,,elcomercio_peru,,,venezuela y colombia retoman relaciones diplomáticas rotas hace tres años
2,2022-08-28 23:29:00+00:00,False,1564032331706470401,AMLO afirma que familias ya aceptaron plan de rescate de 10 mineros https://t.co/dG3VJXWgNa,0,0,2,0,,elcomercio_peru,,,amlo afirma que familias ya aceptaron plan de rescate de mineros
3,2022-08-28 23:14:11+00:00,False,1564028601053347843,Zelensky: los ocupantes rusos sentirán las consecuencias de “futuras acciones” https://t.co/mNJTLz0SS7,6,7,18,1,,elcomercio_peru,,,zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones
5,2022-08-28 22:54:58+00:00,False,1564023766937731073,Autoridades confirman transmisión comunitaria de viruela del mono en Panamá https://t.co/EBFcdrHz4Y,1,0,1,1,,elcomercio_peru,,,autoridades confirman transmisión comunitaria de viruela del mono en panamá
7,2022-08-28 22:30:25+00:00,False,1564017585561141248,Las imágenes de los enfrentamientos entre seguidores de Cristina Kirchner y la policía en Argentina https://t.co/BYalmVyPBF,3,0,8,0,,elcomercio_peru,,,las imágenes de los enfrentamientos entre seguidores de cristina kirchner y la policía en argentina


From checking the resulting text I found that there are some tweets that contain emojis that haven't been removed. For that I will use the `emoji` package.

In [115]:
replace_emojis = lambda x: emoji.replace_emoji(x, "")

In [116]:
data["text_clean"] = data["text_clean"].apply(replace_emojis)
data.head()

Unnamed: 0,created_at,possibly_sensitive,id,text,retweet_count,reply_count,like_count,quote_count,referenced_tweets,newspaper,edit_history_tweet_ids,impression_count,text_clean,corpus
0,2022-08-28 23:57:24+00:00,False,1564039479391838209,Venezuela y Colombia retoman relaciones diplomáticas rotas hace tres años https://t.co/L6uVA6LcEE,0,0,6,1,,elcomercio_peru,,,venezuela y colombia retoman relaciones diplomáticas rotas hace tres años,venezuela y colombia retoman relaciones diplomáticas rotas hace tres años
2,2022-08-28 23:29:00+00:00,False,1564032331706470401,AMLO afirma que familias ya aceptaron plan de rescate de 10 mineros https://t.co/dG3VJXWgNa,0,0,2,0,,elcomercio_peru,,,amlo afirma que familias ya aceptaron plan de rescate de mineros,amlo afirma que familias ya aceptaron plan de rescate de mineros
3,2022-08-28 23:14:11+00:00,False,1564028601053347843,Zelensky: los ocupantes rusos sentirán las consecuencias de “futuras acciones” https://t.co/mNJTLz0SS7,6,7,18,1,,elcomercio_peru,,,zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones,zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones
5,2022-08-28 22:54:58+00:00,False,1564023766937731073,Autoridades confirman transmisión comunitaria de viruela del mono en Panamá https://t.co/EBFcdrHz4Y,1,0,1,1,,elcomercio_peru,,,autoridades confirman transmisión comunitaria de viruela del mono en panamá,autoridades confirman transmisión comunitaria de viruela del mono en panamá
7,2022-08-28 22:30:25+00:00,False,1564017585561141248,Las imágenes de los enfrentamientos entre seguidores de Cristina Kirchner y la policía en Argentina https://t.co/BYalmVyPBF,3,0,8,0,,elcomercio_peru,,,las imágenes de los enfrentamientos entre seguidores de cristina kirchner y la policía en argentina,las imágenes de los enfrentamientos entre seguidores de cristina kirchner y la policía en argentina


Also, form redoing the analysis, and looking at the tweeter feeds from many of the newspapers I found that there are some patterns of writing that do not add to the content, like calls to action, that interact with the audience, but do not add any significance to the headline. I'll be checking at the twitter feed for such patterns and add them during cleaning. I'm not adding them as stopwords because these calls to action in specific are groups of words.

In [117]:
def clean_text_second_pass(text):
    """Get rid of other punctuation and non-sensical text identified.

    Args:
        text (string): text to be processed.
    """
    text = re.sub("click aquí", "", text)
    text = re.sub("opinión", "", text)
    text = re.sub("rt ", "", text)
    text = re.sub('lee aquí el blog de', '', text)
    text = re.sub('vía gestionpe', '', text)
    text = re.sub('entrevista exclusiva', '', text)
    text = re.sub('en vivo', '', text)
    text = re.sub('entérate más aquí', '', text)
    text = re.sub('lee la columna de', '', text)
    text = re.sub('lee y comenta', '', text)
    text = re.sub('lea hoy la columna de', '', text)
    text = re.sub('escrito por', '', text)
    text = re.sub('una nota de', '', text)
    text = re.sub('aquí la nota', '', text)
    text = re.sub('nota completa aquí', '', text)
    text = re.sub('lee más', '', text)
    text = re.sub('lee aquí', '', text)

    text = re.sub("  ", " ", text)

    return text

second_pass = lambda x: clean_text_second_pass(x)

In [118]:
data["text_clean"] = data.text_clean.apply(second_pass)
data.head()

Unnamed: 0,created_at,possibly_sensitive,id,text,retweet_count,reply_count,like_count,quote_count,referenced_tweets,newspaper,edit_history_tweet_ids,impression_count,text_clean,corpus
0,2022-08-28 23:57:24+00:00,False,1564039479391838209,Venezuela y Colombia retoman relaciones diplomáticas rotas hace tres años https://t.co/L6uVA6LcEE,0,0,6,1,,elcomercio_peru,,,venezuela y colombia retoman relaciones diplomáticas rotas hace tres años,venezuela y colombia retoman relaciones diplomáticas rotas hace tres años
2,2022-08-28 23:29:00+00:00,False,1564032331706470401,AMLO afirma que familias ya aceptaron plan de rescate de 10 mineros https://t.co/dG3VJXWgNa,0,0,2,0,,elcomercio_peru,,,amlo afirma que familias ya aceptaron plan de rescate de mineros,amlo afirma que familias ya aceptaron plan de rescate de mineros
3,2022-08-28 23:14:11+00:00,False,1564028601053347843,Zelensky: los ocupantes rusos sentirán las consecuencias de “futuras acciones” https://t.co/mNJTLz0SS7,6,7,18,1,,elcomercio_peru,,,zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones,zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones
5,2022-08-28 22:54:58+00:00,False,1564023766937731073,Autoridades confirman transmisión comunitaria de viruela del mono en Panamá https://t.co/EBFcdrHz4Y,1,0,1,1,,elcomercio_peru,,,autoridades confirman transmisión comunitaria de viruela del mono en panamá,autoridades confirman transmisión comunitaria de viruela del mono en panamá
7,2022-08-28 22:30:25+00:00,False,1564017585561141248,Las imágenes de los enfrentamientos entre seguidores de Cristina Kirchner y la policía en Argentina https://t.co/BYalmVyPBF,3,0,8,0,,elcomercio_peru,,,las imágenes de los enfrentamientos entre seguidores de cristina kirchner y la policía en argentina,las imágenes de los enfrentamientos entre seguidores de cristina kirchner y la policía en argentina


In [124]:
data.reset_index().to_feather(f"{BASE_DIR}/data/interim/data_clean-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")

## Organizing the data

Now I need to get the data in both of the standar text formats:

1. **Corpus -** a collection of text
2. **Document-Term matrix -** word counts in matrix format

In the case of the tweets, I will start by combining all the clean texts and in the case of *Document-term matrix* tokenising the result.

In [125]:
df_clean = pd.read_feather(f"{BASE_DIR}/data/interim/data_clean-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")

### Corpus

The corpus corresponds to the data clean from the step above.

In [136]:
df_corpus = data[["created_at", "newspaper", "text_clean"]].reset_index()
df_corpus.rename(columns={"text_clean": "corpus"}, inplace=True)

In [137]:
df_corpus.to_feather(f"{BASE_DIR}/data/processed/corpus-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")

### Document-Term Matrix

From the corpus constructed in the step above I'll proceed to tokenizethe text to use with further techniques. For that I'll use scikit-learn's `CountVectorizer`, where every row represents a document and each column is a different row.

I'll also remove stop words.

In [105]:
nlp = spacy.load('es_core_news_sm')

In [141]:
def normalize_text(text):
    doc = nlp(text)
    words = [t.orth_ for t in doc if not t.is_punct | t.is_stop | t.is_space]

    return words

normalize = lambda x: normalize_text(x)

In [138]:
data_dtm = pd.read_feather(f"{BASE_DIR}/data/processed/corpus-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")

In [139]:
data_dtm.head()

Unnamed: 0,index,created_at,newspaper,corpus
0,0,2022-08-28 23:57:24+00:00,elcomercio_peru,venezuela y colombia retoman relaciones diplomáticas rotas hace tres años
1,2,2022-08-28 23:29:00+00:00,elcomercio_peru,amlo afirma que familias ya aceptaron plan de rescate de mineros
2,3,2022-08-28 23:14:11+00:00,elcomercio_peru,zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones
3,5,2022-08-28 22:54:58+00:00,elcomercio_peru,autoridades confirman transmisión comunitaria de viruela del mono en panamá
4,7,2022-08-28 22:30:25+00:00,elcomercio_peru,las imágenes de los enfrentamientos entre seguidores de cristina kirchner y la policía en argentina


In [143]:
data_dtm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34924 entries, 0 to 34923
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   index       34924 non-null  int64              
 1   created_at  34924 non-null  datetime64[ns, UTC]
 2   newspaper   34924 non-null  object             
 3   corpus      34924 non-null  object             
 4   doc         34924 non-null  object             
dtypes: datetime64[ns, UTC](1), int64(1), object(3)
memory usage: 1.3+ MB


In [142]:
data_dtm["doc"] = data_dtm["corpus"].apply(normalize)
data_dtm["doc"]

0                                                                                       [venezuela, colombia, retoman, relaciones, diplomáticas, rotas, años]
1                                                                                                 [amlo, afirma, familias, aceptaron, plan, rescate, mineros]
2                                                                                    [zelensky, ocupantes, rusos, sentirán, consecuencias, futuras, acciones]
3                                                                                   [autoridades, confirman, transmisión, comunitaria, viruela, mono, panamá]
4                                                                             [imágenes, enfrentamientos, seguidores, cristina, kirchner, policía, argentina]
                                                                                 ...                                                                         
34919                                               

In [144]:
data_dtm.head()

Unnamed: 0,index,created_at,newspaper,corpus,doc
0,0,2022-08-28 23:57:24+00:00,elcomercio_peru,venezuela y colombia retoman relaciones diplomáticas rotas hace tres años,"[venezuela, colombia, retoman, relaciones, diplomáticas, rotas, años]"
1,2,2022-08-28 23:29:00+00:00,elcomercio_peru,amlo afirma que familias ya aceptaron plan de rescate de mineros,"[amlo, afirma, familias, aceptaron, plan, rescate, mineros]"
2,3,2022-08-28 23:14:11+00:00,elcomercio_peru,zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones,"[zelensky, ocupantes, rusos, sentirán, consecuencias, futuras, acciones]"
3,5,2022-08-28 22:54:58+00:00,elcomercio_peru,autoridades confirman transmisión comunitaria de viruela del mono en panamá,"[autoridades, confirman, transmisión, comunitaria, viruela, mono, panamá]"
4,7,2022-08-28 22:30:25+00:00,elcomercio_peru,las imágenes de los enfrentamientos entre seguidores de cristina kirchner y la policía en argentina,"[imágenes, enfrentamientos, seguidores, cristina, kirchner, policía, argentina]"


In [145]:
data_dtm.to_feather(f"{BASE_DIR}/data/processed/data-dtm-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")

In [195]:
frecuency_df = []

for tweet in data_dtm.itertuples(index=False, name="Tweet"):
    tweet_count = Counter(tweet.doc)
    index = pd.MultiIndex.from_arrays([[tweet.newspaper], [tweet.created_at]], names=["newspaper", "created_at"])
    df = pd.DataFrame(tweet_count, index=index)
    frecuency_df.append(df)

In [197]:
dtm = pd.concat(frecuency_df, axis=1).fillna(0)
dtm

In [67]:
dtm.to_pickle(f'{BASE_DIR}/data/interim/dtm.pkl')