# Topic Modeling

## Introduction

I'll be doing topic modeling for the newspaper headlines, though I don't expect to find much difference in between the papers, maybe a topic or two will change.

For that I'll be using **Latent Dirichlet Allocation (LDA)**, which is a topic modeling technique, with the *Gensim* package and *Spacy*.

In regards to inputs and outputs, I'll be working with the **Document Term Matrix (DTM)** as en inpus, and the outputs will be list with topics for each newspaper.

In [1]:
import os
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
import spacy

from dotenv import load_dotenv
from itertools import product
from gensim import matutils, models
from scipy import sparse

In [2]:
load_dotenv()

BASE_DIR = os.environ.get("BASE_DIR")
BEARER_TOKEN = os.environ.get("BEARER_TOKEN")

In [3]:
pd.set_option("display.max_colwidth", 300)
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 50)
pd.set_option("display.precision", 2)
pd.set_option("display.float_format",  "{:,.3f}".format)

pio.templates.default = "plotly_white"
pio.kaleido.scope.default_scale = 2

gruvbox_colors = ["#fabd2f", "#b8bb26", "#458588", "#fe8019", "#b16286", "#fb4943", "#689d6a", "#d79921", "#98971a", "#83a598", "#d65d0e", "#d3869b", "#cc241d", "#8ec07c", "#b57614", "#79740e", "#076678", "#af3a03", "#8f3f71", "#9d0006", "#4d7b58", "#fbf1c7", "#928374", "#282828"]

In [4]:
TIME_STAMPS = [(2022, 35), (2022, 40), (2022, 45), (2022, 50), (2023, 3)]

In [23]:
dtm = pd.read_pickle(f"{BASE_DIR}/data/processed/dtm-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.pkl")
stats_data = pd.read_feather(f"{BASE_DIR}/data/processed/stats_data-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.feather")
data_dtm = pd.read_pickle(f"{BASE_DIR}/data/processed/data-dtm-{TIME_STAMPS[0]}-{TIME_STAMPS[-1]}.pkl")

In [24]:
stats_data["year"] = stats_data["created_at"].dt.isocalendar().year
stats_data["week"] = stats_data["created_at"].dt.isocalendar().week

data_dtm["year"] = data_dtm["created_at"].dt.isocalendar().year
data_dtm["week"] = data_dtm["created_at"].dt.isocalendar().week

In [25]:
data_dtm["year_week"] = data_dtm["year"].astype("str") + "w" + data_dtm["week"].astype("str")

In [26]:
newspapers = data_dtm["newspaper"].unique()
year_week = data_dtm["year_week"].unique()

In [27]:
data_dtm.head()

Unnamed: 0,id,created_at,newspaper,corpus,doc,token,lemma,year,week,year_week
0,1564039479391838209,2022-08-28 23:57:24+00:00,elcomercio_peru,venezuela colombia retoman relaciones diplomáticas rotas hace tres años,"(venezuela, colombia, retoman, relaciones, diplomáticas, rotas, hace, tres, años)","[venezuela, colombia, retoman, relaciones, diplomáticas, rotas, años]","[venezuela, colombia, retomar, relación, diplomático, roto, año]",2022,34,2022w34
1,1564032331706470401,2022-08-28 23:29:00+00:00,elcomercio_peru,amlo afirma que familias ya aceptaron plan de rescate de mineros,"(amlo, afirma, que, familias, ya, aceptaron, plan, de, rescate, de, mineros)","[amlo, afirma, familias, aceptaron, plan, rescate, mineros]","[amlo, afirmar, familia, aceptar, plan, rescate, minero]",2022,34,2022w34
2,1564028601053347843,2022-08-28 23:14:11+00:00,elcomercio_peru,zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones,"(zelensky, los, ocupantes, rusos, sentirán, las, consecuencias, de, futuras, acciones)","[zelensky, ocupantes, rusos, sentirán, consecuencias, futuras, acciones]","[zelensky, ocupante, ruso, sentir, consecuencia, futuro, acción]",2022,34,2022w34
3,1564023766937731073,2022-08-28 22:54:58+00:00,elcomercio_peru,autoridades confirman transmisión comunitaria de viruela del mono en panamá,"(autoridades, confirman, transmisión, comunitaria, de, viruela, del, mono, en, panamá)","[autoridades, confirman, transmisión, comunitaria, viruela, mono, panamá]","[autoridad, confirmar, transmisión, comunitario, viruela, mono, panamá]",2022,34,2022w34
4,1564017585561141248,2022-08-28 22:30:25+00:00,elcomercio_peru,las imágenes de los enfrentamientos entre seguidores de cristina kirchner la policía en argentina,"(las, imágenes, de, los, enfrentamientos, entre, seguidores, de, cristina, kirchner, la, policía, en, argentina)","[imágenes, enfrentamientos, seguidores, cristina, kirchner, policía, argentina]","[imagen, enfrentamiento, seguidor, cristina, kirchner, policía, argentina]",2022,34,2022w34


In [28]:
data_dtm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34851 entries, 0 to 34923
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   id          34851 non-null  object             
 1   created_at  34851 non-null  datetime64[ns, UTC]
 2   newspaper   34851 non-null  object             
 3   corpus      34851 non-null  object             
 4   doc         34851 non-null  object             
 5   token       34851 non-null  object             
 6   lemma       34851 non-null  object             
 7   year        34851 non-null  UInt32             
 8   week        34851 non-null  UInt32             
 9   year_week   34851 non-null  object             
dtypes: UInt32(2), datetime64[ns, UTC](1), object(7)
memory usage: 2.7+ MB


After reading **Gensim's documentation** I noticed that to get better results I need to lemmatize the words so I'll be redoing the DTM using **Spacy**.

In [29]:
dtm.head()

lemma,aa,aaaaatención,aaaatención,aactor,aafp,aaguinagar,aahh,aap,aar,abad,abajo,abancay,abanderar,abandona,abandonado,abandonar,abandono,abandón,abantir,abanto,abarcar,abarrot,abarrotar,abastecer,abastecer él,...,ünsal,​chocolate,​mantequilla,⃣,→,↓,─,⦿,𝗔𝘂𝗱𝗶𝘁𝗼𝗿𝗶𝗼,𝗖𝗮𝘁𝗮́𝗹𝗼𝗴𝗼,𝗗𝗲𝗹,𝗘𝗱𝗶𝘁𝗼𝗿𝗮,𝗣𝗮𝘁𝗿𝗶𝗰𝗶𝗮,𝗣𝗲𝗿𝘂́,𝗦𝗮́𝗯𝗮𝗱𝗼,𝗨́𝗻𝗲𝘁𝗲,𝗩𝗮𝗹𝗹𝗲,𝗱𝗲,𝗱𝗲𝗹,𝗲𝗱𝗶𝘁𝗼𝗿𝗶𝗮𝗹,𝗵𝗿𝘀,𝗹𝗮,𝗻𝗼𝘃𝗶𝗲𝗺𝗯𝗿𝗲,𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝗰𝗶𝗼́𝗻,󠁧󠁢󠁥󠁮󠁧󠁿
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
1558966707611385861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1558966968039997441,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1558967193043361792,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1558967616777109510,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1558968396674473985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
newspapers_dtm = {}

for newspaper, period in product(newspapers, year_week):
    ids = data_dtm.loc[(data_dtm["newspaper"] == newspaper) & (data_dtm["year_week"] == period), "id"].tolist()

    tweet_data = dtm.T[ids]
    sparse_dtm = sparse.csc_matrix(tweet_data)
    tweet_corpus = matutils.Sparse2Corpus(sparse_dtm)

    id2word = {}

    for index, word in enumerate(tweet_data.T.columns):
        id2word[index] = word

    newspapers_dtm[f"{newspaper}-{period}"] = models.LdaModel(corpus=tweet_corpus, id2word=id2word, num_topics=4, passes=20)

In [59]:
newspaper_lda = pd.DataFrame.from_dict(newspapers_dtm, orient="index")

In [60]:
newspaper_lda.reset_index(inplace=True)
newspaper_lda.rename({0: "lda_model"}, axis=1, inplace=True)

In [61]:
newspaper_lda[["newspaper", "year_week"]] = newspaper_lda["index"].str.split("-", expand=True)

In [62]:
newspaper_lda.drop("index", axis=1, inplace=True)

In [63]:
newspaper_lda["topics"] = newspaper_lda["lda_model"].apply(lambda lda: lda.print_topics())

In [64]:
newspaper_lda["topic_1"] = newspaper_lda["topics"].apply(lambda x: x[0])
newspaper_lda["topic_2"] = newspaper_lda["topics"].apply(lambda x: x[1])
newspaper_lda["topic_3"] = newspaper_lda["topics"].apply(lambda x: x[2])
newspaper_lda["topic_4"] = newspaper_lda["topics"].apply(lambda x: x[3])

In [65]:
newspaper_lda.head()

Unnamed: 0,lda_model,newspaper,year_week,topics,topic_1,topic_2,topic_3,topic_4
0,"LdaModel<num_terms=24473, num_topics=4, decay=0.5, chunksize=2000>",elcomercio_peru,2022w34,"[(0, 0.003*""agosto"" + 0.002*""cambio"" + 0.002*""tipo"" + 0.002*""dólar"" + 0.001*""lima"" + 0.001*""colombia"" + 0.001*""alimentario"" + 0.001*""bono"" + 0.001*""año"" + 0.001*""venezuela""), (1, 0.001*""caso"" + 0.001*""país"" + 0.001*""año"" + 0.001*""eeuu"" + 0.001*""dejar"" + 0.001*""ucrania"" + 0.001*""mono"" + 0.001*""co...","(0, 0.003*""agosto"" + 0.002*""cambio"" + 0.002*""tipo"" + 0.002*""dólar"" + 0.001*""lima"" + 0.001*""colombia"" + 0.001*""alimentario"" + 0.001*""bono"" + 0.001*""año"" + 0.001*""venezuela"")","(1, 0.001*""caso"" + 0.001*""país"" + 0.001*""año"" + 0.001*""eeuu"" + 0.001*""dejar"" + 0.001*""ucrania"" + 0.001*""mono"" + 0.001*""covid"" + 0.001*""incendio"" + 0.001*""informar"")","(2, 0.002*""perú"" + 0.002*""covid"" + 0.001*""contagio"" + 0.001*""morir"" + 0.001*""perucheck"" + 0.001*""mono"" + 0.001*""demanda"" + 0.001*""reportar"" + 0.001*""año"" + 0.001*""policía"")","(3, 0.003*""agosto"" + 0.002*""perú"" + 0.002*""millón"" + 0.002*""peruano"" + 0.001*""mes"" + 0.001*""pasar"" + 0.001*""ruso"" + 0.001*""rusia"" + 0.001*""mundo"" + 0.001*""nuclear"")"
1,"LdaModel<num_terms=24473, num_topics=4, decay=0.5, chunksize=2000>",elcomercio_peru,2022w33,"[(0, 0.004*""agosto"" + 0.003*""dólar"" + 0.003*""cambio"" + 0.003*""tipo"" + 0.002*""precio"" + 0.002*""muerto"" + 0.002*""ataque"" + 0.001*""perú"" + 0.001*""méxico"" + 0.001*""dejar""), (1, 0.003*""perú"" + 0.003*""covid"" + 0.002*""año"" + 0.002*""mil"" + 0.002*""millón"" + 0.002*""caso"" + 0.002*""mono"" + 0.001*""contagio"" ...","(0, 0.004*""agosto"" + 0.003*""dólar"" + 0.003*""cambio"" + 0.003*""tipo"" + 0.002*""precio"" + 0.002*""muerto"" + 0.002*""ataque"" + 0.001*""perú"" + 0.001*""méxico"" + 0.001*""dejar"")","(1, 0.003*""perú"" + 0.003*""covid"" + 0.002*""año"" + 0.002*""mil"" + 0.002*""millón"" + 0.002*""caso"" + 0.002*""mono"" + 0.001*""contagio"" + 0.001*""lima"" + 0.001*""persona"")","(2, 0.001*""año"" + 0.001*""crecer"" + 0.001*""país"" + 0.001*""rusia"" + 0.001*""millón"" + 0.001*""eeuu"" + 0.001*""bono"" + 0.001*""pedir"" + 0.001*""junio"" + 0.001*""chile"")","(3, 0.002*""año"" + 0.001*""agosto"" + 0.001*""pasar"" + 0.001*""millón"" + 0.001*""pedir"" + 0.001*""mundo"" + 0.001*""efemérid"" + 0.001*""perú"" + 0.001*""trump"" + 0.001*""militar"")"
2,"LdaModel<num_terms=24473, num_topics=4, decay=0.5, chunksize=2000>",elcomercio_peru,2022w39,"[(0, 0.003*""elección"" + 0.002*""candidato"" + 0.001*""lima"" + 0.001*""resultado"" + 0.001*""túdecid"" + 0.001*""alcaldía"" + 0.001*""región"" + 0.001*""electoral"" + 0.001*""onpe"" + 0.001*""conocer""), (1, 0.002*""renezp"" + 0.002*""lima"" + 0.001*""huracán"" + 0.001*""iar"" + 0.001*""millón"" + 0.001*""mínimo"" + 0.001*""m...","(0, 0.003*""elección"" + 0.002*""candidato"" + 0.001*""lima"" + 0.001*""resultado"" + 0.001*""túdecid"" + 0.001*""alcaldía"" + 0.001*""región"" + 0.001*""electoral"" + 0.001*""onpe"" + 0.001*""conocer"")","(1, 0.002*""renezp"" + 0.002*""lima"" + 0.001*""huracán"" + 0.001*""iar"" + 0.001*""millón"" + 0.001*""mínimo"" + 0.001*""mil"" + 0.001*""año"" + 0.001*""peruano"" + 0.001*""senamhi"")","(2, 0.004*""perú"" + 0.003*""setiembre"" + 0.002*""precio"" + 0.002*""elección"" + 0.002*""cambio"" + 0.002*""tipo"" + 0.002*""lima"" + 0.001*""consulta"" + 0.001*""bolsonaro"" + 0.001*""dólar"")","(3, 0.003*""rusia"" + 0.002*""ucrania"" + 0.002*""huracán"" + 0.002*""ucraniano"" + 0.001*""ruso"" + 0.001*""florida"" + 0.001*""anexión"" + 0.001*""ian"" + 0.001*""referendo"" + 0.001*""muerto"")"
3,"LdaModel<num_terms=24473, num_topics=4, decay=0.5, chunksize=2000>",elcomercio_peru,2022w44,"[(0, 0.002*""ucrania"" + 0.002*""lula"" + 0.002*""rusia"" + 0.001*""brasil"" + 0.001*""alimentario"" + 0.001*""bono"" + 0.001*""guerra"" + 0.001*""silva"" + 0.001*""exigir"" + 0.001*""golpe""), (1, 0.001*""covid"" + 0.001*""elección"" + 0.001*""brasil"" + 0.001*""contagio"" + 0.001*""policía"" + 0.001*""eeuu"" + 0.001*""hora"" +...","(0, 0.002*""ucrania"" + 0.002*""lula"" + 0.002*""rusia"" + 0.001*""brasil"" + 0.001*""alimentario"" + 0.001*""bono"" + 0.001*""guerra"" + 0.001*""silva"" + 0.001*""exigir"" + 0.001*""golpe"")","(1, 0.001*""covid"" + 0.001*""elección"" + 0.001*""brasil"" + 0.001*""contagio"" + 0.001*""policía"" + 0.001*""eeuu"" + 0.001*""hora"" + 0.001*""lula"" + 0.001*""millón"" + 0.001*""fed"")","(2, 0.002*""corea"" + 0.002*""noviembre"" + 0.002*""perú"" + 0.001*""tipo"" + 0.001*""dólar"" + 0.001*""norte"" + 0.001*""cambio"" + 0.001*""misil"" + 0.001*""sur"" + 0.001*""elección"")","(3, 0.003*""noviembre"" + 0.002*""lima"" + 0.001*""perú"" + 0.001*""brasil"" + 0.001*""conocer"" + 0.001*""millón"" + 0.001*""año"" + 0.001*""crecer"" + 0.001*""octubre"" + 0.001*""gobierno"")"
4,"LdaModel<num_terms=24473, num_topics=4, decay=0.5, chunksize=2000>",elcomercio_peru,2022w49,"[(0, 0.002*""boluarte"" + 0.002*""gobierno"" + 0.002*""presidenta"" + 0.001*""covid"" + 0.001*""pedro"" + 0.001*""castillo"" + 0.001*""año"" + 0.001*""perú"" + 0.001*""rusia"" + 0.001*""unidos""), (1, 0.002*""castillo"" + 0.002*""pedro"" + 0.002*""golpe"" + 0.002*""perú"" + 0.001*""congreso"" + 0.001*""ucrania"" + 0.001*""eeuu""...","(0, 0.002*""boluarte"" + 0.002*""gobierno"" + 0.002*""presidenta"" + 0.001*""covid"" + 0.001*""pedro"" + 0.001*""castillo"" + 0.001*""año"" + 0.001*""perú"" + 0.001*""rusia"" + 0.001*""unidos"")","(1, 0.002*""castillo"" + 0.002*""pedro"" + 0.002*""golpe"" + 0.002*""perú"" + 0.001*""congreso"" + 0.001*""ucrania"" + 0.001*""eeuu"" + 0.001*""sur"" + 0.001*""anunciar"" + 0.001*""bloqueo"")","(2, 0.004*""diciembre"" + 0.003*""perú"" + 0.002*""precio"" + 0.002*""dólar"" + 0.002*""tipo"" + 0.002*""cambio"" + 0.001*""gobierno"" + 0.001*""constitucional"" + 0.001*""policía"" + 0.001*""conocer"")","(3, 0.003*""castillo"" + 0.003*""pedro"" + 0.001*""diciembre"" + 0.001*""lima"" + 0.001*""golpe"" + 0.001*""covid"" + 0.001*""mil"" + 0.001*""rechazar"" + 0.001*""editorial"" + 0.001*""millón"")"
