# Topic Modeling

## Introduction

I'll be doing topic modeling for the newspaper headlines, though I don't expect to find much difference in between the papers, maybe a topic or two will change.

For that I'll be using **Latent Dirichlet Allocation (LDA)**, which is a topic modeling technique, with the *Gensim* package and *Spacy*.

In regards to inputs and outputs, I'll be working with the **Document Term Matrix (DTM)** as en inpus, and the outputs will be list with topics for each newspaper.

In [1]:
import os
import pickle
import pandas as pd

from dotenv import load_dotenv

load_dotenv()

BASE_DIR = os.getenv("BASE_DIR")

In [8]:
tweet_data = pd.read_pickle(f"{BASE_DIR}/data/processed/dtm_stop.pkl").T

In [6]:
from gensim import matutils, models
from scipy import sparse

In [7]:
sparse_dtm = sparse.csc_matrix(tweet_data)
tweet_corpus = matutils.Sparse2Corpus(sparse_dtm)

In [14]:
id2word = {}

for index, word in enumerate(tweet_data.T.columns):
    id2word[index] = word

In [20]:
lda = models.LdaModel(corpus=tweet_corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.001*"trump" + 0.001*"presidente" + 0.001*"país" + 0.001*"alianza" + 0.001*"salud" + 0.001*"uu" + 0.001*"ee" + 0.001*"personas" + 0.001*"capitolio" + 0.001*"personas"'),
 (1,
  '0.003*"normaslegales" + 0.003*"nacional" + 0.002*"presidente" + 0.002*"ley" + 0.002*"salud" + 0.002*"navidad" + 0.002*"minsaperu" + 0.002*"país" + 0.002*"francisco" + 0.002*"pandemia"'),
 (2,
  '0.003*"sagasti" + 0.003*"luis" + 0.002*"vizcarra" + 0.002*"ley" + 0.002*"josé" + 0.002*"francisco" + 0.002*"garcía" + 0.002*"presidente" + 0.001*"césar" + 0.001*"martín"'),
 (3,
  '0.002*"eeuu" + 0.002*"covid" + 0.001*"ley" + 0.001*"mundo" + 0.001*"pandemia" + 0.001*"libertad" + 0.001*"casos" + 0.001*"trump" + 0.001*"personas" + 0.001*"coronavirusenperú"')]

Globally, from the results I can name some of the main themes arousing as the following:
* USA and the assault to the capitol
* Christmas, and the legal status of the country related the pandemic. Mos likely due to the uncertainty regarding the holidays
* Paeople names
* Covid and the pandemic

Now I think I'll be doing the same procedure per paper to see what comes up.