# Topic modelling for tweets of covid19

On this first exploratory work, we intend to find and visualize the main topics on a subset of our corpus of tweets about covid19. We perform un unsupervised learning in order to detect emerging topics for our study of the social narratives of the pandemic.

In [2]:
corpus_label = 'dhcovid_texts_2020-04-25_es_all' # We select all the tweets in spanish for april 25th.

with open(corpus_label+'.txt', 'r') as fi:
    tweets_spa_un_dia=fi.readlines()

In [3]:
with open('stopwords-spa.lst', 'r') as s: # Generic Spanish stopwords
    stop_words = s.read()
    
with open("stopwords-spa.extra.lst") as s: # This list is modified according to our corpus
    stop_words += s.read()

stop_words = stop_words.split()

Predictably, stopwords are of no interest for topic modelling, so we will filter them. And based on the results, we refresh the list of stopwords to eliminate noise (for ex, 'retwitt') or words that are too obvious (for ex, 'covid19').
Topic modelling focuses in nouns, since it is the category that prototypically holds the more information load of a text. We use spacy for part of speech detection but we disable the elements of the nlp pipeline that we don't use in order to reduce the processing time.

In [4]:
import spacy
from tqdm import tqdm

nlp = spacy.load('es_core_news_sm')

nlp_tweets_spa_un_dia = [nlp(tweet, disable=["parser", "ner", "textcat"]) for tweet in tqdm(tweets_spa_un_dia)]

100%|██████████| 50459/50459 [04:21<00:00, 193.21it/s]


In [5]:
from pprint import pprint
nouns_spa_un_dia = []

# We generate a list of tweets where each tweet is a list containing the nouns of the tweet only
for tweet_doc in tqdm(nlp_tweets_spa_un_dia):
    this_tweet_nouns = []
    
    for word in tweet_doc:
        if word.text not in stop_words and word.pos_=='NOUN':
            this_tweet_nouns.append(word.text)
            
    nouns_spa_un_dia.append(this_tweet_nouns)
        
for tweet_nouns in nouns_spa_un_dia[:15]:
    print("*", " ".join(tweet_nouns))
    

100%|██████████| 50459/50459 [00:07<00:00, 6600.39it/s]

* cifra casos coronavirus paso muertes
* bebe meses bombero
* exvicepresidente declaraciones uso desinfectante pacientes
* campaña trabajo personal linea combate comunidad
* primavera verano
* olvides medidas
* muerte reconfiguracion orden
* banqueros cuarentena coactivas embargos lugar infierno guayaquilenemergencia
* cifras pais compra urgencia ventiladores
* asistente
* lumbreras aqui
* aire ganas estupido enfermedad neta
* vecinos cabeza mierda beisbol mesa hija vecina urbanizacion total policia edificio hija vecina caso
* vivo par menopausia
* india ejemplo apertura comercios





In the previous step, we should also lemmatize in order to reduce the number of word forms. However, we observed that the lemmatization with spacy was not satisfactory for this corpus. We will try other lemmatizers in the future.
Other helpful processings that we plan to do here are finding bigrams and removing rare words and common words.

Now we will process our tweets with Gensim, a Python library for unsupervised topic modeling. 

In [None]:
import gensim
from gensim import corpora

dictionary = corpora.Dictionary(nouns_spa_un_dia)
doc_term_matrix = [dictionary.doc2bow(tweet_nouns) for tweet_nouns in nouns_spa_un_dia]

We train our model with LDA for several number of topics and we save our models.

In [None]:
from gensim.models.ldamodel import LdaModel

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

modelos_LDA = {}
topic_numbers_to_try = range(3, 15)

for n in topic_numbers_to_try:
    
    logging.basicConfig(filename=f"modelos_gensim/gensim.LDA_model_n{n:02}.log")
    
    print(f"Entrenando LDA model con {n} topics")
    modelo = LdaModel(
        doc_term_matrix,
        num_topics=n,
        id2word=dictionary,
        passes=20,
        eval_every=1,
        iterations=50
    )
    
    archivo_destino = f"modelos_gensim/modelo_gensim_ntopics{n:02}"
    print(f"Guardando en {archivo_destino}")
    modelo.save(archivo_destino)
    modelos_LDA[n] = modelo

Once the training is finished, we can use pyLDAvis to visualize the saved models:

In [63]:
from gensim.models.ldamodel import LdaModel
from gensim import corpora
import pyLDAvis.gensim
pyLDAvis.enable_notebook()


def cargar_modelo_LDA(ntopics):
    model_fp = f'modelos_gensim/modelo_gensim_ntopics{ntopics:02}'
    return LdaModel.load(model_fp)

def cargar_modelo_y_graficar(ntopics, tweets_procesados, verbose=False):
    '''
    ntopics: numero de topics
    tweets_procesados: lista de tweets que son listas de strings
    '''
    ldamodel = cargar_modelo_LDA(ntopics)
    
    if verbose:
        print(f"** Modelo con #topics = {ntopics} **")
        for i in range(ntopics):
            topic_words_and_weights = ldamodel.show_topic(i)
            topic_words = " ".join([word for word, weight in topic_words_and_weights])
            print(f" * ", topic_words)
        print()
    
    dictionary = corpora.Dictionary(tweets_procesados)
    doc_term_matrix = [dictionary.doc2bow(tweet_terms) for tweet_terms in tweets_procesados]
    return pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionary)

Lets visualize the model for 10 topics:

In [31]:
cargar_modelo_y_graficar(10, nouns_spa_un_dia) 

We can see that 10 topics it is not a good choice since there is a lot of topic overlapping as it is shown in the lower left area of the graphic. Lets try visualizing a lower number of topics:

In [57]:
cargar_modelo_y_graficar(4, nouns_spa_un_dia)

There is still some topic overlapping. Lets try 5 topics:

In [58]:
cargar_modelo_y_graficar(5, nouns_spa_un_dia)

This is a good number of topics. However, in the future we should calculate topic coherence of each topic so we can obtain the best number of topics automatically.

We can save the graphic of the best topic number for our corpus, the 5 topic model, to share it in other applications, like our project website:

In [70]:
pyLDAvis.save_html(cargar_modelo_y_graficar(5, nouns_spa_un_dia), 'modelos_gensim/modelo05.token-vis.html')

We can try other visualizations, like a cicle pack plot, that gives us a good idea of the main words of each topic at a glance:

In [69]:
import pandas as pd


def hacer_csv_para_rawgraphsio(modelo, out_fp, top_n=10):
    records = []
    
    for i in range(modelo.num_topics):
        for word, weight in modelo.show_topic(i, topn=top_n):
            records.append((i, word, weight))

    df = pd.DataFrame(records, columns=["topic_id", "word", "weight"])
    df = df.groupby("topic_id").head(top_n).reset_index(drop=True)
    print(f"Escribiendo CSV en: {out_fp}")
    df.to_csv(out_fp, index=False)
    

modelo = cargar_modelo_LDA(ntopics=5)
fp = "modelos_gensim/modelo05.token.csv"
hacer_csv_para_rawgraphsio(modelo, fp)

Escribiendo CSV en: modelos_gensim/modelo05.token.csv


<img src="modelo.05.tokens.png" />