## Month-on-Month Topic Modelling using LDA

### Data Preparation

In [58]:
import pandas as pd

In [59]:
# Checking the current working directory to download the files
import os
print(os.getcwd())

C:\Users\Utilizador\AppData\Local\Programs\Microsoft VS Code


In [60]:
# Checking date columns of the preprocessed .csv

  # 1) Reading both .csv files
df_r = pd.read_csv('df_r.csv')
df_w = pd.read_csv('df_w.csv')

  # 2) Getting the unique date values from each dataset
months_df_r = df_r['month'].value_counts().sort_index()
months_df_w = df_w['month'].value_counts().sort_index()

  # 3) Printing results
print("Unique months and respective counts in Russian tweets dataset:")
print(months_df_r)

print("\nUnique months and respective counts in Western tweets dataset:")
print(months_df_w)

Unique months and respective counts in Russian tweets dataset:
month
2    1612
3    8960
4    9722
5    2308
Name: count, dtype: int64

Unique months and respective counts in Western tweets dataset:
month
2    1780
3    7501
4    6179
5    1267
Name: count, dtype: int64


There are four common months in both datasets: 2 (February), 3 (March), 4 (April), and 5 (May). We will consider these four months for further monthly LDA comparisons.

In [61]:
# Splitting each of the datasets into three new files, based on the month of the tweet

df_r_2 = df_r[df_r['month'] == 2]
df_r_3 = df_r[df_r['month'] == 3]
df_r_4 = df_r[df_r['month'] == 4]
df_r_5 = df_r[df_r['month'] == 5]

df_w_2 = df_w[df_w['month'] == 2]
df_w_3 = df_w[df_w['month'] == 3]
df_w_4 = df_w[df_w['month'] == 4]
df_w_5 = df_w[df_w['month'] == 5]

# Checking the month values of the new dataframes to confirm if the split was effective
print("Unique months in the Feb Russian tweets dataset:", df_r_2['month'].unique())
print("Unique months in the March Russian tweets dataset:", df_r_3['month'].unique())
print("Unique months in the April Russian tweets dataset:", df_r_4['month'].unique())
print("Unique months in the May Russian tweets dataset:", df_r_5['month'].unique())
print("")
print("Unique months in the Feb Western tweets dataset:", df_w_2['month'].unique())
print("Unique months in the March Western tweets dataset:", df_w_3['month'].unique())
print("Unique months in the April Western tweets dataset:", df_w_4['month'].unique())
print("Unique months in the May Western tweets dataset:", df_w_5['month'].unique())

Unique months in the Feb Russian tweets dataset: [2]
Unique months in the March Russian tweets dataset: [3]
Unique months in the April Russian tweets dataset: [4]
Unique months in the May Russian tweets dataset: [5]

Unique months in the Feb Western tweets dataset: [2]
Unique months in the March Western tweets dataset: [3]
Unique months in the April Western tweets dataset: [4]
Unique months in the May Western tweets dataset: [5]


Now that we have confirmed that splitting the datasets based on date was done successfully, we can save them as separate .csv files so they can be further used for new LDA monthly comparisons.

In [62]:
# Saving the split datasets as new .csv files to be easily accessible for further LDA analysis

df_r_2.to_csv('df_r_2.csv', index = False)
df_r_3.to_csv('df_r_3.csv', index = False)
df_r_4.to_csv('df_r_4.csv', index = False)
df_r_5.to_csv('df_r_5.csv', index = False)

df_w_2.to_csv('df_w_2.csv', index = False)
df_w_3.to_csv('df_w_3.csv', index = False)
df_w_4.to_csv('df_w_4.csv', index = False)
df_w_5.to_csv('df_w_5.csv', index = False)

In [63]:
# Checking if the new .csv files were successfuly saved

df_r_2 = pd.read_csv('df_r_2.csv')
df_r_3 = pd.read_csv('df_r_3.csv')
df_r_4 = pd.read_csv('df_r_4.csv')
df_r_5 = pd.read_csv('df_r_5.csv')

df_w_2 = pd.read_csv('df_w_2.csv')
df_w_3 = pd.read_csv('df_w_3.csv')
df_w_4 = pd.read_csv('df_w_4.csv')
df_w_5 = pd.read_csv('df_w_5.csv')

# Checking if the number of rows in the new dataframes match with the value counts of each month
print("Number of rows in the Feb Russian tweets dataset:", df_r_2.shape[0])
print("Number of rows in the March Russian tweets dataset:", df_r_3.shape[0])
print("Number of rows in the April Russian tweets dataset:", df_r_4.shape[0])
print("Number of rows in the May Russian tweets dataset:", df_r_5.shape[0])
print("")
print("Number of rows in the Feb Western tweets dataset:", df_w_2.shape[0])
print("Number of rows in the March Western tweets dataset:", df_w_3.shape[0])
print("Number of rows in the April Western tweets dataset:", df_w_4.shape[0])
print("Number of rows in the May Western tweets dataset:", df_w_5.shape[0])

Number of rows in the Feb Russian tweets dataset: 1612
Number of rows in the March Russian tweets dataset: 8960
Number of rows in the April Russian tweets dataset: 9722
Number of rows in the May Russian tweets dataset: 2308

Number of rows in the Feb Western tweets dataset: 1780
Number of rows in the March Western tweets dataset: 7501
Number of rows in the April Western tweets dataset: 6179
Number of rows in the May Western tweets dataset: 1267


The number of rows in each of the new .csv files matches the value counts of each month in the preprocessed dataframes, indicating that the new files are now ready to be reused in the LDA analysis.

### TF-IDF corpus

To understand if there was a MoM evolution of the topics discussed, we will perform topic modelling using LDA on each of the newly created dataframes (representing the tweets posted in each month). To do so, we need to create separate TF-IDF corpus for each month's data (we opted for TF-IDF because it holds more information on the more/less important words, and thus should ensure a higher accuracy of the derived insights).
Since the new .csv files were created from the already preprocessed dataframes, we will use the cleaned_tokens column to create the separate TF-IDF corpora.
Additionaly, since we will be using the Gensim LDA (to further be able to visualize it), we need to create these TF-IDF from BoW dictionaries (that will also be passed onto the model upon fitting it).

In [64]:
# %pip install gensim
import gensim
from gensim import corpora, models
from gensim.models import TfidfModel, LdaModel

In [65]:
import gensim
from gensim import corpora

# Creating dictionaries for Russian tweets
dictionary_r_2 = corpora.Dictionary([tokens.split() for tokens in df_r_2['cleaned_tokens']])
dictionary_r_3 = corpora.Dictionary([tokens.split() for tokens in df_r_3['cleaned_tokens']])
dictionary_r_4 = corpora.Dictionary([tokens.split() for tokens in df_r_4['cleaned_tokens']])
dictionary_r_5 = corpora.Dictionary([tokens.split() for tokens in df_r_5['cleaned_tokens']])

# Creating dictionaries for Western tweets
dictionary_w_2 = corpora.Dictionary([tokens.split() for tokens in df_w_2['cleaned_tokens']])
dictionary_w_3 = corpora.Dictionary([tokens.split() for tokens in df_w_3['cleaned_tokens']])
dictionary_w_4 = corpora.Dictionary([tokens.split() for tokens in df_w_4['cleaned_tokens']])
dictionary_w_5 = corpora.Dictionary([tokens.split() for tokens in df_w_5['cleaned_tokens']])

# Creating BoW corpus from dictionaries for Russian tweets
bow_corpus_r_2 = [dictionary_r_2.doc2bow(tokens.split()) for tokens in df_r_2['cleaned_tokens']]
bow_corpus_r_3 = [dictionary_r_3.doc2bow(tokens.split()) for tokens in df_r_3['cleaned_tokens']]
bow_corpus_r_4 = [dictionary_r_4.doc2bow(tokens.split()) for tokens in df_r_4['cleaned_tokens']]
bow_corpus_r_5 = [dictionary_r_5.doc2bow(tokens.split()) for tokens in df_r_5['cleaned_tokens']]

# Creating BoW corpus from dictionaries for Western tweets
bow_corpus_w_2 = [dictionary_w_2.doc2bow(tokens.split()) for tokens in df_w_2['cleaned_tokens']]
bow_corpus_w_3 = [dictionary_w_3.doc2bow(tokens.split()) for tokens in df_w_3['cleaned_tokens']]
bow_corpus_w_4 = [dictionary_w_4.doc2bow(tokens.split()) for tokens in df_w_4['cleaned_tokens']]
bow_corpus_w_5 = [dictionary_w_5.doc2bow(tokens.split()) for tokens in df_w_5['cleaned_tokens']]


In [66]:
# Creating new monthly TF-IDF corpus to pass onto the LDA model

  ## 3) Creating TF-IDF corpus representations from the BoW corpora
tfidf_r_2 = models.TfidfModel(bow_corpus_r_2)
tfidf_r_3 = models.TfidfModel(bow_corpus_r_3)
tfidf_r_4 = models.TfidfModel(bow_corpus_r_4)
tfidf_r_5 = models.TfidfModel(bow_corpus_r_5)

tfidf_w_2 = models.TfidfModel(bow_corpus_w_2)
tfidf_w_3 = models.TfidfModel(bow_corpus_w_3)
tfidf_w_4 = models.TfidfModel(bow_corpus_w_4)
tfidf_w_5 = models.TfidfModel(bow_corpus_w_5)

corpus_tfidf_r_2 = tfidf_r_2[bow_corpus_r_2]
corpus_tfidf_r_3 = tfidf_r_3[bow_corpus_r_3]
corpus_tfidf_r_4 = tfidf_r_4[bow_corpus_r_4]
corpus_tfidf_r_5 = tfidf_r_5[bow_corpus_r_5]

corpus_tfidf_w_2 = tfidf_w_2[bow_corpus_w_2]
corpus_tfidf_w_3 = tfidf_w_3[bow_corpus_w_3]
corpus_tfidf_w_4 = tfidf_w_4[bow_corpus_w_4]
corpus_tfidf_w_5 = tfidf_w_5[bow_corpus_w_5]

### Training the LDA Models

Now that we created new Gensim dictionaries and TF-IDF corpora for each month's data, we can pass them onto the Gensim LDA model for topic modelling.

In [67]:
# LDA Models for Russian tweets
lda_model_r_2 = gensim.models.LdaMulticore(corpus_tfidf_r_2, num_topics = 10, passes = 10, id2word = dictionary_r_2, iterations = 20)
lda_model_r_3 = gensim.models.LdaMulticore(corpus_tfidf_r_3, num_topics = 10, passes = 10, id2word = dictionary_r_3, iterations = 20)
lda_model_r_4 = gensim.models.LdaMulticore(corpus_tfidf_r_4, num_topics = 10, passes = 10, id2word = dictionary_r_4, iterations = 20)
lda_model_r_5 = gensim.models.LdaMulticore(corpus_tfidf_r_5, num_topics = 10, passes = 10, id2word = dictionary_r_5, iterations = 20)

# LDA Models for Western tweets
lda_model_w_2 = gensim.models.LdaMulticore(corpus_tfidf_w_2, num_topics = 10, passes = 10, id2word = dictionary_w_2, iterations = 20)
lda_model_w_3 = gensim.models.LdaMulticore(corpus_tfidf_w_3, num_topics = 10, passes = 10, id2word = dictionary_w_3, iterations = 20)
lda_model_w_4 = gensim.models.LdaMulticore(corpus_tfidf_w_4, num_topics = 10, passes = 10, id2word = dictionary_w_4, iterations = 20)
lda_model_w_5 = gensim.models.LdaMulticore(corpus_tfidf_w_5, num_topics = 10, passes = 10, id2word = dictionary_w_5, iterations = 20)

### Visualizing the LDA Models (per month) - Russian

In [68]:
# %pip install pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

In [71]:
%matplotlib inline

# Main topics of war-related Russian tweets in February (2)
vis_r_2 = gensimvis.prepare(lda_model_r_2, corpus_tfidf_r_2, dictionary_r_2)
pyLDAvis.display(vis_r_2)

In [72]:
# Main topics of war-related Russian tweets in March (3)
vis_r_3 = gensimvis.prepare(lda_model_r_3, corpus_tfidf_r_3, dictionary_r_3)
pyLDAvis.display(vis_r_3)

In [73]:
# Main topics of war-related Russian tweets in April (4)
vis_r_4 = gensimvis.prepare(lda_model_r_4, corpus_tfidf_r_4, dictionary_r_4)
pyLDAvis.display(vis_r_4)

In [74]:
# Main topics of war-related Russian tweets in May (5)
vis_r_5 = gensimvis.prepare(lda_model_r_5, corpus_tfidf_r_5, dictionary_r_5)
pyLDAvis.display(vis_r_5)

### Visualizing the LDA Models (per month) - Western

In [75]:
# Main topics of war-related Western tweets in February (2)
vis_w_2 = gensimvis.prepare(lda_model_w_2, corpus_tfidf_w_2, dictionary_w_2)
pyLDAvis.display(vis_w_2)

In [76]:
# Main topics of war-related Western tweets in March (3)
vis_w_3 = gensimvis.prepare(lda_model_w_3, corpus_tfidf_w_3, dictionary_w_3)
pyLDAvis.display(vis_w_3)

In [77]:
# Main topics of war-related Western tweets in April (4)
vis_w_4 = gensimvis.prepare(lda_model_w_4, corpus_tfidf_w_4, dictionary_w_4)
pyLDAvis.display(vis_w_4)

In [78]:
# Main topics of war-related Western tweets in May (5)
vis_w_5 = gensimvis.prepare(lda_model_w_5, corpus_tfidf_w_5, dictionary_w_5)
pyLDAvis.display(vis_w_5)