# Topic Modeling

This notebook wants to explore the topic modeling possibilities of the summaries dataset. We will implement the model using Bertopic.

In [1]:
# Modules to import
import sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
paths = ['../data','../scripts','../utils']
for path in paths:
    sys.path.append(path)

In [3]:
from dataLoader import loadDataframe

In [4]:
# Load data
path_to_directory = '../../data/cleanData/'
df_movies = loadDataframe('movies', path_to_directory)
df_summaries = loadDataframe('summaries', path_to_directory)

  df[columns_to_convert] = df[columns_to_convert].applymap(eval)


### Bertopic

In [5]:
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from tqdm import tqdm

In [6]:
# Load the spaCy model for English language
nlp = spacy.load("en_core_web_sm")

# Download stopwords and punkt tokenizer
import nltk
nltk.download('stopwords')
nltk.download('punkt')

# Load the English stopwords
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\arnau\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\arnau\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
def preprocess_text(text):
    # Tokenisation
    doc = nlp(text.lower())  # Convert to lowercase
    
    # Lemmatisation and remove stopwords and punctuation
    processed_tokens = [
        token.lemma_ for token in doc if token.text not in stop_words and token.text not in string.punctuation
    ]
    
    char_to_remove = ["'s", " "]
    processed_tokens = [token for token in processed_tokens if token not in char_to_remove]
    
    return " ".join(processed_tokens)

In [8]:
df_summaries.head()

Unnamed: 0,wiki_id,summary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


In [None]:
# Apply the preprocessing to the summaries
tqdm.pandas()
df_cleaned_summaries = df_summaries.copy()
df_cleaned_summaries["cleaned_summary"] = df_cleaned_summaries["summary"].progress_apply(preprocess_text)

# Drop the original summary column
df_cleaned_summaries.drop(columns=["summary"], inplace=True)

  0%|          | 0/42303 [00:00<?, ?it/s]

100%|██████████| 42303/42303 [39:01<00:00, 18.07it/s]  


All the summaries are not in English, so we use the library `langdetect` to filter out the non-English summaries.

In [None]:
!pip install langdetect

In [25]:
# Reconnaitre la langue de chaque summary
from langdetect import detect

def detect_language(text):
    try:
        return detect(text)
    except:
        return "unknown"

df_cleaned_summaries["language"] = df_cleaned_summaries["cleaned_summary"].progress_apply(detect_language)

100%|██████████| 42303/42303 [06:38<00:00, 106.04it/s]


In [28]:
# Save the cleaned summaries
df_cleaned_summaries.to_csv("../../data/topicModelData/cleaned_summaries.csv", index=False)

In [None]:
pourcentage_english = df_cleaned_summaries["language"].value_counts(normalize=True)["en"] * 100
print('Pourcentage of summaries in English: {:.2f}%'.format(pourcentage_english))

Pourcentage of summaries in English: 99.53%


We can just decide to drop the non-English summaries or translate them to English. In this notebook, we will drop the non-English summaries.

In [48]:
df_cleaned_summaries = df_cleaned_summaries[df_cleaned_summaries["language"] == "en"]

In [49]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

In [18]:
# Embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode the summaries
embeddings = embedding_model.encode(df_cleaned_summaries['cleaned_summary'].tolist(), show_progress_bar=True)

Batches:   0%|          | 0/1322 [00:00<?, ?it/s]

In [19]:
# Create the model
topic_model = BERTopic(embedding_model=embedding_model, verbose=True)

# Fit the model
topics, probabilities = topic_model.fit_transform(df_cleaned_summaries['cleaned_summary'].tolist())
df_cleaned_summaries['topic'] = topics

2024-11-23 17:36:21,084 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/1322 [00:00<?, ?it/s]

2024-11-23 18:08:21,971 - BERTopic - Embedding - Completed ✓
2024-11-23 18:08:21,971 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-11-23 18:09:14,691 - BERTopic - Dimensionality - Completed ✓
2024-11-23 18:09:14,693 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-11-23 18:09:28,643 - BERTopic - Cluster - Completed ✓
2024-11-23 18:09:28,649 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-11-23 18:09:36,390 - BERTopic - Representation - Completed ✓


In [20]:
# Print the topics
topic_info = topic_model.get_topic_info()
print(topic_info)

     Topic  Count                                       Name  \
0       -1  26034                        -1_find_go_get_take   
1        0   1253                             0_su_ho_li_eun   
2        1    954                    1_police_frank_kill_car   
3        2    892              2_murder_detective_bank_crime   
4        3    816                       3_love_marry_raj_get   
..     ...    ...                                        ...   
213    212     10              212_eve_husband_clovis_affair   
214    213     10  213_bhairavamurthy_rudra_rakesh_gajapathy   
215    214     10           214_bahar_sanaka_paromita_maanav   
216    215     10               215_gekko_bud_gondo_bluestar   
217    216     10                216_queen_paige_edvard_snow   

                                        Representation  \
0    [find, go, get, take, film, one, tell, leave, ...   
1    [su, ho, li, eun, wong, hong, jin, master, chi...   
2    [police, frank, kill, car, shoot, drug, murder...   

In [21]:
# Visualise topics by dominant words
topic_model.visualize_barchart()

In [22]:
# Visualise topics on a map
topic_model.visualize_topics()