# Topic Modeling

This notebook wants to explore the topic modeling possibilities of the summaries dataset. We will implement the model using Bertopic.

In [1]:
# Modules to import
import sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
paths = ['../data','../scripts','../utils']
for path in paths:
    sys.path.append(path)

In [3]:
from dataLoader import loadDataframe

In [4]:
# Load data
path_to_directory = '../../data/cleanData/'
df_movies = loadDataframe('movies', path_to_directory)
df_summaries = loadDataframe('summaries', path_to_directory)

  df[columns_to_convert] = df[columns_to_convert].applymap(eval)


### Bertopic

In [5]:
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from tqdm import tqdm

In [6]:
# Load the spaCy model for English language
nlp = spacy.load("en_core_web_sm")

# Download stopwords and punkt tokenizer
import nltk
nltk.download('stopwords')
nltk.download('punkt')

# Load the English stopwords
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\arnau\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\arnau\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
def preprocess_text(text):
    # Tokenisation
    doc = nlp(text.lower())  # Convert to lowercase
    
    # Lemmatisation and remove stopwords and punctuation
    processed_tokens = [
        token.lemma_ for token in doc if token.text not in stop_words and token.text not in string.punctuation
    ]
    
    char_to_remove = ["'s", " "]
    processed_tokens = [token for token in processed_tokens if token not in char_to_remove]
    
    return " ".join(processed_tokens)

In [8]:
df_summaries.head()

Unnamed: 0,wiki_id,summary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


In [None]:
# Apply the preprocessing to the summaries
tqdm.pandas()
df_cleaned_summaries = df_summaries.copy()
df_cleaned_summaries["cleaned_summary"] = df_cleaned_summaries["summary"].progress_apply(preprocess_text)

# Drop the original summary column
df_cleaned_summaries.drop(columns=["summary"], inplace=True)

  0%|          | 0/42303 [00:00<?, ?it/s]

100%|██████████| 42303/42303 [39:01<00:00, 18.07it/s]  


All the summaries are not in English, so we use the library `langdetect` to filter out the non-English summaries.

In [25]:
# Reconnaitre la langue de chaque summary
from langdetect import detect

def detect_language(text):
    try:
        return detect(text)
    except:
        return "unknown"

df_cleaned_summaries["language"] = df_cleaned_summaries["cleaned_summary"].progress_apply(detect_language)

100%|██████████| 42303/42303 [06:38<00:00, 106.04it/s]


In [11]:
# Save the cleaned summaries
df_cleaned_summaries.to_csv("../../data/topicModelData/cleaned_summaries.csv", index=False)

In [12]:
pourcentage_english = df_cleaned_summaries["language"].value_counts(normalize=True)["en"] * 100
print('Pourcentage of summaries in English: {:.2f}%'.format(pourcentage_english))

Pourcentage of summaries in English: 99.53%


In [10]:
df_cleaned_summaries = pd.read_csv("../../data/topicModelData/cleaned_summaries.csv")

We can just decide to drop the non-English summaries or translate them to English. In this notebook, we will drop the non-English summaries.

In [13]:
df_cleaned_summaries = df_cleaned_summaries[df_cleaned_summaries["language"] == "en"]

In [14]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [15]:
# Embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode the summaries
embeddings = embedding_model.encode(df_cleaned_summaries['cleaned_summary'].tolist(), show_progress_bar=True)

Batches: 100%|██████████| 1316/1316 [40:26<00:00,  1.84s/it] 


In [16]:
# Create the model
topic_model = BERTopic(embedding_model=embedding_model, verbose=True)

# Fit the model
topics, probabilities = topic_model.fit_transform(df_cleaned_summaries['cleaned_summary'].tolist())
df_cleaned_summaries['topic'] = topics

2024-11-24 10:57:02,619 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 1316/1316 [39:33<00:00,  1.80s/it] 
2024-11-24 11:36:39,906 - BERTopic - Embedding - Completed ✓
2024-11-24 11:36:39,906 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-11-24 11:37:35,978 - BERTopic - Dimensionality - Completed ✓
2024-11-24 11:37:35,989 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-11-24 11:37:46,514 - BERTopic - Cluster - Completed ✓
2024-11-24 11:37:46,537 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-11-24 11:37:54,297 - BERTopic - Representation - Completed ✓


In [17]:
# Print the topics
topic_info = topic_model.get_topic_info()
print(topic_info)

     Topic  Count                                 Name  \
0       -1  25994                  -1_find_go_take_get   
1        0   1321                      0_ho_su_li_wong   
2        1   1200          1_mother_husband_year_child   
3        2    921               2_love_marry_singh_get   
4        3    897       3_murder_bank_detective_prison   
..     ...    ...                                  ...   
222    221     10    221_luther_halligan_brewster_jöns   
223    222     10      222_molly_lasch_dwayne_danielle   
224    223     10     223_japanese_chinese_troop_ahmad   
225    224     10  224_bardot_vadim_merteuil_madeleine   
226    225     10        225_pelikán_dato_dániel_zoran   

                                        Representation  \
0    [find, go, take, get, one, leave, tell, man, f...   
1    [ho, su, li, wong, jin, hong, master, eun, chi...   
2    [mother, husband, year, child, life, father, f...   
3    [love, marry, singh, get, father, raja, marria...   
4    [murder,

Our algorithm provides 227 different topics. We have to explore the topics to understand if they make sense and if some are redundant or useless.

In [18]:
# Visualise topics by dominant words
topic_model.visualize_barchart()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [19]:
# Visualise topics on a map
topic_model.visualize_topics()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [20]:
# Save the model
topic_model.save("topic_model")



In [21]:
# Save the topics
df_cleaned_summaries.to_csv("../../data/topicModelData/summaries_with_topics.csv", index=False)

### Review topics

Merge similar topics

Drop non-informative topics

In [57]:
topic_info.head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,25994,-1_find_go_take_get,"[find, go, take, get, one, leave, tell, man, f...",[babel focus four interrelated set situation c...
1,0,1321,0_ho_su_li_wong,"[ho, su, li, wong, jin, hong, master, eun, chi...",[leader kong tung clan yen chan ying ruthless ...
2,1,1200,1_mother_husband_year_child,"[mother, husband, year, child, life, father, f...",[david catherine robinson move rundown country...
3,2,921,2_love_marry_singh_get,"[love, marry, singh, get, father, raja, marria...",[story focusse three character different persp...
4,3,897,3_murder_bank_detective_prison,"[murder, bank, detective, prison, crime, polic...",[three outlaws rob bank one wound two partner ...


In [63]:
docs = df_cleaned_summaries['cleaned_summary'].tolist()
topic_distr, _ = topic_model.approximate_distribution(docs)

100%|██████████| 43/43 [02:01<00:00,  2.83s/it]


### Analysis using df_movies