# Topic Modeling

This notebook wants to explore the topic modeling possibilities of the summaries dataset. We will implement the model using Bertopic.

In [1]:
# Modules to import
import sys
import pandas as pd
import numpy as np

# Paths to add
paths = ['../data','../scripts','../utils']
for path in paths:
    sys.path.append(path)
    
# Data loader
from dataLoader import loadDataframe

# Load data
path_to_directory = '../../data/cleanData/'
df_movies = loadDataframe('movies', path_to_directory)
df_summaries = loadDataframe('summaries', path_to_directory)

  df[columns_to_convert] = df[columns_to_convert].applymap(eval)


### Bertopic

First step is to preprocess text to remove stopwords, punctuation, and lemmatize the words. We use spacy and nltk for this task. Indeed, these libraries are powerful tools for text preprocessing.

In [2]:
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from tqdm import tqdm

In [None]:
# Load the spaCy model for English language
nlp = spacy.load("en_core_web_sm")

# Download stopwords and punkt tokenizer
import nltk
nltk.download('stopwords')
nltk.download('punkt')

# Load the English stopwords
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\arnau\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\arnau\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
def preprocess_text(text):
    # Tokenisation
    doc = nlp(text.lower())  # Convert to lowercase
    
    # Lemmatisation and remove stopwords and punctuation
    processed_tokens = [
        token.lemma_ for token in doc if token.text not in stop_words and token.text not in string.punctuation
    ]
    
    char_to_remove = ["'s", " "]
    processed_tokens = [token for token in processed_tokens if token not in char_to_remove]
    
    word_to_remove = ['film','movie','story','tell','leave']
    processed_tokens = [token for token in processed_tokens if token not in word_to_remove]
    
    return " ".join(processed_tokens)

In [5]:
# Apply the preprocessing to the summaries
tqdm.pandas()
df_cleaned_summaries = df_summaries.copy()
df_cleaned_summaries["cleaned_summary"] = df_cleaned_summaries["summary"].progress_apply(preprocess_text)

# Drop the original summary column
df_cleaned_summaries.drop(columns=["summary"], inplace=True)

  0%|          | 0/42303 [00:00<?, ?it/s]

100%|██████████| 42303/42303 [1:20:13<00:00,  8.79it/s]  


All the summaries are not in English, so we use the library `langdetect` to filter out the non-English summaries.

In [6]:
# Detect the language of each summary
from langdetect import detect

def detect_language(text):
    try:
        return detect(text)
    except:
        return "unknown"

df_cleaned_summaries["language"] = df_cleaned_summaries["cleaned_summary"].progress_apply(detect_language)

100%|██████████| 42303/42303 [19:08<00:00, 36.82it/s]


In [7]:
# Save the cleaned summaries
df_cleaned_summaries.to_csv("../../data/topicModelData/cleaned_summaries.csv", index=False)

In [None]:
# Load the cleaned summaries
df_cleaned_summaries = pd.read_csv("../../data/topicModelData/cleaned_summaries.csv")

In [None]:
pourcentage_english = df_cleaned_summaries["language"].value_counts(normalize=True)["en"] * 100
print('Pourcentage of summaries in English: {:.2f}%'.format(pourcentage_english))

Pourcentage of summaries in English: 99.46%


We can just decide to drop the non-English summaries or translate them to English. In this notebook, we will drop the non-English summaries.

In [None]:
df_cleaned_summaries = df_cleaned_summaries[df_cleaned_summaries["language"] == "en"]

In [None]:
docs = df_cleaned_summaries["cleaned_summary"].tolist()

In [3]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

  from .autonotebook import tqdm as notebook_tqdm


1. Embedding

In [13]:
# Embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode the summaries
embeddings = embedding_model.encode(docs, show_progress_bar=True)

Batches: 100%|██████████| 1315/1315 [42:35<00:00,  1.94s/it] 


In [14]:
# Save the embeddings
np.save("embeddings.npy", embeddings)

In [None]:
# Load the embeddings
embeddings_model = np.load("embeddings.npy")

2. UMAP

In [15]:
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

3. HDBSCAN

In [16]:
hdbscan_model = HDBSCAN(min_cluster_size=50, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

4. Vectorizer

In [17]:
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

In [18]:
# Create the model
topic_model = BERTopic(embedding_model=embedding_model,
                       umap_model=umap_model,
                       hdbscan_model=hdbscan_model,
                       vectorizer_model=vectorizer_model,
                       verbose=True)

# Fit the model
topics, probabilities = topic_model.fit_transform(docs)
df_cleaned_summaries['topic'] = topics

2024-11-26 20:49:24,887 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 1315/1315 [16:42:18<00:00, 45.73s/it]       
2024-11-27 13:31:50,447 - BERTopic - Embedding - Completed ✓
2024-11-27 13:31:50,450 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-11-27 13:33:40,868 - BERTopic - Dimensionality - Completed ✓
2024-11-27 13:33:40,881 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-11-27 13:34:07,625 - BERTopic - Cluster - Completed ✓
2024-11-27 13:34:07,634 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-11-27 13:34:37,968 - BERTopic - Representation - Completed ✓


In [19]:
# Save the model
topic_model.save("topic_model_min_cluster_size_50")



### Review topics

Merge similar topics

Drop non-informative topics

In [4]:
df_cleaned_summaries =  pd.read_csv("../../data/topicModelData/cleaned_summaries.csv")
# Keep only the summaries in English
df_cleaned_summaries = df_cleaned_summaries[df_cleaned_summaries["language"] == "en"]
docs = df_cleaned_summaries["cleaned_summary"].tolist()

In [16]:
# Load new model
topic_model = BERTopic.load("topic_model_min_cluster_size_50")
topic_info = topic_model.get_topic_info()

In [17]:
topic_info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,20926,-1_man_kill_make_try,"[man, kill, make, try, time, life, friend, new...",[backwoodsman name adam pontipee new bride mil...
1,0,5220,0_love_father_marry_family,"[love, father, marry, family, come, son, life,...",[ekam – son soil family relation family disput...
2,1,4606,1_kill_house_man_make,"[kill, house, man, make, car, try, police, hom...",[plot bronx borough new york city entire anglo...
3,2,2213,2_life_love_mother_father,"[life, love, mother, father, family, year, new...",[michelle jordan young energetic 8 year old gi...
4,3,1384,3_murder_police_detective_crime,"[murder, police, detective, crime, killer, pri...",[17 year old orphan henri young steal 5.00 gro...
5,4,1165,4_ho_li_master_man,"[ho, li, master, man, china, wong, chinese, ki...",[start lee ho get arm chop white faced henchma...
6,5,638,5_earth_scientist_planet_human,"[earth, scientist, planet, human, alien, space...",[plot concern legion wing serpent rogue group ...
7,6,477,6_confederate_town_union_man,"[confederate, town, union, man, sheriff, war, ...",[colonel jonas fanatical unrepentant confedera...
8,7,369,7_life_antonio_family_juan,"[life, antonio, family, juan, love, argentina,...",[2000 elsa 25 year old woman barely make livin...
9,8,352,8_earth_ship_planet_alien,"[earth, ship, planet, alien, space, human, cre...",[sometime future earth recover robot war devas...


Using the HDBSCAN model with a minimum cluster size of 50, we reduce the number of topicsfrom 225 to 44, which represents a more manageable number of topics. Now, can merge similar topics and drop non-informative topics.

In [32]:
# topics to merge

#topics_to_merge = [0,2] # love and family
# topics_to_merge = [16,18] # nazi germany
# topics_to_merge = [10, 18] # French culture
# topics_to_merge = [16,26] # pirates
topics_to_merge = [7,10] # spain

topic_model.merge_topics(docs, topics_to_merge)

In [33]:
topic_info = topic_model.get_topic_info()
topic_info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,20926,-1_man_kill_make_try,"[man, kill, make, try, time, life, friend, new...",[backwoodsman name adam pontipee new bride mil...
1,0,7433,0_love_father_family_marry,"[love, father, family, marry, life, son, come,...",[young widow commit suicide compel forego mode...
2,1,4606,1_kill_man_house_make,"[kill, man, house, make, car, try, police, hom...",[open michelle mancini drive night storm reali...
3,2,1384,2_murder_police_detective_crime,"[murder, police, detective, crime, killer, pri...",[psychotic killer garland red lynch use campai...
4,3,1165,3_ho_li_master_man,"[ho, li, master, man, china, wong, kill, chine...","[plot|date""farewell concubine study notes"">{{c..."
5,4,673,4_juan_family_miguel_love,"[juan, family, miguel, love, father, man, luis...",[juan oliver want make good impression new job...
6,5,638,5_earth_scientist_planet_human,"[earth, scientist, planet, human, alien, space...",[distant future war human race alien know gami...
7,6,477,6_confederate_town_union_man,"[confederate, town, union, man, war, sheriff, ...",[set state virginia american civil war james s...
8,7,396,7_jean_paris_pierre_marie,"[jean, paris, pierre, marie, love, julien, ant...","["" feel go life like american tourist many tow..."
9,8,356,8_german_nazi_hitler_war,"[german, nazi, hitler, war, berlin, germany, v...",[world war ii wehrmacht colonel claus von stau...


In [11]:
topic_names = {
    -1: "Unclassified",
    0: "Love & Family",
    1: "Crime",
    2: "Investigation",
    3: "Martial Arts",
    4: "Family Drama",
    5: "Sci-Fi Earth",
    6: "Civil War",
    7: "French Life",
    8: "WWII",
    9: "Space",
    10: "Pirates",
    11: "USSR",
    12: "Soldiers",
    13: "Cartoons",
    14: "Monsters",
    15: "Sports",
    16: "Samurai",
    17: "Tom & Jerry",
    18: "Royalty",
    19: "Christmas",
    20: "Stooges",
    21: "Africa",
    22: "Charlie Brown",
    23: "Middle East",
    24: "College",
    25: "Disney",
    26: "Tokyo Life",
    27: "Fantasy",
    28: "Pink Panther",
    29: "Yakuza",
    30: "Jungle",
    31: "Racing",
    32: "Musketeers",
    33: "Laurel & Hardy",
    34: "Betty Boop",
    35: "Godzilla",
    36: "Roman Empire",
    37: "Politics",
    38: "School Life",
    39: "Boxing"
}

topic_model.set_topic_labels(topic_names)

In [36]:
topic_info = topic_model.get_topic_info()
topic_info

Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
0,-1,20926,-1_man_kill_make_try,Unclassified,"[man, kill, make, try, time, life, friend, new...",[backwoodsman name adam pontipee new bride mil...
1,0,7433,0_love_father_family_marry,Love & Family,"[love, father, family, marry, life, son, come,...",[young widow commit suicide compel forego mode...
2,1,4606,1_kill_man_house_make,Crime,"[kill, man, house, make, car, try, police, hom...",[open michelle mancini drive night storm reali...
3,2,1384,2_murder_police_detective_crime,Investigation,"[murder, police, detective, crime, killer, pri...",[psychotic killer garland red lynch use campai...
4,3,1165,3_ho_li_master_man,Martial Arts,"[ho, li, master, man, china, wong, kill, chine...","[plot|date""farewell concubine study notes"">{{c..."
5,4,673,4_juan_family_miguel_love,Family Drama,"[juan, family, miguel, love, father, man, luis...",[juan oliver want make good impression new job...
6,5,638,5_earth_scientist_planet_human,Sci-Fi Earth,"[earth, scientist, planet, human, alien, space...",[distant future war human race alien know gami...
7,6,477,6_confederate_town_union_man,Civil War,"[confederate, town, union, man, war, sheriff, ...",[set state virginia american civil war james s...
8,7,396,7_jean_paris_pierre_marie,French Life,"[jean, paris, pierre, marie, love, julien, ant...","["" feel go life like american tourist many tow..."
9,8,356,8_german_nazi_hitler_war,WWII,"[german, nazi, hitler, war, berlin, germany, v...",[world war ii wehrmacht colonel claus von stau...


In [37]:
# Save the merged model
topic_model.save("topic_model_merged")



Connect each summary to the topic with the highest probability

In [None]:
topic_model = BERTopic.load("topic_model_merged")

In [7]:
# Get the topics for each summary
topics, probabilities = topic_model.transform(docs)

# Add the topics to the dataframe
df_cleaned_summaries["topic"] = topics

Batches: 100%|██████████| 1315/1315 [36:55<00:00,  1.68s/it] 


In [12]:
# Add the names of the topics
df_cleaned_summaries["topic_name"] = df_cleaned_summaries["topic"].map(topic_names)

In [13]:
# Save the dataframe
df_cleaned_summaries.to_csv("../../data/topicModelData/summaries_with_topics.csv", index=False)