## Header 
Author : Younes Moussaif  
Date created : 14.12.21   
Date last modified : 15.12.21  
Description : Notebook implementing topic modelling using BERTopic

### packages you may need to run notebook : 

- pip install -U scikit-learn
- pip install umap-learn
- pip install -U sentence-transformers
- pip install hdbscan OR conda install -c conda-forge hdbscan
- pip install bertopic 
 
 You might also need to install Microsoft Visual C++ version 14.0 or higher : https://docs.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170#visual-studio-2015-2017-2019-and-2022

### Imports

In [199]:
import pandas as pd
import numpy as np
from bertopic import BERTopic
import torch
import pickle
import random
import spacy

In [200]:
from spacy.lang.en import English

In [201]:
PATH_GENERATED_DATA = './generated_data/'

In [202]:
# use enriched dataframe instead : 

data = pd.read_pickle(PATH_GENERATED_DATA+'df_enriched.pkl', compression='infer', storage_options=None)

### Preprocessing

Some topics seemed to contain stopwords such as "and the" so we remove them

In [203]:
# turn the quotes into a list
corpus_sentences = data['quotation'].tolist()

In [115]:
nlp = spacy.load("en_core_web_sm")

In [116]:
def remove_sw(quote):
    doc = nlp(quote)
    tokens = [token.text for token in doc]
    words = [token.text for token in doc if token.is_stop != True]
    untokenize = ' '.join(words)
    return untokenize

In [None]:
# takes about 10 minutes to run
corpus_sentences_no_sw = [remove_sw(quote) for quote in corpus_sentences]

In [None]:
# save to pickle
with open(PATH_GENERATED_DATA+'BERTopic/corpus_sentences_no_sw.pkl', 'wb') as f:
    pickle.dump(corpus_sentences_no_sw, f)

In [71]:
# load from pickle 
with open(PATH_GENERATED_DATA+'BERTopic/corpus_sentences_no_sw.pkl', 'rb') as f:
    corpus_sentences_no_sw = pickle.load(f)

## Topic modelling with BERTopic

Github : https://github.com/MaartenGr/BERTopic  
Article : https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8 (from the author of the package) 
Documentation : https://maartengr.github.io/BERTopic/api/bertopic.html

In [204]:
corpus_sentences[1:3]

['more family-friendly and flexible workplaces, and affordable child care, for everyone',
 'We need more women and parents in Parliament. And we need more family-friendly and flexible workplaces, and affordable child care, for everyone.']

In [205]:
# activate graphics card if available
if torch.cuda.is_available():      
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

In [206]:
device

device(type='cuda')

In [207]:
# set seed to make the experiment reproducible
random.seed(0)
print(random.random()) # doesn't seem to work

0.8444218515250481


In [208]:
corpus_sentences = data['quotation'].tolist()

In [209]:
# loading the model
topic_model = BERTopic(nr_topics="25", verbose=True)

In [210]:
random.seed(0) # doesn't seem to work

# obtaining the topics by fitting the model to our corpus
topics, probs = topic_model.fit_transform(corpus_sentences_no_sw)

Batches:   0%|          | 0/1729 [00:00<?, ?it/s]

2021-12-15 23:22:17,827 - BERTopic - Transformed documents to Embeddings
2021-12-15 23:22:46,218 - BERTopic - Reduced dimensionality with UMAP
2021-12-15 23:22:52,164 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2021-12-15 23:23:00,019 - BERTopic - Reduced number of topics from 444 to 294


In [None]:
# save to pickle

#with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_topic_model.pkl', 'wb') as f:
#    pickle.dump(topic_model, f)
#with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_topics.pkl', 'wb') as f:
#    pickle.dump(topics, f)
#with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_probs.pkl', 'wb') as f:
#    pickle.dump(probs, f)

In [217]:
# load from pickle
with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_topic_model.pkl', 'rb') as f:
    topic_model = pickle.load(f)
with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_topics.pkl', 'rb') as f:
    topics = pickle.load(f)
with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_probs.pkl', 'rb') as f:
    probs = pickle.load(f)

In [218]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,23724,-1_women_rights_men_sexual
1,0,6626,0_child_care_families_parents
2,1,972,1_equal_pay_work_deserve
3,2,967,2_sport_sports_athletes_olympic
4,3,893,3_talaq_triple_muslim_supreme
...,...,...,...
280,287,11,287_incidents_incident_violation_boycotted
279,283,11,283_puzder_fast_theft_ceos
289,288,10,288_condemn_aamir_foregrounding_productions
290,289,10,289_condemn_condemns_tabloid_strongly


In [219]:
topic_model.get_topic(0)

[('child', 0.021795932698919697),
 ('care', 0.020721303979970813),
 ('families', 0.012209292106688903),
 ('parents', 0.010311399460665685),
 ('quality', 0.010029045212635672),
 ('affordable', 0.009706152740018436),
 ('children', 0.007964286233083017),
 ('tax', 0.006922717497713701),
 ('cost', 0.0068375567358329855),
 ('early', 0.006397330469115812)]

In [220]:
topic_model.get_topic(1)

[('equal', 0.047050339786462086),
 ('pay', 0.04190260913636381),
 ('work', 0.016895808128517167),
 ('deserve', 0.01017407542071993),
 ('paid', 0.009005233500831624),
 ('unequal', 0.008642807062243204),
 ('value', 0.008107994213026883),
 ('earn', 0.007937821169087315),
 ('job', 0.0070908928436217876),
 ('paycheck', 0.006179442658798376)]

In [221]:
topic_model.visualize_topics()

In [222]:
fig = topic_model.visualize_topics()
fig.write_html(PATH_GENERATED_DATA+'BERTopic/visualise_topics.html')

In [223]:
topic_model.visualize_barchart() 

In [224]:
fig2 = topic_model.visualize_barchart() 
fig2.write_html(PATH_GENERATED_DATA+'BERTopic/visualise_barcharts.html')

In [88]:
# We have too many topics (300+) so we try to reduce the number of topics to 50 

# source : https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8
# Further reduce topics

topics_red, probs_red = topic_model.reduce_topics(corpus_sentences_no_sw, topics, probs, nr_topics=35)

2021-12-15 23:01:09,904 - BERTopic - Reduced number of topics from 292 to 36


In [None]:
# save to pickle

#with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_topic_model_reduced.pkl', 'wb') as f:
#    pickle.dump(topic_model, f)
#with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_topics_reduced.pkl', 'wb') as f:
#    pickle.dump(topics_red, f)
#with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_probs_reduced.pkl', 'wb') as f:
#    pickle.dump(probs_red, f)

In [233]:
# load from pickle
with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_topic_model_reduced.pkl', 'rb') as f:
    topic_model = pickle.load(f)
with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_topics_reduced.pkl', 'rb') as f:
    topics_red = pickle.load(f)
with open(PATH_GENERATED_DATA+'BERTopic/BERTopic_probs_reduced.pkl', 'rb') as f:
    probs_red = pickle.load(f)

In [234]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,31511,-1_women_sexual_harassment_rights
1,0,6638,0_child_care_families_quality
2,1,1348,1_equal_pay_work_paid
3,2,995,2_sport_sports_equal_athletes
4,3,945,3_muslim_talaq_triple_bill
5,4,911,4_workplace_harassment_sexual_eeoc
6,5,846,5_violence_based_gender_domestic
7,6,793,6_students_sexual_university_harassment
8,7,727,7_candidates_female_clinton_hillary
9,8,689,8_film_hollywood_films_movie


In [235]:
topic_model.get_topic(0)

[('child', 0.0921881854408564),
 ('care', 0.09184372356120485),
 ('families', 0.03422215339805453),
 ('quality', 0.023208074708701378),
 ('parents', 0.023107753827824613),
 ('affordable', 0.021888509122598303),
 ('children', 0.021655873006755783),
 ('working', 0.014988081982936663),
 ('need', 0.014623979674174956),
 ('health', 0.0145967362609869)]

In [236]:
topic_model.get_topic(1)

[('equal', 0.16367295551620634),
 ('pay', 0.141265458450939),
 ('work', 0.04976420805945432),
 ('paid', 0.03272955008457474),
 ('deserve', 0.028792649877411015),
 ('women', 0.02725687316352579),
 ('card', 0.02530974399987919),
 ('playing', 0.023369253398065105),
 ('leave', 0.019767881647891936),
 ('woman', 0.016789963522756485)]

In [237]:
topic_model.visualize_topics()

In [238]:
fig3 = topic_model.visualize_topics()
fig3.write_html(PATH_GENERATED_DATA+'BERTopic/visualise_topics_red.html')

In [239]:
topic_model.visualize_barchart()

In [240]:
fig4 = topic_model.visualize_barchart()
fig4.write_html(PATH_GENERATED_DATA+'BERTopic/visualise_barcharts_red.html')

In [241]:
topic_keywords_df = topic_model.get_topic_info()
topic_keywords_df.head()

Unnamed: 0,Topic,Count,Name
0,-1,31511,-1_women_sexual_harassment_rights
1,0,6638,0_child_care_families_quality
2,1,1348,1_equal_pay_work_paid
3,2,995,2_sport_sports_equal_athletes
4,3,945,3_muslim_talaq_triple_bill


In [96]:
import re

In [97]:
topic_keywords_df['Name'] = topic_keywords_df['Name'].apply(lambda x : re.sub(r'[0-9-]', '', x)).apply(lambda x : x.strip('_').split('_'))

In [98]:
topic_keywords_df.head()

Unnamed: 0,Topic,Count,Name
0,-1,31511,"[women, sexual, harassment, rights]"
1,0,6638,"[child, care, families, quality]"
2,1,1348,"[equal, pay, work, paid]"
3,2,995,"[sport, sports, equal, athletes]"
4,3,945,"[muslim, talaq, triple, bill]"


In [99]:
#topic_keywords_df.to_pickle(PATH_GENERATED_DATA+'BERTopic/topic_keywords_df.pkl', compression='infer', protocol=5, storage_options=None)

In [101]:
topic_keywords_df = pd.read_pickle(PATH_GENERATED_DATA+'BERTopic/topic_keywords_df.pkl')

In [102]:
# put the topics number and quote into a df
# https://github.com/MaartenGr/BERTopic/issues/189
topic_docs = {topic: [] for topic in set(topics_red)}
for topic, doc in zip(topics_red, corpus_sentences):
    topic_docs[topic].append(doc)

In [103]:
topic_docs.keys()

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, -1])

In [104]:
topics_df = pd.Series(topic_docs, name='quotation').rename_axis('topic').explode().reset_index()

In [105]:
topics_df.head()

Unnamed: 0,topic,quotation
0,0,"more family-friendly and flexible workplaces, ..."
1,0,We need more women and parents in Parliament. ...
2,0,"Fair pay and fair scheduling, paid family leav..."
3,0,Every American deserves a fair shot at success...
4,0,"Right now in many states, quality child care i..."


In [106]:
#topics_df.to_pickle(PATH_GENERATED_DATA+'BERTopic/BERTopic_topic_topics_df.pkl', compression='infer', protocol=5, storage_options=None)

In [25]:
# Otherwise we can restrict the number of topics 

In [6]:
topics_df = pd.read_pickle(PATH_GENERATED_DATA+'BERTopic/BERTopic_topic_topics_df.pkl')

In [107]:
topics_df.head()

Unnamed: 0,topic,quotation
0,0,"more family-friendly and flexible workplaces, ..."
1,0,We need more women and parents in Parliament. ...
2,0,"Fair pay and fair scheduling, paid family leav..."
3,0,Every American deserves a fair shot at success...
4,0,"Right now in many states, quality child care i..."


In [108]:
topic_model.get_topic(0)

[('child', 0.0921881854408564),
 ('care', 0.09184372356120485),
 ('families', 0.03422215339805453),
 ('quality', 0.023208074708701378),
 ('parents', 0.023107753827824613),
 ('affordable', 0.021888509122598303),
 ('children', 0.021655873006755783),
 ('working', 0.014988081982936663),
 ('need', 0.014623979674174956),
 ('health', 0.0145967362609869)]

In [109]:
topics_df[topics_df['topic']==0].head()

Unnamed: 0,topic,quotation
0,0,"more family-friendly and flexible workplaces, ..."
1,0,We need more women and parents in Parliament. ...
2,0,"Fair pay and fair scheduling, paid family leav..."
3,0,Every American deserves a fair shot at success...
4,0,"Right now in many states, quality child care i..."


In [110]:
topic_model.get_topic(1)

[('equal', 0.16367295551620634),
 ('pay', 0.141265458450939),
 ('work', 0.04976420805945432),
 ('paid', 0.03272955008457474),
 ('deserve', 0.028792649877411015),
 ('women', 0.02725687316352579),
 ('card', 0.02530974399987919),
 ('playing', 0.023369253398065105),
 ('leave', 0.019767881647891936),
 ('woman', 0.016789963522756485)]

In [111]:
topics_df[topics_df['topic']==1].head()

Unnamed: 0,topic,quotation
6638,1,if advocating for equal pay for equal work is ...
6639,1,"So it will include, whether it's raising the m..."
6640,1,"The Republicans often say, `Well, there she go..."
6641,1,Scott Walker repealed protections for equal pay.
6642,1,"Equal pay, paid leave, child care, these are n..."


In [112]:
topics_df['topic'].value_counts()

-1     31511
 0      6638
 1      1348
 2       995
 3       945
 4       911
 5       846
 6       793
 7       727
 8       689
 9       635
 10      616
 11      595
 12      571
 13      571
 14      509
 15      440
 16      439
 17      401
 18      395
 19      392
 20      362
 21      350
 22      335
 23      330
 24      327
 25      320
 26      319
 27      318
 28      311
 29      263
 30      254
 31      220
 32      218
 33      217
 34      217
Name: topic, dtype: int64