
# BERTopic

Author: Xuedan ZOU (xuedan.zou.gr@dartmouth.edu)  

Date: 5/30/2022


## About

We tried BERTopic method to do the topic modeling as a comparision to traditional LDA method. The inputs of this algorithms are many sentences (splited from orignal texts) and the outputs are possible topics BERTopic thought and their top key words. 

References:  

[1] https://maartengr.github.io/BERTopic/index.html 

[2] https://github.com/MaartenGr/BERTopic (code reference)

[3] Grootendorst, Maarten. "BERTopic: Neural topic modeling with a class-based TF-IDF procedure." arXiv preprint arXiv:2203.05794 (2022).

## Data

First try to install necessary packages and get our pre-prepared folkstories data.

First install and import all of the necessary packages.

In [None]:
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic
  Using cached bertopic-0.10.0-py2.py3-none-any.whl (58 kB)
Collecting umap-learn>=0.5.0
  Using cached umap-learn-0.5.3.tar.gz (88 kB)
Collecting hdbscan>=0.8.28
  Using cached hdbscan-0.8.28.tar.gz (5.2 MB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1
  Using cached sentence-transformers-2.2.0.tar.gz (79 kB)
Collecting transformers<5.0.0,>=4.6.0
  Using cached transformers-4.19.2-py3-none-any.whl (4.2 MB)
Collecting sentencepiece
  Using cached sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Collecting huggingface-hub
  Using cached huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Using cached tokenizers-0.12.1-cp37-cp37m-manylin

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
import gdown
from bertopic import BERTopic

Then time to read data from Google Drive.

In [None]:
url = "https://drive.google.com/uc?id=1nX3c3JrEYaqaSybF8Yh4ou1C03FUkNGU"

output = 'asia.csv'
gdown.download(url, output, quiet=False)
!unzip -j $output
df = pd.read_csv('asia.csv', header=None)
df.columns = ["numb", "country", "text"]


Downloading...
From: https://drive.google.com/uc?id=1nX3c3JrEYaqaSybF8Yh4ou1C03FUkNGU
To: /content/asia.csv
100%|██████████| 2.31M/2.31M [00:00<00:00, 96.3MB/s]

Archive:  asia.csv
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of asia.csv or
        asia.csv.zip, and cannot find asia.csv.ZIP, period.





We then transfer the original texts into string list. Since we need to feed in those data into BERTopic and seperate each story as sentences will greatly help us have more data to feed in. (BERTopic regards each sentence as a seperate data)

In [None]:
texts= df['text'].tolist()
data = []
for story in texts:
  tex = story.split(".")
  for sentence in tex :
    data.append(sentence)
print(len(data))


20910


## BERTopic 

We then call BERTopic algorithms to finish topic modeling task given our prepared data before. For each region, we restricted the number of our topics the algorithms discovered to 30.

In [None]:
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True, min_topic_size=30)
topics, probs = topic_model.fit_transform(data) 

Batches:   0%|          | 0/654 [00:00<?, ?it/s]

2022-05-30 15:52:41,685 - BERTopic - Transformed documents to Embeddings
2022-05-30 15:53:09,287 - BERTopic - Reduced dimensionality
2022-05-30 15:53:23,249 - BERTopic - Clustered reduced embeddings


In [None]:
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True,nr_topics="auto")
topics, probs = topic_model.fit_transform(data) 

In [None]:
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True,nr_topics=30)
topics, probs = topic_model.fit_transform(data) 

Batches:   0%|          | 0/654 [00:00<?, ?it/s]

2022-05-30 16:21:53,365 - BERTopic - Transformed documents to Embeddings
2022-05-30 16:22:27,424 - BERTopic - Reduced dimensionality
2022-05-30 16:25:01,423 - BERTopic - Clustered reduced embeddings
2022-05-30 16:25:13,264 - BERTopic - Reduced number of topics from 292 to 31


## Results

The results are shown as below. We can use different ways to visualize it.

### Asia Result


In [None]:
freq = topic_model.get_topic_info(); freq.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,15345,-1_the_and_to_of
1,0,725,0_you_will_me_your
2,1,442,1_fairytale_note_traditionally_tale
3,2,345,2_fish_hunter_the_hook
4,3,268,3_you_said_are_it
5,4,237,4_sultan_the_to_sultans
6,5,219,5_monkey_crab_monkeys_crabs
7,6,214,6_princess_her_to_the
8,7,205,7_dragon_dragons_dragonking_the
9,8,184,8_we_ship_island_sea


In [None]:
new_topics, new_probs = topic_model.reduce_topics(data, topics, probs, nr_topics=15) 

2022-05-30 16:00:09,172 - BERTopic - Reduced number of topics from 92 to 16


In [None]:
topic_model.visualize_barchart(top_n_topics=30)

## Others

We then use our method to other different regions and the results are shown as the following.

In [None]:
url = "https://drive.google.com/uc?id=19I5XPS3rMIPAo7STn8oGl_Q498hQS1LP"
output = 'africa.csv'
gdown.download(url, output, quiet=False)
!unzip -j $output
df = pd.read_csv('africa.csv', header=None)
df.columns = ["numb", "country", "text"]


Downloading...
From: https://drive.google.com/uc?id=19I5XPS3rMIPAo7STn8oGl_Q498hQS1LP
To: /content/africa.csv
100%|██████████| 726k/726k [00:00<00:00, 69.0MB/s]


Archive:  africa.csv
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of africa.csv or
        africa.csv.zip, and cannot find africa.csv.ZIP, period.


In [None]:
texts= df['text'].tolist()
data = []
for story in texts:
  tex = story.split(".")
  for sentence in tex :
    data.append(sentence)
print(len(data))

6071


In [None]:
topic_model_africa = BERTopic(language="english", calculate_probabilities=True, verbose=True,nr_topics=30)
topics, probs = topic_model_africa.fit_transform(data) 

Batches:   0%|          | 0/190 [00:00<?, ?it/s]

2022-05-30 18:13:37,978 - BERTopic - Transformed documents to Embeddings
2022-05-30 18:13:54,571 - BERTopic - Reduced dimensionality
2022-05-30 18:13:58,319 - BERTopic - Clustered reduced embeddings
2022-05-30 18:14:02,652 - BERTopic - Reduced number of topics from 119 to 31


### Africa Result


In [None]:
freq_africa = topic_model_africa.get_topic_info(); freq_africa.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,3616,-1_the_he_and_to
1,0,226,0_wolf_ou_an_jackalse
2,1,192,1_ses_you_honey_agun
3,2,141,2_sculpat_hahsie_ses_de
4,3,126,3_me_you_will_ill
5,4,107,4_baasjes_baasje_stars_my
6,5,87,5_said_oh_you_master
7,6,85,6_tortoise_leopard_the_and
8,7,82,7_choose_satisfied_im_
9,8,82,8_sultan_the_daaraaee_gazelle


In [None]:
topic_model_africa.visualize_barchart(top_n_topics=30)

In [None]:
url = "https://drive.google.com/uc?id=1bS7Tqz3JlGCMvrDGn4Yq2IqaGMlrHCPc"
output = 'europe.csv'
gdown.download(url, output, quiet=False)
!unzip -j $output
df = pd.read_csv('europe.csv', header=None)
df.columns = ["numb", "country", "text"]


Downloading...
From: https://drive.google.com/uc?id=1bS7Tqz3JlGCMvrDGn4Yq2IqaGMlrHCPc
To: /content/europe.csv
100%|██████████| 3.03M/3.03M [00:00<00:00, 213MB/s]


Archive:  europe.csv
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of europe.csv or
        europe.csv.zip, and cannot find europe.csv.ZIP, period.


In [None]:
texts= df['text'].tolist()
data = []
for story in texts:
  tex = story.split(".")
  for sentence in tex :
    data.append(sentence)
print(len(data))

24513


In [None]:
topic_model_europe = BERTopic(language="english", calculate_probabilities=True, verbose=True,nr_topics=30)
topics, probs = topic_model_europe.fit_transform(data) 

Batches:   0%|          | 0/767 [00:00<?, ?it/s]

2022-05-30 18:24:54,050 - BERTopic - Transformed documents to Embeddings
2022-05-30 18:25:23,191 - BERTopic - Reduced dimensionality
2022-05-30 18:29:25,264 - BERTopic - Clustered reduced embeddings
2022-05-30 18:29:37,237 - BERTopic - Reduced number of topics from 336 to 31


### Europe Result


In [None]:
freq_europe = topic_model_europe.get_topic_info(); freq_europe.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,17572,-1_the_and_to_of
1,0,803,0_you_naggeneen_said_do
2,1,574,1_ivan_tsar_tsarevich_the
3,2,299,2_sea_ship_the_and
4,3,291,3_violette_ourson_agnella_passerose
5,4,289,4_horse_the_horses_his
6,5,287,5_fairy_fairies_the_of
7,6,273,6_blondine_bonnebiche_beauminon_her
8,7,235,7_bova_korolevich_drushnevna_to
9,8,235,8_bondeson_comp_10_sharply


In [None]:
topic_model_europe.visualize_barchart(top_n_topics=30)

In [None]:
url = "https://drive.google.com/uc?id=1WtodI_VHLKb9W9GZfOZOvzFjqOiAOkiy"
output = 'north_america_new.csv'
gdown.download(url, output, quiet=False)
!unzip -j $output
df = pd.read_csv('north_america_new.csv', header=None)
df.columns = ["numb", "country", "text"]

Downloading...
From: https://drive.google.com/uc?id=1WtodI_VHLKb9W9GZfOZOvzFjqOiAOkiy
To: /content/north_america_new.csv
100%|██████████| 799k/799k [00:00<00:00, 66.3MB/s]

Archive:  north_america_new.csv
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of north_america_new.csv or
        north_america_new.csv.zip, and cannot find north_america_new.csv.ZIP, period.





In [None]:
texts= df['text'].tolist()
data = []
for story in texts:
  tex = story.split(".")
  for sentence in tex :
    data.append(sentence)
print(len(data))

7277


In [None]:
topic_model_northamerica = BERTopic(language="english", calculate_probabilities=True, verbose=True,nr_topics=30)
topics, probs = topic_model_northamerica.fit_transform(data) 

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Batches:   0%|          | 0/228 [00:00<?, ?it/s]

2022-06-03 21:40:28,115 - BERTopic - Transformed documents to Embeddings
2022-06-03 21:41:13,027 - BERTopic - Reduced dimensionality
2022-06-03 21:41:17,647 - BERTopic - Clustered reduced embeddings
2022-06-03 21:41:22,003 - BERTopic - Reduced number of topics from 107 to 31


### North America Result

In [None]:
topic_model_northamerica.visualize_barchart(top_n_topics=30)

In [None]:
freq_northamerica = topic_model_northamerica.get_topic_info(); freq_northamerica.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,4036,-1_the_and_he_of
1,0,520,0_you_will_your_my
2,1,254,1_her_she_and_to
3,2,171,2_rabbit_he_said_and
4,3,156,3_manabozho_his_of_was
5,4,140,4_old_man_you_said
6,5,135,5_lodge_the_of_in
7,6,127,6_he_his_foot_ball
8,7,115,7_grasshopper_manito_his_of
9,8,110,8_magician_owasso_the_mishosha


In [None]:
url = "https://drive.google.com/uc?id=1JT8yzb5Y0eX8unZep75f13XHhaMoAhBX"
output = 'south_america.csv'
gdown.download(url, output, quiet=False)
!unzip -j $output
df = pd.read_csv('south_america.csv', header=None)
df.columns = ["numb", "country", "text"]


Downloading...
From: https://drive.google.com/uc?id=1JT8yzb5Y0eX8unZep75f13XHhaMoAhBX
To: /content/south_america.csv
100%|██████████| 236k/236k [00:00<00:00, 51.4MB/s]


Archive:  south_america.csv
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of south_america.csv or
        south_america.csv.zip, and cannot find south_america.csv.ZIP, period.


In [None]:
texts= df['text'].tolist()
data = []
for story in texts:
  tex = story.split(".")
  for sentence in tex :
    data.append(sentence)
print(len(data))

2922


In [None]:
topic_model_southamerica = BERTopic(language="english", calculate_probabilities=True, verbose=True,nr_topics=30)
topics, probs = topic_model_southamerica.fit_transform(data) 


Batches:   0%|          | 0/92 [00:00<?, ?it/s]

2022-05-30 18:39:56,672 - BERTopic - Transformed documents to Embeddings
2022-05-30 18:40:12,518 - BERTopic - Reduced dimensionality
2022-05-30 18:40:13,342 - BERTopic - Clustered reduced embeddings
2022-05-30 18:40:15,771 - BERTopic - Reduced number of topics from 70 to 31


### South America Result

In [None]:
freq_southamerica = topic_model_southamerica.get_topic_info(); freq_southamerica.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,1294,-1_the_of_and_to
1,0,160,0_you_ill_come_me
2,1,133,1_he_too_away_and
3,2,103,2_giant_giants_the_land
4,3,101,3_tiger_goat_stag_the
5,4,86,4_rabbit_coyote_ox_the
6,5,65,5_dionysia_sea_serpent_labismena
7,6,63,6_man_father_mother_his
8,7,62,7_hen_white_little_quirrichi
9,8,55,8_bananas_peddler_wax_banana


In [None]:
topic_model_southamerica.visualize_barchart(top_n_topics=30)

In [None]:
url = "https://drive.google.com/uc?id=1DbfJ_Luyn-kAbtt0VFCMceAMIsn8ZF5c"
output = 'australia.csv'
gdown.download(url, output, quiet=False)
!unzip -j $output
df = pd.read_csv('australia.csv', header=None)
df.columns = ["numb", "country", "text"]

Downloading...
From: https://drive.google.com/uc?id=1DbfJ_Luyn-kAbtt0VFCMceAMIsn8ZF5c
To: /content/australia.csv
100%|██████████| 158k/158k [00:00<00:00, 45.9MB/s]


Archive:  australia.csv
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of australia.csv or
        australia.csv.zip, and cannot find australia.csv.ZIP, period.


In [None]:
texts= df['text'].tolist()
data = []
for story in texts:
  tex = story.split(".")
  for sentence in tex :
    data.append(sentence)
print(len(data))

1507


In [None]:
topic_model_australia = BERTopic(language="english", calculate_probabilities=True, verbose=True,nr_topics=30)
topics, probs = topic_model_australia.fit_transform(data) 

Batches:   0%|          | 0/48 [00:00<?, ?it/s]

2022-05-30 18:41:22,515 - BERTopic - Transformed documents to Embeddings
2022-05-30 18:41:30,074 - BERTopic - Reduced dimensionality
2022-05-30 18:41:30,275 - BERTopic - Clustered reduced embeddings
2022-05-30 18:41:31,890 - BERTopic - Reduced number of topics from 36 to 31


### Australia Result

In [None]:
freq_australia = topic_model_australia.get_topic_info(); freq_australia.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,586,-1_the_and_they_of
1,0,106,0_she_her_his_he
2,1,69,1_tree_trees_bark_they
3,2,54,2_kangaroo_gwineeboo_rat_it
4,3,44,3_you_my_children_will
5,4,44,4_water_hole_kurreahs_narran
6,5,37,5_oom_their_tribe_dayoorls
7,6,35,6_goomblegubbon_dinewan_her_wings
8,7,33,7_weeoombeens_piggiebillah_fellows_emu
9,8,32,8_text___


In [None]:
topic_model_australia.visualize_barchart(top_n_topics=30)

In [None]:

url = "https://drive.google.com/uc?id=1TLjy0f_dDjiHf-mdFRUgmEleXLOq5UmB"
output = 'all_region_data_new.csv'
gdown.download(url, output, quiet=False)
!unzip -j $output
df = pd.read_csv('all_region_data_new.csv', header=None)
df.columns = ["numb", "country", "text"]

Downloading...
From: https://drive.google.com/uc?id=1TLjy0f_dDjiHf-mdFRUgmEleXLOq5UmB
To: /content/all_region_data_new.csv
100%|██████████| 7.26M/7.26M [00:00<00:00, 28.6MB/s]


Archive:  all_region_data_new.csv
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of all_region_data_new.csv or
        all_region_data_new.csv.zip, and cannot find all_region_data_new.csv.ZIP, period.


In [None]:
texts= df['text'].tolist()
data = []
for story in texts:
  if type(story) == float:
    continue
  tex = story.split(".")
  for sentence in tex :
    data.append(sentence)
print(len(data))

63195


In [None]:
topic_model_all_regions = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model_all_regions.fit_transform(data) 

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Batches:   0%|          | 0/1975 [00:00<?, ?it/s]

2022-06-04 02:10:22,265 - BERTopic - Transformed documents to Embeddings
2022-06-04 02:12:23,071 - BERTopic - Reduced dimensionality
2022-06-04 03:35:37,184 - BERTopic - Clustered reduced embeddings


### General Result

In [None]:

freq_all_regions = topic_model_all_regions.get_topic_info(); freq_all_regions.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,28632,-1_she_her_of_and
1,0,1993,0_de_ou_ses_dat
2,1,889,1_you_will_me_us
3,2,854,2_ship_sea_island_shore
4,3,847,3_nos_mulvey_sharply_note
5,4,722,4_fish_fisherman_hook_fishing
6,5,539,5_yes_says_said_right
7,6,442,6_giant_giants_jack_giantess
8,7,393,7_tree_trees_pine_branches
9,8,384,8_fairy_fairies_bienfaisante_fairyland


In [None]:
topic_model_all_regions.visualize_barchart(top_n_topics=30)