<a href="https://colab.research.google.com/github/fengjiaoya/DL/blob/main/Week_8_BertTopic_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Topic modelling using Latent Semantic Analysis (LSA)**

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents.

Latent semantic analysis is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Dataset link:-
https://www.kaggle.com/datasets/eswarchandt/amazon-music-reviews?select=Musical_instruments_reviews.csv


In [None]:
import pandas as pd

# load data
df = pd.read_csv('/content/gdrive/My Drive/INFO 5731 TA/Datasets/Musical_instruments_reviews.csv', usecols=['reviewerID', 'reviewText'])
df.head()

Unnamed: 0,reviewerID,reviewText
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac..."
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...
2,A195EZSQDW3E21,The primary job of this device is to block the...
3,A2C00NNG1ZQQG2,Nice windscreen protects my MXL mic and preven...
4,A94QU4C90B1AX,This pop filter is great. It looks and perform...


In [None]:
df['reviewText'].isna().value_counts()

False    10254
True         7
Name: reviewText, dtype: int64

In [None]:
print(df.shape)
df = df.dropna()
print(df.shape)

(10261, 2)
(10254, 2)


In [None]:
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation \
                                        , preprocess_string, strip_short, stem_text

# preprocess given text
def preprocess(text):

    # clean text based on given filters
    CUSTOM_FILTERS = [lambda x: x.lower(),
                                remove_stopwords,
                                strip_punctuation,
                                strip_short,
                                stem_text]
    text = preprocess_string(text, CUSTOM_FILTERS)

    return text

# apply function to all reviews
df['Text (Clean)'] = df['reviewText'].apply(lambda x: preprocess(x))

In [None]:
# preview of dataset
df.head()

Unnamed: 0,reviewerID,reviewText,Text (Clean)
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac...","[write, here, exactli, suppos, filter, pop, so..."
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...,"[product, exactli, afford, realiz, doubl, scre..."
2,A195EZSQDW3E21,The primary job of this device is to block the...,"[primari, job, devic, block, breath, produc, p..."
3,A2C00NNG1ZQQG2,Nice windscreen protects my MXL mic and preven...,"[nice, windscreen, protect, mxl, mic, prevent,..."
4,A94QU4C90B1AX,This pop filter is great. It looks and perform...,"[pop, filter, great, look, perform, like, stud..."


In [None]:
from gensim import corpora

# create a dictionary with the corpus
corpus = df['Text (Clean)']
dictionary = corpora.Dictionary(corpus)

# convert corpus into a bag of words
bow = [dictionary.doc2bow(text) for text in corpus]

In [None]:
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel

# Coherence score in topic modeling to measure how interpretable the topics are to humans.
# find the coherence score with a different number of topics
for i in range(2,11):
    lsi = LsiModel(bow, num_topics=i, id2word=dictionary)
    coherence_model = CoherenceModel(model=lsi, texts=df['Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))

Coherence score with 2 clusters: 0.45967954833086194
Coherence score with 3 clusters: 0.4360426022881803
Coherence score with 4 clusters: 0.43212298815202504
Coherence score with 5 clusters: 0.41792118515582066
Coherence score with 6 clusters: 0.39990425361388793
Coherence score with 7 clusters: 0.41304814338254403
Coherence score with 8 clusters: 0.3692227573209305
Coherence score with 9 clusters: 0.34758997063002595
Coherence score with 10 clusters: 0.36856587803793983


In [None]:
# perform SVD on the bag of words with the LsiModel to extract 2 topics
lsi = LsiModel(bow, num_topics=2, id2word=dictionary)

In [None]:
# find the 5 words with the srongest association to the derived topics
for topic_num, words in lsi.print_topics(num_words=10):
    print('Words in {}: {}.'.format(topic_num, words))

Words in 0: 0.329*"sound" + 0.314*"guitar" + 0.242*"string" + 0.232*"pedal" + 0.217*"amp" + 0.214*"like" + 0.198*"us" + 0.163*"plai" + 0.161*"good" + 0.148*"great".
Words in 1: -0.584*"string" + 0.428*"pedal" + -0.380*"guitar" + 0.312*"amp" + -0.161*"tune" + 0.157*"sound" + 0.114*"tone" + -0.099*"tuner" + 0.090*"effect" + 0.080*"tube".


In [None]:
# find the scores given between the review and each topic
corpus_lsi = lsi[bow]
score1 = []
score2 = []
for doc in corpus_lsi:
    score1.append(round(doc[0][1],2))
    score2.append(round(doc[1][1],2))

# create data frame that shows scores assigned for both topics for each review
df_topic = pd.DataFrame()
df_topic['Text'] = df['reviewText']
df_topic['Topic 0 score'] = score1
df_topic['Topic 1 score'] = score2
df_topic['Topic']= df_topic[['Topic 0 score', 'Topic 1 score']].apply(lambda x: x.argmax(), axis=1)
df_topic.head(1)

Unnamed: 0,Text,Topic 0 score,Topic 1 score,Topic
0,"Not much to write about here, but it does exac...",0.88,0.22,0


In [None]:
# find a sample review from each topic
df_topic0 = df_topic[df_topic['Topic'] == 0]
df_topic1 = df_topic[df_topic['Topic']==1]
print('Sample text from topic 0:\n {}'.format(df_topic0.sample(1, random_state=2)['Text'].values))
print('\nSample text from topic 1:\n {}'.format(df_topic1.sample(1, random_state=2)['Text'].values))

Sample text from topic 0:
 ["If you need a tube screamer. You can't go wrong with the TS808. The only reason why you shouldn't get it, is if it isn't the right tone for you. I say this because you can't say anything bad about this pedal. It sounds amazing. It just might not be your sound. I personally went with the TS808 because it compliments my amp better than the TS9. People who say this is better than the TS9 or vice versa are choosing the wrong wording. They prefer it. It's not better. Both are great."]

Sample text from topic 1:
 ["While this is not a $300 tape delay...it's a solid pedal that gets it done. It's way better built than other pedals in this price range. I'm sold. I've purchased several JoYo pedals and all of them are built extremely well and perform as good as if not better than top name pedals."]


#**Topic modelling using LDA**

Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic.

Dataset link:-
https://www.kaggle.com/datasets/benhamner/nips-papers?select=papers.csv

1. load data

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import os

#check the file path
file_path = '/content/gdrive/My Drive/Colab Notebooks/5731/Datasets/papers.csv'
if os.path.exists(file_path):
    print("File found!")
else:
    print("File not found. Please check the path.")

File found!


In [None]:
# Importing modules
import pandas as pd

# Read data into papers
papers = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/5731/Datasets/papers.csv')
# Print head
papers.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


2. Data preprocessing

In [None]:
# Remove the columns
papers = papers.drop(columns=['id', 'event_type', 'pdf_name'], axis=1).sample(100)
# Print out the first rows of papers
papers.head()

Unnamed: 0,year,title,abstract,paper_text
3104,2009,Streaming k-means approximation,We provide a clustering algorithm that approxi...,Streaming k-means approximation\nNir Ailon\nGo...
3364,2010,An analysis on negative curvature induced by s...,"In the neural-network parameter space, an att...",An analysis on negative curvature induced by\n...
3390,1990,Convergence of a Neural Network Classifier,Abstract Missing,Convergence of a Neural Network Classifier\n\n...
4716,2014,An Autoencoder Approach to Learning Bilingual ...,Cross-language learning allows us to use train...,An Autoencoder Approach to Learning\nBilingual...
6392,2017,Revenue Optimization with Approximate Bid Pred...,"In the context of advertising auctions, findin...",Revenue Optimization with Approximate Bid\nPre...


In [None]:
# Load the regular expression library
import re
# Remove punctuation
papers['paper_text_processed'] = \
papers['paper_text'].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
papers['paper_text_processed'] = \
papers['paper_text_processed'].map(lambda x: x.lower())
# Print out the first rows of papers
papers['paper_text_processed'].head()

Unnamed: 0,paper_text_processed
0,767\n\nself-organization of associative databa...
1,683\n\na mean field theory of layer iv of visu...
2,394\n\nstoring covariance by the associative\n...
3,bayesian query construction for neural\nnetwor...
4,neural network ensembles cross\nvalidation and...


In [None]:
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc))
             if word not in stop_words] for doc in texts]
data = papers.paper_text_processed.values.tolist()
data_words = list(sent_to_words(data))
# remove stop words
data_words = remove_stopwords(data_words)
print(data_words[:1][0][:30])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['self', 'organization', 'associative', 'database', 'applications', 'hisashi', 'suzuki', 'suguru', 'arimoto', 'osaka', 'university', 'toyonaka', 'osaka', 'japan', 'abstract', 'efficient', 'method', 'self', 'organizing', 'associative', 'databases', 'proposed', 'together', 'applications', 'robot', 'eyesight', 'systems', 'proposed', 'databases', 'associate']


In [None]:
# Creatie a dictionary and a corpus from preprocessed text data
import gensim.corpora as corpora
# Create Dictionary
id2word = corpora.Dictionary(data_words)
# Create Corpus
texts = data_words
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1][0][:30])

[(0, 1), (1, 1), (2, 1), (3, 2), (4, 1), (5, 6), (6, 1), (7, 1), (8, 3), (9, 1), (10, 2), (11, 2), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 4), (18, 8), (19, 1), (20, 1), (21, 2), (22, 2), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]


3. LDA model

In [None]:
from pprint import pprint
# number of topics
num_topics = 10
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]



[(0,
  '0.006*"learning" + 0.005*"model" + 0.005*"set" + 0.004*"data" + '
  '0.004*"time" + 0.004*"using" + 0.004*"function" + 0.004*"network" + '
  '0.003*"training" + 0.003*"one"'),
 (1,
  '0.009*"model" + 0.007*"data" + 0.006*"learning" + 0.005*"using" + '
  '0.004*"set" + 0.004*"algorithm" + 0.004*"time" + 0.004*"training" + '
  '0.004*"models" + 0.003*"two"'),
 (2,
  '0.006*"learning" + 0.005*"function" + 0.005*"model" + 0.005*"data" + '
  '0.004*"algorithm" + 0.004*"problem" + 0.004*"time" + 0.004*"one" + '
  '0.003*"set" + 0.003*"models"'),
 (3,
  '0.007*"learning" + 0.006*"algorithm" + 0.006*"model" + 0.005*"data" + '
  '0.005*"set" + 0.005*"function" + 0.004*"using" + 0.004*"time" + '
  '0.004*"matrix" + 0.004*"problem"'),
 (4,
  '0.006*"model" + 0.006*"data" + 0.005*"algorithm" + 0.004*"using" + '
  '0.004*"learning" + 0.004*"one" + 0.004*"number" + 0.004*"set" + '
  '0.004*"image" + 0.003*"models"'),
 (5,
  '0.006*"learning" + 0.006*"data" + 0.006*"algorithm" + 0.005*"set" +

In [None]:
# Get the full topic distribution for a specific topic (e.g., Topic 0)
topic_id = 0  # Change this to any topic you want to check
full_topic_distribution = lda_model.get_topic_terms(topic_id, topn=len(id2word))

# Sum up the full distribution probabilities
total_probability = sum([prob for _, prob in full_topic_distribution])

# Print the total probability (should be close to 1)
print("Total Probability for Topic", topic_id, ":", total_probability)


Total Probability for Topic 0 : 0.9999999137940208


In [None]:
# Retrieve and print all words and their probabilities in Topic 0

# Step 1: Set the topic ID to Topic 0
topic_id = 0  # Change this to view other topics if needed

# Step 2: Retrieve all words and their probabilities for Topic 0
# `topn=len(id2word)` ensures we retrieve all words in the vocabulary
full_topic_distribution = lda_model.get_topic_terms(topic_id, topn=len(id2word))

# Step 3: Print all words and their probabilities for Topic 0
print(f"Topic {topic_id} - Word Distribution:")
for word_id, prob in full_topic_distribution:
    word = id2word[word_id]  # Convert word_id to the actual word
    print(f"Word: {word}, Probability: {prob}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Word: actot, Probability: 7.086700293257309e-08
Word: rdnowak, Probability: 7.086700293257309e-08
Word: oftheorem, Probability: 7.086700293257309e-08
Word: bredereck, Probability: 7.086700293257309e-08
Word: ratherpis, Probability: 7.086700293257309e-08
Word: eigehvalue, Probability: 7.086700293257309e-08
Word: conrect, Probability: 7.086700293257309e-08
Word: ajkl, Probability: 7.086700293257309e-08
Word: oaj, Probability: 7.086700293257309e-08
Word: thatpis, Probability: 7.086700293257309e-08
Word: horia, Probability: 7.086699582714573e-08
Word: pimentel, Probability: 7.086698872171837e-08
Word: ningliu, Probability: 7.086698872171837e-08
Word: yuzhao, Probability: 7.086698872171837e-08
Word: chistov, Probability: 7.086698872171837e-08
Word: colorblind, Probability: 7.086698872171837e-08
Word: alarcn, Probability: 7.086698872171837e-08
Word: loong, Probability: 7.086698872171837e-08
Word: binnings, Probability: 7.086698

In [None]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


In [None]:
# Import pyLDAvis for topic model visualization

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
# Enable Notebook Visualizations
pyLDAvis.enable_notebook()

# Preparing the Visualization
LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)

# Displaying the Visualization
pyLDAvis.display(LDAvis_prepared)

#**Interpreting the above chart**

The intertopic distance map is a visualization of the topics in a two-dimensional space . The area of these topic circles is proportional to the amount of words that belong to each topic across the dictionary. The circles are plotted using a multidimensional scaling algorithm (converts a bunch of dimension, more than we can conceive with our human brains, to a reasonable number of dimensions, like two) based on the words they comprise, so topics that are closer together have more words in common.



The bar chart by default shows the 30 most salient terms. The bars indicate the total frequency of the term across the entire corpus. Salient is a specific metric, defined at the bottom of the visualization, that can be thought of as a metric used to identify most informative or useful words for identifying topics in the entire collection of texts. Higher saliency values indicate that a word is more useful for identifying a specific topic.



When you select a topic in the intertopic distance map, or specify a topic in the top panel, the bar chart changes to display the most salient words included in that specific topic. A second darker bar is also displayed over the term’s total frequency that shows the topic-specific frequency of words that belong to the selected topic. If the dark bar entirely eclipses the light bar, that term nearly exclusively belongs to the selected topic.



When you select a word in the bar chart, the topics and probabilities by topic of that word are displayed in the intertopic distance map, so you can see which other topics a term might be shared with.

#***LDA vs LSA***
Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents onto a lower-dimensional space. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Rather than looking at each document isolated from the others, it looks at all the documents as a whole and the terms within them to identify relationships.

Latent Dirichlet Allocation(LDA) algorithm is an unsupervised learning algorithm that works on a probabilistic statistical model to discover topics that the document contains automatically.

LDA assumes that each document in a corpus contains a mix of topics that are found throughout the entire corpus. The topic structure is hidden - we can only observe the documents and words, not the topics themselves. Because the structure is hidden (also known as latent), this method seeks to infer the topic structure given the known words and documents.

#**Topic modelling using BERTopic**
Bertopic documentation
https://maartengr.github.io/BERTopic/api/bertopic.html

BERTopic is a topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

Dataset Link:-
https://www.kaggle.com/datasets/gpreda/tokyo-olympics-2020-tweets

1. load data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/5731/Datasets/tokyo_2020_tweets.csv')
# Dropping Missing Data (Handling NaN Values)
df = df.dropna()
# Random Sampling 20,000 Rows
# Using the same random_state will ensure the same 20,000 rows are selected each time you run the code
df = df.sample(20000, random_state=42)
df.head()

  and should_run_async(code)
  df = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/5731/Datasets/tokyo_2020_tweets.csv')


Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet
295003,1419957058938486804,Australia in Belgium 🇦🇺🇪🇺🇧🇪🇱🇺,"Brussels, Belgium",Australian Mission to the European Union and N...,2014-11-26 15:52:22,3766,3226.0,4622.0,True,2021-07-27 09:45:19,"Congrats to 🇦🇺, 🇧🇪, and 🇱🇺 for our achievement...",['Tokyo2020'],Twitter Web App,0.0,0.0,False
63678,1419307203933229065,念真 Nian Zhen,"Enshi, Hubei, China.",一个中国\n\nChina institute of contemporary intern...,2017-06-30 20:07:29,7788,87.0,38102.0,False,2021-07-25 14:43:01,Reality Vs New York Times\n#To...,['Tokyo2020'],Twitter for Android,12.0,57.0,False
101824,1419592988833947650,Indonesia Badminton,Indonesia,Indonesia Badminton Supporter • Click to follo...,2009-11-07 23:26:53,55711,451.0,2081.0,False,2021-07-26 09:38:37,[Live Score] #Tokyo2020 \n\nWD Group A : \n\...,['Tokyo2020'],Twitter for iPhone,0.0,4.0,False
60985,1419311331782447108,FirstSportz,New Delhi,Official Twitter handle of Firstsportz. \nFoll...,2019-11-26 14:14:55,478,258.0,4053.0,False,2021-07-25 14:59:25,Tennis at Tokyo Olympics: Stefanos #Tsitsipas ...,['Tsitsipas'],Twitter for Android,0.0,0.0,False
47225,1419235863100370946,Kurdistan 24 English,"Erbil, Kurdistan",The English service of Kurdistan's leading new...,2015-07-06 08:32:11,139803,5.0,7.0,True,2021-07-25 09:59:32,The 41-year-old gold medalist is an ethnic Kur...,"['Tokyo2020', 'Olympics']",Twitter Web App,3.0,9.0,False


2. Data preprocessing

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

stop_words = stopwords.words('english')

def clean_text(x):
  x = str(x)
  x = x.lower()
  x = re.sub(r'#[A-Za-z0-9]*', ' ', x)
  x = re.sub(r'https*://.*', ' ', x)
  x = re.sub(r'@[A-Za-z0-9]+', ' ', x)
  tokens = word_tokenize(x)
  x = ' '.join([w for w in tokens if not w.lower() in stop_words])
  x = re.sub(r'[%s]' % re.escape('!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~“…”’'), ' ', x)
  x = re.sub(r'\d+', ' ', x)
  x = re.sub(r'\n+', ' ', x)
  x = re.sub(r'\s{2,}', ' ', x)
  return x


df['clean_text'] = df.text.apply(clean_text)
df.head()

  and should_run_async(code)
  x = re.sub(r'[%s]' % re.escape('!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~“…”’'), ' ', x)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet,clean_text
295003,1419957058938486804,Australia in Belgium 🇦🇺🇪🇺🇧🇪🇱🇺,"Brussels, Belgium",Australian Mission to the European Union and N...,2014-11-26 15:52:22,3766,3226.0,4622.0,True,2021-07-27 09:45:19,"Congrats to 🇦🇺, 🇧🇪, and 🇱🇺 for our achievement...",['Tokyo2020'],Twitter Web App,0.0,0.0,False,congrats 🇦🇺 🇧🇪 🇱🇺 achievements far olympics 👏👏👏
63678,1419307203933229065,念真 Nian Zhen,"Enshi, Hubei, China.",一个中国\n\nChina institute of contemporary intern...,2017-06-30 20:07:29,7788,87.0,38102.0,False,2021-07-25 14:43:01,Reality Vs New York Times\n#To...,['Tokyo2020'],Twitter for Android,12.0,57.0,False,reality vs new york times
101824,1419592988833947650,Indonesia Badminton,Indonesia,Indonesia Badminton Supporter • Click to follo...,2009-11-07 23:26:53,55711,451.0,2081.0,False,2021-07-26 09:38:37,[Live Score] #Tokyo2020 \n\nWD Group A : \n\...,['Tokyo2020'],Twitter for iPhone,0.0,4.0,False,live score wd group greysia polii apriyani ra...
60985,1419311331782447108,FirstSportz,New Delhi,Official Twitter handle of Firstsportz. \nFoll...,2019-11-26 14:14:55,478,258.0,4053.0,False,2021-07-25 14:59:25,Tennis at Tokyo Olympics: Stefanos #Tsitsipas ...,['Tsitsipas'],Twitter for Android,0.0,0.0,False,tennis tokyo olympics stefanos becomes first g...
47225,1419235863100370946,Kurdistan 24 English,"Erbil, Kurdistan",The English service of Kurdistan's leading new...,2015-07-06 08:32:11,139803,5.0,7.0,True,2021-07-25 09:59:32,The 41-year-old gold medalist is an ethnic Kur...,"['Tokyo2020', 'Olympics']",Twitter Web App,3.0,9.0,False,year old gold medalist ethnic kurd western ir...


3. BertTopic model

In [None]:
! pip install bertopic

  and should_run_async(code)


Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.39-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-3.2.0-py3-none-any.whl.metadata (10 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.6-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.16.4-py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hdbscan-0.8.39-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m
[

In [None]:
from bertopic import BERTopic

  and should_run_async(code)


In [None]:
tweets = df.clean_text.to_list()
timestamp = df.date.to_list()

  and should_run_async(code)


In [None]:
topic_model = BERTopic(language="english")
topics, probs = topic_model.fit_transform(tweets)

  and should_run_async(code)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
topic_model.get_topic_info()

  and should_run_async(code)


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6948,-1_athletes_day_medals_olympics,"[athletes, day, medals, olympics, today, olymp...",[still one win 🇳🇬🇳🇬 seeing three days action o...
1,0,205,0_swimming_swim_swimmers_swims,"[swimming, swim, swimmers, swims, swimmer, phe...","[swimming time 😀, swimming time , swimming swi..."
2,1,162,1_amp_pandelela_mun_xxxii,"[amp, pandelela, mun, xxxii, olympiad, yee, hd...","[best mun yee amp pandelela, 🏀👏🏽 amp get first..."
3,2,161,2_gold_golden_golds_rush,"[gold, golden, golds, rush, whitemoney, girl, ...","[gold 🥇🇮🇹, 🙌🏽 gold , 🤺 gold ]"
4,3,152,3_medals_medal_gold_ody,"[medals, medal, gold, ody, three, ever, first,...","[two gold medals 🥇 🥇, gold medals today, gold..."
...,...,...,...,...,...
353,352,10,352_goodluck_baaaaack_fam_sucks,"[goodluck, baaaaack, fam, sucks, that, good, w...","[goodluck 🇳🇬, good baaaaack fam , goodluck ]"
354,353,10,353_program_debut_bask_debuts,"[program, debut, bask, debuts, resilience, pre...",[😃😃 believe day ️⃣ still another games olympic...
355,354,10,354_lily_zhang_singles_wendy,"[lily, zhang, singles, wendy, offiong, edem, c...",[lily zhang wins round two women 's singles ma...
356,355,10,355_huston_nyjah_neal_jamie,"[huston, nyjah, neal, jamie, clutch, moldauer,...","[nyjah huston something else 's love shoes , n..."


In [None]:
topic_model.visualize_topics()

  and should_run_async(code)


*   visualize_topics() in BERTopic generates an interactive map of the discovered topics.
*   Each circle represents a topic, and its size reflects how common the topic is in the dataset.
*   Topics that are closer together share more similar terms, while those farther apart are more distinct.
*   The slider at the bottom allows you to explore different topics interactively, and the red circle highlights the selected topic.





In [None]:
topic_model.visualize_barchart(top_n_topics=12, n_words = 10, width = 350, height = 350)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
topic_model.visualize_hierarchy(top_n_topics=12, width = 700, height = 700)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



**Bertopic - Bigrams**

In [None]:
bigram_topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True, n_gram_range=(2, 2))
bigram_topics, bigram_probs = bigram_topic_model.fit_transform(tweets)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.

2024-10-15 22:51:03,780 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/625 [00:00<?, ?it/s]

2024-10-15 22:54:20,541 - BERTopic - Embedding - Completed ✓
2024-10-15 22:54:20,542 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-15 22:54:50,012 - BERTopic - Dimensionality - Completed ✓
2024-10-15 22:54:50,017 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-15 22:58:07,776 - BERTopic - Cluster - Completed ✓
2024-10-15 22:58:07,792 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-10-15 22:58:08,853 - BERTopic - Representation - Completed ✓


In [None]:
bigram_freq = bigram_topic_model.get_topic_info()
bigram_freq


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6220,-1_proud moment_gold medal_medal count_good luck,"[proud moment, gold medal, medal count, good l...",[nss volunteer tamil nadu cheering team india ...
1,0,687,0_olympic games_watching olympics_watch olympi...,"[olympic games, watching olympics, watch olymp...",[everything n't perfect performance olympics s...
2,1,311,1_gold medal_gold medals_medal country_silver ...,"[gold medal, gold medals, medal country, silve...",[congratulations winning silver medal country ...
3,2,237,2_rebeca coreia_coreia tandara_tandara garay_g...,"[rebeca coreia, coreia tandara, tandara garay,...",[yes rebeca coreia tandara garay rosamaria mac...
4,3,217,3_air pistol_mixed team_saurabh chaudhary_air ...,"[air pistol, mixed team, saurabh chaudhary, ai...",[ m air pistol mixed team qualification stage ...
...,...,...,...,...,...
336,335,11,335_live end_end germany_germany lead_lead india,"[live end, end germany, germany lead, lead ind...","[ day live end q germany lead india , day liv..."
337,336,11,336_congrats congrats_congrats getting_onions ...,"[congrats congrats, congrats getting, onions c...","[congrats getting , wow congrats moving onions..."
338,337,10,337_guys best_another one_best gals_gals guys,"[guys best, another one, best gals, gals guys,...","[sweet another one , great one go get next one..."
339,338,10,338_ride congratulations_phenomenal ride_incre...,"[ride congratulations, phenomenal ride, incred...","[incredible ride congratulations champion , t..."


In [None]:
bigram_topic_model.visualize_topics()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
bigram_topic_model.visualize_hierarchy(top_n_topics=12, width = 700, height = 700)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
bigram_topic_model.visualize_barchart(top_n_topics=12, n_words = 10, width = 350, height = 350)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
bigram_topic_model.visualize_heatmap(top_n_topics=12, width=800, height=800)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.

