#**Topic modelling using Latent Semantic Analysis (LSA)**

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents.

Latent semantic analysis is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Dataset link:-
https://www.kaggle.com/datasets/eswarchandt/amazon-music-reviews?select=Musical_instruments_reviews.csv


In [None]:
import pandas as pd

# load data
df = pd.read_csv('/content/gdrive/My Drive/INFO 5731 TA/Datasets/Musical_instruments_reviews.csv', usecols=['reviewerID', 'reviewText'])
df.head()

Unnamed: 0,reviewerID,reviewText
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac..."
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...
2,A195EZSQDW3E21,The primary job of this device is to block the...
3,A2C00NNG1ZQQG2,Nice windscreen protects my MXL mic and preven...
4,A94QU4C90B1AX,This pop filter is great. It looks and perform...


In [None]:
df['reviewText'].isna().value_counts()

False    10254
True         7
Name: reviewText, dtype: int64

In [None]:
print(df.shape)
df = df.dropna()
print(df.shape)

(10261, 2)
(10254, 2)


In [None]:
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation \
                                        , preprocess_string, strip_short, stem_text

# preprocess given text
def preprocess(text):
    
    # clean text based on given filters
    CUSTOM_FILTERS = [lambda x: x.lower(), 
                                remove_stopwords, 
                                strip_punctuation, 
                                strip_short, 
                                stem_text]
    text = preprocess_string(text, CUSTOM_FILTERS)
    
    return text

# apply function to all reviews 
df['Text (Clean)'] = df['reviewText'].apply(lambda x: preprocess(x))

In [None]:
# preview of dataset
df.head()

Unnamed: 0,reviewerID,reviewText,Text (Clean)
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac...","[write, here, exactli, suppos, filter, pop, so..."
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...,"[product, exactli, afford, realiz, doubl, scre..."
2,A195EZSQDW3E21,The primary job of this device is to block the...,"[primari, job, devic, block, breath, produc, p..."
3,A2C00NNG1ZQQG2,Nice windscreen protects my MXL mic and preven...,"[nice, windscreen, protect, mxl, mic, prevent,..."
4,A94QU4C90B1AX,This pop filter is great. It looks and perform...,"[pop, filter, great, look, perform, like, stud..."


In [None]:
from gensim import corpora

# create a dictionary with the corpus
corpus = df['Text (Clean)']
dictionary = corpora.Dictionary(corpus)

# convert corpus into a bag of words
bow = [dictionary.doc2bow(text) for text in corpus]

In [None]:
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel

# Coherence score in topic modeling to measure how interpretable the topics are to humans.
# find the coherence score with a different number of topics
for i in range(2,11):
    lsi = LsiModel(bow, num_topics=i, id2word=dictionary)
    coherence_model = CoherenceModel(model=lsi, texts=df['Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))

Coherence score with 2 clusters: 0.45967954833086194
Coherence score with 3 clusters: 0.4360426022881803
Coherence score with 4 clusters: 0.43212298815202504
Coherence score with 5 clusters: 0.41792118515582066
Coherence score with 6 clusters: 0.39990425361388793
Coherence score with 7 clusters: 0.41304814338254403
Coherence score with 8 clusters: 0.3692227573209305
Coherence score with 9 clusters: 0.34758997063002595
Coherence score with 10 clusters: 0.36856587803793983


In [None]:
# perform SVD on the bag of words with the LsiModel to extract 2 topics
lsi = LsiModel(bow, num_topics=2, id2word=dictionary)

In [None]:
# find the 5 words with the srongest association to the derived topics
for topic_num, words in lsi.print_topics(num_words=10):
    print('Words in {}: {}.'.format(topic_num, words))

Words in 0: 0.329*"sound" + 0.314*"guitar" + 0.242*"string" + 0.232*"pedal" + 0.217*"amp" + 0.214*"like" + 0.198*"us" + 0.163*"plai" + 0.161*"good" + 0.148*"great".
Words in 1: -0.584*"string" + 0.428*"pedal" + -0.380*"guitar" + 0.312*"amp" + -0.161*"tune" + 0.157*"sound" + 0.114*"tone" + -0.099*"tuner" + 0.090*"effect" + 0.080*"tube".


In [None]:
# find the scores given between the review and each topic
corpus_lsi = lsi[bow]
score1 = []
score2 = []
for doc in corpus_lsi:
    score1.append(round(doc[0][1],2))
    score2.append(round(doc[1][1],2))

# create data frame that shows scores assigned for both topics for each review
df_topic = pd.DataFrame()
df_topic['Text'] = df['reviewText']
df_topic['Topic 0 score'] = score1
df_topic['Topic 1 score'] = score2
df_topic['Topic']= df_topic[['Topic 0 score', 'Topic 1 score']].apply(lambda x: x.argmax(), axis=1)
df_topic.head(1)

Unnamed: 0,Text,Topic 0 score,Topic 1 score,Topic
0,"Not much to write about here, but it does exac...",0.88,0.22,0


In [None]:
# find a sample review from each topic
df_topic0 = df_topic[df_topic['Topic'] == 0]
df_topic1 = df_topic[df_topic['Topic']==1]
print('Sample text from topic 0:\n {}'.format(df_topic0.sample(1, random_state=2)['Text'].values))
print('\nSample text from topic 1:\n {}'.format(df_topic1.sample(1, random_state=2)['Text'].values))

Sample text from topic 0:
 ["If you need a tube screamer. You can't go wrong with the TS808. The only reason why you shouldn't get it, is if it isn't the right tone for you. I say this because you can't say anything bad about this pedal. It sounds amazing. It just might not be your sound. I personally went with the TS808 because it compliments my amp better than the TS9. People who say this is better than the TS9 or vice versa are choosing the wrong wording. They prefer it. It's not better. Both are great."]

Sample text from topic 1:
 ["While this is not a $300 tape delay...it's a solid pedal that gets it done. It's way better built than other pedals in this price range. I'm sold. I've purchased several JoYo pedals and all of them are built extremely well and perform as good as if not better than top name pedals."]


#**Topic modelling using LDA**

Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic.

Dataset link:-
https://www.kaggle.com/datasets/benhamner/nips-papers?select=papers.csv

In [None]:
# Importing modules
import pandas as pd

# Read data into papers
papers = pd.read_csv('/content/gdrive/My Drive/INFO 5731 TA/Datasets/papers.csv')
# Print head
papers.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


In [None]:
# Remove the columns
papers = papers.drop(columns=['id', 'event_type', 'pdf_name'], axis=1).sample(100)
# Print out the first rows of papers
papers.head()

Unnamed: 0,year,title,abstract,paper_text
6260,2017,Doubly Accelerated Stochastic Variance Reduced...,We develop a new accelerated stochastic gradie...,Doubly Accelerated\nStochastic Variance Reduce...
1261,2002,Identity Uncertainty and Citation Matching,Abstract Missing,Identity Uncertainty and Citation Matching\n\n...
5155,2015,Infinite Factorial Dynamical Model,We propose the infinite factorial dynamic mode...,Infinite Factorial Dynamical Model\nIsabel Val...
2439,2007,Learning Bounds for Domain Adaptation,Empirical risk minimization offers well-known ...,Learning Bounds for Domain Adaptation\n\nJohn ...
2268,2006,An Efficient Method for Gradient-Based Adaptat...,Abstract Missing,An Efficient Method for Gradient-Based Adaptat...


In [None]:
# Load the regular expression library
import re
# Remove punctuation
papers['paper_text_processed'] = \
papers['paper_text'].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
papers['paper_text_processed'] = \
papers['paper_text_processed'].map(lambda x: x.lower())
# Print out the first rows of papers
papers['paper_text_processed'].head()


invalid escape sequence \.


invalid escape sequence \.


invalid escape sequence \.


invalid escape sequence \.



6260    doubly accelerated\nstochastic variance reduce...
1261    identity uncertainty and citation matching\n\n...
5155    infinite factorial dynamical model\nisabel val...
2439    learning bounds for domain adaptation\n\njohn ...
2268    an efficient method for gradient-based adaptat...
Name: paper_text_processed, dtype: object

In [None]:
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]
data = papers.paper_text_processed.values.tolist()
data_words = list(sent_to_words(data))
# remove stop words
data_words = remove_stopwords(data_words)
print(data_words[:1][0][:30])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['doubly', 'accelerated', 'stochastic', 'variance', 'reduced', 'dual', 'averaging', 'method', 'regularized', 'empirical', 'risk', 'minimization', 'tomoya', 'murata', 'ntt', 'data', 'mathematical', 'systems', 'inc', 'tokyo', 'japan', 'murata', 'msicojp', 'taiji', 'suzuki', 'department', 'mathematical', 'informatics', 'graduate', 'school']


In [None]:
import gensim.corpora as corpora
# Create Dictionary
id2word = corpora.Dictionary(data_words)
# Create Corpus
texts = data_words
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1][0][:30])

[(0, 1), (1, 1), (2, 31), (3, 3), (4, 14), (5, 1), (6, 1), (7, 20), (8, 1), (9, 2), (10, 3), (11, 4), (12, 11), (13, 1), (14, 7), (15, 1), (16, 1), (17, 1), (18, 2), (19, 6), (20, 1), (21, 2), (22, 2), (23, 1), (24, 2), (25, 1), (26, 1), (27, 9), (28, 2), (29, 1)]


In [None]:
from pprint import pprint
# number of topics
num_topics = 10
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]



[(0,
  '0.006*"model" + 0.005*"data" + 0.005*"learning" + 0.004*"algorithm" + '
  '0.004*"models" + 0.004*"function" + 0.004*"set" + 0.003*"using" + '
  '0.003*"one" + 0.003*"matrix"'),
 (1,
  '0.007*"data" + 0.006*"learning" + 0.005*"algorithm" + 0.005*"set" + '
  '0.005*"one" + 0.004*"model" + 0.003*"function" + 0.003*"performance" + '
  '0.003*"method" + 0.003*"two"'),
 (2,
  '0.008*"data" + 0.007*"model" + 0.005*"algorithm" + 0.005*"using" + '
  '0.004*"set" + 0.004*"learning" + 0.004*"log" + 0.003*"information" + '
  '0.003*"results" + 0.003*"one"'),
 (3,
  '0.006*"data" + 0.006*"set" + 0.005*"model" + 0.005*"algorithm" + '
  '0.004*"learning" + 0.004*"function" + 0.004*"results" + 0.003*"training" + '
  '0.003*"models" + 0.003*"one"'),
 (4,
  '0.006*"data" + 0.005*"learning" + 0.004*"algorithm" + 0.004*"set" + '
  '0.004*"log" + 0.004*"one" + 0.004*"using" + 0.004*"model" + '
  '0.004*"function" + 0.003*"number"'),
 (5,
  '0.012*"model" + 0.006*"learning" + 0.005*"data" + 0.004*"

In [None]:
!pip install pyLDAvis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.6 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: pyLDAvis, sklearn
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=83c8982feb5c734c016e5e0fab98dbffa3d67f529293f8ab0ea1687c16039fd7
  Stored in directory: /root/.cache/pip/wheels/c9/21/f6/17bcf2667e8a68532ba2fbf6d5c72fdf4c7f7d9abfa4852d2f
  Building wheel for sklearn (setup.

In [None]:
#import pyLDAvis.gensim

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()


LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)

pyLDAvis.display(LDAvis_prepared)


Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working



#**Interpreting the above chart**

The intertopic distance map is a visualization of the topics in a two-dimensional space . The area of these topic circles is proportional to the amount of words that belong to each topic across the dictionary. The circles are plotted using a multidimensional scaling algorithm (converts a bunch of dimension, more than we can conceive with our human brains, to a reasonable number of dimensions, like two) based on the words they comprise, so topics that are closer together have more words in common.

 

The bar chart by default shows the 30 most salient terms. The bars indicate the total frequency of the term across the entire corpus. Salient is a specific metric, defined at the bottom of the visualization, that can be thought of as a metric used to identify most informative or useful words for identifying topics in the entire collection of texts. Higher saliency values indicate that a word is more useful for identifying a specific topic. 

 

When you select a topic in the intertopic distance map, or specify a topic in the top panel, the bar chart changes to display the most salient words included in that specific topic. A second darker bar is also displayed over the term’s total frequency that shows the topic-specific frequency of words that belong to the selected topic. If the dark bar entirely eclipses the light bar, that term nearly exclusively belongs to the selected topic.

 

When you select a word in the bar chart, the topics and probabilities by topic of that word are displayed in the intertopic distance map, so you can see which other topics a term might be shared with.

#***LDA vs LSA***
Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents onto a lower-dimensional space. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Rather than looking at each document isolated from the others, it looks at all the documents as a whole and the terms within them to identify relationships.

Latent Dirichlet Allocation(LDA) algorithm is an unsupervised learning algorithm that works on a probabilistic statistical model to discover topics that the document contains automatically.

LDA assumes that each document in a corpus contains a mix of topics that are found throughout the entire corpus. The topic structure is hidden - we can only observe the documents and words, not the topics themselves. Because the structure is hidden (also known as latent), this method seeks to infer the topic structure given the known words and documents.

#**Topic modelling using BERTopic**
Bertopic documentation
https://maartengr.github.io/BERTopic/api/bertopic.html

BERTopic is a topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

Dataset Link:-
https://www.kaggle.com/datasets/gpreda/tokyo-olympics-2020-tweets

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('/content/gdrive/My Drive/INFO 5731 TA/Datasets/tokyo_2020_tweets.csv')
df = df.dropna()
df = df.sample(20000, random_state=42)
df.head()


Columns (0,1,2,3,4,5,6,7,8,9,10,11,12,15) have mixed types.Specify dtype option on import or set low_memory=False.



Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet
295003,1419957058938486804,Australia in Belgium 🇦🇺🇪🇺🇧🇪🇱🇺,"Brussels, Belgium",Australian Mission to the European Union and N...,2014-11-26 15:52:22,3766,3226.0,4622.0,True,2021-07-27 09:45:19,"Congrats to 🇦🇺, 🇧🇪, and 🇱🇺 for our achievement...",['Tokyo2020'],Twitter Web App,0.0,0.0,False
63678,1419307203933229065,念真 Nian Zhen,"Enshi, Hubei, China.",一个中国\n\nChina institute of contemporary intern...,2017-06-30 20:07:29,7788,87.0,38102.0,False,2021-07-25 14:43:01,Reality Vs New York Times\n#To...,['Tokyo2020'],Twitter for Android,12.0,57.0,False
101824,1419592988833947650,Indonesia Badminton,Indonesia,Indonesia Badminton Supporter • Click to follo...,2009-11-07 23:26:53,55711,451.0,2081.0,False,2021-07-26 09:38:37,[Live Score] #Tokyo2020 \n\nWD Group A : \n\...,['Tokyo2020'],Twitter for iPhone,0.0,4.0,False
60985,1419311331782447108,FirstSportz,New Delhi,Official Twitter handle of Firstsportz. \nFoll...,2019-11-26 14:14:55,478,258.0,4053.0,False,2021-07-25 14:59:25,Tennis at Tokyo Olympics: Stefanos #Tsitsipas ...,['Tsitsipas'],Twitter for Android,0.0,0.0,False
47225,1419235863100370946,Kurdistan 24 English,"Erbil, Kurdistan",The English service of Kurdistan's leading new...,2015-07-06 08:32:11,139803,5.0,7.0,True,2021-07-25 09:59:32,The 41-year-old gold medalist is an ethnic Kur...,"['Tokyo2020', 'Olympics']",Twitter Web App,3.0,9.0,False


In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

stop_words = stopwords.words('english')

def clean_text(x):
  x = str(x)
  x = x.lower()
  x = re.sub(r'#[A-Za-z0-9]*', ' ', x)
  x = re.sub(r'https*://.*', ' ', x)
  x = re.sub(r'@[A-Za-z0-9]+', ' ', x)
  tokens = word_tokenize(x)
  x = ' '.join([w for w in tokens if not w.lower() in stop_words])
  x = re.sub(r'[%s]' % re.escape('!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~“…”’'), ' ', x)
  x = re.sub(r'\d+', ' ', x)
  x = re.sub(r'\n+', ' ', x)
  x = re.sub(r'\s{2,}', ' ', x)
  return x


df['clean_text'] = df.text.apply(clean_text)
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet,clean_text
295003,1419957058938486804,Australia in Belgium 🇦🇺🇪🇺🇧🇪🇱🇺,"Brussels, Belgium",Australian Mission to the European Union and N...,2014-11-26 15:52:22,3766,3226.0,4622.0,True,2021-07-27 09:45:19,"Congrats to 🇦🇺, 🇧🇪, and 🇱🇺 for our achievement...",['Tokyo2020'],Twitter Web App,0.0,0.0,False,congrats 🇦🇺 🇧🇪 🇱🇺 achievements far olympics 👏👏👏
63678,1419307203933229065,念真 Nian Zhen,"Enshi, Hubei, China.",一个中国\n\nChina institute of contemporary intern...,2017-06-30 20:07:29,7788,87.0,38102.0,False,2021-07-25 14:43:01,Reality Vs New York Times\n#To...,['Tokyo2020'],Twitter for Android,12.0,57.0,False,reality vs new york times
101824,1419592988833947650,Indonesia Badminton,Indonesia,Indonesia Badminton Supporter • Click to follo...,2009-11-07 23:26:53,55711,451.0,2081.0,False,2021-07-26 09:38:37,[Live Score] #Tokyo2020 \n\nWD Group A : \n\...,['Tokyo2020'],Twitter for iPhone,0.0,4.0,False,live score wd group greysia polii apriyani ra...
60985,1419311331782447108,FirstSportz,New Delhi,Official Twitter handle of Firstsportz. \nFoll...,2019-11-26 14:14:55,478,258.0,4053.0,False,2021-07-25 14:59:25,Tennis at Tokyo Olympics: Stefanos #Tsitsipas ...,['Tsitsipas'],Twitter for Android,0.0,0.0,False,tennis tokyo olympics stefanos becomes first g...
47225,1419235863100370946,Kurdistan 24 English,"Erbil, Kurdistan",The English service of Kurdistan's leading new...,2015-07-06 08:32:11,139803,5.0,7.0,True,2021-07-25 09:59:32,The 41-year-old gold medalist is an ethnic Kur...,"['Tokyo2020', 'Olympics']",Twitter Web App,3.0,9.0,False,year old gold medalist ethnic kurd western ir...


In [None]:
! pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic
  Downloading bertopic-0.12.0-py2.py3-none-any.whl (90 kB)
[K     |████████████████████████████████| 90 kB 3.7 MB/s 
Collecting hdbscan>=0.8.28
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[K     |████████████████████████████████| 5.2 MB 28.5 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting umap-learn>=0.5.0
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[K     |████████████████████████████████| 88 kB 6.2 MB/s 
[?25hCollecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 4.7 MB/s 
[?25hCollecting pyyaml<6.0
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 52.5 MB/s 
Collecti

In [None]:
from bertopic import BERTopic

In [None]:
tweets = df.clean_text.to_list()
timestamp = df.date.to_list()

In [None]:
topic_model = BERTopic(language="english")
topics, probs = topic_model.fit_transform(tweets)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,6655,-1_athletes_day_medals_tokyo
1,0,334,0_olympics_olympic_games_watching
2,1,277,1_garay_femenino_tandara_de
3,2,178,2_amp_pandelela_mun_xxxii
4,3,177,3_skateboarding_skateboarders_skate_skater
...,...,...,...
358,357,10,357_weightlifting_category_kg_categor
359,358,10,358_ride_zwemmen_nanna_merrald
360,359,10,359_yessss_wwooohhhhhhhhhh_uhuh_yessssssssss
361,360,10,360_tiebreak_tie_liam_break


In [None]:
topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart(top_n_topics=12, n_words = 10, width = 350, height = 350)

In [None]:
topic_model.visualize_hierarchy(top_n_topics=12, width = 700, height = 700)


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead



**Bertopic - Bigrams**

In [None]:
bigram_topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True, n_gram_range=(2, 2))
bigram_topics, bigram_probs = bigram_topic_model.fit_transform(tweets)

Batches:   0%|          | 0/625 [00:00<?, ?it/s]

2022-11-03 00:22:57,569 - BERTopic - Transformed documents to Embeddings
2022-11-03 00:23:30,739 - BERTopic - Reduced dimensionality
2022-11-03 00:26:03,443 - BERTopic - Clustered reduced embeddings


In [None]:
bigram_freq = bigram_topic_model.get_topic_info() 
bigram_freq

Unnamed: 0,Topic,Count,Name
0,-1,6157,-1_tokyo olympics_congratulations chanu_proud ...
1,0,466,0_gold medals_gold medal_medal table_medal tally
2,1,458,1_watching olympics_olympic games_watch olympi...
3,2,271,2_semi finals_next round_wins match_sweep sweep
4,3,255,3_rebeca coreia_coreia tandara_tandara garay_r...
...,...,...,...
337,336,11,336_around contact_us whatsapp_tours around_co...
338,337,11,337_dream make_make happen_happen leal_dream d...
339,338,10,338_many chances_ahead let_wi chance_chances said
340,339,10,339_hope guys_like meme_guys like_hope like


In [None]:
bigram_topic_model.visualize_topics()

In [None]:
bigram_topic_model.visualize_hierarchy(top_n_topics=12, width = 700, height = 700)


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead



In [None]:
bigram_topic_model.visualize_barchart(top_n_topics=12, n_words = 10, width = 350, height = 350)

In [None]:
bigram_topic_model.visualize_heatmap(top_n_topics=12, width=800, height=800)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

