# Topic analysis 
unsupervised learning


Vayansky and Kumar (2020): 
review topic modeling algorithms 
According to their guidelines:
If average number of word per document > 50 and complext topic relationships are NOT of interest ( for ex: evolution of topics over time or correlation between topics), then Latent Dirichlet allocation (LDA) would be a good choice. 



<small>Vayansky, I., & Kumar, S. A. P. (2020) 'A review of topic modeling methods', Information Systems, 94, 101582.</small>


In [1]:
import pandas as pd

file_path = 'data/english_posts_cleaned.csv'
english_posts = pd.read_csv(file_path)
english_posts.info()

# 522377 posts

In [None]:
data = english_posts.copy()
data.head(3)

Unnamed: 0,id,main_submission_id,comment_parent_id,subreddit,post_type,text,datetime,month,year,text_length,language,language_ft
0,is4ft9s,y2q46p,t3_y2q46p,autism,comment,I don t think it works like that,2022-10-13 05:58:56,10,2022,32,en,en
1,is4gwqj,y2q46p,t3_y2q46p,autism,comment,I do we have handicap add on to our government...,2022-10-13 06:12:48,10,2022,189,en,en
2,is4c22w,y2q46p,t3_y2q46p,autism,comment,Hey u Starflarity thank you for your post at r...,2022-10-13 05:14:14,10,2022,458,en,en


#### LDA - Topic modeling

Based on the Dirichlet distribution = a family of continuous multivariate probability distributions parameterized by a vector Alpha of positive reals. It is a multivariate generalization of the beta distribution,hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution. (https://en.wikipedia.org/wiki/Dirichlet_distribution)

LDA was initially proposed by Blei et al. (2003) and is based on the following assumpitions:
- documents with similar topics use similar groups of words
- latent topics can then be found by searching for groups of words that frequently occur together in documents across the corpus
- documents are probability distributions over latent topics
- topics themselves are probability distributions over words


expalanation of how it works: https://www.youtube.com/watch?v=be7Xd2Ntai8&ab_channel=AnalyticsExcellence

<small>Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003) 'Latent Dirichlet Allocation', Journal of Machine Learning Research, 3, 993-1022.</small>

How to choose optimal number of topics ( for LDA this needs to be chosen by us ):
- Subject expertise: if you might have an idea of how many topics are present in your documents, choose the number acordingly.
- Perplexity: metric commonly used to evaluate the performance of an LDA model. Lower perplexity values indicate better models. However, it's important to note that perplexity alone might not always reflect the interpretability of topic
- Coherence: measures the interpretability of topic. Higher coherence scores generally indicate better-defined topics. 
- Visual analysis: manually inspect the topics generated the different model. If the topics make sense and are coherent, it's a good indication that the model is good.

In [None]:
%%capture
%pip install import_ipynb

In [None]:
%%capture

import import_ipynb
import reddit_post_analysis

# Access the variable from the first notebook
stopwords = reddit_post_analysis.stop_words

In [None]:
stopwords

['s',
 'm',
 'go',
 'u',
 're',
 'Ye',
 'OP',
 've',
 'd',
 'll',
 'ok',
 'ex',
 'Oh',
 'im',
 'NT',
 'bc',
 'don t',
 'doesn t',
 'didn t',
 'isn t',
 'aren t',
 'wasn t',
 'wouldn t',
 'won t',
 'don',
 'doesn',
 'didn',
 'isn',
 'aren',
 'wasn',
 'wouldn',
 'won',
 'get',
 "it's",
 "i'd",
 "aren't",
 'otherwise',
 'her',
 'which',
 "you're",
 "they've",
 "doesn't",
 'than',
 'however',
 'my',
 "i've",
 'into',
 "who's",
 'their',
 "she's",
 'an',
 'and',
 'com',
 'any',
 'under',
 "he'd",
 'other',
 'by',
 'k',
 'on',
 'why',
 "when's",
 'am',
 'do',
 "shouldn't",
 "hasn't",
 "shan't",
 'there',
 'we',
 'does',
 'yourselves',
 "she'll",
 "i'll",
 "you'll",
 'each',
 'not',
 "haven't",
 'i',
 'them',
 'so',
 "they'll",
 'can',
 'few',
 'myself',
 'yours',
 'at',
 'could',
 'would',
 'these',
 "he'll",
 "how's",
 'as',
 "can't",
 'being',
 "we'd",
 'once',
 "couldn't",
 'also',
 'both',
 'over',
 "mustn't",
 'how',
 'having',
 'below',
 'all',
 "weren't",
 'out',
 "didn't",
 "hadn't",

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Get the default English stop words from CountVectorizer
default_stopwords = CountVectorizer(stop_words='english').get_stop_words()

all_stopwords = list(default_stopwords)
# Combine the default and custom stop words
for i in stopwords:
    if i not in all_stopwords:
        all_stopwords.append(i)

# all_stopwords


In [None]:
%%capture
%pip install pyldavis

In [None]:
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint
# Plotting tools
import matplotlib.pyplot as plt
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
nlp = spacy.load('en_core_web_sm')
%matplotlib inline

In [None]:
# Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. 
# Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. 
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence))) 
data_words = list(sent_to_words(data.text))

# append 'words' column to dataset
data.loc[:, 'words'] = data_words
print(data_words[:2])

[['don', 'think', 'it', 'works', 'like', 'that'], ['do', 'we', 'have', 'handicap', 'add', 'on', 'to', 'our', 'governments', 'student', 'money', 'but', 'if', 'you', 'have', 'it', 'you', 'can', 'only', 'earn', 'an', 'amount', 'of', 'money', 'on', 'the', 'side', 'and', 'you', 'only', 'get', 'it', 'as', 'long', 'as', 'you', 're', 'student', 'obviously']]


In [None]:
# Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.
# The advantage of this is, we get to reduce the total number of unique words in the dictionary. 
# As a result, the number of columns in the document-word matrix (created by CountVectorizer in the next step) will be denser with lesser columns. 
# You can expect better topics to be generated in the end.
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): 
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out


In [None]:
# Initialize spacy ‘en’ model, keeping only tagger component (for efficiency)
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# Do lemmatization keeping only Noun and Verb - these are the parts of speech that usually reflect topics
data_lemmatized = lemmatization(data.words, allowed_postags=['NOUN', 'VERB']) 
data.loc[:, 'lemmas'] = data_lemmatized
print(data_lemmatized[:2])

# After this pre-processing, the post text is represented as a collection of words (= bag of words).

['think work', 'add government student money earn amount money side get student']


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522377 entries, 0 to 522376
Data columns (total 14 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   id                  522377 non-null  object
 1   main_submission_id  522377 non-null  object
 2   comment_parent_id   522377 non-null  object
 3   subreddit           522377 non-null  object
 4   post_type           522377 non-null  object
 5   text                522377 non-null  object
 6   datetime            522377 non-null  object
 7   month               522377 non-null  int64 
 8   year                522377 non-null  int64 
 9   text_length         522377 non-null  int64 
 10  language            522377 non-null  object
 11  language_ft         522377 non-null  object
 12  words               522377 non-null  object
 13  lemmas              522377 non-null  object
dtypes: int64(3), object(11)
memory usage: 55.8+ MB


In [None]:
data.head(2)

Unnamed: 0,id,main_submission_id,comment_parent_id,subreddit,post_type,text,datetime,month,year,text_length,language,language_ft,words,lemmas
0,is4ft9s,y2q46p,t3_y2q46p,autism,comment,I don t think it works like that,2022-10-13 05:58:56,10,2022,32,en,en,"[don, think, it, works, like, that]",think work
1,is4gwqj,y2q46p,t3_y2q46p,autism,comment,I do we have handicap add on to our government...,2022-10-13 06:12:48,10,2022,189,en,en,"[do, we, have, handicap, add, on, to, our, gov...",add government student money earn amount money...


In [None]:
cv = CountVectorizer(analyzer='word',
                     max_df=0.9,                        # maimum required occurences of a word 
                     min_df=2,                          # minimum required occurences of a word 
                     stop_words=all_stopwords,          # remove stop words
                     lowercase=True,                    # convert all words to lowercase
                     token_pattern='[a-zA-Z0-9]{3,}',   # num chars > 3
                     max_features=50000,                # max number of uniq words
                    )
            

In [None]:
# no need to do train/validate/test split as it is unsupervised learning
# vectorize all text dataset
data_vectorized = cv.fit_transform(data.lemmas)

In [None]:
# 522377 text rows
# each text/document row is represented by a 37574 dimentions vector (dataset has 37574 features)
data_vectorized

<522377x37574 sparse matrix of type '<class 'numpy.int64'>'
	with 7156031 stored elements in Compressed Sparse Row format>

In [None]:
# Convert the vectors to a DataFrame
vectors = pd.DataFrame(data_vectorized.toarray(), columns=cv.get_feature_names_out())

# Concatenate the original DataFrame with the vectors DataFrame
data = pd.concat([data, vectors], axis=1)
data.head(3)

In [None]:
# documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
# learning_decayfloat, default=0.7
# learning_method{‘batch’, ‘online’}, default=’batch’
from sklearn.decomposition import LatentDirichletAllocation
max_iter = 10

LDA = LatentDirichletAllocation(n_components=10,            # number of topics
                                max_iter=max_iter,          # 
                                batch_size=512,             # number docs in each learning iteration (as there are 10 iter, max 5120 of the docs will be seen when building the model)
                                random_state=7,
                                learning_method='online',
                                evaluate_every = -1,        # compute perplexity every n iters, default: Don't
                                n_jobs = -1,                # Use all availble CPUs
                                )

In [None]:
LDA.fit(data_vectorized)
print(LDA)  # Model attributes

LatentDirichletAllocation(batch_size=512, learning_method='online', n_jobs=-1,
                          random_state=7)


In [None]:
# evaluate model performance

# Log Likelyhood: Higher the better
print("Log Likelihood: ", LDA.score(data_vectorized))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
# perplexity might not be the best measure to evaluate topic models because it doesn’t consider the context and semantic associations between words.
print("Perplexity: ", LDA.perplexity(data_vectorized))

# See model parameters
pprint(LDA.get_params())


Log Likelihood:  -64526052.89568952
Perplexity:  1648.8671941932269
{'batch_size': 512,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 10,
 'mean_change_tol': 0.001,
 'n_components': 10,
 'n_jobs': -1,
 'perp_tol': 0.1,
 'random_state': 7,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}


In [None]:
# get the vocab of the words
vocab_size = len(cv.get_feature_names_out())
print('Number of words in vocab.: ', vocab_size)

In [None]:
# get the most common words per topic = words that have the highes probabilities of belonging to a topic
# argsort() gets index positions sorted from least to greatest
# top 10 = last 10 vlaues of argsort()
# code reference documentation:
# https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py

import matplotlib.pyplot as plt
import numpy as np

def plot_top_words(model, feature_names, n_top_words, title):
    fig, axes = plt.subplots(4, 5, figsize=(30, 32), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[-n_top_words:]
        top_features = feature_names[top_features_ind]
        weights = topic[top_features_ind]  # 

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
        ax.tick_params(axis="both", which="major", labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.show()

In [None]:
# get the topics
topics = LDA.components_

# plot the top 10 words per topic
feature_names = np.array(cv.get_feature_names_out())
title = 'Top 10 Words per Topic'
plot_top_words(LDA, feature_names, 10, title)

In [None]:
# Use GridSearch to determine the best LDA model.
# The most important tuning parameter for LDA models is n_components (number of topics).

# Define Search Param
search_params = {'n_components': [10, 15, 20], 'batch_size': [128, 512]}
# Init the Model
lda = LatentDirichletAllocation(max_iter=10,      
                                random_state=7,
                                learning_method='online',
                                evaluate_every = -1,    # compute perplexity every n iters, default: Don't
                                n_jobs = -1,            # Use all availble CPUs
                              )
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params, cv=None, n_jobs=-1, error_score='raise', refit=True)
# Do the Grid Search
model.fit(data_vectorized)

KeyboardInterrupt: 

In [None]:
# Best Model
best_lda_model = model.best_estimator_
# Model Parameters
print("Best Model's Params: ", model.best_params_)
# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)
# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

In [None]:
# %%capture
# %pip install -U pip setuptools wheel
# %pip install -U 'spacy[apple]'
# !python3 -m spacy download en_core_web_sm
# !python3 -m spacy validate


In [None]:
# %%capture
# %pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

##### Deep learning pipeline

pre-process text --> vectorize (dense embedings) --> deep learning Neural Net (hidden layers) --> output (here: topics)

In the DL pipeline, the raw data (after pre-processing) is directly fed to a model. The model is capable of “learning” features from the data. Hence, these features are more in line with the task at hand, so they generally give improved performance. But, since all these features are learned via model parameters, the model loses interpretability. (Vajjala et al., 2020)

<small>Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical Natural Language Processing. O'Reilly Media, Inc.</small>

#### BERTopic
'BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.' (Grootendorst, 2022)

<small>Grootendorst, M. (2022). "BERTopic: Neural topic modeling with a class-based TF-IDF procedure." arXiv:2203.05794v1 [cs.CL]. </small>


In [None]:
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.decomposition import LatentDirichletAllocation
# import torch
# from transformers import DistilBertTokenizer, DistilBertModel
# from pytorch_pretrained_bert import BertTokenizer, BertModel