# Topic modeling

We are going to look at data from the [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset.  These are postings to newsgroups in 20 different categories.  We will focus on 3 to keep things simple (and the computations quick).

Scikit-learn has a function for downloading the data.  See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

**Remember: We are using labeled data but exploring this from the standpoint of unsupervised learning.  It is convenient to use the data because we know that the topics naturally fall into discrete thematic groups, but that information is not passed to the models, nor would we assume it to be known beforehand.**

## LDA

Latent Dirichlet Allocation:  a topic model that generates topics based on a set of documents' word frequencies.

* Get a "dictionary" that has IDs for all the words along with a record of their word frequencies.
* Use our "bag of words" to generate a list for each document containing its words and their frequencies
* Use gensim to generate an LDA model

## Gensim

* "Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning."
* [gensim website](https://radimrehurek.com/gensim/)

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
# This will be reminiscent of last week
# but we'll also bring along the rec.motorcycles data

cats = ['sci.space', 'comp.graphics', 'rec.motorcycles']

data = fetch_20newsgroups(categories=cats,
                          remove=('headers','footers','quotes'))

In [None]:
print(data.DESCR)

In [None]:
x = data.data

In [None]:
len(x)

In [None]:
x[0]

In [None]:
data.target_names

In [None]:
data.target

We use NLTK to pre-process the words.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation

# getting corpora
import nltk
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
myStopWords = list(punctuation) + stopwords.words('english')

In [None]:
x[0]

In [None]:
[w for w in word_tokenize(x[0].lower()) if w not in myStopWords]

In [None]:
docs = []
for i in x:
    docs.append([w for w in word_tokenize(i.lower()) if w not in myStopWords])

In [None]:
docs[0]

In [None]:
from nltk.stem.porter import PorterStemmer
#from nltk.stem import LancasterStemmer

In [None]:
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [None]:
docs_stemmed = []
for i in docs:
    docs_stemmed.append([p_stemmer.stem(w) for w in i])

In [None]:
docs_stemmed[0]

In [None]:
# ....hmm.....
# there's still a lot of junk, so rather than stemming, let's
# try to retain only the "valid" words

import nltk
nltk.download('words')
nltk_valid_words = set(nltk.corpus.words.words())

In [None]:
# [this will still not be a perfect method]
'nasa' in nltk_valid_words

In [None]:
'algorithm' in nltk_valid_words

In [None]:
'algorithms' in nltk_valid_words

In [None]:
docs_validwords = []
for d in docs: #_stemmed:
    docs_validwords.append([i for i in d if i in nltk_valid_words])

In [None]:
docs[0]

In [None]:
docs_validwords[0]

Here we use gensim to make the dictionary and corpus structures, and to employ the LDA model to extract groups (aka topics) and the distribution of words for each topic.

In [None]:
from gensim import corpora, models
import gensim

In [None]:
# dictionary = corpora.Dictionary(docs_stemmed)
dictionary = corpora.Dictionary(docs_validwords)

In [None]:
len(dictionary)

In [None]:
dictionary.filter_extremes(no_below=10, no_above=0.5)
# could also trim with keep_n=1000 or similar to keep only the top words

In [None]:
len(dictionary)

In [None]:
print(dictionary.token2id)

In [None]:
print(dictionary.token2id['algorithm'])

In [None]:
dictionary[36]

In [None]:
# corpus = [dictionary.doc2bow(text) for text in docs_stemmed]
corpus = [dictionary.doc2bow(text) for text in docs_validwords]

In [None]:
print(corpus[0])

In [None]:
print(dictionary.token2id['science'])

In [None]:
dictionary[27]

In [None]:
# docs_stemmed[0]
docs_validwords[0]

In [None]:
wordid = corpus[0][27]
print(dictionary[wordid[0]],wordid[1])

In [None]:
for i in corpus[0]:
    print(dictionary[i[0]], i[1])

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics=3, 
                                           id2word = dictionary, 
                                           passes=200)

In [None]:
ldamodel.show_topics(num_topics=3)

In [None]:
for i in ldamodel.print_topics(num_topics=3, num_words=20):
    print(i[0])
    print(i[1])
    print('\n')

In [None]:
data.target_names

In [None]:
import matplotlib.pyplot as plt
import re

In [None]:
re.split(re.escape(' + ') + '|' + re.escape('*'), 'hi + me*4')

In [None]:
fig,ax = plt.subplots(1,3,figsize=(8,6))
for i in ldamodel.print_topics(num_topics=3, num_words=20):
    x = []
    y = []
    count = 0
    for j in re.split(re.escape(' + ') + '|' + re.escape('*'), i[1]):
        if count % 2 == 0:
            y.insert(0,float(j))
        else:
            x.insert(0,j)
        count += 1
    ax[i[0]].barh(x,y,height=0.5)
plt.tight_layout()

# TF-IDF (Term Frequency Inverse Document Frequency)

TF-IDF is similar to bag-of-words, but it down weights words appearing frequently across lots of documents.

In [None]:
#Initialize the model
tfidf = gensim.models.TfidfModel(corpus)

In [None]:
corpus[0]

In [None]:
# apply transformation
tfidf[corpus[0]]

In [None]:
corpus_transformed = tfidf[corpus]

In [None]:
corpus_transformed[0]

In [None]:
tfidf.num_docs

In [None]:
ldamodel_tfidf = gensim.models.ldamodel.LdaModel(corpus_transformed, 
                                           num_topics=3, 
                                           id2word = dictionary, 
                                           passes=200)

In [None]:
for i in ldamodel_tfidf.print_topics(num_topics=3, num_words=20):
    print(i[0])
    print(i[1])
    print('\n')

In [None]:
fig,ax = plt.subplots(1,3,figsize=(8,6))
for i in ldamodel_tfidf.print_topics(num_topics=3, num_words=20):
    x = []
    y = []
    count = 0
    for j in re.split(re.escape(' + ') + '|' + re.escape('*'), i[1]):
        if count % 2 == 0:
            y.insert(0,float(j))
        else:
            x.insert(0,j)
        count += 1
    ax[i[0]].barh(x,y,height=0.5)
plt.tight_layout()

# Scikit-Learn

Returing to our handy Scikit-Learn library:

In [None]:
cats = ['sci.space', 'comp.graphics', 'rec.motorcycles']
data = fetch_20newsgroups(categories=cats,
                   remove=('headers','footers','quotes'))
corpus = data.data

In [None]:
corpus[0]

In [None]:
len(corpus)

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
# getting the document-term matrix

# use CountVectorizer to convert the text to a matrix of token counts
# min_frac: words occurring in less than the minimum fraction of documents 
#           are excluded; integer refers to count rather than fraction
# max_frac: words occurring in more than the maximum fraction of documents can be included
#           are excluded
# max_words: after exclusions, the most commonly occurring words are retained
#           up to a total of max_words words
# ngram_range: can be used to include pairs, triples, etc of words for larger
#           ranges like (1,2), (1,3), etc
# stop_words: pass in the list of stop words to be dropped
#           which here is ignored

min_frac    = 2
max_frac    = 0.75
max_words   = 5000
ngram_range = (1,2)
stop_words  = 'english'

vectorizer = CountVectorizer(
    min_df=min_frac, 
    max_df=max_frac, 
    max_features=max_words, 
    ngram_range=ngram_range,
    stop_words=stop_words
)
# vectorizer = TfidfVectorizer(
#     min_df=min_frac, 
#     max_df=max_frac, 
#     max_features=max_words, 
#     ngram_range=ngram_range,
#     stop_words=stop_words
# )

In [None]:
# Fit and transform the corpus texts into a word frequency matrix
word_freq_matrix = vectorizer.fit_transform(corpus)

# Convert the word frequency matrix to a numerical (numpy) array
word_freq_array = word_freq_matrix.toarray()

# Get the words for column names
words = vectorizer.get_feature_names_out()

# Create a dataframe of the result (not currently used below)
word_freq_df = pd.DataFrame(word_freq_array, columns=words)

In [None]:
word_freq_df

In [None]:
word_freq_df[['earth','graphics','image',
              'nasa','algorithms','astronomy',
              'bike','dirt','motorcycle']]

In [None]:
# Train the LDA model
# 
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
# 
# initialize the LDA object
# n_components: the number of topics
# max_iter: Number of iterations over all the training data during training
# n_jobs: Number of parallel task to use during training (-1 for all available cores)
# learning_method: can be "batch" or "online"
#     batch uses all training data in each EM update
#     online uses mini-batch of training data in each EM update
#            The learning rate is controlled by the learning_decay 
#            and the learning_offset parameters
# learning_decay: should be set between (0.5, 1.0] to guarantee asymptotic convergence
# learning_offset: A (positive) parameter that downweights early iterations in online learning
#                  should be > 1.0
# random_state: sets the random number seed for reproducibility

lda = LatentDirichletAllocation(n_components=3,
                                max_iter=200,
                                n_jobs=-1,
                                learning_method='online',
                                learning_offset=50.,
                                learning_decay=.5,
                                random_state=0,
                                evaluate_every=5, # [by default, perplexity change of < 0.1 is used as check for convergence]
                                verbose=1)

# Training the LDA model
lda.fit(word_freq_matrix)

In [None]:
numtopics = 3

# document-by-topic matrix
doc_topic_matrix = lda.transform(word_freq_matrix)

# Convert the document-topic matrix to a dataframe (tabular structure)
doc_topic_df = pd.DataFrame(doc_topic_matrix,
                            columns=(['topic_' + str(i) for i in range(numtopics)]))

# Look at it
# Entries are the probability of the document belonging to the topic
doc_topic_df

In [None]:
# topic-by-word matrix
topic_word_matrix = lda.components_

# Convert the topic-word matrix to a dataframe (tabular structure)
topic_word_df = pd.DataFrame(topic_word_matrix, columns=words)

# Look at it
# Entries are NOT probabilities, but they tell you importances of the word-topic correspondence
topic_word_df

In [None]:
topic_word_df[['earth','graphics','image',
               'nasa','algorithms','astronomy',
               'bike','dirt','motorcycle']]

In [None]:
# interpretation

# document-by-topic matrix
doc_topic_matrix = lda.transform(word_freq_array)

# topic-by-word matrix
topic_word_matrix = lda.components_

# top words in each topic
n_top_words = 10
for topic_idx, topic in enumerate(topic_word_matrix):
    message = "Topic #%d: " % topic_idx
    message += " ".join([words[i]
                         for i in topic.argsort()[:-n_top_words - 1:-1]])
    print(message)
    print()

In [None]:
# This will scan across number of topics from 10 to 50 in steps of 5
# and make a plot showing the decrease in perplexity score 
# as a function of number of topics

# train LDA models with different numbers of topics
numtopics = range(1,10)
perplexityscores = []
for i in numtopics:
    print('Finished training with topics =', i)
    lda = LatentDirichletAllocation(n_components=i,
                                    max_iter=200,
                                    n_jobs=-1,
                                    learning_method='online',
                                    learning_offset=50.,
                                    learning_decay=.5,
                                    random_state=0)
    
    lda.fit(word_freq_array)
    
    perplexityscores.append(lda.perplexity(word_freq_array))
    
plt.plot(numtopics, perplexityscores)

# Visualization for NLP

Here we'll use a new-ish library for topic model visualizatin: topicwizard
* https://x-tabdeveloping.github.io/topicwizard/
* Some things may not play nicely if you try this with other datasets/models, but it's definitely worth exploring

In [None]:
# You can install with pip, e.g.
# !pip install topic-wizard

In [None]:
import topicwizard

In [None]:
vectorizer = CountVectorizer(
    min_df=min_frac, 
    max_df=max_frac, 
    max_features=max_words, 
    ngram_range=ngram_range,
    stop_words=stop_words
)

model = LatentDirichletAllocation(n_components=3,
                                max_iter=200,
                                n_jobs=-1,
                                learning_method='online',
                                learning_offset=50.,
                                learning_decay=.5,
                                random_state=0)

topic_pipeline = topicwizard.pipeline.make_topic_pipeline(vectorizer, model)

topic_pipeline.fit(corpus)

topicwizard.visualize(corpus, model=topic_pipeline)