# Topic modeling

We are going to look at data from the [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset.  These are postings to newsgroups in 20 different categories.

Scikit-learn has a function for downloading the data.  See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

## LDA

Latent Dirichlet Allocation:  a topic model that generates topics based on a set of documents' word frequencies.

* Get a "dictionary" that has IDs for all the words along with a record of their word frequencies.
* Use our "bag of words" to generate a list for each document containing its words and their frequencies
* Use gensim to generate an LDA model

## Gensim

* "Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning."
* [gensim website](https://radimrehurek.com/gensim/)

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
data = fetch_20newsgroups(remove=("headers", "footers", "quotes"))

In [None]:
print(data.DESCR)

In [None]:
x = data.data

In [None]:
len(x)

In [None]:
x[0]

In [None]:
data.target_names

In [None]:
data.target

We use NLTK to pre-process the words.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation

# getting corpora
import nltk
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
myStopWords = list(punctuation) + stopwords.words('english')

In [None]:
x[0]

In [None]:
[w for w in word_tokenize(x[0].lower()) if w not in myStopWords]

In [None]:
docs = []
for i in x:
    docs.append([w for w in word_tokenize(i.lower()) if w not in myStopWords])

In [None]:
docs[0]

In [None]:
from nltk.stem.porter import PorterStemmer
#from nltk.stem import LancasterStemmer

In [None]:
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [None]:
docs_stemmed = []
for i in docs:
    docs_stemmed.append([p_stemmer.stem(w) for w in i])

In [None]:
docs_stemmed[0]

Here we use gensim to make the dictionary and corpus structures, and to employ the LDA model to extract groups (aka topics) and the distribution of words for each topic.

In [None]:
from gensim import corpora, models
import gensim

In [None]:
dictionary = corpora.Dictionary(docs_stemmed)

In [None]:
len(dictionary)

In [None]:
dictionary.filter_extremes(no_below=10, no_above=0.5)
# could also trim with keep_n=1000 or similar to keep only the top words

In [None]:
len(dictionary)

In [None]:
print(dictionary.token2id)

In [None]:
print(dictionary.token2id['patient'])

In [None]:
dictionary[1668]

In [None]:
corpus = [dictionary.doc2bow(text) for text in docs_stemmed]

In [None]:
print(corpus[30])

In [None]:
dictionary[276]

In [None]:
docs_stemmed[30]

In [None]:
wordid = corpus[30][0]
print(dictionary[wordid[0]],wordid[1])

In [None]:
for i in corpus[30]:
    print(dictionary[i[0]], i[1])

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics=20, 
                                           id2word = dictionary, 
                                           passes=5)

In [None]:
ldamodel.show_topics(num_topics=20)

In [None]:
for i in ldamodel.print_topics(num_topics=20, num_words=20):
    print(i[0])
    print(i[1])
    print('\n')

In [None]:
data.target_names

In [None]:
import matplotlib.pyplot as plt
import re

In [None]:
re.split(re.escape(' + ') + '|' + re.escape('*'), 'hi + me*4')

In [None]:
fig,ax = plt.subplots(5,4,figsize=(15,20))
ax = ax.flatten()
for i in ldamodel.print_topics(num_topics=20, num_words=20):
    x = []
    y = []
    count = 0
    for j in re.split(re.escape(' + ') + '|' + re.escape('*'), i[1]):
        if count % 2 == 0:
            y.insert(0,float(j))
        else:
            x.insert(0,j)
        count += 1
    ax[i[0]].barh(x,y,height=0.5)
plt.tight_layout()

# TF-IDF (Term Frequency Inverse Document Frequency)

TF-IDF is similar to bag-of-words, but it down weights words appearing frequently across lots of documents.

In [None]:
#Initialize the model
tfidf = gensim.models.TfidfModel(corpus)

In [None]:
corpus[30]

In [None]:
# apply transformation
tfidf[corpus[30]]

In [None]:
corpus_transformed = tfidf[corpus]

In [None]:
corpus_transformed[30]

In [None]:
tfidf.num_docs

In [None]:
ldamodel_tfidf = gensim.models.ldamodel.LdaModel(corpus_transformed, 
                                           num_topics=20, 
                                           id2word = dictionary, 
                                           passes=20)

In [None]:
for i in ldamodel_tfidf.print_topics(num_topics=20, num_words=20):
    print(i[0])
    print(i[1])
    print('\n')

In [None]:
fig,ax = plt.subplots(5,4,figsize=(15,20))
ax = ax.flatten()
for i in ldamodel_tfidf.print_topics(num_topics=20, num_words=20):
    x = []
    y = []
    count = 0
    for j in re.split(re.escape(' + ') + '|' + re.escape('*'), i[1]):
        if count % 2 == 0:
            y.insert(0,float(j))
        else:
            x.insert(0,j)
        count += 1
    ax[i[0]].barh(x,y,height=0.5)
plt.tight_layout()