# Topic modeling

We are going to look at data from the [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset.  These are postings to newsgroups in 20 different categories.

Scikit-learn has a function for downloading the data.  See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

## LDA

Latent Dirichlet Allocation:  a topic model that generates topics based on a set of documents' word frequencies.

* Get a "dictionary" that has IDs for all the words along with a record of their word frequencies.
* Use our "bag of words" to generate a list for each document containing its words and their frequencies
* Use gensim to generate an LDA model

## Gensim

* "Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning."
* [gensim website](https://radimrehurek.com/gensim/)

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
data = fetch_20newsgroups(remove=("headers", "footers", "quotes"))

In [3]:
print(data.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

In [4]:
x = data.data

In [5]:
len(x)

11314

In [6]:
x[0]

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

In [7]:
data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [8]:
data.target

array([7, 4, 4, ..., 3, 1, 8])

We use NLTK to pre-process the words.

In [9]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation

In [10]:
myStopWords = list(punctuation) + stopwords.words('english')

In [11]:
[w for w in word_tokenize(x[0].lower()) if w not in myStopWords]

['wondering',
 'anyone',
 'could',
 'enlighten',
 'car',
 'saw',
 'day',
 '2-door',
 'sports',
 'car',
 'looked',
 'late',
 '60s/',
 'early',
 '70s',
 'called',
 'bricklin',
 'doors',
 'really',
 'small',
 'addition',
 'front',
 'bumper',
 'separate',
 'rest',
 'body',
 'know',
 'anyone',
 'tellme',
 'model',
 'name',
 'engine',
 'specs',
 'years',
 'production',
 'car',
 'made',
 'history',
 'whatever',
 'info',
 'funky',
 'looking',
 'car',
 'please',
 'e-mail']

In [12]:
docs = []
for i in x:
    docs.append([w for w in word_tokenize(i.lower()) if w not in myStopWords])

In [13]:
docs[0]

['wondering',
 'anyone',
 'could',
 'enlighten',
 'car',
 'saw',
 'day',
 '2-door',
 'sports',
 'car',
 'looked',
 'late',
 '60s/',
 'early',
 '70s',
 'called',
 'bricklin',
 'doors',
 'really',
 'small',
 'addition',
 'front',
 'bumper',
 'separate',
 'rest',
 'body',
 'know',
 'anyone',
 'tellme',
 'model',
 'name',
 'engine',
 'specs',
 'years',
 'production',
 'car',
 'made',
 'history',
 'whatever',
 'info',
 'funky',
 'looking',
 'car',
 'please',
 'e-mail']

In [14]:
from nltk.stem.porter import PorterStemmer
#from nltk.stem import LancasterStemmer

In [15]:
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [16]:
docs_stemmed = []
for i in docs:
    docs_stemmed.append([p_stemmer.stem(w) for w in i])

In [17]:
docs_stemmed[0]

['wonder',
 'anyon',
 'could',
 'enlighten',
 'car',
 'saw',
 'day',
 '2-door',
 'sport',
 'car',
 'look',
 'late',
 '60s/',
 'earli',
 '70',
 'call',
 'bricklin',
 'door',
 'realli',
 'small',
 'addit',
 'front',
 'bumper',
 'separ',
 'rest',
 'bodi',
 'know',
 'anyon',
 'tellm',
 'model',
 'name',
 'engin',
 'spec',
 'year',
 'product',
 'car',
 'made',
 'histori',
 'whatev',
 'info',
 'funki',
 'look',
 'car',
 'pleas',
 'e-mail']

Here we use gensim to make the dictionary and corpus structures, and to employ the LDA model to extract groups (aka topics) and the distribution of words for each topic.

In [18]:
from gensim import corpora, models
import gensim

ModuleNotFoundError: No module named 'gensim'

In [None]:
dictionary = corpora.Dictionary(docs_stemmed)

In [None]:
len(dictionary)

In [None]:
dictionary.filter_extremes(no_below=10, no_above=0.5)
# could also trim with keep_n=1000 or similar to keep only the top words

In [None]:
len(dictionary)

In [None]:
print(dictionary.token2id)

In [None]:
print(dictionary.token2id['patient'])

In [None]:
dictionary[1668]

In [None]:
corpus = [dictionary.doc2bow(text) for text in docs_stemmed]

In [None]:
print(corpus[30])

In [None]:
dictionary[276]

In [None]:
docs_stemmed[30]

In [None]:
wordid = corpus[30][0]
print(dictionary[wordid[0]],wordid[1])

In [None]:
for i in corpus[30]:
    print(dictionary[i[0]], i[1])

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics=20, 
                                           id2word = dictionary, 
                                           passes=5)

In [None]:
ldamodel.show_topics(num_topics=20)

In [None]:
for i in ldamodel.print_topics(num_topics=20, num_words=20):
    print(i[0])
    print(i[1])
    print('\n')

In [None]:
data.target_names

In [None]:
import matplotlib.pyplot as plt
import re

In [None]:
re.split(re.escape(' + ') + '|' + re.escape('*'), 'hi + me*4')

In [None]:
fig,ax = plt.subplots(5,4,figsize=(15,20))
ax = ax.flatten()
for i in ldamodel.print_topics(num_topics=20, num_words=20):
    x = []
    y = []
    count = 0
    for j in re.split(re.escape(' + ') + '|' + re.escape('*'), i[1]):
        if count % 2 == 0:
            y.insert(0,float(j))
        else:
            x.insert(0,j)
        count += 1
    ax[i[0]].barh(x,y,height=0.5)
plt.tight_layout()

# TF-IDF (Term Frequency Inverse Document Frequency)

TF-IDF is similar to bag-of-words, but it down weights words appearing frequently across lots of documents.

In [None]:
#Initialize the model
tfidf = gensim.models.TfidfModel(corpus)

In [None]:
corpus[30]

In [None]:
# apply transformation
tfidf[corpus[30]]

In [None]:
corpus_transformed = tfidf[corpus]

In [None]:
corpus_transformed[30]

In [None]:
ldamodel_tfidf = gensim.models.ldamodel.LdaModel(corpus_transformed, 
                                           num_topics=20, 
                                           id2word = dictionary, 
                                           passes=5)

In [None]:
for i in ldamodel_tfidf.print_topics(num_topics=20, num_words=20):
    print(i[0])
    print(i[1])
    print('\n')

In [None]:
fig,ax = plt.subplots(5,4,figsize=(15,20))
ax = ax.flatten()
for i in ldamodel_tfidf.print_topics(num_topics=20, num_words=20):
    x = []
    y = []
    count = 0
    for j in re.split(re.escape(' + ') + '|' + re.escape('*'), i[1]):
        if count % 2 == 0:
            y.insert(0,float(j))
        else:
            x.insert(0,j)
        count += 1
    ax[i[0]].barh(x,y,height=0.5)
plt.tight_layout()