# Topic modeling

We are going to look at data from the [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset.  In particular, we'll look at postings to the following newsgroups:
* comp.graphics
* rec.motorcycles
* sci.med

Individual posts to these newsgroups are included as files under the `20fetch` folder, but there is no organization by topic.

## LDA

Latent Dirichlet Allocation:  a topic model that generates topics based on a set of documents' word frequencies.

* Get a "dictionary" that has IDs for all the words along with a record of their word frequencies.
* Use our "bag of words" to generate a list for each document containing its words and their frequencies
* Use gensim to generate an LDA model

Nice example: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

## Gensim

* "Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning."
* [gensim website](https://radimrehurek.com/gensim/)

In [None]:
import pandas as pd
from pathlib import Path  
import glob

In [None]:
directory_path = '20fetch'
text_files = glob.glob(f"{directory_path}/*")
text_files[0]

In [None]:
with open(text_files[0]) as f:
    x = f.read()

In [None]:
x

In [None]:
listOfNews = []
for i in text_files:
    try:
        with open(i) as f:
            listOfNews.append(f.read())
    except:
        pass

In [None]:
len(text_files)

In [None]:
len(listOfNews)

In [None]:
listOfNews[45]

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation

In [None]:
extrastop = ['``',"''","'re","'s","'re",'``',"''","'ll","--","\'\'","...",
             "n\'t",'one','would','use','subject','from',"\'m","\'ve"]

In [None]:
myStopWords = list(punctuation) + stopwords.words('english') + extrastop

In [None]:
[w for w in word_tokenize(listOfNews[0].lower()) if w not in myStopWords]

In [None]:
listOfNewsWords = []
for i in listOfNews:
    listOfNewsWords.append([w for w in word_tokenize(i.lower()) if w not in myStopWords])

In [None]:
listOfNewsWords[0]

In [None]:
from nltk.stem.porter import PorterStemmer
#from nltk.stem import LancasterStemmer

In [None]:
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [None]:
listOfStemmedWords = []
for i in listOfNewsWords:
    listOfStemmedWords.append([p_stemmer.stem(w) for w in i])

In [None]:
listOfStemmedWords[0]

In [None]:
!pip install gensim

In [None]:
from gensim import corpora, models
import gensim

In [None]:
dictionary = corpora.Dictionary(listOfStemmedWords)

In [None]:
print(dictionary.token2id)

In [None]:
corpus = [dictionary.doc2bow(text) for text in listOfStemmedWords]

In [None]:
print(corpus[30])

In [None]:
print(dictionary.token2id['oil'])

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics=3, 
                                           id2word = dictionary, 
                                           passes=20)

In [None]:
for i in ldamodel.print_topics(num_topics=3, num_words=20):
    print(i)

In [None]:
ldamodel.show_topic(1, topn=20)

In [None]:
for i in range(3):
    print('Topic '+str(i))
    for j in ldamodel.show_topic(i, topn=20):
            print(j[0])
    print('\n')

In [None]:
npasses = 100
ntopics = 3

ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics=ntopics, 
                                           id2word = dictionary, 
                                           passes=npasses)

for i in range(ntopics):
    print('Topic '+str(i))
    for j in ldamodel.show_topic(i, topn=20):
            print(j[0])
    print('\n')