# Topic Modeling

There are two popular choices for models: Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). LDA is a more complex process, and thus takes more resources and longer to run, but has higher accuracy. LSI is a much simpler process and can be run quite quickly.
- LSI looks at words in a documents and its relationships to other words, with the important assumption that every word can only mean one thing. (cf. https://en.wikipedia.org/wiki/Latent_semantic_indexing)
- LDA seeks to remedy this fault by allowing words to exist in multiple topics, first grouping them by topic, and each document is compared across each topic to determine the best fit. (cf. https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

In [None]:
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from string import punctuation
from gensim import corpora, models, similarities 


stemmer = SnowballStemmer("english")

docs_tm = [tokenize_only(x) for x in docs]
docs_tm = [[x for x in i if x not in stopwords.words("english") and x not in punctuation] for i in docs_tm]

We first create a dict of word IDs and their respective word frequency for all documents.

In [None]:
# create a Gensim dictionary from the texts
dictionary = corpora.Dictionary(docs_tm)

In [None]:
# remove extremes (similar to the min/max df step used when creating the tf-idf matrix)
# no_below is absolute # of docs, no_above is fraction of corpus
dictionary.filter_extremes(no_below=40, no_above=.70)

The corpus we now create with doc2bow is a vector of all words (IDs from the dict), and frequency for each document.

In [None]:
# convert the dictionary to a bag of words corpus for reference
corpus = [dictionary.doc2bow(i) for i in docs_tm]

We'll make a tfidf, *term freqency inverse document frequency*, matrix. A tfidf takes into account the frequency of a word in the entire corpus, and offsets it based on its frequency among documents (more here: https://en.wikipedia.org/wiki/Tfâ€“idf):

In [None]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

## LSI

In [None]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=6)
corpus_lsi = lsi[corpus_tfidf]
lsi.print_topics(6)

## LDA

In [None]:
# we run chunks of 15 books, and update after every 2 chunks, and make 10 passes
lda = models.LdaModel(corpus, num_topics=6, 
                            update_every=2,
                            id2word=dictionary, 
                            chunksize=15, 
                            passes=10)

lda.show_topics()

In [None]:
corpus_lda = lda[corpus_tfidf]
for i,doc in enumerate(corpus_lda): # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(titles[i],doc)
    print ()

For more with gensim, see the tutorials here: https://radimrehurek.com/gensim/tutorial.html