## Note:

Extra packages to be installed

* **wordcloud** --> `pip install wordcloud`
* **stop_words** -->  `pip install stop-words`

# Introduction to Natural Language Processing

## What is NLP?

>Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.  -- Wikipedia 

* We interact with technology every single day of our lives, but humans and computers communicate in fundamentally different ways.

* languages like python, R, C etc. have well defined rules, where there can be only one meaning attached to a sentance.

* However, the languages that humans speak are more unstructured, where a single word or phrase can mean differently in two different scenarios.

* Most of the communication happens in such natural languages. Hence, to unlock any meaningful utilty where human language is associated, NLP is extremely important.

## Examples of NLP Applications

1. Machine Translation
- Search Engines
2. Spam Filters
- Summarization

In [None]:
# import sys  
# reload(sys)  
# sys.setdefaultencoding('utf8')

## Data:

## The 20 Newsgroups dataset

* [Official Website](http://qwone.com/~jason/20Newsgroups/)
* The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
* The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

## The 20 Newsgroups dataset

* In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn.
* In order to get faster execution times, we will work on a partial dataset with only 4 categories out of the 20 available in the dataset.

In [1]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

No handlers could be found for logger "sklearn.datasets.twenty_newsgroups"


## The 20 Newsgroups dataset

The returned dataset is a scikit-learn “bunch”:

* a simple holder object with fields that can be both accessed as python dict keys or object attributes for convenience
* for instance, the *target_names* holds the list of the requested category names:

In [None]:
twenty_train.target_names

## Agenda

* Tokenization
* Stop-word Removal
* Stemming
* Word Cloud
* TF-IDF

## Constructing the Datasets

It would be desiarable to split the dataset into the following parts:

* X_train
* y_train
* X_test
* y_test

In [2]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(twenty_train.data, twenty_train.target, train_size = 0.8)



## Constructing the Datasets - X_train

In [None]:
for i in range(5):
    print(X_train[i])

In [None]:
len(X_train)

## Constructing the Datasets - X_test

In [None]:
X_test[0]

## Converting list to Pandas Series

In [3]:
import pandas as pd

X_train = pd.Series(X_train)
X_test = pd.Series(X_test)

X_train[0]

u"From: romdas@uclink.berkeley.edu (Ella I Baff)\nSubject: Re: Good Grief!  (was Re: Candida Albicans: what is it?)\nOrganization: University of California, Berkeley\nLines: 9\nDistribution: world\nNNTP-Posting-Host: uclink.berkeley.edu\n\n   >If anybody, doctors included, said to me to my face that there is no\n   >evidence of the 'yeast connection', I cannot guarantee their safety.\n   >For their incompetence, ripping off their lips is justified as far as\n   >I am concerned.\n\nThis doesn't sound like Candida Albicans to me.\n\nJohn Badanes, DC, CA\nromdas@uclink.berkeley.edu\n"

## Constructing the Datasets - y_train

In [None]:
y_train

## Constructing the Datasets - y_test

In [None]:
y_test

## Tokenization

Tokenization breaks unstructured data, text, into chunks of
information which can be counted as discrete elements. 

These counts of token
occurrences in a document can be used directly as a vector representing that document.


This immediately turns an unstructured string (text document) into a structured,
numerical data structure suitable for machine learning.

* Tokenization segments a document into its atomic elements (tokens)
* Typically, our tokens are the words
    - As an example where characters will be more appropriate as tokens, consider Language Detection

## Tokenization

In [4]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

* The above code will match any word characters until it reaches a non-word character, like a space
* This can cause problems for words like *don’t* which will be read as two tokens - *don* and *t*.
* A better tokeniser is TreeBankWordTokenizer which would break words like *don't* into *do* and *n't* 
* NLTK provides a number of pre-constructed tokenizers (like nltk.tokenize.simple)

## Tokenization

In [5]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

In [6]:
X_train = X_train.apply(lambda row: row.lower())
X_train = X_train.apply(lambda row: tokenizer.tokenize(row))
X_train.head(20)

0     [from, :, romdas, @, uclink.berkeley.edu, (, e...
1     [from, :, i3150101, @, dbstu1.rz.tu-bs.de, (, ...
2     [from, :, jonh, @, david.wheaton.edu, (, jonat...
3     [from, :, steve.hayes, @, f22.n7101.z5.fidonet...
4     [from, :, dxf12, @, po.cwru.edu, (, douglas, f...
5     [from, :, atterlep, @, vela.acs.oakland.edu, (...
6     [from, :, add, @, sciences.sdsu.edu, (, james,...
7     [from, :, gt7122b, @, prism.gatech.edu, (, bou...
8     [from, :, merlin, @, neuro.usc.edu, (, merlin,...
9     [from, :, reedr, @, cgsvax.claremont.edu, subj...
10    [from, :, g.coulter, @, daresbury.ac.uk, (, g....
11    [from, :, markl, @, hunan.rastek.com, (, mark,...
12    [from, :, lawrence, curcio, <, lc2b+, @, andre...
13    [from, :, news, @, cbnewsk.att.com, subject, :...
14    [from, :, vek, @, allegra.att.com, (, van, kel...
15    [from, :, morley, @, suncad.camosun.bc.ca, (, ...
16    [from, :, jwindley, @, cheap.cs.utah.edu, (, j...
17    [from, :, halat, @, pooh.bears, (, jim, ha

## Stop-word Removal

* Certain parts of English speech, like conjunctions (“for”, “or”) or the word “the” are meaningless to a topic model. These terms are called stop words and need to be removed from our token list
* The definition of a stop word is flexible and depends on the kind of documents being modeled. For example
    - if we’re topic modeling a collection of music reviews, then terms like *The Who* will have trouble being surfaced because *the* is a common stop word and is usually removed
* One should always carefully consider if any of the likely topics may have common stop-words in them and modify the list of stop-words accordingly
* Let's look at the stopwords from the *stop_words* package, a [relatively conservative list](https://github.com/Alir3z4/stop-words/blob/master/english.txt).

In [None]:
from stop_words import get_stop_words

# create English stop words list
en_stop = get_stop_words('en')
print(en_stop)

## Impact of stop-word removal

In [None]:
X_train = X_train.apply(lambda row: [i for i in row if i not in en_stop])
X_train[0]

## Stemming

* Stemming words is another common technique to reduce topically similar words to their root. For example, 
    - *stemming*, *stemmer*, *stemmed*, all have similar meanings
    - stemming reduces those terms to *stem*
    - This is important for topic modeling, which would otherwise view those terms as separate entities and reduce their importance in the model
* Stemming is flexible and some methods are more aggressive. [The Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) is the most widely used method

In [None]:
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
print(p_stemmer)

## The Porter Stemmer

* *p_stemmer* requires all tokens to be type str
* p_stemmer returns the string parameter in stemmed form

In [None]:
# X_train = X_train.apply(lambda row: [p_stemmer.stem(i) for i in row])
# X_train.head()

In [None]:
text = []
for i in range(len(X_train)):
    tokens = X_train[i]
    tokens = [p_stemmer.stem(i) for i in tokens]
    text = text + tokens
    print(tokens)

## WordCloud

WordClouds are a quick way to check the result of our preprocessing steps and debug them.

In [None]:
textall = " ".join(text)
textall

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

wordcloud = WordCloud().generate(textall)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

## WordCloud with lower max font size

In [None]:
# lower max_font_size
wordcloud = WordCloud(max_font_size=40).generate(textall)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

## WordCloud with additional stopwords

In [None]:
from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

haptik_mask = np.array(Image.open("./images/haptik.png"))

stopwords = set(STOPWORDS)
stopwords.add("flight")

wc = WordCloud(max_words=2000, stopwords=stopwords)
# generate word cloud
wc = wc.generate(textall)

# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.show()

## Term-Document Matrix: Representing text as numerical data

* **What is TDM**:  A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. 

Consider a corpus of documents and a dictionary of terms contain all the words that appear in the documents. The term-document matrix then is a two-dimensional matrix whose rows are the terms and columns are the documents, so each entry (i, j) represents the frequency of term i in document j. ([source](https://www.quora.com/What-is-a-term-document-matrix))

* **Why we care about TDM**:

    1. Content words that appear several times in a document are probably more meaningful than content words that appear just once.
    2. Words that appear more frequently in both our document as well other documenets (like "the", "a", "an" etc.) do not convey much meaning.
    2. Infrequently used words are likely to be more interesting than common words.

## Text Preprocessing using `sklearn`

`sklearn`'s `feature_extraction` module provides convenient API **CountVectorizer** to "convert raw text into a matrix of token counts" along with all the text processing steps we covered.

In [7]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [8]:
vect

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [9]:
# use TreeankWordTokenizer
tokenizer = TreebankWordTokenizer()
vect.set_params(tokenizer=tokenizer.tokenize)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=<bound method TreebankWordTokenizer.tokenize of <nltk.tokenize.treebank.TreebankWordTokenizer object at 0x117e9a8d0>>,
        vocabulary=None)

In [10]:
# remove English stop words
vect.set_params(stop_words='english')

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=<bound method TreebankWordTokenizer.tokenize of <nltk.tokenize.treebank.TreebankWordTokenizer object at 0x117e9a8d0>>,
        vocabulary=None)

In [11]:
# include 1-grams and 2-grams
vect.set_params(ngram_range=(1, 2))

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=<bound method TreebankWordTokenizer.tokenize of <nltk.tokenize.treebank.TreebankWordTokenizer object at 0x117e9a8d0>>,
        vocabulary=None)

In [12]:
# ignore terms that appear in more than 50% of the documents
vect.set_params(max_df=0.5)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.5, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=<bound method TreebankWordTokenizer.tokenize of <nltk.tokenize.treebank.TreebankWordTokenizer object at 0x117e9a8d0>>,
        vocabulary=None)

In [13]:
# only keep terms that appear in at least 2 documents
vect.set_params(min_df=2)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.5, max_features=None, min_df=2,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=<bound method TreebankWordTokenizer.tokenize of <nltk.tokenize.treebank.TreebankWordTokenizer object at 0x117e9a8d0>>,
        vocabulary=None)

**Note:** vect takes data as rows of text. Hence, we will have to get X_train in that format.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(twenty_train.data, twenty_train.target, train_size = 0.8)

X_train

[u'From: brein@jplpost.jpl.nasa.gov (Barry S. Rein)\nSubject: Need survival data on colon cancer\nOrganization: Jet Propulsion Laboratory\nLines: 17\nDistribution: world\nNNTP-Posting-Host: desa.jpl.nasa.gov\n\nA relative of mine was recently diagnosed with colon cancer.  I would like\nto know the best source of survival statistics for this disease when\ndiscovered at its various stages.\n\nI would prefer to be directed to a recent source of this data, rather than\nreceive the data itself.\n\nThank you,\n****************************************************************************\n*                              Barry Rein                                 \n*\n*                       brein@jplpost.jpl.nasa.gov                        \n*\n****************************************************************************\n*                            No clever comment.                           \n* \n****************************************************************************\n',
 u"From: djohns

In [15]:
X_train = pd.Series(X_train)
X_test = pd.Series(X_test)

X_train

0       From: brein@jplpost.jpl.nasa.gov (Barry S. Rei...
1       From: djohnson@cs.ucsd.edu (Darin Johnson)\nSu...
2       From: jemurray@magnus.acs.ohio-state.edu (John...
3       From: mangoe@cs.umd.edu (Charley Wingate)\nSub...
4       From: bosch@rz.uni-karlsruhe.de (Gerhard Bosch...
5       From: healta@saturn.wwc.edu (TAMMY R HEALY)\nS...
6       From: madler@cco.caltech.edu (Mark Adler)\nSub...
7       From: mpaul@unl.edu (marxhausen paul)\nSubject...
8       From: topcat!tom@tredysvr.tredydev.unisys.com ...
9       From: yozzo@watson.ibm.com (Ralph Yozzo)\nSubj...
10      From: vek@allegra.att.com (Van Kelly)\nSubject...
11      From: sp1marse@kristin (Marco Seirio)\nSubject...
12      From: conditt@tsd.arlut.utexas.edu (Paul Condi...
13      From: drt@athena.mit.edu (David R Tucker)\nSub...
14      From: u0mrm@csc.liv.ac.uk (M.R. Mellodew)\nSub...
15      From: cliff@watson.ibm.com (cliff)\nSubject: R...
16      From: sandvik@newton.apple.com (Kent Sandvik)\...
17      From: 

In [16]:
# learn the 'vocabulary' of the training data
vect.fit(X_train)

# examine the fitted vocabulary
vect.get_feature_names()

[u'!',
 u'! !',
 u"! ''",
 u"! 'd",
 u"! 'm",
 u"! 'q",
 u"! 're",
 u"! 's",
 u"! 've",
 u'! (',
 u'! )',
 u'! *',
 u'! **',
 u'! ******************************************************************',
 u'! ************************************************************************',
 u'! ,',
 u'! -',
 u'! --',
 u'! .',
 u'! ...',
 u'! 68070',
 u'! 8-',
 u'! :',
 u'! =',
 u'! >',
 u'! ?',
 u'! [',
 u'! ]',
 u'! ^^^^^^',
 u"! ___.'*",
 u'! ``',
 u'! act-up',
 u'! adagio.panasonic.com',
 u'! allergies',
 u'! archive-server',
 u'! ariadne',
 u'! armstrng',
 u'! article-i.d.',
 u'! ata',
 u'! awdprime.austin.ibm.com',
 u'! bbenowit',
 u'! caf',
 u'! cboesel',
 u'! cheers',
 u'! christians',
 u'! clyde',
 u'! comments',
 u'! course',
 u'! dalcs',
 u'! day',
 u'! did',
 u'! does',
 u'! dyer',
 u'! elroy.jpl.nasa.gov',
 u'! end',
 u'! enjoy',
 u'! ergonomic',
 u'! europa.eng.gtefsd.com',
 u'! expected',
 u'! far',
 u'! fido',
 u'! fight',
 u'! finally',
 u'! geraldo.cc.utexas.edu',
 u'! gerry.palo'

Next, we transform training data into a 'document-term matrix'

In [17]:
simple_train_dtm = vect.transform(X_train)
simple_train_dtm

<1805x62748 sparse matrix of type '<type 'numpy.int64'>'
	with 368081 stored elements in Compressed Sparse Row format>

In [18]:
test_dtm = vect.transform(X_test)

Next, we examine the vocabulary and document-term matrix together

In [20]:
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names()).head(4)

Unnamed: 0,!,! !,! '',! 'd,! 'm,! 'q,! 're,! 's,! 've,! (,...,} texture,} },~,~ (,~~~,~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~,~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ live,~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~,~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |,ÿ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [21]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(simple_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [22]:
# make class predictions for test_dtm
y_pred_class = nb.predict(test_dtm)

In [23]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.95132743362831862

In [24]:
metrics.confusion_matrix(y_test, y_pred_class)

array([[ 87,   0,   0,   4],
       [  0, 102,   2,   2],
       [  0,   4, 119,   3],
       [  2,   3,   2, 122]])

## Text processing using `gensim`

Install necessary packages: **gensim**  --> `pip install gensim`

## Getting Started with Term Document Matrix using `nltk` and `stop-words`

* Now we have *texts* - a tokenized, stopped and stemmed list of words from a single document
* Let’s fast forward and loop through all our documents and appended each one to *texts*
* So now *texts* is a list of lists, one list for each of our original documents

In [None]:
from nltk.tokenize import RegexpTokenizer, TreebankWordTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

In [None]:
tokenizer = TreebankWordTokenizer()

In [None]:
# create English stop words list
en_stop = get_stop_words('en')

In [None]:
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
    

In [None]:
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health." 

In [None]:
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

In [None]:
# list for tokenized documents in loop
texts = []

In [None]:
# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

print(texts, "\n")

print("===========================================")

for line in texts:
    print(line)

## Topic Modeling

https://en.wikipedia.org/wiki/Topic_model
* A type of statistical model for discovering the abstract *topics* that occur in a collection of documents
* Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently:
    - *dog* and *bone* will appear more often in documents about dogs
    - *cat* and *meow* will appear in documents about cats
    - *the* and *is* will appear equally in both
* A document typically concerns multiple topics in different proportions
    - In a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words
* The *topics* produced by topic modeling techniques are clusters of similar words

## Latent Dirichlet Allocation

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
* Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar
* For example, if observations are words collected into documents, then
    - each document is a mixture of a small number of topics, and 
    - each word's creation is attributable to one of the document's topics

## LDA Model

* LDA assumes documents are produced from a mixture of topics.
* Those topics then generate words based on their probability distribution, like the ones in our walkthrough model.
* In other words, LDA assumes a document is made from the following steps:
    - Determine the number of words in a document.
    - Let’s say our document has 6 words.
    - Determine the mixture of topics in that document
        - For example, the document might contain 1/2 the topic “health” and 1/2 the topic “vegetables.”
    - Using each topic’s multinomial distribution, output words to fill the document’s word slots.
    - In our example, the “health” topic is 1/2 our document, or 3 words.
    - The “health” topic might have the word “diet” at 20% probability or “exercise” at 15%, so it will fill the document word slots based on those probabilities.
    * Given this assumption of how documents are created, LDA backtracks and tries to figure out what topics would create those documents in the first place.


* To generate an LDA model, we need to understand how frequently each term occurs within each document



To do that, we need to construct a document-term matrix with *gensim*

In [None]:
from gensim import corpora, models

dictionary = corpora.Dictionary(texts)
print(dictionary)

* The Dictionary() function traverses texts, assigning a unique integer id to each unique token while also collecting word counts and relevant statistics
* To see each token’s unique integer id, try -

In [None]:
print(dictionary.token2id)

Next, our dictionary must be converted into a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) -

In [None]:
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

In [None]:
print(corpus[0])

In [None]:
print(corpus[1])

In [None]:
for line in corpus:
    print(line)

* The doc2bow() function converts dictionary into a bag-of-words
* The result, *corpus*, is a list of vectors equal to the number of documents
* In each document vector is a series of tuples
* The tuples are (term ID, term frequency) pairs
* This includes terms that actually occur - terms that do not occur in a document will not appear in that document’s vector

Looking at the data above, please answer the following:
* How many times does *basebal* occur in *doc_a*?
* How many times does *basebal* occur in *doc_b*?
* How many times does *health* occur in *doc_e*?
* Give an example of a word that occurs in *doc_a* but doesn't occur in *doc_b*.
* How many times does *brother* occur in all the documents?

# The LDA Model

*corpus* is a (sparse) document-term matrix and now we’re ready to generate an LDA model

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)

https://radimrehurek.com/gensim/models/ldamodel.html
* num_topics
    - required
    - An LDA model requires the user to determine how many topics should be generated
    - Our document set is small, so we’re only asking for three topics
* id2word
    - required
    - The LdaModel class requires our previous dictionary to map ids to strings
* passes
    - optional
    - The number of laps the model will take through corpus
    - The greater the number of passes, the more accurate the model will be
    - A lot of passes can be slow on a very large corpus.

In [None]:
print(ldamodel)

In [None]:
print(ldamodel.print_topics())

In [None]:
print(ldamodel.print_topics(num_topics=2))

In [None]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

* Each generated topic is separated by a comma
* Within each topic are the three most probable words to appear in that topic

Let's now look at a topic in detail - 

In [None]:
print(ldamodel.print_topic(topicno=0))

In [None]:
print(ldamodel.print_topic(topicno=1))

In [None]:
print(ldamodel.print_topic(topicno=2))

## Refine the model

In [None]:
for line in ldamodel.print_topics(num_topics=3, num_words=3):
    print(line)

* Even though our document set is small the model is reasonable
* Third Topic - health, brocolli and good make sense together

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

In [None]:
for line in ldamodel.print_topics(num_topics=2, num_words=4):
    print(line)

Let's try it with more passes

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=200)

In [None]:
for line in ldamodel.print_topics(num_topics=2, num_words=4):
    print(line)