# Applications_with_NLP

**Objective:** A general introduction to NLP applications to serve as a foundation for further self study and additional NLP curriculums. This notebook intends to serve as a basis for various techniques such as vectorizing text, classifying text and topic modeling.

**Prerequisites:**

Open your terminal/Anaconda Prompot, cd to the lecture code folder and run the following commands:

`pip install -r requirements.txt`

## Word Embedding

- We have examined methods for preprocessing text in the previous NLP Jump Start, let's examine methods of converting text to vectors of the modeling stage. Within NLP, there are two predominant methods: **frequency based** and **prediction based**.

- When you hear of "bag-of-words" or "TF-IDF", it is referring to frequency based while "word2vec" or "doc2vec" refer to prediction based. 

- Note, **NEITHER** of these embeddings are your actual classifiers/regressors but instead used to understand text behavior or pass as the inputs into a predictive model.

- For purposes of this tutorial, we will focus on the **frequency based** methods as the others often uses pre-trained neural network models and as a result, has less use for understanding of foundational NLP methods. 

- Since frequency based methods lean traditionally on the bag-of-words method, it doesn't capture positions or semantics.  However, despite this, the output vectors of these models can perform well in a variety of NLP predictive problems. (ex. a high frequency of "free", despite various positional text, generally would indicate an email is spam in the classic example)

### CountVectorizer

- Often this method is often referred to as "bag-of-words" but CountVectorizer can includes methods of n-grams to handle more complex tasks. At it's basic principle, it's simply a count of words (or permutation of words) throughout the document.

- In this scheme, features and samples are defined as follows:
 - each **individual token occurrence frequency** is treated as a **feature**.
 - the vector of all the token frequencies for a given **document** is considered a **multivariate sample**.
- A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.


- **The data**: 20 newsgroups dataset that we will use today comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

In [4]:
from sklearn.datasets import fetch_20newsgroups

In [5]:
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [6]:
print(list(newsgroups_train.target_names))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


- Note `newsgroups_train.data` will be a list of strings. The same method and code structure will work if you pass the column of text from your pandas dataframe as well. Let's walk through how to transform the data through CountVectorization and TF-IDF respectively.

In [7]:
print(newsgroups_train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [9]:
category_index = newsgroups_train.target[0]
print(newsgroups_train.target_names[category_index])

rec.autos


In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
count_vec = CountVectorizer()
X_train_count = count_vec.fit_transform(newsgroups_train.data)
X_train_count.shape

(11314, 130107)

- As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
- In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the `scipy.sparse` matrix

In [12]:
type(X_train_count)

scipy.sparse.csr.csr_matrix

In [13]:
import numpy as np
np.sum(X_train_count.todense()[0])

122

- Array mapping from feature integer indices to feature name

In [14]:
print(len(count_vec.get_feature_names()))

130107


- Print out the token in the corpus every 1000 words.

In [15]:
print(count_vec.get_feature_names()[::1000])

['00', '045032', '0p7', '1102', '13500', '15o6', '185117', '1kqspt', '1rh8zrck', '22182', '26m', '2nw', '34ij', '3le', '47j', '50s2', '5ily', '65h', '6tle', '7566', '7zt8caxgcs', '8fss', '93n', '9v1', '_law', 'a87w', 'ad99s461', 'airplanes', 'amusements', 'apr2207', 'astronomy_', 'b03a', 'basil', 'bevelled', 'bm5ld2i', 'bridegroom', 'byk7', 'c5x9xs', 'carvings', 'changing', 'citilille', 'collegian', 'connetions', 'coutry', 'cufnhuu6', 'dani', 'deglitching', 'dhs03', 'disturbance', 'dressed', 'e94eg9', 'eldon', 'envolved', 'evolutionary', 'f941', 'fetuses', 'flux', 'fronts', 'galations', 'ghosted', 'gqx3t', 'guzzi', 'hardwired', 'hflvh', 'hording', 'hysterically', 'iin', 'inen', 'intl', 'iyb', 'jggx6m', 'july26', 'kbjo9', 'kl5', 'kzm', 'le11w4teb', 'lindquist', 'lqqc0k', 'm4i', 'mainframes', 'maxvill', 'melmon', 'mightily', 'mmle', 'movments', 'mushy', 'nambla', 'nexus', 'notreached', 'objecten', 'ooy', 'overhang', 'paramax', 'performances', 'pissed', 'populous', 'priveledge', 'punjab',

- If we want to count not just **uni-gram** but **bigram** as well, we can set the `ngram_range=(1, 2)`. 
- Check the size of the output matrix

In [16]:
count_vec = CountVectorizer(ngram_range=(1,2))
X_train_count = count_vec.fit_transform(newsgroups_train.data)
X_train_count.shape

(11314, 1181803)

- Print out the unigram and bigrams every 5000 words.

In [17]:
print(count_vec.get_feature_names()[::5000])

['00', '06w', '100 sunnyvale', '12 darryl', '142902', '169 671', '1958 4283', '1acj', '1z6e pmf9f', '23 49', '27 251', '2qqh1d2 pn', '34 lidstrom', '3e4', '44 473', '4tb pb', '583 907', '63d3e4', '6rkxb 15la', '76 38', '814 238', '8y syx_scx', '9_ qw', '_j physiol', 'aborts or', 'accumulated huge', 'adjustable arm', 'age country', 'algorithm literature', 'also highly', 'amtrack mr', 'and commander', 'and pleading', 'annotate subsequent', 'anything compared', 'arabia an', 'arguing on', 'article c5n3x0', 'aset', 'atari or', 'autos they', 'b8nr1d9 sld9', 'baseball haven', 'be muslims', 'beethoven 45th', 'best buy', 'billion year', 'bmp 14', 'botha are', 'brokers agents', 'business men', 'by persecution', 'ca mece7187', 'can slide', 'cars only', 'center supposedly', 'cheapest teams', 'chronicus can', 'claude lecommandeur', 'codeset part', 'come ll', 'completely settled', 'congrete', 'contrib gnuplot3', 'cosmetic industry', 'craziest thing', 'cullen to', 'dachsel internet', 'days their', 'd

- Depending on the size of your corpus, you can also set the `min_df` parameter 
 - ignore terms that have a document frequency lower than the given threshold.
 - the size of the output matrix is much smaller than the previous ones.

In [18]:
count_vec = CountVectorizer(ngram_range=(1,2), min_df=10)
X_train_count = count_vec.fit_transform(newsgroups_train.data)
X_train_count.shape

(11314, 50399)

In [19]:
print(count_vec.get_feature_names()[::1000])

['00', '320x200', 'about 70', 'allowed to', 'and night', 'are limited', 'aurora alaska', 'before but', 'brown edu', 'car and', 'clesun central', 'consider it', 'd3 d1', 'director of', 'easiest way', 'eskimo com', 'few cases', 'free if', 'go in', 'has given', 'hizbollah', 'in canada', 'into consideration', 'it takes', 'knowledgeable', 'lines what', 'marv', 'mob', 'necessity', 'now let', 'of view', 'org au', 'payments', 'postage', 'quality of', 'regulations', 'run at', 'series', 'software package', 'strange', 'techno', 'the century', 'the ozone', 'theologians', 'time so', 'told to', 'university lines', 'virtual memory', 'wetware', 'without being', 'you ll']


### TF-IDF

- A **Term Frequency** is a count of how many times a word occurs in a given document (synonymous with bag of words). The **Inverse Document Frequency** is the the number of times a word occurs in a corpus of documents. TF-IDF goes one step further. It applies a frequency count but then penalizes it by dividing it across the appearance throughout all documents.

- One possible definition of TF-IDF is:

$$tf_{t,d} = log(1+f_{t,d})$$

$$idf_{t,d} = log(1+\frac{N}{df_{t}})$$

$$w_{t,d} = tf_{t,d}\times idf_{t,d}$$

where N is the number of documents in the corpus

**Question:** Why is log used when calculating term frequency and inverse document frequency?

- If term frequency for word 'AI' in doc1 is 10 and doc2 is 20, we can say that doc2 is more relevant than doc1 for word 'AI'. However, if the term frequency of the same word, 'AI' for doc1 is 1 million and doc2 is 2 million, at this point, there is no much different in term of relevant anymore because they both contain a very high count for term 'AI'.

- Adding log is to **dampen** the importance of term that has a high frequency, e.g. Using log base 2, the count of 1 million will be reduced to 19.9! 

- We also add 1 to the log(tf) because when tf is equal to 1, the log(1) is zero. by adding one, we distinguish between tf=0 and tf=1.
- The `TfidfVectorizer` is equivalent to `CountVectorizer` followed by `TfidfTransformer`.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
tf_idf = TfidfVectorizer(ngram_range=(1,2), min_df=10)
X_train_tf = tf_idf.fit_transform(newsgroups_train.data)
X_train_tf.shape

(11314, 50399)

## Predicting Methods

### Naive Bayes Classifier

- Now that we've examined how to prepare text, transform text, we can examine various methods of predicting with our matrix. The traditional model for text predictions is Naive Bayes.

- Naive Bayes assumes that terms within documents are independent of each other. The classic example of this is the email spam detection. Consider implementation of this model on small amounts of texts across each document (ie reviews for negative, positive.) It’s important to note, because of the assumption that the terms would be independent of each other, removing highly correlated features before running the model would improve performance greatly.

- There are two prevailing models: multivariate bernoulli model vs multinomial model:
 - The Multivariate Bernoulli model which ignores the frequency of words and Multinomial model which takes the frequency into account. 
 - Since we are implementing frequency rather than Boolean, it should be apparent that we will be using the Multinomial model.

In [22]:
from sklearn.naive_bayes import MultinomialNB

- We want to predict the category of each news.

In [23]:
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

- Let's do a mock example of checking what are predicted tags using both sets of inputs.

- First we will create and train the model on the two training sets.

In [24]:
multiNB = MultinomialNB()

cntvecMNB = multiNB.fit(X_train_count, newsgroups_train.target)
tf_idfMNB = multiNB.fit(X_train_tf, newsgroups_train.target)

- Now that the models are trained, we will send new statements using the same CountVectorizer and TF-IDF modeled we trained earlier to transform our new statements as new test statements and predict those new vectors with our Multinomial model.

In [25]:
new_docs = ["""In the ancient and medieval world  
            the etymological Latin root religio was understood as an individual virtue of worship 
            never as doctrine, practice, or actual source of knowledge.  
            Furthermore, religio referred to broad social obligations to family, neighbors, rulers, and even towards God. 
            When religio came into English around the 1200s as religion, it took the meaning of "life bound by monastic vows". 
            The compartmentalized concept of religion, where religious things were separated from worldly things, 
            was not used before the 1500s. The concept of religion was first used in the 1500s to distinguish 
            the domain of the church and the domain of civil authorities.""",
            
           """A graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate and 
           alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. 
           GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. 
           Modern GPUs are very efficient at manipulating computer graphics and image processing, 
           and their highly parallel structure makes them more efficient than general-purpose CPUs  
           for algorithms where the processing of large blocks of data is done in parallel. 
           In a personal computer, a GPU can be present on a video card, or it can be embedded 
           on the motherboard or—in certain CPUs—on the CPU die"""]

- **Important**: We do **NOT** fit the test data through a new CountVectorizer model or a new TF-IDF model.

In [27]:
new_doc_count = count_vec.transform(new_docs)
new_doc_tfidf = tf_idf.transform(new_docs)

In [28]:
cnt_predicted = cntvecMNB.predict(new_doc_count)
tfidf_predicted = tf_idfMNB.predict(new_doc_count)

In [29]:
for i in cnt_predicted:
    print(newsgroups_train.target_names[i])

soc.religion.christian
comp.graphics


In [30]:
for i in tfidf_predicted:
    print(newsgroups_train.target_names[i])

soc.religion.christian
comp.graphics


- So here we see both predictions accurately label the two but let's evaluate the behavior from an accuracy metrics of the two word embedding inputs.
- Try to see if you can walk through the steps you need to do before reading the code! 

In [31]:
X_test_count = count_vec.transform(newsgroups_test.data)
X_test_count.shape

(7532, 50399)

In [32]:
X_test_tf = tf_idf.transform(newsgroups_test.data)
X_test_tf.shape

(7532, 50399)

In [33]:
countvec_predicted = cntvecMNB.predict(X_test_count)
tfidf_predicted = tf_idfMNB.predict(X_test_tf)

- Now we can generate a classification report to examine how our models perform

In [34]:
from sklearn import metrics
print('The report for CountVectorizer word embedding through a Multinomial model:')
print(metrics.classification_report(newsgroups_test.target, countvec_predicted, target_names= newsgroups_test.target_names))

The report for CountVectorizer word embedding through a Multinomial model:
                          precision    recall  f1-score   support

             alt.atheism       0.76      0.38      0.51       319
           comp.graphics       0.68      0.59      0.63       389
 comp.os.ms-windows.misc       0.79      0.58      0.67       394
comp.sys.ibm.pc.hardware       0.69      0.70      0.69       392
   comp.sys.mac.hardware       0.81      0.66      0.73       385
          comp.windows.x       0.81      0.70      0.75       395
            misc.forsale       0.89      0.72      0.79       390
               rec.autos       0.73      0.86      0.79       396
         rec.motorcycles       0.93      0.80      0.86       398
      rec.sport.baseball       0.91      0.74      0.82       397
        rec.sport.hockey       0.89      0.92      0.91       399
               sci.crypt       0.69      0.92      0.79       396
         sci.electronics       0.77      0.45      0.57       393


In [35]:
print('The report for TF-IDF Vectorizer word embedding through a Multinomial model:')
print(metrics.classification_report(newsgroups_test.target, tfidf_predicted, target_names= newsgroups_test.target_names))

The report for TF-IDF Vectorizer word embedding through a Multinomial model:
                          precision    recall  f1-score   support

             alt.atheism       0.79      0.61      0.69       319
           comp.graphics       0.69      0.69      0.69       389
 comp.os.ms-windows.misc       0.77      0.69      0.72       394
comp.sys.ibm.pc.hardware       0.66      0.74      0.70       392
   comp.sys.mac.hardware       0.85      0.77      0.81       385
          comp.windows.x       0.80      0.77      0.79       395
            misc.forsale       0.85      0.82      0.84       390
               rec.autos       0.81      0.89      0.85       396
         rec.motorcycles       0.91      0.89      0.90       398
      rec.sport.baseball       0.89      0.87      0.88       397
        rec.sport.hockey       0.86      0.97      0.91       399
               sci.crypt       0.78      0.93      0.85       396
         sci.electronics       0.79      0.61      0.69       39

- We can see that TF-IDF performs significantly better in recall. Try to examine why this may be the case now knowing what you learned about the difference between CountVectorizer and TF-IDF.
- Consider instances where CountVectorizer may work better than TF-IDF!

## Topic Modeling

- [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

- Each document is modeled as a multinomial distribution of **topics** and each topic is modeled as a multinomial distribution of **words**.

- It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

![img](https://s3.amazonaws.com/nycdsabt01/lda_graph.png)

- The above chart shows how LDA tries to classify documents. Documents are represented as a distribution of topics. 

- Topics, in turn, are represented by a distribution of all tokens in the vocabulary. But we do not know the number of topics that are present in the corpus and the documents that belong to each topic. In other words, we want to treat the assignment of the documents to topics as a random variable itself which is estimated from the data.

- This sounds complicated, but the process we have discussed above is similar to the Dirichlet process, which we will dig into later after we see the result.

In [36]:
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer

lemtzer = WordNetLemmatizer()

def lemmatize_stemming(text):
    return lemtzer.lemmatize(text, pos='v')

# Write a function to perform the pre processing steps on the entire dataset
def preprocess(text):
    result=[]
    for token in simple_preprocess(text) :
        if token not in STOPWORDS:
            result.append(lemmatize_stemming(token))
            
    return result

ModuleNotFoundError: No module named 'gensim'

In [None]:
# Uncomment the following lines if this is the first time you use nltk
# import nltk
# nltk.download('wordnet')

- Preview a document after preprocessing

In [None]:
doc_sample = 'This disk has failed many times. I would like to get it replaced.'

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

- Let's now preprocess all the news headlines we have. To do that, we iterate over the list of documents in our training sample

In [None]:
processed_docs  = []

for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))

### Bag of words on the dataset

- Now let's create a dictionary from `processed_docs` containing the index of each word appears in the training set. To do that, let's pass `processed_docs` to `gensim.corpora.Dictionary()` and call it `dictionary`.

In [None]:
import gensim
dictionary = gensim.corpora.Dictionary(processed_docs)

In [None]:
len(dictionary.keys())

- Filter out tokens that appear in
 - less than `no_below` documents (absolute number) or
 - more than `no_above` documents (fraction of total corpus size, not absolute number).
 - after (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if None).

In [None]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 20% of all documents
'''
dictionary.filter_extremes(no_below=15, no_above=0.2, keep_n=50000)

- Checking dictionary created

In [None]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

- Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. 

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [None]:
bow_corpus[0]

### Running LDA using Bag of Words

- Some of the parameters we will be tweaking are:
 - `num_topics` is the number of requested latent topics to be extracted from the training corpus.
 - `id2word` is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
 - `passes` is the number of training passes through the corpus. For example, if the training corpus has 50,000 documents, chunksize is 10,000, passes is 2, then online training is done in 10 updates:
- We choose the number of topics to be 6 in our lda model as partitioned according to subject matter on the [website](http://qwone.com/~jason/20Newsgroups/). 

In [None]:
%%time
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=6, id2word=dictionary, passes=2)
lda_model.save('lda.model')

- For each topic, we can explore the words occuring in that topic and its relative weight.
- Can you distinguish different topics using the words in each topic and their corresponding weights?

In [None]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} Word: {}\n'.format(idx, topic))

- Testing model on unseen document

In [None]:
num = 100
unseen_document = newsgroups_test.data[num]
print(unseen_document)

In [None]:
# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

- The main use of the pyLDAvis package to provide interactive visualizations to augment our understanding. Let us see how our topics look:

In [None]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)

- We can understand the document from the following perspective.
 - Topics are represented as a bubble. The size of the bubble is proportional to its prevalence of the corpus.
 - Similar topics appear close together, topics further apart are less similar.
- Upon selecting a topic, the most representative words for the selected topic can be seen. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.
- When a topic is selected, the percentage of tokens in the topic is also visible. This measure can be used as an additional measure to weed out irrelevant topics.

### Running LDA using TF-IDF

In [None]:
from gensim import models
import warnings
warnings.filterwarnings('ignore')

- Construct the tfidf corpus from our bag-of-words corpus.

In [None]:
tfidf = models.TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]

In [None]:
%%time
lda_model_tfidf = gensim.models.LdaMulticore(tfidf_corpus, num_topics=6, id2word=dictionary, passes=2)
lda_model_tfidf.save('lda_tfidf.model')

- Can you distinguish different topics using the words in each topic and their corresponding weights?

In [None]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

In [None]:
pyLDAvis.gensim.prepare(lda_model_tfidf, tfidf_corpus, dictionary)

## Dig into the details

### What is Dirichlet distribution?
- The idea of the Dirichlet process is simple; we assign elements to categories following a very simple rule: When assigning the nth element, we assign it to a new category with the probability

$$\frac{\alpha}{\alpha + n -1}$$

or we assign it to an already existing category x with probability

$$\frac{n_x}{\alpha + n -1}$$

where $n_x$ is the number of random variables already assigned to category $X$. What’s $\alpha$?

- In case of the Dirichlet distribution, it is a conjugate prior for the multinomial distribution. 
 - If in the case of the binomial distribution we can think of it in terms of drawing white and black balls with replacement from the urn, then in case of the multinomial distribution we are drawing with replacement N balls appearing in k colors, where each of colors of the balls can be drawn with probabilities $p_1,...,p_k$. 
  - The Dirichlet distribution is a conjugate prior for $p_1,...,p_k$ probabilities and $\alpha_{1},...,\alpha_{k}$ parameters can be thought as pseudocounts of balls of each color assumed a priori.
 - The higher value of $\alpha$, the greater amount of the total "mass" is assigned to it. If $\alpha < 1$, it can be thought as anti-weight that pushes away each point toward extremes, while when it is high, it attracts each point toward some central value (central in the sense that all points are concentrated around it, not in the sense that it is symmetrically central).

![img](http://phyletica.org/images/dpp-3-example.gif)
[Souce](http://phyletica.org/dirichlet-process/)

### LDA in graphical form

- The diagram what we see here is called the **plate notation for LDA**.

![img](https://s3.amazonaws.com/nycdsabt01/lda_plate.png)
[Source](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

- M is the superset of all the documents
- N is the number of words per document
- Inside the rectangle N we see w and z which can thought of as
 - w -> the words observed in document i
 - z -> the random topic for jth word for document i
- theta -> Topic distribution for document i
- alpha -> Parameter to set prior Dirichlet distribution to per document level
- beta -> Parameter to set prior Dirichlet distribution at per topic word level

**Key things to remember when optimizing on alpha and beta**

- **High Alpha** indicates that each document is more likely to have a mixture of all the topics
- **Low Alpha** indicates that each document is more likely to have a mixture of one/two or few of the topics
- **High Beta** indicates that each topic is more likely to have a mixture of all the words
- **Low Beta** indicates that each topic is more likely to have a mixture of one/two or few of the words

- Note: these two parameters are called **alpha** and **eta** respectively in the gensim package.