### LDA background
LDA assumes that documents are probability distribution over laten topics.
Topics are probability distribution over words.
LDA takes a number of documents. It assumes that the words in each document are related. It then tries to figure out the 'recipe' for how each document could have been created. We just need to tell the model how many topics to construct and it uses that 'recipe' to generate topic and word distributions over a corpus. Based on that output, we can identify similar documents within the corpus.

### In order to understand the LDA process, we have to know how LDA assumes topics are generated:
1. determine the number of words in the document
2. choose a topic mixture for the document over a fixed set of topics (ie. topic A 20%, topic B 50%, etc)
3. generate words in the document by:
    - pick a topic based on the document's multinomial distribution
    - pick a word based on the topic's multinomial distribution

### Working backwards
Suppose you have a corpus of documents, and you want LDA to learn the topic representatino of K topics in each document and the word distribution of each topic. LDA would backtrack from the document level to identify topics that are likely to have generated the corpus.

### LDA's Magic
1. randomly assign each word in each documen tto one of the K topics
2. for each document
    - assume that all topic assignments except for the current one are correct
    - claculate two proportions:
        1. proportion of words in document d that are currently assigned to topic t = p(topic t | document d)
        2. proportion of assignments to topic t over all documents that come from this word w = p(word w | topic t)
    - multiply those two proportions and assign w a new topic based on that probability. p(topic t | document d) * p(word w | topic t)
3. eventually we'll reach a steady state where assignments make sense

### alpha (parameter of the Dirichlet pior of the per-document topic distribution)
high: each document will contain many topics
low: each document iwll have distinct topics

### beta (parameter of the Dirichlet prior on the per-topic word distribution)
high: each topic will contain many words
low: each topic will contain few words

### theta (topic distribution for document m)
### z (topic for the n-th word in document m)
### w (specific word)

In [None]:
### step 0: examine and import corpus

In [None]:
### import corpus, only for relevant texts

In [None]:
import os
import codecs
import json

import spacy
import pandas as pd
import itertools as it

from gensim.models import Phrases #seems like this is slower, but Phaser was not compatible to our code? need some research
from gensim.models.word2vec import LineSentence
from spacy.lang.en.stop_words import STOP_WORDS

from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle

nlp = spacy.load('en')

In [None]:
def lda_description(review_text, min_topic_freq = 0.05):
    """
    accept the original text of a review and 
    1. parse it with spaCy,
    2. apply text pre-proccessing steps, 
    3. create a bag-of-words representation, 
    4. create an LDA representation, and
    5. print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_review = [token.lemma_ for token in parsed_review if not punct_space(token)]
    
    # apply the first-order and secord-order phrase models
    bigram_review = bigram_model[unigram_review]
    trigram_review = trigram_model[bigram_review]
    
    # remove any remaining stopwords
    trigram_review = [term for term in trigram_review if term not in STOP_WORDS]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda.sort(key = lambda tup: tup[1], reverse = True)
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print('{:25} {}'.format(topic_names[topic_number], round(freq, 3)))