https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

In [38]:
https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/

SyntaxError: invalid syntax (<ipython-input-38-e3a047ec2dd9>, line 1)

1) Describe how topic models define a document. Why is this a useful framework? What substantive questions might it answer? Why isn’t it useful and which questions might it be ill-suited to?

LDA assumes, like humans assume, that words carry strong semantic information and similar documents will use similar words. Secondly, documents are pobability distributions over latent topics and topics are probability distributions over words. 

An important distinction to other models is that LDA works with probability distributions not strict word-frequencies. 

Topic models define a document as a mixture of a small number of topics and its words atrributed to one of the topics. This framework is useful as it allows, for example, genitists to study the genome in an applied way. Another example, engineers, may classify documents and approximate their relation to other topics. 

With plate notation, let's describe the below image: 
- $K$ number of topics
- $M$ number of documents
- $N$ number of words in a given document
- β parameter of Dirichlet prior on per-topic word distro
- $\Phi$ word distribution for topic $K$ (sums to 1)
- $a$ parameter of Dirichlet prior on per-document topic distro
- $\Theta$ topic distro for document 
- $z$ topic word in document
- $w$ specific word

<img src="lda_k.png">
Source: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

# Setup

In [20]:
# import packages
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim

# sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

# Load Data

In [34]:
# load data from processed dir
df_game_reviews = pd.read_csv(r'../data/processed/game_reviews_processed.csv')

In [35]:
# clean
# remove all data colums aside from recommendationid, review, and timestamp_created.
def prune_cols(df):
    # keep only needed cols and subset english only
    global df_game_reviews
    df_game_reviews = df_game_reviews.loc[df_game_reviews['language'] == 'english'] # subsets for english
    df_game_reviews = df_game_reviews[["recommendationid", "review", "timestamp_created"]] # dops unnecessary cols
    reviews = df_game_reviews.review.tolist()
    
    temp_lst = []
    for item in reviews:
        item = str(item)
        item = item.replace('\n', ' ')
        temp_lst.append(item)
    
    reviews = temp_lst
    temp_lst = None
        
    return reviews[:10]

prune_cols(df_game_reviews)

["Game is fun and exciting when it works. Very simple milsim aspects that are much easier to grasp than Arma. However I cannot even play it in it's current state. Every time I get into a firefight the game locks up and completely freezes. My PC more than makes the minimum requirements. Poor optimization ruins the experience for me. Can't recommend it in it's current state.",
 'Excellent game.   Plenty of fun to be had in the game as is. Works very well on my i5, GTX 970, 8gig of RAM (could probably use 16gb)  This game is NOT for everyone however, it requires a solid hour or so of your time and you HAVE to be willing to communicate or you will simply have a poor experience. If you have played Project Reality (BF2) you WILL enjoy this game.  Pros -  Commucation Weapons Gunplay Maps Simple control scheme compared to ArmA.   Cons - It will be VERY frustrating for players that have never played Project Reality or ArmA',
 "Players are too serious and you can't have fun. Instead of playing y

# Tokenize

In [36]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(reviews))

print(data_words[:1])

[['game', 'is', 'fun', 'and', 'exciting', 'when', 'it', 'works', 'very', 'simple', 'milsim', 'aspects', 'that', 'are', 'much', 'easier', 'to', 'grasp', 'than', 'arma', 'however', 'cannot', 'even', 'play', 'it', 'in', 'it', 'current', 'state', 'every', 'time', 'get', 'into', 'firefight', 'the', 'game', 'locks', 'up', 'and', 'completely', 'freezes', 'my', 'pc', 'more', 'than', 'makes', 'the', 'minimum', 'requirements', 'poor', 'optimization', 'ruins', 'the', 'experience', 'for', 'me', 'can', 'recommend', 'it', 'in', 'it', 'current', 'state']]


# Lemmatization

In [37]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# Run in terminal: python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:2])

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

# Fit LDA with a few different values for K. How does the value of K seem to change your results?

# Using your knowledge of the corpus, choose the best value for K and justify this result substantively. Fit a topic model with this value, interpret it substantively as it relates to your research question, and write these results up.