# Vectorization and Topic Analysis

In this notebook, we use sklearn's tf-idf vectorizer on our restaurant and business review corpus and perform topic analysis on the reviews.

## Import the data

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import textacy
import pickle
import time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF, TruncatedSVD

In [2]:
with open('../data/tokenized.pkl', 'rb') as f:
    rests = pickle.load(f)

## Build the custom vectorizer

Since I previously used spacy to tokenize the restaurant reviews, I can't use the default tf-idf vectorizer from sklearn since by default the vectorizer incorporates an analyzer, tokenizer, and preprocessor; we've already done the tokenizer and preprocessing in Notebook 2. I will need to construct a custom vectorizer using the tf-idf framework to vectorize our text for topic analysis.

In [4]:
def identity(doc):
    return doc

In [5]:
vectorizer = TfidfVectorizer(analyzer='word', 
                             tokenizer=identity, 
                             preprocessor=identity, 
                             token_pattern=None, 
                             min_df=5, 
                             max_df=0.95, 
                             max_features=100000)

## Create document-term matrices for the two review corpora

In [6]:
%%time
doc_term_matrix = vectorizer.fit_transform((doc for doc in rests))

CPU times: user 1min 37s, sys: 3.08 s, total: 1min 40s
Wall time: 1min 40s


## Create topic models and print out topics

Let's compare the topic generation of two decomposition models: Latent Semantic Analysis (LSA), and non-Negative Matrix Factorization (NMF).

In [29]:
def topic_model(input_matrix, model_name, n_topics):
    model = textacy.TopicModel(model_name, n_topics=n_topics)
    model.fit(input_matrix)
    doc_topic_matrix = model.transform(input_matrix)

    for topic_idx, top_terms in model.top_topic_terms(vectorizer.get_feature_names()):
        print('topic', topic_idx, ':', '   '.join(top_terms))
        
    for topic_idx, top_docs in model.top_topic_docs(doc_topic_matrix, topics=[0,1], top_n=2):
        print(topic_idx)
        for j in top_docs:
            print(rests[j])

### LSA

In [30]:
model = textacy.TopicModel(model_name, n_topics=n_topics)
model.fit(input_matrix)
doc_topic_matrix = model.transform(input_matrix)

for topic_idx, top_terms in model.top_topic_terms(vectorizer.get_feature_names()):
    print('topic', topic_idx, ':', '   '.join(top_terms))

for topic_idx, top_docs in model.top_topic_docs(doc_topic_matrix, topics=[0,1], top_n=2):
    print(topic_idx)
    for j in top_docs:
        print(rests[j])

topic 0 : numb   be   that   this   they   with   place   have   great   order
topic 1 : great   place   service   friendly   very   amaze   always   staff   this   delicious
topic 2 : pizza   crust   numb   slice   wing   delivery   topping   cheese   pepperoni   order
topic 3 : numb   great   service   minute   price   atmosphere   friendly   sushi   staff   star
topic 4 : place   this   have   they   always   sushi   that   there   never   when
topic 5 : place   this   numb   chicken   sushi   fry   sandwich   fresh   price   sauce
topic 6 : very   sushi   this   restaurant   place   back   be   will   definitely   pizza
topic 7 : sushi   they   very   always   their   have   fresh   price   roll   friendly
topic 8 : sushi   great   that   with   much   price   roll   buffet   really   some
topic 9 : sushi   order   chicken   fry   service   burger   here   back   will   always
topic 10 : burger   very   fry   sushi   burgers   friendly   staff   they   price   be
topic 11 : very   

1
['great', 'coffeeeeeeeee', 'great', 'service', 'very', 'place']
['this', 'place', 'great', 'great', 'service']


### NMF

In [32]:
model = textacy.TopicModel('nmf', n_topics=20)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)

for topic_idx, top_terms in model.top_topic_terms(vectorizer.get_feature_names()):
    print('topic', topic_idx, ':', '   '.join(top_terms))

for topic_idx, top_docs in model.top_topic_docs(doc_topic_matrix, topics=[0, 1, 2, 3, 4], top_n=2):
    print(topic_idx)
    for i in top_docs:
        print(rests[i])

topic 0 : that   much   there   just   really   than   what   think   little   well
topic 1 : great   service   atmosphere   price   awesome   excellent   drink   selection   happy   fantastic
topic 2 : pizza   crust   slice   cheese   wing   topping   pepperoni   delivery   sauce   salad
topic 3 : numb   star   minute   only   price   about   give   people   around   hour
topic 4 : with   sauce   salad   which   cheese   side   delicious   also   steak   flavor
topic 5 : place   this   recommend   favorite   vega   have   look   your   from   awesome
topic 6 : they   have   them   also   make   your   when   close   because   offer
topic 7 : very   price   restaurant   service   tasty   excellent   reasonable   portion   clean   good
topic 8 : sushi   roll   fresh   sashimi   quality   price   salmon   japanese   tempura   buffet
topic 9 : order   minute   service   when   wait   ask   after   table   take   never
topic 10 : burger   fry   burgers   cheese   onion   shake   bacon   pa

In [33]:
model.save('../models/nmf.pkl')