# Vectorization and Topic Analysis

In this notebook, we use sklearn's tf-idf vectorizer on our restaurant and business review corpus and perform topic analysis on the reviews.

## Import the data

In [2]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import textacy
import pickle
import time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF, TruncatedSVD

In [4]:
with open('../data/tokenized.pkl', 'rb') as f:
    rests = pickle.load(f)

## Build the custom vectorizer

Since I previously used spacy to tokenize the restaurant reviews, I can't use the default tf-idf vectorizer from sklearn since by default the vectorizer incorporates an analyzer, tokenizer, and preprocessor; we've already done the tokenizer and preprocessing in Notebook 2. I will need to construct a custom vectorizer using the tf-idf framework to vectorize our text for topic analysis.

In [5]:
def identity(doc):
    return doc

In [8]:
vectorizer = TfidfVectorizer(analyzer='word', 
                             tokenizer=identity, 
                             preprocessor=identity, 
                             token_pattern=None, 
                             ngram_range=(1, 2),
                             min_df=5, 
                             max_df=0.95, 
                             max_features=100000)

In [10]:
%%time
doc_term_matrix = vectorizer.fit_transform((doc for doc in rests))

CPU times: user 7min 26s, sys: 9.54 s, total: 7min 36s
Wall time: 7min 36s


## Create topic models

Let's compare the topic generation of three models: Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and non-Negative Matrix Factorization (NMF).

### LDA

In [22]:
%%time
model = textacy.TopicModel('lda', n_components=10)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)
doc_topic_matrix.shape



KeyboardInterrupt: 

In [None]:
vectorizer.

In [None]:
doc_topic_matrix.

In [None]:
model.top_terms

In [11]:
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
    print('topic', topic_idx, ':', '   '.join(top_terms))

AttributeError: 'TfidfVectorizer' object has no attribute 'id_to_term'