# Topic Analysis
In this notebook I use negative review predictions from the modeling stage to perform topic analysis and identify keywords for the negative reviews. I will the text ranker **`networkx`** to carry out this analysis.

## Setup

In [0]:
from google.colab import drive
from importlib.machinery import SourceFileLoader
import networkx as nx
import numpy as np
import os
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Mount Google Drive.

In [0]:
ROOT = '/content/drive'
PROJECT = 'My Drive/Thinkful/Final_Capstone_Project/'
PROJECT_PATH = os.path.join(ROOT, PROJECT)

In [3]:
drive.mount(ROOT)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Load custom methods and constants.

In [4]:
con = SourceFileLoader('constants', os.path.join(PROJECT_PATH, 'utilities/constants.py')).load_module()
met = SourceFileLoader('methods', os.path.join(PROJECT_PATH, 'utilities/methods.py')).load_module()

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


## Load Data
Load the review text for reviews classified as negative.

In [0]:
# df_clean = pd.read_csv(os.path.join(PROJECT_PATH, 'data/negative_review_predictions.csv'))
df_clean = pd.read_csv(os.path.join(PROJECT_PATH, 'data/negative_review_predictions-extremes.csv'))

In [6]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97 entries, 0 to 96
Data columns (total 2 columns):
text      97 non-null object
labels    97 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.6+ KB


## Data Preparation
1. Tokenize words.
2. Concatenate all reviews into one, long list.
3. Create vocabulary from text.


In [0]:
df_clean['tokens'] = df_clean['text'].apply(lambda x: x.split())

In [8]:
df_clean.loc[:5, 'tokens']

0    [food, okay, bit, underwhelming, honest, servi...
1    [take, husband, celebrate, birthday, adult, ki...
2    [come, happy, hour, Friday, order, food, come,...
3    [average, food, wait, table, see, table, keep,...
4    [try, Lasagna, cream, pasta, recommend, flavor...
5    [award, unfriendlist, restaurant, deli, go, pl...
Name: tokens, dtype: object

In [0]:
corpus = []
for review in df_clean['tokens'].values:
  corpus += review

In [10]:
print(f'The corpus of negative reviews contains {len(corpus)} words.')

The corpus of negative reviews contains 6257 words.


In [0]:
vocab = list(set(corpus))

In [0]:
# # vocab.remove('good')
# corpus = [x for x in corpus if x != 'good']

In [13]:
print(f'The corpus vocabulary size is {len(vocab)} words.')

The corpus vocabulary size is 1636 words.


## Adjacency
1. For each word in the corpus, keep track of that word and the neighbors around it.

In [0]:
df_adjacency = pd.DataFrame(columns=vocab, index=vocab, data=0)

In [0]:
for i, word in enumerate(corpus):
  end = max(0, len(corpus) - (len(corpus) - (i+5)))
  neighbors = corpus[i+1: end]
  if neighbors:
    df_adjacency.loc[word, neighbors] = df_adjacency.loc[word, neighbors] + 1

In [16]:
df_adjacency.shape

(1636, 1636)

## Text Rank
1. Calculate TextRank using networkx.

In [0]:
nx_words = nx.from_numpy_matrix(df_adjacency.values)
ranks = nx.pagerank(nx_words, alpha=.85, tol=.00000001)

In [0]:
ranked = sorted(((ranks[i], s) for i, s in enumerate(vocab)), reverse=True)

In [19]:
print(ranked[:5])

[(0.010964248987369706, 'food'), (0.010367582941851236, 'order'), (0.009453637391546069, 'good'), (0.009319615825433877, 'place'), (0.008587368792391616, 'come')]
