In [1]:
#skip_extension allows for placement of %%skip True at cell beginnings to skip cell execution, even if cell is "run"
%load_ext skip_extension
import warnings
warnings.filterwarnings('ignore')

## Project Description & Hypothesis

This notebook uses the [Yelp Dataset](https://www.yelp.com/dataset/challenge) to explore how the language used in restaurant reviews can be potentially used to classify the price range of the restaurants under review. Restaurants on Yelp oftentimes have a number of \\$ signs associated with them, indicating the approximate price per person for the restaurants. This price range attribute runs from 1 to 4 (\\$-\\$\\$\\$\\$), with 1 representing a cost of < \\$10, 2 representing a cost of \\$11-\\$30, 3 representing a cost of \\$31-\\$60, and 4 representing a cost of > \\$61.

In other words, this notebook hypothesizes that the language used in reviews for restaurants of lower price ranges differs from that used in reviews for restaurants of higher price ranges. Reviewers may describe their experiences at pricier restaurants differently than they do for less pricey restaurants.

**Hypothesis**: Lexical features extracted from restaurant reviews can provide information that allows for the differentiation between reviews on restaurants of different price ranges. These features can be used to classify the price range of the restaurant being reviewed.

**Sub-hypothesis**: Furthermore, controlling for star rating of the review can help improve classification power, as the model will try to find patterns based more on price range-specific language and not sentiment-specific (as reflected by the star rating assigned). For example, it will better be able to discern between a 4-star review for a \\$ restaurant and a 4-star review for a \\$$$ restaurant.

A defining feature of this notebook is its heavy usage of interactive controls, [```ipywidgets```](https://ipywidgets.readthedocs.io/en/latest/), for data exploration and modeling. In many cases, cells will contain numerous capabilities and customization options through their various widgets. Given the size of the dataset and the computational burden associated with generating certain visualizations of this data, several visualizations generated throughout the process of this project are embedded within the notebook itself. There are also interactive code blocks that allow you to scroll through multiple visualizations generated by the same code block (pulling from the figures folder of this repository). The visualization code is always provided for those who would like to reproduce the visualizations, and all original figures are in the ```figures``` folder of this repository.

The raw data contains 6,685,900 reviews and 192,609 unique businesses.

### Class Label Remappings
The default class labels used in this notebook are the price ranges 1, 2, 3, and 4. However, several different remappings of the class labels were also explored to see how model performance was impacted. In addition to the default labeling scheme, three other labeling schemes were explored for a total of four:
1. **Default**: This labeling scheme took the price ranges as they are and used them as the target classes. 
2. **[1\\$ & 2\\$] vs. [3\\$ & 4\\$]**: This labeling scheme bifurcated the price ranges, grouping the two lower ranges, 1\\$ and 2\\$, together, and the two higher ranges, 3\\$ and 4\\$, together.
3. **[1\\$ & 2\\$] vs. [3\\$ & 4\\$] > [1\*-3\*] vs. [4\*-5\*]**: This remapping scheme creates four classes. The first class is reviews on restaurants in the price ranges 1\\$ and 2\\$ with star ratings of one to three stars (1\* to 3\*). The second class is for reviews on restaurants in the same lower price ranges but with star ratings of four or five stars (4\* to 5\*). The third and fourth classes are similar to these two, but they involve the higher two price ranges, 3\\$ and 4\\$.
4. **[1\\$ & 2\\$] vs. [3\\$ & 4\\$] > [1\*-2\*] vs. [4\*-5\*]**: This remapping scheme is very similar to the one above, but it removes reviews with a star rating of three (3\*) from consideration. It still creates four classes, but the two classes that previously contained reviews with star ratings from one to three stars (1\* to 3\*) now only contain reviews with star ratings of one or two stars (1\* to 2\*). Although this relabeling scheme results in lost data compared with the other schemes, it does remove noisy three-star reviews from the analysis.

## Data Acquisition

This notebook utilizes the [Yelp Dataset](https://www.yelp.com/dataset/challenge) that is publicly available for download to researchers and academics. Although publicly available, the dataset comes with limitations regarding its distribution under the *Yelp Dataset Terms of Use*. Therefore, the data necessary for executing this notebook is not stored publicly in the repository associated with this notebook. If you would like to replicate the steps in this notebook, please download the dataset from the aforementioned link.

The dataset is split between five JSON files: 
- ```business.json```, which provides information on the businesses in the dataset (e.g., location, name, category)
- ```review.json```, which has the text reviews, the review start ratings, and other review-related information
- ```user.json```, which contains user/reviewer-related metadata
- ```checkin.json```, which provides data on the various checkins at the businesses
- ```tip.json```, which contains the tip text written by users

Given the focus of the hypothesis, this notebook only makes use of the ```business.json``` and ```review.json``` data. The ```user.json``` data was also ingested for future exploration of user metadata, but this data was not used in this project. 

### Data Ingestion

The Yelp dataset was stored locally using PostgreSQL. Custom Python scripts were used to extract the data from the JSON files and load it into the database. Below, custom modules, ```Yelp_DB_Maker``` and ```Yelp_Data_Importer``` are imported and used for data ingestion. The scripts containing these modules are included along with this notebook in the repository for ease of reproduction. Following the **Imports** cell below, you will need to enter your PostgreSQL credentials along with the path to where you have locally stored the Yelp data. The default database name for holding the data is set to 'yelp,' but feel free to change this whatever name you like. The data ingestion code includes progress/loading bars to keep you apprised of ingestion progress.

### Imports

In [2]:
import psycopg2
import os
import json
import pickle
import Yelp_DB_Maker
import Yelp_Data_Importer
import pandas as pd

In [3]:
dbname = 'yelp2'
username = 'postgres'
password = 'KhobDige12!'
dataset_path = '/Users/alice.naghshineh/Desktop/yelp_data'

In [None]:
try:
    conn = psycopg2.connect('dbname={} user={} password={}'.format(dbname, username, password))
    conn.set_session(autocommit=True)
    print('Connection to database established.')

except psycopg2.Error:
    print('Database does not exist. Creating database now.')
    conn = psycopg2.connect('dbname={} user={} password={}'.format('postgres', username, password))
    cur = conn.cursor()
    conn.set_session(autocommit=True)
    cur.execute('CREATE DATABASE {}'.format(dbname))
    cur.close()
    conn.close()
    conn = psycopg2.connect('dbname={} user={} password={}'.format(dbname, username, password))
    print('Database created.')
    conn.set_session(autocommit=True)

In [None]:
datafiles = []
for file in os.listdir(dataset_path):
    if len(file.split('.')) >= 2 and file.split('.')[-1].lower() == 'json':
        datafiles.append(file)

print('Available Data Files:\n')
for file in datafiles:
    print('\t' + file)

In [None]:
Yelp_DB_Maker.YelpDBMaker(conn, datafiles).create()
Yelp_Data_Importer.YelpDataImporter(conn, datafiles, dataset_path).populate()

In [None]:
cur = conn.cursor()
cur.execute("""
    SELECT business.business_id, categories, business.stars, price_range, review.review_id, review.stars, review_text, user_info.user_id, elite, average_stars
    FROM business JOIN review ON business.business_id = review.business_id JOIN user_info ON review.user_id = user_info.user_id
""")

cols = ['business_id', 'categories', 'business_stars', 'price_range', 'review_id', 'review_stars',
        'review_text', 'user_id', 'elite', 'user_average_stars']

data = pd.DataFrame(cur.fetchall(), columns=cols)
cur.close()

In [None]:
%%skip True
with open(os.path.join(dataset_path, 'raw_data.pkl'), 'wb') as f:
    pickle.dump(data,f)

In [None]:
%%skip True
#Load the dataframe from pickle file
with open(os.path.join(dataset_path, 'raw_data.pkl'), 'rb') as f:
    data = pickle.load(f)

## Clean the Data for Further Exploration
Arguably, text data requires more purposeful "cleaning" than many other kinds of data (e.g., numerical). Raw text cannot be fed into models; it must be vectorized, or given a mathematical representation with which the model can make sense. This section of the notebook is devoted to preparing the text data for modeling and to cleaning the data in other ways.

### Process Review Text

The text preprocessing for this project largely relies on [```NLTK```](https://www.nltk.org). Generic English stopwords are removed, although the preprocessing code purposefully retains what are typically treated as stopwords but convey negativity (e.g., don't, not, but, won't, etc.). These words are likely important for capturing the sentiment of reviews and are therefore retained. The preprocessing code below tokenizes the data and allows for the option of also lemmatizing the tokens. In the context of this project, distinct data columns containing non-lemmatized tokens and lemmatized tokens were created to compare the effect of lemmatization on model performance.

Given that preprocessing so many reviews is time intensive, progress/loading bars are built into the code. It is recommended that the tokenized data be pickled after completion of the preprocessing.

### Imports & NLTK Downloads

In [4]:
from tqdm._tqdm_notebook import tqdm_notebook
import nltk
nltk.download('stopwords')
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
import re
import numpy as np

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alice.naghshineh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/alice.naghshineh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/alice.naghshineh/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
class YelpPreProcess():
    
    def __init__(self, df, lemmatize=False, tokenize=True):
        self.df = df
        self.tokenize = tokenize
        self.lem = lemmatize
        
    def create_stop_words(self):
        stops = nltk.corpus.stopwords.words('english')

        neg_stops = [
            'no', 'nor', 'not', 'don', "don't",
            'ain', 'aren', "aren't", 'couldn',
            "couldn't", 'didn', "didn't", 'doesn',
            "doesn't", 'hadn', "hadn't", 'hasn',
            "hasn't", 'haven', "haven't", 'isn',
            "isn't", 'mightn', "mightn't", 'mustn',
            "mustn't", 'needn', "needn't", 'shan',
            "shan't", 'shouldn', "shouldn't", 'wasn',
            "wasn't", 'weren', "weren't", "won'",
            "won't", 'wouldn', "wouldn't", 'but',
            "don'", "ain't"
        ]

        common_nonneg_contr = [
            "could've", "he'd", "he'd've", "he'll",
            "he's", "how'd", "how'll", "how's",
            "i'd", "i'd've", "i'll", "i'm",
            "i've", "it'd", "it'd've", "it'll",
            "it's", "let's", "ma'am", "might've",
            "must've", "o'clock", "'ow's'at",
            "she'd", "she'd've", "she'll", "she's",
            "should've", "somebody'd", "somebody'd've",
            "somebody'll", "somebody's", "someone'd",
            "someone'd've", "someone'll", "someone's",
            "something'd", "something'd've", "something'll",
            "something's", "that'll", "that's", "there'd",
            "there'd've", "there're", "there's", "they'd",
            "they'd've", "they'll", "they're", "they've",
            "'twas", "we'd", "we'd've", "we'll",
            "we're", "we've", "what'll", "what're",
            "what's", "what've", "when's", "where'd",
            "where's", "where've", "who'd", "who'd've",
            "who'll", "who're", "who's", "who've",
            "why'll", "why're", "why's", "would've",
            "y'all", "y'all'll", "y'all'd've", "you'd",
            "you'd've", "you'll", "you're", "you've"
        ]

        letters = [
            'a', 'b', 'c', 'd', 'e', 'f', 
            'g', 'h', 'i', 'j', 'k', 'l', 
            'm', 'n', 'o', 'p', 'q', 'r', 
            's', 't', 'u', 'v', 'w', 'x', 
            'y', 'z'
        ]

        ranks = ['st', 'nd', 'rd', 'th']

        #Make sure negative words are not being retained in the original stops list
        stops = [x for x in stops if x not in neg_stops]

        stops = stops + common_nonneg_contr + letters + ranks + [""] + ['us'] + [''] + ["'"] + ["'s"]
        stops = list(set(stops))
        return stops
    
    def clean_and_tokenize(self, text):
        text = text.lower()
        tokenizer = nltk.RegexpTokenizer('\w+\'?\w+')
        filtered_tokens = [(re.sub(r"[^A-Za-z']", '', token)) for token in tokenizer.tokenize(text)]
        stops = self.create_stop_words()
        tokens = [token for token in filtered_tokens if token not in stops]
        tokens = [re.sub("'s", '', token) for token in tokens]
        return tokens
        
               
    def wordnet_lemmatize(self, tokens_list):
        wnl = nltk.WordNetLemmatizer()
        tag_dict = {"a": wordnet.ADJ,
                    "n": wordnet.NOUN,
                    "v": wordnet.VERB,
                    "r": wordnet.ADV}
        tokens = [wnl.lemmatize(token, pos=tag_dict.get(nltk.pos_tag([token])[0][1][0].lower(), wordnet.NOUN)) 
                  for token in tokens_list]
        return tokens        
    
    def process_text(self):
        if self.tokenize:
            tqdm_notebook.pandas(desc='Tokenization Progress')
            self.df['review_tokens'] = self.df['review_text'].progress_apply(lambda x: self.clean_and_tokenize(x))
        if self.lem:
            tqdm_notebook.pandas(desc='Lemmatization Progress')
            self.df['lemmatized_tokens'] = self.df['review_tokens'].progress_apply(lambda x: self.wordnet_lemmatize(x))
        return self.df

In [None]:
data = YelpPreProcess(df=data, tokenize=True, lemmatize=True).process_text()

In [None]:
#Let's make sure the tokenization process worked properly by observing a subset of the new dataframe.
for i in np.random.randint(low=1, high=1000, size=3):
    print('Original review text:\n')
    print(data.loc[i, 'review_text'])
    print('\n')
    print('Review tokens:\n')
    print(data.loc[i, 'review_tokens'])
    print('\n')

### Other Data Cleaning & Feature Engineering

This section accomplishes several goals towards finalizing the dataset for modeling:

- **Remove non-English reviews**: Unfortunately, the Yelp dataset contains thousands of non-English text reviews. These non-English reviews make up a small fraction of the total reviews, but they are filtered out to focus on English reviews only. This filtering process relies on the [```langid.py```](https://github.com/saffsd/langid.py) language identification tool to identify the language of the reviews. Only those receiving a classification of 'en' were retained.
- **Create restaurant dummy variable**: Althought the Yelp dataset contains reviews on businesses of various categories, the large majority of reviews are on restaurants. To control for the potentially confounding effect of business category on the language used in the reviews and also business category's possible correlation with price range, the analysis in this notebook is eventually limited to restaurants only. Non-restaurants are removed from the analysis later in the notebook, but they are initially retained for visualization purposes. This dummy variable allows for their easy removal later. To create it, businesses with 'categories' text containing 'Restaurants,' indicating that they are restaurants (potentially among other categorizations), were marked with a 1 and 0 otherwise.
- **Construct variables for estimated review length and tokens count**: This notebook also explores the possibility that review length (also measured through the count of tokens in the review after tokenization) is associated with the reviewed businesses' price ranges. Do pricier/fancier restaurants tend to receive, on average, longer reviews?
- **Filter out reviews on restaurants without price range data**: Seeing as this notebook seeks to classify the price ranages of businesses under review, those reviews on businesses without recorded price range data are removed from consideration.

### Imports

In [5]:
import langid

In [None]:
with open(os.path.join(dataset_path, 'tokenized_data.pkl'), 'rb') as f:
    data = pickle.load(f)

In [None]:
data['categories'].fillna('', inplace = True)
data['is_restaurant'] = [1 if 'Restaurants' in x else 0 for x in data['categories']]
data['est_review_len'] = data['review_text'].apply(lambda x: len(x.split()))
data['tokens_len'] = data['review_tokens'].apply(lambda x: len(x))

In [None]:
tqdm_notebook.pandas(desc="Language Detection Progress")
data['language_detected'] = data['review_text'].progress_apply(lambda x: langid.classify(x)[0])

In [None]:
language_counts = dict(data['language_detected'].value_counts())
print('# of reviews classified as non-English: {}'.format(len(data)-language_counts['en']))

In [None]:
data = data[data['language_detected'] == 'en']
data = data[data['price_range'].notnull()]

# Data Exploration & Visualizations

This section of the notebook is dedicated to exploring the data largely through visualizations. The first sub-section provides code for visualizations that rely on libraries outside of ```Yellowbrick```. The second sub-section utilizies several different ```Yellowbrick``` visualizations.

## Non-Yellowbrick

In this sub-section, you will find code that generates multiple visualizations, including an interactive set of bar graphs, several graphs looking at the relationship between review length, price ranges, and review ratings, wordclouds, and a scatterplot capturing and confirming the high correlation between review length and review token count (this high correlation was used as justification later for including token count alone as a feature). The second to last code block is interactive and generates the top unigrams and bigrams (based on TF-IDF Vectorization) that are correlated with the different class labels and their remappings. The section ends with a visualization looking at the problem of class imbalance within each of the four label remappings.

### Imports

In [6]:
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, Layout
from IPython.display import Image
import plotly.graph_objs as go
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')
import collections
from wordcloud import WordCloud
from scipy.stats import pearsonr
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import TfidfVectorizer

The ```dummy_fun``` function below is used multiple times throughout this notebook. It is primarily used within the text vectorizers to ensure that they do not perform any preprocessing (i.e., tokenization, stop word removel, etc.) on the already tokenized review data.

In [7]:
def dummy_fun(text):
    return text

In [None]:
def plot_bar(column=['price_range', 'review_stars', 
                     'categories', 'price_range -groupby- review_stars', 
                     'review_stars -groupby- price_range', 'price_range -groupby- categories']):
    
    if column in ['price_range', 'review_stars', 'categories']:
        if column == 'price_range':
            filtered_data = dict(data[column].value_counts())
            labels_dict = {k:'{}$'.format(k) for k in list(filtered_data.keys())}
            ticktext= list(labels_dict.values())
            title_text = 'Price_Range'

        if column == 'review_stars':
            filtered_data = dict(data.loc[data['price_range'].notnull()][column].value_counts())
            labels_dict = {k:k*'*' for k in list(filtered_data.keys())}
            ticktext= list(labels_dict.values())
            title_text = 'Review_Stars'

        if column == 'categories':
            column = 'is_restaurant'
            filtered_data = dict(data.loc[data['price_range'].notnull()][column].value_counts())
            ticktext = ['Restaurants', 'Non-Restaurants']
            title_text = 'Categories'            
            column = 'business'
        
        x = list(filtered_data.keys())
        y = list(filtered_data.values())
        fig = go.Figure([go.Bar(x=x, y=y, marker={'color':y, 'colorscale': 'haline'})])

        fig.update_layout(title_text='{} Counts (for data with non-null price_range values)'.format(title_text),
            title_font_size=18,
            xaxis = dict(
            tickmode = 'array',
            tickvals = x,
            title = '{} categories'.format(column),
            tickfont=dict(size=12),
            ticktext=ticktext),
            yaxis = dict(
            tickmode = 'array',
            title = 'count',
            tickfont=dict(size=12)
            )
          )

    elif column == 'price_range -groupby- review_stars':
        price_ranges = ['1', '2', '3', '4']
        filtered_data = list(data.groupby(['price_range', 'review_stars']).size())
        fig = go.Figure(data=[
            go.Bar(name='*', x=price_ranges, y=[filtered_data[i] for i in range(0,20,5)]),
            go.Bar(name='**', x=price_ranges, y=[filtered_data[i] for i in range(1,20,5)]),
            go.Bar(name='***', x=price_ranges, y=[filtered_data[i] for i in range(2,20,5)]),
            go.Bar(name='****', x=price_ranges, y=[filtered_data[i] for i in range(3,20,5)]),
            go.Bar(name='*****', x=price_ranges, y=[filtered_data[i] for i in range(4,20,5)])
            ])
     
        fig.update_layout(barmode='stack',
                     title_text = 'Price_Range Counts Grouped by by Review_Stars',
                     title_font_size = 18,
                     xaxis = dict(
                     tickmode = 'array',
                     tickvals = price_ranges,
                     title = 'price_range categories',
                     tickfont = dict(size=12),
                     ticktext=['1$', '2$', '3$', '4$']),
                     yaxis = dict(
                     title = 'count',
                     tickfont = dict(size=12)))
    
    elif column == 'review_stars -groupby- price_range':
        review_stars = ['1', '2', '3', '4', '5']
        filtered_data = list(data.groupby(['review_stars', 'price_range']).size())
        fig = go.Figure(data=[
            go.Bar(name='1$', x=review_stars, y=[filtered_data[i] for i in range(0,20,4)]),
            go.Bar(name='2$', x=review_stars, y=[filtered_data[i] for i in range(1,20,4)]),
            go.Bar(name='3$', x=review_stars, y=[filtered_data[i] for i in range(2,20,4)]),
            go.Bar(name='4$', x=review_stars, y=[filtered_data[i] for i in range(3,20,4)])
            ])

        fig.update_layout(barmode='stack',
                         title_text = 'Review_Stars Counts Grouped by Price_Range',
                         title_font_size = 18,
                         xaxis = dict(
                         tickmode = 'array',
                         tickvals = review_stars,
                         title = 'review_stars categories',
                         tickfont = dict(size=12),
                         ticktext = ['*', '**', '***', '****', '*****']),
                         yaxis = dict(
                         title = 'count',
                         tickfont = dict(size=12)))
        
    else:
        filtered_data = list(data.groupby(['price_range', 'is_restaurant']).size())
        price_ranges = ['1','2','3','4']
        
        fig = go.Figure(data=[
            go.Bar(name='Restaurant', x=price_ranges, y=[filtered_data[i] for i in range(1,8,2)]),
            go.Bar(name='Non-Restaurant', x=price_ranges, y=[filtered_data[i] for i in range(0,8,2)])
            ])
        
        fig.update_layout(barmode='stack',
                         title_text = 'Price_Range Counts Grouped by Category (Restaurant vs. Non-Restaurant)',
                         title_font_size = 18,
                         xaxis = dict(
                         tickmode = 'array',
                         tickvals = price_ranges,
                         title = 'price_range categories',
                         tickfont = dict(size=12),
                         ticktext = ['1$', '2$', '3$', '4$']),
                         yaxis = dict(
                         title = 'count',
                         tickfont = dict(size=12)))        
        
    fig.show()
    
interact_manual(plot_bar);

In [8]:
@interact
def show_bar_plots(file=os.listdir('figures/bar_plots')):
    display(Image(os.path.join('figures/bar_plots/', file)))

interactive(children=(Dropdown(description='file', options=('review_stars_groupby_price_range_bar.png', 'price…

In [None]:
fig = px.histogram(data[data['price_range'].notnull()], x='est_review_len')
fig.update_layout(title = 'Review Length Histogram (for data with non-null price_range values)',
                 xaxis = dict(title='review length'),
                 yaxis = dict(title='count'))
fig.show()

<img src="figures/review_length_figs/review_length_histogram.png" width="500" height="300">

In [None]:
fig = go.Figure()
price_ranges = ['1', '2', '3', '4']

for price_range in price_ranges:
    fig.add_trace(go.Box(x=data['price_range'][data['price_range'] == price_range],
                            y=data['est_review_len'][data['price_range'] == price_range],
                            name='{}$'.format(price_range)))

fig.update_layout(title = 'Review Length Boxplots By Price Range',
                 xaxis = dict(showticklabels=False),
                 yaxis = dict(title='review length',
                             tickfont = dict(size=12)))
fig.show()

<img src="figures/review_length_figs/review_length_price_range_boxplots.png" width="500" height="300">

In [None]:
plt.figure(figsize=(12,6))
for price_range in ['1', '2', '3', '4']:
    ax = sns.kdeplot(data[data['price_range'] == price_range]['est_review_len'], 
                     shade=False, label='{}$'.format(price_range))

plt.title('Review Length Density Plot for Price Ranges', size=18)
plt.xlabel('review length', size=14)
plt.ylabel('density', size=14)
plt.legend(title ='price_range', fontsize=12, title_fontsize=12);

<img src="figures/review_length_figs/review_length_kdeplot_by_price_range.png" width="500" height="300">

In [None]:
fig, axes = plt.subplots(2,2, sharex=True, sharey=True, figsize=(10,8))
for ax, price_range in list(zip(axes.flatten(), ['1','2','3','4'])):
    ax.set_xticks([0,250,500,750,1000])
    ax.set_title('price_range = {}'.format(price_range+'$'))
    for star in [1,2,3,4,5]:
        sns.kdeplot(data[(data['price_range'] == price_range) & (data['review_stars'] == star)]['est_review_len'], 
                  shade=False, ax=ax, label='{}'.format(star*'*'))
        
fig.suptitle('Review Length Density Plots for Each Price Range Broken Down by Star Rating', fontsize=14)
fig.text(0.5, 0.04, 'Review Length', ha='center', size=14)
fig.text(0.04, 0.5, 'Density', va='center', rotation='vertical', size=14);    

<img src="figures/review_length_figs/review_length_kdeplots_by_price_range_review_stars.png" width="600" height="400">

In [None]:
data_sample = data.sample(n=5000)
fig = go.Figure(data=go.Scatter(x=data_sample['est_review_len'],
                                y=data_sample['tokens_len'],
                                mode='markers',
                                marker_color=data_sample['tokens_len'])) 

fig.update_layout(title='Review Length vs. Token Count',
                  xaxis=dict(title='Review Length'),
                  yaxis=dict(title='Token Count'))
fig.show()

<img src="figures/review_len_token_count_scatter.png" width="600" height="400">

In [None]:
#These two features are highly correlated! They can be used interchangeably within the model.
corr, _ = pearsonr(data['est_review_len'], data['tokens_len'])
corr

In [None]:
def create_wc(price_range=['1$', '2$', '3$', '4$', 'All Price Ranges'], num_words = (1,1000)):
    price_range_dict = {'1$': '1', '2$': '2', '3$': '3', '4$': '4'}
    if price_range != 'All Price Ranges':
        tokenized_review_list = data[data['price_range'] == price_range_dict[price_range]]['review_tokens'].tolist()
    else:
        tokenized_review_list = data['review_tokens'].tolist()
    tokens_list = [token for tokenized_review in tokenized_review_list for token in tokenized_review]
    c = collections.Counter()
    c.update(tokens_list)
    wc = WordCloud(background_color="white", height=700, width=1000).generate_from_frequencies(dict(c))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    return plt.show()

interact_manual(create_wc);

In [9]:
files = [f for f in os.listdir('figures/word_clouds') if f.split('.')[1] == 'png']
@interact
def show_word_clouds(file=files):
    display(Image(os.path.join('figures/word_clouds/', file)))

interactive(children=(Dropdown(description='file', options=('wc_all_price_ranges.png', 'wc_2$.png', 'wc_3$.png…

In [None]:
def find_correlated_grams(token_type = ['Non-Lemmatized', 'Lemmatized'], 
                          remap_type = ['Default', '[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]', 
                                          '[1$ & 2$] vs. [3$ & 4$] > [1*-2* & 4*-5*]']):
    
    if 'remapped_labels' in data.columns:
        data.drop(columns=['remapped_labels'], inplace=True)    
    
    tfidf = TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, preprocessor=dummy_fun, 
                            token_pattern=None, ngram_range = (1,2))
    
    if token_type == 'Non-Lemmatized':
        file_name = 'non_lemmatized_correlated_grams_'
        tokens = 'review_tokens'
    else:
        file_name = 'lemmatized_correlated_grams_'
        tokens = 'lemmatized_tokens'
    
    if remap_type == '[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]':
        data.loc[(data['price_range'].isin(['1','2'])) &
                         (data['review_stars'].isin([1,2,3])), 'remapped_labels'] = '1'
        data.loc[(data['price_range'].isin(['1','2'])) &
                         (data['review_stars'].isin([4,5])), 'remapped_labels'] = '2'
        data.loc[(data['price_range'].isin(['3','4'])) &
                         (data['review_stars'].isin([1,2,3])), 'remapped_labels'] = '3'
        data.loc[(data['price_range'].isin(['3','4'])) &
                         (data['review_stars'].isin([4,5])), 'remapped_labels'] = '4'  
        file_name += 'remap1.pkl'
        
    elif remap_type == '[1$ & 2$] vs. [3$ & 4$] > [1*-2* & 4*-5*]':
        data.loc[(data['price_range'].isin(['1','2'])) &
                         (data['review_stars'].isin([1,2])), 'remapped_labels'] = '1'
        data.loc[(data['price_range'].isin(['1','2'])) &
                         (data['review_stars'].isin([4,5])), 'remapped_labels'] = '2'
        data.loc[(data['price_range'].isin(['3','4'])) &
                         (data['review_stars'].isin([1,2])), 'remapped_labels'] = '3'
        data.loc[(data['price_range'].isin(['3','4'])) &
                         (data['review_stars'].isin([4,5])), 'remapped_labels'] = '4'
        file_name += 'remap2.pkl' 
        
    else:
        data['remapped_labels'] = data['price_range']
        file_name += 'default.pkl'
        
    remapped_data = data[data['remapped_labels'].notnull()]    
    features = tfidf.fit_transform(remapped_data[tokens])
    corr_grams = pd.DataFrame(columns=['price_range', 'unigrams', 'bigrams'])

    class_names_dict = {'Default': {'1':'1$', '2':'2$', '3':'3$', '4':'4$'},
                        '[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]':{'1':'[1$ & 2$] > [1*-3*]',
                                                                     '2':'[1$ & 2$] > [4*-5*]',
                                                                     '3':'[3$ & 4$] > [1*-3*]',
                                                                     '4':'[3$ & 4$] > [4*-5*]'},
                        '[1$ & 2$] vs. [3$ & 4$] > [1*-2* & 4*-5*]':{'1':'[1$ & 2$] > [1*-2*]',
                                                                     '2':'[1$ & 2$] > [4*-5*]',
                                                                     '3':'[3$ & 4$] > [1*-2*]',
                                                                     '4':'[3$ & 4$] > [4*-5*]'}}
    N = 10
    for class_label in ['1', '2', '3', '4']:
        class_name = class_names_dict[remap_type][class_label]
        features_chi2 = chi2(features, labels == class_label)
        indices = np.argsort(features_chi2[0])
        feature_names = np.array(tfidf.get_feature_names())[indices]
        unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
        bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
        class_label_row = pd.DataFrame([[class_name, unigrams[-N:], bigrams[-N:]]],
                                      columns=['class_name', 'unigrams', 'bigrams'])
        corr_grams = corr_grams.append(class_label_row, ignore_index=True)
        print("Price range {}$".format(class_name))
        print("  . Most correlated unigrams:\n       . {}".format('\n       . '.join(unigrams[-N:])))
        print("  . Most correlated bigrams:\n       . {}".format('\n       . '.join(bigrams[-N:])))
    
    with open(os.path.join(dataset_path, file_name), 'wb') as f:
        pickle.dump(corr_grams, f)
        
interact_manual(find_correlated_grams);

In [None]:
@interact
def see_correlated_grams(df_file=['non_lemmatized_correlated_grams', 'lemmatized_correlated_grams']):
    pd.set_option('max_colwidth', 800)
    with open(os.path.join('data/correlated_grams', df_file+'.pkl'), 'rb') as f:
        df = pickle.load(f)
    return df

In [None]:
fig, axes = plt.subplots(2,2, sharex=False, sharey=True, figsize=(13,10))
price_range_counts = dict(data['price_range'].value_counts())
price_range_stars_counts = list(data.groupby(['price_range', 'review_stars']).size())
for ax, remap_type in list(zip(axes.flatten(), ['Default', '[1$ & 2$] vs. [3$ & 4$]',
                                                '[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]',
                                                '[1$ & 2$] vs. [3$ & 4$] > [1*-2* & 4*-5*]'])):
    
    ax.set_title('{}'.format(remap_type))
    if remap_type == 'Default':
        x = ['1$', '2$', '3$', '4$']
        y = [price_range_counts['1'], price_range_counts['2'], price_range_counts['3'], price_range_counts['4']]
    elif remap_type == '[1$ & 2$] vs. [3$ & 4$]':
        x = ['[1$ & 2$]', '[3$ & 4$]']
        y = [price_range_counts['1'] + price_range_counts['2'], price_range_counts['3'] + price_range_counts['4']]
    elif remap_type == '[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]':
        x = ['[1$ & 2$] > [1*-3*]', '[1$ & 2$] > [4*-5*]','[3$ & 4$] > [1*-3*]','[3$ & 4$] > [4*-5*]']
        y = [sum(price_range_stars_counts[0:3])+sum(price_range_stars_counts[5:8]),
             sum(price_range_stars_counts[3:5])+sum(price_range_stars_counts[8:10]),
             sum(price_range_stars_counts[10:13])+sum(price_range_stars_counts[15:18]),
             sum(price_range_stars_counts[13:15])+sum(price_range_stars_counts[18:20])]
    else:
        x = ['[1$ & 2$] > [1*-2*]', '[1$ & 2$] > [4*-5*]','[3$ & 4$] > [1*-2*]','[3$ & 4$] > [4*-5*]']
        y = [sum(price_range_stars_counts[0:2])+sum(price_range_stars_counts[5:7]),
             sum(price_range_stars_counts[3:5])+sum(price_range_stars_counts[8:10]),
             sum(price_range_stars_counts[10:12])+sum(price_range_stars_counts[15:17]),
             sum(price_range_stars_counts[13:15])+sum(price_range_stars_counts[18:20])]
        
    sns.barplot(x=x, y=y, ax=ax)
    
fig.suptitle('Class Imbalance Across Different Label Remappings', fontsize=14)
fig.text(0.5, 0.04, 'Target Classes', ha='center', size=12)
fig.text(0.04, 0.5, 'Count', va='center', rotation='vertical', size=12);

<img src="figures/class_imbalance.png" width="700" height="400" align='middle'>

## Yellowbrick Visualizations

The Yellowbrick Visualizations sub-section makes use of three Text Modeling Visualizers from [```Yellowbrick```](https://www.scikit-yb.org/en/latest/index.html).

These three Text Modeling Visualizers are:
- [Token Frequency Distribution](https://www.scikit-yb.org/en/latest/api/text/freqdist.html)
- [t-SNE Corpus Visualization](https://www.scikit-yb.org/en/latest/api/text/tsne.html)
- [UMAP Corpus Visualization](https://www.scikit-yb.org/en/latest/api/text/umap_vis.html)

The code for the Token Frequency Distribution is interactive and allows you to look at top token counts for all price ranges together and for each price range individually. The t-SNE and UMAP codes, which rely on TF-IDF Vectorization, are also interactive. They allow you to look at clustering behavior of the review documents for all of the label remapping schemes, for different ngram ranges, and for lemmatized vs. non-lemmatized tokens.

### Imports

In [None]:
from yellowbrick.text import FreqDistVisualizer
from yellowbrick.text import TSNEVisualizer
from yellowbrick.text import UMAPVisualizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def token_freq_dist(price_range=['1$', '2$', '3$', '4$', 'All Price Ranges'], token_num = (1,75)):

    vectorizer = CountVectorizer(
    tokenizer = dummy_fun,
    preprocessor= dummy_fun,
    token_pattern=None)
    
    price_range_dict = {'1$': '1', '2$': '2', '3$': '3', '4$': '4'}
    if price_range != 'All Price Ranges':
        docs = vectorizer.fit_transform(data[data['price_range'] == price_range_dict[price_range]]['review_tokens'])
    else:
        docs = vectorizer.fit_transform(data['review_tokens'])

    features = vectorizer.get_feature_names()
    visualizer = FreqDistVisualizer(features=features, size=(1080, 720), n=token_num, orient='h',
                                   title='Frequency Distribution of Top {} Tokens for {}'.format(token_num,price_range))
    visualizer.fit(docs)
    visualizer.poof()
    
interact_manual(token_freq_dist);

In [None]:
files = [f for f in os.listdir('figures/token_freq_distributions') if f.split('.')[1] == 'png']
@interact
def show_token_distrib(file=files):
    display(Image(os.path.join('figures/token_freq_distributions/', file)))

In [None]:
dropdowns_dict={'Remap Price Range Only':['Default', '[1$ & 2$] vs. [3$ & 4$]', '[1$ & $2] vs [$4]'], 
           'Remap Price Range with Review Stars':['[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]', 
                                                  '[1$ & 2$] vs. [3$ & 4$] > [1*-2* & 4*-5*]']}

remap_widget = widgets.Dropdown(options = dropdowns_dict.keys())
scheme_widget = widgets.Dropdown()
tokens_widget = widgets.Dropdown(options = ['Non-Lemmatized', 'Lemmatized'])
ngram_widget = widgets.IntSlider(min=1, max=5)
n_widget = widgets.IntSlider(min=500, max=20000, step=500)
vis_widget = widgets.Dropdown(options=['t-SNE', 'UMAP'])

def update(*args):
    scheme_widget.options = dropdowns_dict[remap_widget.value]
remap_widget.observe(update)

def remap_labels(remap_type, scheme, tokens_type, ngram_range, n_documents, visualization):
    def remap(remap_dict, value):
        return remap_dict.get(value, None)

    if 'remapped_labels' in data.columns:
        data.drop(columns=['remapped_labels'], inplace=True)
     
    if remap_type == 'Remap Price Range Only':
        
        if scheme == 'Default':
            data['remapped_labels'] = data['price_range']
            title = '{} Projection of {} Reviews\n(all price ranges)\n(ngram range: 1 - {})'.format(visualization, n_documents, ngram_range)
        
        elif scheme == '[1$ & 2$] vs. [3$ & 4$]':
            remap_dict = {'1':'1', '2':'1', '3':'2', '4':'2'}
            data['remapped_labels'] = data['price_range'].apply(lambda x: remap(remap_dict, x))
            title = '{} Projection of {} Reviews\n([1\\$ & 2\\$] vs. [3\\$ & 4\\$])'.format(visualization, n_documents)
            
        else:
            remap_dict = {'1':'1', '2':'1', '4':'2'}
            data['remapped_labels'] = data['price_range'].apply(lambda x: remap(remap_dict, x))
            title = '{} Projection of {} Reviews\n([1\\$ & 2\\$] vs. [4\\$])'.format(visualization, n_documents)
            
    else:
        if scheme == '[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]':
            data.loc[(data['price_range'].isin(['1','2'])) &
                     (data['review_stars'].isin([1,2,3])), 'remapped_labels'] = '1'
            data.loc[(data['price_range'].isin(['1','2'])) &
                     (data['review_stars'].isin([4,5])), 'remapped_labels'] = '2'
            data.loc[(data['price_range'].isin(['3','4'])) &
                     (data['review_stars'].isin([1,2,3])), 'remapped_labels'] = '3'
            data.loc[(data['price_range'].isin(['3','4'])) &
                     (data['review_stars'].isin([4,5])), 'remapped_labels'] = '4'  
            title = '{} Projection of {} Reviews\n([1\\$ & 2\\$] vs. [3\\$ & 4\\$] > [1*-3* & 4*-5*])'.format(visualization, n_documents)
  
        else:
            data.loc[(data['price_range'].isin(['1','2'])) &
                     (data['review_stars'].isin([1,2])), 'remapped_labels'] = '1'
            data.loc[(data['price_range'].isin(['1','2'])) &
                     (data['review_stars'].isin([4,5])), 'remapped_labels'] = '2'
            data.loc[(data['price_range'].isin(['3','4'])) &
                     (data['review_stars'].isin([1,2])), 'remapped_labels'] = '3'
            data.loc[(data['price_range'].isin(['3','4'])) &
                     (data['review_stars'].isin([4,5])), 'remapped_labels'] = '4'
            title = '{} Projection of {} Reviews\n([1\\$ & 2\\$] vs. [3\\$ & 4\\$] > [1*-2* & 4*-5*])'.format(visualization, n_documents)
    
    remapped_data = data[data['remapped_labels'].notnull()]
    remapped_data_sample = remapped_data.sample(n=n_documents)
    target = remapped_data_sample['remapped_labels']
    tfidf = TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, preprocessor=dummy_fun,
                            token_pattern=None, ngram_range = (1,ngram_range), min_df=10)
    if tokens_type == 'Non-Lemmatized':
        tokens = 'review_tokens'
    else:
        tokens = 'lemmatized_tokens'
    X = tfidf.fit_transform(remapped_data_sample[tokens])
    
    if visualization == 't-SNE':
        tsne = TSNEVisualizer(colormap='viridis', title=title)
        tsne.fit(X, target)
        tsne.poof();
    else:
        umap = UMAPVisualizer(colormap='viridis', title=title)
        umap.fit(X, target)
        umap.poof();
        
widgets.interact_manual(remap_labels, remap_type=remap_widget, scheme=scheme_widget, tokens_type=tokens_widget,
                        ngram_range=ngram_widget, n_documents=n_widget, visualization=vis_widget);

In [None]:
files = [f for f in os.listdir('figures/tsne_umap') if f.split('.')[1] == 'png']
@interact
def show_tsne_umaps(file=files):
    display(Image(os.path.join('figures/tsne_umap/', file)))

## Modeling

Code in this section is highly interactive, with numerous widgets to facilitate the process of exploring different models. Specifically, in the initial block of modeling-related code, there are many options for steering model exploration. You can choose how you would like to remap the class labels, the size of the random sample you would like to take of the original dataset for model exploration, and the size of the test set (default is 0.2). You can also select the type of tokens to feed into the model &mdash; non-lemmatized or lemmatized &mdash; along with the vectorizer to use &mdash; ```CountVectorizer``` or ```TfidfVectorizer``` &mdash; and the vectorizer's ngram range.

You can dictate, in addition to the vectorized text, what features you'd like to include &mdash; the counts of tokens in the reviews and/or the star ratings of the reviews. The latter of these features, star ratings, is only available for inclusion if the class label remapping scheme does not depend on price range AND star ratings. This mixing of heterogeneous features is made possible thanks to scikit-learn's ```FeatureUnion```. The code makes 9 classifiers available for comparison (you can select any combination of these models to compare, from one to all). However, you can easily add models by modifying the ```models_dict``` dictionary in the code. 

Finally, there are two widgets that give you control over the ```class_weight``` attribute in the models for which this attribute exists. Adjusting the value of this attribute &mdash; from 'auto' to 'balanced' to 'custom' &mdash; can significantly improve model performance, especially in the context of imbalanced data (which certainly applies to the dataset at hand). If 'custom' is selected, you can input your own class-weighting scheme to be used by the model(s). See the [scikit-learn documentation on ```class_weight```](https://scikit-learn.org/dev/glossary.html#term-class-weight) for more information on how to approach this attribute and how class weights will be used differently depending on the algorithm.

For each model pipeline run, the code will let you know how long it took to fit the pipeline to the training data, and it will provide you with both a ```classification_report``` and ```confusion_matrix```. These two model performance summaries were chosen for the richness of the information they provide.

### Imports

In [None]:
import time
import yaml
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC, NuSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
data = data[data['is_restaurant'] == 1]
data = data[data['tokens_len'] > 0]

In [None]:
with open(os.path.join(dataset_path, 'final_data.pkl'), 'wb') as f:
    pickle.dump(data,f)

In [None]:
with open(os.path.join(dataset_path, 'final_data.pkl'), 'rb') as f:
    data = pickle.load(f)

In [None]:
dropdowns_dict={'Remap Price Range Only':['Default', '[1$ & 2$] vs. [3$ & 4$]', '[1$ & $2] vs [$4]'], 
               'Remap Price Range with Review Stars':['[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]', 
                                                      '[1$ & 2$] vs. [3$ & 4$] > [1*-2* & 4*-5*]']}

style = {'description_width': 'initial'}
remap_widget = widgets.Dropdown(options = dropdowns_dict.keys(), description='Remap type:', 
                                style=style, value='Remap Price Range Only')
scheme_widget = widgets.Dropdown(description='Scheme:')
include_stars_widget = widgets.Dropdown(description='Include Star Ratings?:', style=style)

def update(*args):
    scheme_widget.options = dropdowns_dict[remap_widget.value]
    if remap_widget.value == 'Remap Price Range Only':
        include_stars_widget.options = ['Yes', 'No']
    else:
        include_stars_widget.options = ['No']
remap_widget.observe(update)

data_sample_size = widgets.BoundedFloatText(
    value=0.2,
    min=0,
    max=1,
    step=0.05,
    description='Sample size:',
    disabled=False)

test_size = widgets.BoundedFloatText(
    value=0.2,
    min=0,
    max=0.5,
    step=0.05,
    description='Test size:',
    disabled=False)

tokens_widget = widgets.RadioButtons(
    options=['Non-Lemmatized', 'Lemmatized'],
    description='Token type:',
    disabled=False)

vect_widget = widgets.ToggleButtons(
    options=['TfidfVectorizer', 'CountVectorizer'],
    description='Vectorizer:', style=style)

tokens_len_widget = widgets.RadioButtons(
    options=['Yes', 'No'],
    description='Include Tokens Count?:',
    style=style)

ngram_widget = widgets.IntRangeSlider(min=1, max=5, description='Ngram range:', style=style)

features_widget = widgets.SelectMultiple(
    options=['TfidfVectorizer', 'CountVectorizer', 'Review Tokens Count'],
    description='Features:',
    disabled=False)

models_dict = {'Logistic Regression':LogisticRegression(solver='lbfgs'), 'SGDClassifier':SGDClassifier(),
               'SVC':SVC(gamma='auto'), 'NuSVC':NuSVC(gamma='auto'), 'LinearSVC':LinearSVC(),
               'KNeighborsClassifier':KNeighborsClassifier(), 'BaggingClassifier':BaggingClassifier(),
               'ExtraTreesClassifier':ExtraTreesClassifier(n_estimators=100), 
               'RandomForestClassifier':RandomForestClassifier(n_estimators=100)}

models = widgets.SelectMultiple(
    options=models_dict.keys(),
    description='Models',
    disabled=False)

class_weight_widget = widgets.ToggleButtons(options=['auto', 'balanced', 'custom'], 
                                            description='Models class_weight:', 
                                            style=style, value='auto')

custom_weight_widget = widgets.Text(disabled=True, description='Custom weight scheme:', style=style,
                                    placeholder='N/A')

def update2(*args):
    if class_weight_widget.value == 'custom':
        custom_weight_widget.placeholder='{class_label: weight}'
        custom_weight_widget.disabled = False
    else:
        custom_weight_widget.placeholder='N/A'
        custom_weight_widget.disabled = True
class_weight_widget.observe(update2)

class TextSelector(BaseEstimator, TransformerMixin):
    def __init__(self, field):
        self.field = field
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.field]

class NumberSelector(BaseEstimator, TransformerMixin):
    def __init__(self, field):
        self.field = field
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[[self.field]]

def remap(remap_dict, value):
    return remap_dict.get(value, None)

def compare_models(remap_type, scheme, tokens_type, sample_frac, test_frac, 
                   vectorizer, ngram_range, include_tokens_len, include_stars, 
                   classifiers, class_weight, custom_weight):
    
    if class_weight == 'custom':
        try:
            custom_weight = yaml.load(custom_weight)
            if isinstance(custom_weight, dict) == False:
                raise ValueError('Please enter custom_weight in the following format: {class_label1: weight1, class_label2: weight2, etc.}')

        except:
            raise ValueError('Please enter custom_weight in the following format: {class_label1: weight1, class_label2: weight2, etc.}')

    if 'remapped_labels' in data.columns:
        data.drop(columns=['remapped_labels'], inplace=True)
       
    if remap_type == 'Remap Price Range Only':
        
        if scheme == 'Default':
            data['remapped_labels'] = data['price_range']
            target_names = ['1$', '2$', '3$', '4$']
        
        elif scheme == '[1$ & 2$] vs. [3$ & 4$]':
            remap_dict = {'1':'1', '2':'1', '3':'2', '4':'2'}
            data['remapped_labels'] = data['price_range'].apply(lambda x: remap(remap_dict, x))
            target_names = ['[1$ & 2$]', '[3$ & 4$]']
            
        else:
            remap_dict = {'1':'1', '2':'1', '4':'2'}
            data['remapped_labels'] = data['price_range'].apply(lambda x: remap(remap_dict, x))
            target_names = ['[1$ & 2$]', '[$4]']
    else:
        if scheme == '[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]':
            data.loc[(data['price_range'].isin(['1','2'])) &
                             (data['review_stars'].isin([1,2,3])), 'remapped_labels'] = '1'
            data.loc[(data['price_range'].isin(['1','2'])) &
                             (data['review_stars'].isin([4,5])), 'remapped_labels'] = '2'
            data.loc[(data['price_range'].isin(['3','4'])) &
                             (data['review_stars'].isin([1,2,3])), 'remapped_labels'] = '3'
            data.loc[(data['price_range'].isin(['3','4'])) &
                             (data['review_stars'].isin([4,5])), 'remapped_labels'] = '4'
            target_names = ['[1$ & 2$] > [1*-3*]', '[1$ & 2$] > [4*-5*]', 
                            '[3$ & 4$] > [1*-3*]', '[3$ & 4$] > [4*-5*]']
  
        else:
            data.loc[(data['price_range'].isin(['1','2'])) &
                             (data['review_stars'].isin([1,2])), 'remapped_labels'] = '1'
            data.loc[(data['price_range'].isin(['1','2'])) &
                             (data['review_stars'].isin([4,5])), 'remapped_labels'] = '2'
            data.loc[(data['price_range'].isin(['3','4'])) &
                             (data['review_stars'].isin([1,2])), 'remapped_labels'] = '3'
            data.loc[(data['price_range'].isin(['3','4'])) &
                             (data['review_stars'].isin([4,5])), 'remapped_labels'] = '4'
            target_names = ['[1$ & 2$] > [1*-2*]', '[1$ & 2$] > [4*-5*]', 
                            '[3$ & 4$] > [1*-2*]', '[3$ & 4$] > [4*-5*]']
            
    remapped_data = data[data['remapped_labels'].notnull()]
    
    if class_weight == 'custom':
        if set(remapped_data['remapped_labels'].unique()) != set(custom_weight.keys()):
            raise ValueError('Please enter custom_weight in the following format: {class_label1: weight1, class_label2: weight2, etc.}\nMake sure the key-value pairs only include all class labels and their weights.')
    
    remapped_data_sample = remapped_data.sample(frac=sample_frac)
    
    if tokens_type == 'Non-Lemmatized':
        tokens = 'review_tokens'
    else:
        tokens = 'lemmatized_tokens'
        

    if include_tokens_len == 'Yes' and include_stars == 'Yes':
        features = [tokens, 'tokens_len', 'review_stars']
        
    elif include_tokens_len == 'Yes' and include_stars == 'No':
        features = [tokens, 'tokens_len']
    
    elif include_tokens_len == 'No' and include_stars == 'Yes':
        features = [tokens, 'review_stars']
    
    else:
        features = [tokens]
 
    X = remapped_data_sample[features]

    y = remapped_data_sample['remapped_labels']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_frac, stratify=y)

    for clf in classifiers:
        model = models_dict[clf]
        
        if hasattr(model, 'class_weight'):
            if class_weight != 'custom':
                model.class_weight = class_weight
            else:
                model.class_weight = custom_weight
            
        else:
            print('{} does not have class_weight attribute. Cannot set to {}\n.'.format(model.__class__.__name__, class_weight))
            
        if vectorizer == 'CountVectorizer':
            vect_step = ('vect', CountVectorizer(analyzer='word', tokenizer=dummy_fun, preprocessor=dummy_fun,
                                            token_pattern=None, ngram_range=ngram_range))
        else:
            vect_step = ('vect', TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, preprocessor=dummy_fun,
                                            token_pattern=None, ngram_range=ngram_range))

        text = Pipeline([
                        ('selector', TextSelector(tokens)),
                        vect_step
                        ])

        stars = Pipeline([('selector', NumberSelector('review_stars'))])

        tokens_count = Pipeline([
                                ('selector', NumberSelector('tokens_len')),
                                ('standard', StandardScaler())
                                ])

        if include_tokens_len == 'No':
            if include_stars == 'No':
                feats = text
            else:
                feats = FeatureUnion([('text', text),
                                      ('stars', stars)])

        else:
            if include_stars == 'No':
                feats = FeatureUnion([('text', text),
                                      ('tokens_count', tokens_count)])
            else:
                feats = FeatureUnion([('text', text),
                                      ('tokens_count', tokens_count),
                                      ('stars', stars)])        

        pipeline = Pipeline([
            ('features', feats),
            ('clf', model),
        ])

        print(pipeline.steps)
            
        print('Fitting {} pipeline --- '.format(model.__class__.__name__), end='')
        time_start = time.time()
        pipeline.fit(X_train, y_train)
        preds = pipeline.predict(X_test)
        time_stop = time.time()
        elapsed = time_stop - time_start
        print('{} minutes {} seconds'.format(elapsed // 60, elapsed % 60))
        print('\n')
        print(classification_report(y_test, preds, target_names=target_names))
        print(confusion_matrix(y_test, preds))
        print('\n')

interact_manual(compare_models, remap_type=remap_widget, scheme=scheme_widget, 
                tokens_type=tokens_widget, sample_frac=data_sample_size, test_frac=test_size,
                vectorizer=vect_widget, ngram_range=ngram_widget, include_tokens_len=tokens_len_widget, 
                include_stars=include_stars_widget, classifiers=models, class_weight=class_weight_widget,
                custom_weight=custom_weight_widget);

### General Observations &mdash; Imbalanced Dataset

From running several iterations of different model pipelines above, it is immediately clear that the imbalanced natures of the datasets (both the original with the default price range class labels and the remapped ones) are confounding model performance. While adjusting the ```class_weight``` attributes for certain models to 'balanced' or customized schemes markedly improves classification performance (particularly reflected through recall scores) on minority classes, the models are still generally performing poorly on the minority classes. 

#### Possible Approaches to Handling the Imbalance

There are myriad approaches for dealing with imbalanced data, as this is a common phenomenon within machine learning problems. This notebook will explore some of these methods using the library [```imbalanced-learn (imblearn)```](https://imbalanced-learn.readthedocs.io/en/stable/index.html), which "is a Python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance." 

Specifically, this notebook looks at the following oversampling techniques of the minority classes:
- [```RandomOverSampler```](https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#random-over-sampler):  This technique "generate[s] new samples in the classes which are under-represented. The most naive strategy is to generate new samples by randomly sampling with replacement the current available samples." Therefore, synthetic data is not created, as the technique duplicates observations from the minority class(es).
- [```ADASYN```](https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#from-random-over-sampling-to-smote-and-adasyn): This technique over-samples the minority class(es) by generating "new samples in by interpolation." ```ADASYN``` "focuses on generating samples next to the original samples which are wrongly classified using a k-Nearest Neighbors classifier."
- [```SMOTE```](https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#from-random-over-sampling-to-smote-and-adasyn): This technique also, like ```ADASYN```, creates synthetic data. "The basic implementation of ```SMOTE``` will not make any distinction between easy and hard samples to be classified using the nearest neighbors rule."

The interactive code below allows you to compare model performances across these different oversampling techniques. Like the interactive code before, you can specify the sample size of the dataset you want to work with, the class label remapping strategy, the vectorization method and ngram range, the combination of classifier(s) to compare, and the combination of oversampling method(s) to compare. Note that the widget for selecting a combination of oversampling techniques also inclues a sampler called 'Standard/Default'. This sampler is a dummy sampler that simply returns the data as is; it is included as an option for facilitating comparison between oversampling methods and the default data. New to this section's interactive code, however, is the option to use cross-validation when comparing across the different oversampling techniques. You can specify whether or not you want to use the ```Stratified K-Folds cross-validator``` from scikit-learn and how many splits to use. If you opt not to use cross-validation, you can specify the size of the test size you want to use when calling ```train_test_split```.

When cross-validation is not selected, the code will return the [```classification_report_imbalanced```](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.metrics.classification_report_imbalanced.html) summary from ```imblearn``` for each fitted pipeline. When cross-validation is selected, the code will return performance stats for each fold, and when all folds are complete, summary stats for all folds.

#### Imports

In [None]:
from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score

In [None]:
dropdowns_dict={'Remap Price Range Only':['Default', '[1$ & 2$] vs. [3$ & 4$]', '[1$ & $2] vs [$4]'], 
               'Remap Price Range with Review Stars':['[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]', 
                                                      '[1$ & 2$] vs. [3$ & 4$] > [1*-2* & 4*-5*]']}

style = {'description_width': 'initial'}
remap_widget = widgets.Dropdown(options = dropdowns_dict.keys(), description='Remap type:', 
                                style=style, value='Remap Price Range Only')
scheme_widget = widgets.Dropdown(description='Scheme:')

def update(*args):
    scheme_widget.options = dropdowns_dict[remap_widget.value]
remap_widget.observe(update)

data_sample_size = widgets.BoundedFloatText(
    value=0.2,
    min=0,
    max=1,
    step=0.05,
    description='Sample size:',
    disabled=False)

RANDOM_STATE = 42

models_dict = {'Logistic Regression':LogisticRegression(solver='lbfgs'), 'SGDClassifier':SGDClassifier(),
               'SVC':SVC(gamma='auto'), 'NuSVC':NuSVC(gamma='auto'), 'LinearSVC':LinearSVC(),
               'KNeighborsClassifier':KNeighborsClassifier(), 'BaggingClassifier':BaggingClassifier(),
               'ExtraTreesClassifier':ExtraTreesClassifier(n_estimators=100), 
               'RandomForestClassifier':RandomForestClassifier(n_estimators=100)}

class DummySampler:
    def sample(self, X, y):
        return X, y
    def fit(self, X, y):
        return self
    def fit_resample(self, X, y):
        return self.sample(X, y)

samplers_dict = {'Standard/Default': DummySampler(), 'ADASYN': ADASYN(random_state=RANDOM_STATE),
                 'RandomOverSampler': RandomOverSampler(random_state=RANDOM_STATE),
                 'SMOTE': SMOTE(random_state=RANDOM_STATE)}

models = widgets.SelectMultiple(
    options=models_dict.keys(),
    description='Models',
    disabled=False)

samplers_widget = widgets.SelectMultiple(
    options=samplers_dict.keys(),
    description='Oversampling method:',
    style=style)

vect_widget = widgets.ToggleButtons(
    options=['CountVectorizer', 'TfidfVectorizer'],
    description='Vectorizer:', style=style)

ngram_widget = widgets.IntRangeSlider(min=1, max=5, description='Ngram range:', style=style)

cv_widget = widgets.Dropdown(options=['Yes', 'No'], description='Use cross-validation?:', style=style)

test_size = widgets.BoundedFloatText(disabled=True, description='Test size:', style=style,
                                    placeholder='N/A')
splits_widget = widgets.IntSlider(min=2, max=12, description='Number of splits:', style=style,
                                       disabled=False)

def update2(*args):
    if cv_widget.value == 'No':
        test_size.disabled = False
        test_size.value=0.2
        test_size.min=0
        test_size.max=0.5
        test_size.step=0.05
        splits_widget.disabled = True
    else:
        splits_widget.disabled=False
        test_size.disabled = True
cv_widget.observe(update2)
    
class TextSelector(BaseEstimator, TransformerMixin):
    def __init__(self, field):
        self.field = field
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.field]
    
def remap(remap_dict, value):
    return remap_dict.get(value, None)

def compare_oversampling_methods(remap_type, scheme, sample_frac, vectorizer, 
                                 ngram_range, classifiers, samplers, 
                                 use_cv, n_splits, test_frac):

    if 'remapped_labels' in data.columns:
        data.drop(columns=['remapped_labels'], inplace=True)
       
    if remap_type == 'Remap Price Range Only':
        
        if scheme == 'Default':
            data['remapped_labels'] = data['price_range']
            target_names = ['1$', '2$', '3$', '4$']
        
        elif scheme == '[1$ & 2$] vs. [3$ & 4$]':
            remap_dict = {'1':'1', '2':'1', '3':'2', '4':'2'}
            data['remapped_labels'] = data['price_range'].apply(lambda x: remap(remap_dict, x))
            target_names = ['[1$ & 2$]', '[3$ & 4$]']
            
        else:
            remap_dict = {'1':'1', '2':'1', '4':'2'}
            data['remapped_labels'] = data['price_range'].apply(lambda x: remap(remap_dict, x))
            target_names = ['[1$ & 2$]', '[$4]']
    else:
        if scheme == '[1$ & 2$] vs. [3$ & 4$] > [1*-3* & 4*-5*]':
            data.loc[(data['price_range'].isin(['1','2'])) &
                             (data['review_stars'].isin([1,2,3])), 'remapped_labels'] = '1'
            data.loc[(data['price_range'].isin(['1','2'])) &
                             (data['review_stars'].isin([4,5])), 'remapped_labels'] = '2'
            data.loc[(data['price_range'].isin(['3','4'])) &
                             (data['review_stars'].isin([1,2,3])), 'remapped_labels'] = '3'
            data.loc[(data['price_range'].isin(['3','4'])) &
                             (data['review_stars'].isin([4,5])), 'remapped_labels'] = '4'
            target_names = ['[1$ & 2$] > [1*-3*]', '[1$ & 2$] > [4*-5*]', 
                            '[3$ & 4$] > [1*-3*]', '[3$ & 4$] > [4*-5*]']
  
        else:
            data.loc[(data['price_range'].isin(['1','2'])) &
                             (data['review_stars'].isin([1,2])), 'remapped_labels'] = '1'
            data.loc[(data['price_range'].isin(['1','2'])) &
                             (data['review_stars'].isin([4,5])), 'remapped_labels'] = '2'
            data.loc[(data['price_range'].isin(['3','4'])) &
                             (data['review_stars'].isin([1,2])), 'remapped_labels'] = '3'
            data.loc[(data['price_range'].isin(['3','4'])) &
                             (data['review_stars'].isin([4,5])), 'remapped_labels'] = '4'
            target_names = ['[1$ & 2$] > [1*-2*]', '[1$ & 2$] > [4*-5*]', 
                            '[3$ & 4$] > [1*-2*]', '[3$ & 4$] > [4*-5*]']
            
    remapped_data = data[data['remapped_labels'].notnull()]   
    remapped_data_sample = remapped_data.sample(frac=sample_frac)
    #remapped_data_sample['review_tokens'] = remapped_data_sample['review_tokens'].apply(lambda x: ' '.join(x))
    X = remapped_data_sample['review_tokens']
    X.reset_index(drop=True, inplace=True)
    y = remapped_data_sample['remapped_labels']
    y.reset_index(drop=True, inplace=True)
    
    vectorizer_dict = {'CountVectorizer':CountVectorizer(analyzer='word', tokenizer=dummy_fun, preprocessor=dummy_fun,
                                            token_pattern=None, ngram_range=ngram_range),
                   'TfidfVectorizer':TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, preprocessor=dummy_fun,
                                            token_pattern=None, ngram_range=ngram_range)}

    pipelines = [
        ['{}-{}-{}'.format(vectorizer, sampler, clf),
         make_pipeline(vectorizer_dict[vectorizer], samplers_dict[sampler], models_dict[clf])]
        for sampler in samplers for clf in classifiers
    ]

    for name, pipeline in pipelines:
        #print(pipeline.steps)
        if use_cv == 'No':
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_frac, stratify=y)
            print('Fitting {} pipeline --- '.format(name), end='')
            time_start = time.time()
            pipeline.fit(X_train, y_train)
            preds = pipeline.predict(X_test)
            time_stop = time.time()
            elapsed = time_stop - time_start
            print('{} minutes {} seconds'.format(elapsed // 60, elapsed % 60))
            print('\n')
            print(classification_report_imbalanced(y_test, preds, target_names=target_names))
        
        else:
            #print('X: {}'.format(X))
            kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=777)
            accuracy = []
            precision = []
            recall = []
            f1 = []
            fold = 0
            for train, test in kfold.split(X, y):
                fold += 1
                print('Fitting {} pipeline for fold {}/{} --- '.format(name, fold, n_splits), end='')
                time_start = time.time()
                pipeline.fit(X[train], y[train])
                preds = pipeline.predict(X[test])
                time_stop = time.time()
                elapsed = time_stop - time_start
                print('{} minutes {} seconds'.format(elapsed // 60, elapsed % 60))
                scores = pipeline.score(X[test], y[test])
                accuracy.append(scores * 100)
                precision.append(precision_score(y[test], preds, average='macro')*100)
                print('\n')
                print('precision: {}'.format(precision_score(y[test], preds, average=None)))
                recall.append(recall_score(y[test], preds, average='macro')*100)
                print('recall: {}'.format(recall_score(y[test], preds, average=None)))
                f1.append(f1_score(y[test], preds, average='macro')*100)
                print('f1 score: {}'.format(f1_score(y[test], preds, average=None)))
                print('\n')
            print("accuracy: %.2f%% (+/- %.2f%%)" % (np.mean(accuracy), np.std(accuracy)))
            print("precision: %.2f%% (+/- %.2f%%)" % (np.mean(precision), np.std(precision)))
            print("recall: %.2f%% (+/- %.2f%%)" % (np.mean(recall), np.std(recall)))
            print("f1 score: %.2f%% (+/- %.2f%%)" % (np.mean(f1), np.std(f1)))
            print('\n')

interact_manual(compare_oversampling_methods, remap_type=remap_widget, scheme=scheme_widget,
                sample_frac=data_sample_size, vectorizer=vect_widget,
                ngram_range=ngram_widget, classifiers=models, samplers=samplers_widget,
                use_cv=cv_widget, n_splits=splits_widget, test_frac=test_size);

### Future Work

The ideal way to "fix" imbalanced data is collecting more minority class observations, not generating synthetic ones or undersampling (oversampling) the majority class(es) (minority class(es)). Unfortunately, for many machine learning problems, the ability to collect more data points can be nearly impossible, onerous (resource-wise, time-wise, etc.), or tricky. For example, a government may only publish X observations every 5 years. How would you go about supplementing this data if it's heavily imbalanced? Take another example of a dataset with tweets, some annotated as abusive, others annotated as neutral. This dataset is likely heavily imbalanced in favor of neutral tweets. You could technically use the Twitter API to collect more tweets, but that then begs the question... how are you going to ensure that this additional data has the supplemental abusive tweets you need? Sure, you can use custom API queries to try to tease out tweets that are more likely abusive in nature, but certain queries may bias what abusive content you are pulling (e.g., by using pre-determined abusive terms, you may be missing out on tweets that are more subtly abusive).

Yelp data, however, is a bit of an exception. Yelp offers [Fusion APIs](https://www.yelp.com/fusion) through which business and review data can be collected. When making a call to the API, you can request results having specific price range values! You could then plug the business_ids returned from that call to receive up to 3 reviews per business. This doesn't guarantee that the returned reviews will have the desired star ratings (if using a class relabeling scheme that takes into account both price range and review star rating), but you could make multiple calls to the API and then filter out reviews that don't meet the required conditions. There is a potential worry that the distribution of results from the APIs differs from that of the Yelp dataset at hand, but using the APIs to supplement the dataset is one way to add real, fresh observations to the minority classes.

Due to time constraints, this notebook did not explore the Yelp APIs avenue for bolstering the representation of minority classes in the dataset, but this will be the next step. It will be interesting to compare model performances on this dataset supplemented with actual, fresh Yelp reviews with those using the aforementioned oversampling techniques. 

It will also be interesting in the future to engineer new text-based features &mdash; like POS tags or frequency counts of words from informal/formal lexicons (as a proxy for the formality of language being used) &mdash; and to home in on one model and conduct hyperparameter tuning (perhaps with ```GridSearchCV```).