loganjtravis@gmail.com (Logan Travis)

In [1]:
%%capture --no-stdout

# Imports; captures errors to supress warnings about changing
# import syntax
from itertools import compress
import matplotlib.pyplot as plot
import matplotlib.cm as cm
import matplotlib
import nltk
import numpy as np
import pandas as pd
import random
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.model_selection import train_test_split

In [2]:
# Set random seed for repeatability
random.seed(42)

In [3]:
# Set matplotlib to inline to preserve images in PDF
%matplotlib inline

# Summary

From course page [Week 5 > Task 6 Information > Task 6 Overview](https://www.coursera.org/learn/data-mining-project/supplement/gvCsC/task-4-and-5-overview):

> In this task, you are going to predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. Making a prediction about an unobserved attribute using data mining techniques represents a wide range of important applications of data mining. Through working on this task, you will gain direct experience with such an application. Due to the flexibility of using as many indicators for prediction as possible, this would also give you an opportunity to potentially combine many different algorithms you have learned from the courses in the Data Mining Specialization to solve a real world problem and experiment with different methods to understand what’s the most effective way of solving the problem.
> 
> **About the Dataset**
You should first [download the dataset](https://d396qusza40orc.cloudfront.net/dataminingcapstone/Task6/Hygiene.tar.gz). The dataset is composed of a training subset containing 546 restaurants used for training your classifier, in addition to a testing subset of 12753 restaurants used for evaluating the performance of the classifier. In the training subset, you will be provided with a binary label for each restaurant, which indicates whether the restaurant has passed the latest public health inspection test or not, whereas for the testing subset, you will not have access to any labels. The dataset is spread across three files such that the first 546 lines in each file correspond to the training subset, and the rest are part of the testing subset. Below is a description of each file:
>
> * hygiene.dat: Each line contains the concatenated text reviews of one restaurant.
> * hygiene.dat.labels: For the first 546 lines, a binary label (0 or 1) is used where a 0 indicates that the restaurant has passed the latest public health inspection test, while a 1 means that the restaurant has failed the test. The rest of the lines have "[None]" in their label field implying that they are part of the testing subset.
> * hygiene.dat.additional: It is a CSV (Comma-Separated Values) file where the first value is a list containing the cuisines offered, the second value is the zip code, which gives an idea about the location, the third is the number of reviews, and the fourth is the average rating, which can vary between 0 and 5 (5 being the best).

# A Note on This Report

I hid much of my code displaying only chunks that clarified my process. My previous reports exceeded 15 pages, mostly Python code. Reviewers suggested replacing code with written descriptions for clarity.

# Predictive Model 01: Unigrams and Logistic Regression

I start by representing text as a unigram vector then applying logistic regression. This predictive model gives a useful baseline for future methods. It also highlights the difficulty of the prediction: Logistic regression alone proves an *incredibly* poor predictor!

## Prepare Training Data

In [4]:
# Set paths to data source, work in process ("WIP"), and output
PATH_SOURCE = "source"
PATH_WIP = "wip"
PATH_OUTPUT = "output"

# Set file paths
PATH_SOURCE_TRAIN_TEXT = f"{PATH_SOURCE}/Hygiene/train_hygiene.dat"
PATH_SOURCE_TRAIN_LABELS = f"{PATH_SOURCE}/Hygiene/train_hygiene.dat.labels"
PATH_SOURCE_TRAIN_REST = f"{PATH_SOURCE}/Hygiene/train_hygiene.dat.additional"
PATH_SOURCE_TARGET_TEXT = f"{PATH_SOURCE}/Hygiene/target_hygiene.dat"
PATH_SOURCE_TARGET_REST = f"{PATH_SOURCE}/Hygiene/target_hygiene.dat.additional"

# Set output paths
PATH_OUTPUT_PRED_LABELS = f"{PATH_OUTPUT}/pred_hygiene.dat.labels"

In [5]:
# Get training text and labels
with open(PATH_SOURCE_TRAIN_TEXT) as f:
    arrTrainText = [l.rstrip() for l in f]
with open(PATH_SOURCE_TRAIN_LABELS) as f:
    arrTrainLabels = [l.rstrip() == "1" for l in f]
dfTrain = pd.DataFrame(data={"failed_hygiene": arrTrainLabels, "review_text": arrTrainText})

In [6]:
# Split data into training and testing sets
dfTrain["review_text_len"] = dfTrain.review_text.str.len()
dfTrain, dfTest = train_test_split(dfTrain, test_size=0.3, random_state=84)

In [7]:
# Inspect first 10 rows
dfTrain.head(10)

Unnamed: 0,failed_hygiene,review_text,review_text_len
463,True,"Lovely place! Great neighborhood feel, excelle...",17352
240,False,The Crab Spring rolls were absolutely amazing!...,12390
461,False,We went about a year ago... the experience was...,3107
257,False,I was expecting a lot more given all the great...,2566
407,False,This joint became a regular stop for us when w...,4765
545,False,A for effort. If you happen to be stuck with s...,18692
465,True,Eat breakfast here.This restaurant has one of ...,4837
331,False,I was going to watch a movie at SIFF but wante...,9730
381,True,All I had here were cha sao bao (BBQ pork buns...,4814
66,False,One of the best Phillies in the city. Service ...,326


In [8]:
# Sanity check on training versus testing split
dfTrain.groupby(["failed_hygiene"]).agg({
    "review_text": ["count"],
    "review_text_len": ["mean", "std"]
})

Unnamed: 0_level_0,review_text,review_text_len,review_text_len
Unnamed: 0_level_1,count,mean,std
failed_hygiene,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
False,195,7276.015385,10327.798184
True,187,9967.219251,12589.871449


In [9]:
# Sanity check on training versus testing split
dfTest.groupby(["failed_hygiene"]).agg({
    "review_text": ["count"],
    "review_text_len": ["mean", "std"]
})

Unnamed: 0_level_0,review_text,review_text_len,review_text_len
Unnamed: 0_level_1,count,mean,std
failed_hygiene,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
False,78,6230.25641,8406.959927
True,86,9252.313953,10821.455737


## Create Unigram Probability Matrix

I chose not to use IDF weighting because the data concatenated all reviews for a single restaurant with no delimiter to split them. Instead I count terms then normalize appearances for each restaurant creating a term probability matrix. A couple upfront comments:

* Logistic regression works best when the number of samples far exceeds the number of variables. That is not the case for this data set. I expect very poor performance with high sensitivity to model parameters.
* I did not tune the term vectorizer; it includes all terms found in the training data.
* I will tune the parameters in the next model. I intend this model as a *naive* baseline.

In [10]:
class MyTokenizer:
    def __init__(self):
        """String tokenizer utilizing lemmatizing and stemming."""
        self.wnl = nltk.stem.WordNetLemmatizer()
    
    def __call__(self, document):
        """Return tokens from a string."""
        return [self.wnl.lemmatize(token) for \
                        token in nltk.word_tokenize(document)]

In [11]:
# Set token limit
MAX_FEATURES = 100000

# Set document frequency ceiling
MAX_DF = 1.0

# Set document frequency floor
MIN_DF = 1

In [12]:
# Create TF vectorizer 
tf = CountVectorizer(max_features=MAX_FEATURES, max_df=MAX_DF, \
                     min_df=MIN_DF, stop_words="english", \
                     tokenizer=MyTokenizer())

In [13]:
%%time

# Calculate training term frequencies
trainTerms = tf.fit_transform(dfTrain.review_text)

CPU times: user 7.09 s, sys: 609 ms, total: 7.7 s
Wall time: 7.77 s


In [14]:
# Normalize for each restaurant
trainTerms = trainTerms / trainTerms.sum(axis=1)
print("{:,} restaurant reviews extracted into {:,} unigram terms.".format(*trainTerms.shape))

382 restaurant reviews extracted into 23,107 unigram terms.


In [15]:
%%time

# Calculate testing term frequences; Note: Transform ONLY,
# no additional fitting
testTerms = tf.transform(dfTest.review_text)

CPU times: user 2.39 s, sys: 0 ns, total: 2.39 s
Wall time: 2.4 s


In [16]:
# Normalize for each restaurant
testTerms = testTerms / testTerms.sum(axis=1)

## Train Logistic Regression Model

In [17]:
# Create logistic regression model
model_TF_LR = LogisticRegression(random_state=42)

In [18]:
%%time

# Train logistic regression model
model_TF_LR = model_TF_LR.fit(trainTerms, dfTrain.failed_hygiene)

CPU times: user 15.6 ms, sys: 0 ns, total: 15.6 ms
Wall time: 24.3 ms


In [19]:
# Calucalate F1 score
model_TF_LR_Pred = model_TF_LR.predict(testTerms)
model_TF_LR_F1 = f1_score(dfTest.failed_hygiene, model_TF_LR_Pred)
print("Model 01 Logistic Regression of Term Probabilities F-1 Score: {:.6f}".format(model_TF_LR_F1))

Model 01 Logistic Regression of Term Probabilities F-1 Score: 0.000000


In [20]:
# Display confusion matrix
model_TF_LR_CM = confusion_matrix(dfTest.failed_hygiene, model_TF_LR_Pred)
print("True Negatives: {0:,}\nTrue Positives: {3:,}\nFalse Negatives: {2:,}\nFalse Positives: {1:,}".format(*model_TF_LR_CM.ravel()))

True Negatives: 77
True Positives: 0
False Negatives: 86
False Positives: 1


The simple logistic regression performed *horribly*. It only predict **one** failed hygiene inspection in the test data and that was a false positive. Review text simply includes too much noise. While we would anticipate hygiene issues to appear in reviews, we should also expect them to drown in a sea of non-hygiene related reviews about the food, service, location, etc.

# Predictive Model 02: Remove Noise from Unigrams and Logistic Regression

I next try a feature selection method called Recursive Feature Elimination. The 22,000+ terms found in the training data far exceed the training sample (382). If a simple logistic regession has predictive value, it first needs to train on only the most useful features.

## Find Most Predictive Terms using Recursive Feature Elimination

In [36]:
# Create logistic regression model for RFE
model_TF_LR_RFE = LogisticRegression(random_state=42)

In [37]:
# Create Recursive Feature Elimination instance
rfe_TF_LR_RFE = RFE(model_TF_LR_RFE, n_features_to_select=100, step=100)

In [38]:
%%time

# Reduce features using Recursive Feature Elimination
rfe_TF_LR_RFE = rfe_TF_LR_RFE.fit(trainTerms, dfTrain.failed_hygiene)

CPU times: user 1min 12s, sys: 16.6 s, total: 1min 28s
Wall time: 27 s


In [39]:
# Inspect remaing terms
list(compress(tf.get_feature_names(), rfe_TF_LR_RFE.get_support()))

['!',
 '#',
 '$',
 '&',
 "'ll",
 "'s",
 "'ve",
 ',',
 '.',
 '...',
 '160',
 ':',
 ';',
 '?',
 '``',
 'amp',
 'around..',
 'atmosphere',
 'awesome',
 'bad',
 'banh',
 'bar',
 'beer',
 'best',
 'burger',
 'cash',
 'cheap',
 'cheese',
 'chicken',
 'chinese',
 'city',
 'curry',
 'decor',
 'dim',
 'dish',
 'drink',
 'egg',
 'falafel',
 'far',
 'fast',
 'food',
 'fried',
 'friendly',
 'fry',
 'good',
 'grab',
 'great',
 'happy',
 'hot',
 'just',
 'know',
 'like',
 'love',
 'meal',
 'meat',
 'mi',
 'minute',
 'monday',
 'morning',
 "n't",
 'nice',
 'noodle',
 'order',
 'oyster',
 'pad',
 'parking',
 'philly',
 'pho',
 'pizza',
 'place',
 'pork',
 'pretty',
 'price',
 'quick',
 'really',
 'respectable',
 'rice',
 'roll',
 'room',
 'salad',
 'say',
 'selection',
 'spaghetti',
 'special',
 'special..',
 'spicy',
 'staff',
 'star',
 'sum',
 'super',
 'thai',
 'think',
 'time',
 'tofu',
 'town',
 'tried',
 'try',
 'unless',
 'wa',
 'went']

In [40]:
# Restrict term probability matrices to best terms from RFE
trainTerms_RFE = trainTerms[:, rfe_TF_LR_RFE.get_support(indices=True)]
testTerms_RFE = testTerms[:, rfe_TF_LR_RFE.get_support(indices=True)]

## Train Logistic Regression Model after RFE

In [41]:
%%time

# Train logistic regression model
model_TF_LR_RFE = model_TF_LR_RFE.fit(trainTerms_RFE, dfTrain.failed_hygiene)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.55 ms


In [44]:
# Calucalate F1 score
model_TF_LR_RFE_Pred = model_TF_LR_RFE.predict(testTerms_RFE)
model_TF_LR_RFE_F1 = f1_score(dfTest.failed_hygiene, model_TF_LR_RFE_Pred)
print("Model 02 Logistic Regression of Term Probabilities after RFE F-1 Score: {:.6f}".format(model_TF_LR_RFE_F1))

Model 02 Logistic Regression of Term Probabilities after RFE F-1 Score: 0.000000


In [43]:
# Display confusion matrix
model_TF_LR_RFE_CM = confusion_matrix(dfTest.failed_hygiene, model_TF_LR_RFE_Pred)
print("True Negatives: {0:,}\nTrue Positives: {3:,}\nFalse Negatives: {2:,}\nFalse Positives: {1:,}".format(*model_TF_LR_RFE_CM.ravel()))

True Negatives: 77
True Positives: 0
False Negatives: 86
False Positives: 1
