loganjtravis@gmail.com (Logan Travis)

In [145]:
%%capture --no-stdout

# Imports; captures errors to supress warnings about changing
# import syntax
from itertools import compress
import matplotlib.pyplot as plot
import matplotlib.cm as cm
import matplotlib
import nltk
import numpy as np
import pandas as pd
import random
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, f1_score

In [3]:
# Set random seed for repeatability
random.seed(42)

In [4]:
# Set matplotlib to inline to preserve images in PDF
%matplotlib inline

# Summary

From course page [Week 5 > Task 6 Information > Task 6 Overview](https://www.coursera.org/learn/data-mining-project/supplement/gvCsC/task-4-and-5-overview):

> In this task, you are going to predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. Making a prediction about an unobserved attribute using data mining techniques represents a wide range of important applications of data mining. Through working on this task, you will gain direct experience with such an application. Due to the flexibility of using as many indicators for prediction as possible, this would also give you an opportunity to potentially combine many different algorithms you have learned from the courses in the Data Mining Specialization to solve a real world problem and experiment with different methods to understand what’s the most effective way of solving the problem.
> 
> **About the Dataset**
You should first [download the dataset](https://d396qusza40orc.cloudfront.net/dataminingcapstone/Task6/Hygiene.tar.gz). The dataset is composed of a training subset containing 546 restaurants used for training your classifier, in addition to a testing subset of 12753 restaurants used for evaluating the performance of the classifier. In the training subset, you will be provided with a binary label for each restaurant, which indicates whether the restaurant has passed the latest public health inspection test or not, whereas for the testing subset, you will not have access to any labels. The dataset is spread across three files such that the first 546 lines in each file correspond to the training subset, and the rest are part of the testing subset. Below is a description of each file:
>
> * hygiene.dat: Each line contains the concatenated text reviews of one restaurant.
> * hygiene.dat.labels: For the first 546 lines, a binary label (0 or 1) is used where a 0 indicates that the restaurant has passed the latest public health inspection test, while a 1 means that the restaurant has failed the test. The rest of the lines have "[None]" in their label field implying that they are part of the testing subset.
> * hygiene.dat.additional: It is a CSV (Comma-Separated Values) file where the first value is a list containing the cuisines offered, the second value is the zip code, which gives an idea about the location, the third is the number of reviews, and the fourth is the average rating, which can vary between 0 and 5 (5 being the best).

# A Note on This Report

I hid much of my code displaying only chunks that clarified my process. My previous reports exceeded 15 pages, mostly Python code. Reviewers suggested replacing code with written descriptions for clarity.

# Predictive Model 01: Unigrams and Logistic Regression

I start by representing text as a unigram vector then applying logistic regression. This predictive model gives a useful baseline for future methods. It also highlights the difficulty of the prediction: Logistic regression alone proves an *incredibly* poor predictor!

## Prepare Training Data

In [33]:
# Set paths to data source, work in process ("WIP"), and output
PATH_SOURCE = "source"
PATH_WIP = "wip"
PATH_OUTPUT = "output"

# Set file paths
PATH_SOURCE_TRAIN_TEXT = f"{PATH_SOURCE}/Hygiene/train_hygiene.dat"
PATH_SOURCE_TRAIN_LABELS = f"{PATH_SOURCE}/Hygiene/train_hygiene.dat.labels"
PATH_SOURCE_TRAIN_REST = f"{PATH_SOURCE}/Hygiene/train_hygiene.dat.additional"
PATH_SOURCE_TARGET_TEXT = f"{PATH_SOURCE}/Hygiene/target_hygiene.dat"
PATH_SOURCE_TARGET_REST = f"{PATH_SOURCE}/Hygiene/target_hygiene.dat.additional"

# Set output paths
PATH_OUTPUT_PRED_LABELS = f"{PATH_OUTPUT}/pred_hygiene.dat.labels"

In [47]:
# Get training text and labels
with open(PATH_SOURCE_TRAIN_TEXT) as f:
    arrTrainText = [l.rstrip() for l in f]
with open(PATH_SOURCE_TRAIN_LABELS) as f:
    arrTrainLabels = [l.rstrip() == "1" for l in f]
dfTrain = pd.DataFrame(data={"failed_hygiene": arrTrainLabels, "review_text": arrTrainText})

In [48]:
# Split data into training and testing sets
dfTrain["use"] = "train"
dfTrain.at[random.sample(range(dfTrain.shape[0]), int(dfTrain.shape[0]*0.3)), "use"] = "test"
dfTrain.use = dfTrain.use.astype("category")

In [49]:
# Inspect first 10 rows
dfTrain.head(10)

Unnamed: 0,failed_hygiene,review_text,use
0,True,"The baguettes and rolls are excellent, and alt...",train
1,True,I live up the street from Betty. &#160;When my...,train
2,True,I'm worried about how I will review this place...,train
3,False,Why can't you access them on Google street vie...,train
4,False,Things to like about this place: homemade guac...,train
5,True,I had been holding off on visiting Bastille fo...,test
6,True,I had gone by this place as they were moving i...,test
7,False,Any chance I get to eat with my hands and have...,test
8,False,My favorite Thai restaurant in the U-District....,test
9,False,I'm pretty sure someone who was born and raise...,train


In [62]:
# Sanity check on training versus testing split
dfTrain["review_text_len"] = dfTrain.review_text.str.len()
dfTrain.groupby(["use", "failed_hygiene"]).agg({
    "review_text": ["count"],
    "review_text_len": ["mean", "std"]
})

Unnamed: 0_level_0,Unnamed: 1_level_0,review_text,review_text_len,review_text_len
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std
use,failed_hygiene,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
test,False,86,6329.860465,10091.451518
test,True,77,10411.194805,13845.61325
train,False,187,7274.946524,9696.181689
train,True,196,9479.117347,11288.419323


## Create Unigram Frequency Matrix

I chose not to use IDF weighting because the data concatenated all reviews for a single restaurant with no delimiter to split them. Instead I count terms then normalize appearances for each restaurant.

One caution: The training data does not include all possible terms. The testing data will likely include new terms as will future reviews. I intend this first model as a simple baseline so do not address the problem.

In [104]:
# Set token limit
MAX_FEATURES = 100000

# Set document frequency ceiling
MAX_DF = 1.0

# Set document frequency floor
MIN_DF = 1

In [64]:
class MyTokenizer:
    def __init__(self):
        """String tokenizer utilizing lemmatizing and stemming."""
        self.wnl = nltk.stem.WordNetLemmatizer()
    
    def __call__(self, document):
        """Return tokens from a string."""
        return [self.wnl.lemmatize(token) for \
                        token in nltk.word_tokenize(document)]

In [105]:
# Create TF vectorizer 
tf = CountVectorizer(max_features=MAX_FEATURES, max_df=MAX_DF, \
                     min_df=MIN_DF, stop_words="english", \
                     tokenizer=MyTokenizer())

In [106]:
%%time

# Calculate training term frequencies
trainTerms = tf.fit_transform(dfTrain[dfTrain.use == "train"].review_text)

CPU times: user 6.08 s, sys: 0 ns, total: 6.08 s
Wall time: 6.11 s


In [108]:
# Normalize for each restaurant
trainTerms = trainTerms / trainTerms.sum(axis=1)

In [109]:
%%time

# Calculate testing term frequences; Note: Transform ONLY,
# no additional fitting
testTerms = tf.transform(dfTrain[dfTrain.use == "test"].review_text)

CPU times: user 2.58 s, sys: 0 ns, total: 2.58 s
Wall time: 2.63 s


In [110]:
# Normalize for each restaurant
testTerms = testTerms / testTerms.sum(axis=1)

## Train Linear Regression Model

In [120]:
# Create logistic regression model
model_TF_LR = LogisticRegression(random_state=42)

In [121]:
%%time

# Train logistic regression model
model_TF_LR = model_TF_LR.fit(trainTerms, dfTrain[dfTrain.use == "train"].failed_hygiene)

CPU times: user 62.5 ms, sys: 0 ns, total: 62.5 ms
Wall time: 31.6 ms


In [152]:
# Calucalate F1 score
model_TF_LR_TestPred = model_TF_LR.predict(testTerms)
model_TF_LR_TestF1 = f1_score(dfTrain[dfTrain.use == "test"].failed_hygiene, model_TF_LR_TestPred)
print("Model 01 Logistic Regression of Term Probabilities F-1 Score: {:.6f}".format(model_TF_LR_TestF1))

Model 01 Logistic Regression of Term Probabilities F-1 Score: 0.647059


In [158]:
# Display confusion matrix
model_TF_LR_CM = confusion_matrix(dfTrain[dfTrain.use == "test"].failed_hygiene, testPred)
print("True Negatives: {0:,}\nTrue Positives: {3:,}\nFalse Positives: {1:,}\nFalse Negatives: {2:,}".format(*model_TF_LR_CM.ravel()))

True Negatives: 2
True Positives: 77
False Positives: 84
False Negatives: 0


In [141]:
rfe = RFE(model_TF_LR, n_features_to_select=10, step=100)

In [142]:
%%time

# Reduce features using Recursive Feature Elimination
rfe = rfe.fit(trainTerms, dfTrain[dfTrain.use == "train"].failed_hygiene)

CPU times: user 28.9 s, sys: 14 s, total: 42.9 s
Wall time: 27 s


In [143]:
list(compress(tf.get_feature_names(), rfe.support_))

['#', '&', '.', '...', '160', ';', '?', 'best', 'thai', 'wa']