loganjtravis@gmail.com (Logan Travis)

In [164]:
%%capture --no-stdout

# Imports; captures errors to supress warnings about changing
# import syntax
import matplotlib.pyplot as plot
import matplotlib.cm as cm
import matplotlib
import nltk
import pandas as pd
import random
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [2]:
# Set random seed for repeatability
random.seed(42)

In [3]:
# Set matplotlib to inline to preserve images in PDF
%matplotlib inline

# Summary

From course page [Week 5 > Task 6 Information > Task 6 Overview](https://www.coursera.org/learn/data-mining-project/supplement/gvCsC/task-4-and-5-overview):

> In this task, you are going to predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. Making a prediction about an unobserved attribute using data mining techniques represents a wide range of important applications of data mining. Through working on this task, you will gain direct experience with such an application. Due to the flexibility of using as many indicators for prediction as possible, this would also give you an opportunity to potentially combine many different algorithms you have learned from the courses in the Data Mining Specialization to solve a real world problem and experiment with different methods to understand what’s the most effective way of solving the problem.
> 
> **About the Dataset**
You should first [download the dataset](https://d396qusza40orc.cloudfront.net/dataminingcapstone/Task6/Hygiene.tar.gz). The dataset is composed of a training subset containing 546 restaurants used for training your classifier, in addition to a testing subset of 12753 restaurants used for evaluating the performance of the classifier. In the training subset, you will be provided with a binary label for each restaurant, which indicates whether the restaurant has passed the latest public health inspection test or not, whereas for the testing subset, you will not have access to any labels. The dataset is spread across three files such that the first 546 lines in each file correspond to the training subset, and the rest are part of the testing subset. Below is a description of each file:
>
> * hygiene.dat: Each line contains the concatenated text reviews of one restaurant.
> * hygiene.dat.labels: For the first 546 lines, a binary label (0 or 1) is used where a 0 indicates that the restaurant has passed the latest public health inspection test, while a 1 means that the restaurant has failed the test. The rest of the lines have "[None]" in their label field implying that they are part of the testing subset.
> * hygiene.dat.additional: It is a CSV (Comma-Separated Values) file where the first value is a list containing the cuisines offered, the second value is the zip code, which gives an idea about the location, the third is the number of reviews, and the fourth is the average rating, which can vary between 0 and 5 (5 being the best).

# A Note on The Training Data

I realized after building my second predict model that I may have incorrectly interpretted the meaning of `hygiene.data.labels`. The assigment states, "...a 0 indicates that the restaurant has passed the latest public health inspection test, while a 1 means that the restaurant has failed the test." It does not indicate the immediacy (in time) of those last inspections. A restaurant might have failed its hygiene inspection only days before compiling the data set as easily as another restaurant passed an ispection from years ago. Also, the reviews lack time indicators so a restaurant's reviews might include reviews from a decade ago when it failed a hygiene inspection. How those potentially negative reviews affect the predicition of future failure would depend on many factor.

**In short:** I recommend tempering expectations for accurrate prediction. Even if a model works, it will necessarily overfit to the nuances of this training data.

# Predictive Model 01: Unigrams and Logistic Regression

I start by representing text as a unigram vector then applying logistic regression. This predictive model gives a useful baseline for future methods. It also highlights the difficulty of the prediction: Logistic regression alone proves an *incredibly* poor predictor!

## Prepare Training Data

In [54]:
# Set paths to data source, work in process ("WIP"), and output
PATH_SOURCE = "source"
PATH_WIP = "wip"
PATH_OUTPUT = "output"

# Set file paths
PATH_SOURCE_TRAIN_TEXT = f"{PATH_SOURCE}/Hygiene/train_hygiene.dat"
PATH_SOURCE_TRAIN_LABELS = f"{PATH_SOURCE}/Hygiene/train_hygiene.dat.labels"
PATH_SOURCE_TRAIN_REST = f"{PATH_SOURCE}/Hygiene/train_hygiene.dat.additional"
PATH_SOURCE_TARGET_TEXT = f"{PATH_SOURCE}/Hygiene/target_hygiene.dat"
PATH_SOURCE_TARGET_REST = f"{PATH_SOURCE}/Hygiene/target_hygiene.dat.additional"

# Set paths to AutoPhrase output
AUTOPHRASE_LOG = "AutoPhrase/models/hygiene/log.txt"
AUTOPHRASE_RESULTS = "AutoPhrase/models/hygiene/AutoPhrase.txt"

In [5]:
# Get training text and labels
with open(PATH_SOURCE_TRAIN_TEXT) as f:
    arrTrainText = [l.rstrip() for l in f]
with open(PATH_SOURCE_TRAIN_LABELS) as f:
    arrTrainLabels = [l.rstrip() == "1" for l in f]
dfTrain = pd.DataFrame(data={"failed_hygiene": arrTrainLabels, "review_text": arrTrainText})

In [6]:
# Split data into training and testing sets
dfTrain["review_text_len"] = dfTrain.review_text.str.len()
dfTrain, dfTest = train_test_split(dfTrain, test_size=0.3, random_state=84)

In [7]:
# Inspect first 10 rows
dfTrain.head(10)

Unnamed: 0,failed_hygiene,review_text,review_text_len
463,True,"Lovely place! Great neighborhood feel, excelle...",17352
240,False,The Crab Spring rolls were absolutely amazing!...,12390
461,False,We went about a year ago... the experience was...,3107
257,False,I was expecting a lot more given all the great...,2566
407,False,This joint became a regular stop for us when w...,4765
545,False,A for effort. If you happen to be stuck with s...,18692
465,True,Eat breakfast here.This restaurant has one of ...,4837
331,False,I was going to watch a movie at SIFF but wante...,9730
381,True,All I had here were cha sao bao (BBQ pork buns...,4814
66,False,One of the best Phillies in the city. Service ...,326


In [8]:
# Sanity check on training versus testing split
dfTrain.groupby(["failed_hygiene"]).agg({
    "review_text": ["count"],
    "review_text_len": ["mean", "std"]
})

Unnamed: 0_level_0,review_text,review_text_len,review_text_len
Unnamed: 0_level_1,count,mean,std
failed_hygiene,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
False,195,7276.015385,10327.798184
True,187,9967.219251,12589.871449


In [9]:
# Sanity check on training versus testing split
dfTest.groupby(["failed_hygiene"]).agg({
    "review_text": ["count"],
    "review_text_len": ["mean", "std"]
})

Unnamed: 0_level_0,review_text,review_text_len,review_text_len
Unnamed: 0_level_1,count,mean,std
failed_hygiene,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
False,78,6230.25641,8406.959927
True,86,9252.313953,10821.455737


## Create Unigram Probability Matrix

I chose not to use IDF weighting because the data concatenated all reviews for a single restaurant with no delimiter to split them. Instead I count terms then normalize appearances for each restaurant creating a term probability matrix. A couple upfront comments:

* Logistic regression works best when the number of samples far exceeds the number of variables. That is not the case for this data set. I expect very poor performance with high sensitivity to model parameters.
* I did not tune the term vectorizer; it includes all terms found in the training data.
* I will tune the parameters in the next model. I intend this model as a *naive* baseline.

In [10]:
class MyTokenizer:
    def __init__(self):
        """String tokenizer utilizing lemmatizing and stemming."""
        self.wnl = nltk.stem.WordNetLemmatizer()
    
    def __call__(self, document):
        """Return tokens from a string."""
        return [self.wnl.lemmatize(token) for \
                        token in nltk.word_tokenize(document)]

In [11]:
# Create TF vectorizer 
tf = CountVectorizer(max_df=1.0, min_df=1, \
                     stop_words="english", \
                     tokenizer=MyTokenizer())

In [12]:
%%time

# Calculate training term frequencies
trainTerms = tf.fit_transform(dfTrain.review_text)

CPU times: user 7.39 s, sys: 344 ms, total: 7.73 s
Wall time: 7.86 s


In [13]:
# Normalize for each restaurant
trainP = trainTerms / trainTerms.sum(axis=1)
print("{:,} restaurant reviews extracted into {:,} unigram terms.".format(*trainP.shape))

382 restaurant reviews extracted into 23,107 unigram terms.


In [14]:
%%time

# Calculate testing term frequences; Note: Transform ONLY,
# no additional fitting
testTerms = tf.transform(dfTest.review_text)

CPU times: user 2.45 s, sys: 15.6 ms, total: 2.47 s
Wall time: 2.49 s


In [15]:
# Normalize for each restaurant
testP = testTerms / testTerms.sum(axis=1)

## Train Logistic Regression Model

In [16]:
# Create logistic regression model
model_TF_LR = LogisticRegression(random_state=42)

In [17]:
%%time

# Train logistic regression model
model_TF_LR = model_TF_LR.fit(trainP, dfTrain.failed_hygiene)

CPU times: user 62.5 ms, sys: 0 ns, total: 62.5 ms
Wall time: 25.1 ms


In [21]:
def printModelF1(truth, prediction, modelName, includeConfusionMatrix=True):
    """Print model quality using specified measure."""
    f1 = f1_score(truth, prediction)
    print("{}\n-----\nF-1 Score: {:.6f}".format(modelName, f1))
    if(includeConfusionMatrix):
        cm = confusion_matrix(truth, prediction)
        print("True Negatives: {0:,}\nTrue Positives: {3:,}\nFalse Negatives: {2:,}\nFalse Positives: {1:,}".format(*cm.ravel()))

In [22]:
# Calucalate F1 score
model_TF_LR_Pred = model_TF_LR.predict(testP)
printModelF1(dfTest.failed_hygiene, model_TF_LR_Pred, \
             "Model 01 Logistic Regression of Term Probabilities")

Model 01 Logistic Regression of Term Probabilities
-----
F-1 score: 0.000000
True Negatives: 77
True Positives: 0
False Negatives: 86
False Positives: 1


The simple logistic regression performed *horribly*. It only predict **one** failed hygiene inspection in the test data and that was a false positive. Review text simply includes too much noise. While we would anticipate hygiene issues to appear in reviews, we should also expect them to drown in a sea of non-hygiene related reviews about the food, service, location, etc.

# Predictive Model 02: Recursive Feature Elimination of  Unigrams Before Logistic Regression

I next try a feature selection method called Recursive Feature Elimination ([`RFE` on SciKit Learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.get_support)). The 22,000+ terms found in the training data far exceed the training sample (382). If a simple logistic regession has predictive value, it first needs to train on only the most useful features. I therefore find the 30 top ranked (by RFE) as a starting point to see what improvements logistic regression has to offer.

## Find Most Predictive Terms using RFE

In [23]:
# Create logistic regression model for RFE
model_TF_LR_RFE = LogisticRegression(random_state=42)

In [24]:
# Create Recursive Feature Elimination instance
rfe_TF_LR_RFE = RFE(model_TF_LR_RFE, n_features_to_select=30, step=100)

In [25]:
%%time

# Reduce features using Recursive Feature Elimination
rfe_TF_LR_RFE = rfe_TF_LR_RFE.fit(trainP, dfTrain.failed_hygiene)

CPU times: user 1min 11s, sys: 16.9 s, total: 1min 28s
Wall time: 28.4 s


In [26]:
# Restrict training terms to best from RFE and calculate new
# relative probabilities
trainTerms_RFE = trainTerms[:, rfe_TF_LR_RFE.get_support(indices=True)]
trainP_RFE = trainTerms_RFE / trainTerms_RFE.sum(axis=1)

In [27]:
# Restrict testing terms to best from RFE and calculate new
# relative probabilities
testTerms_RFE = testTerms[:, rfe_TF_LR_RFE.get_support(indices=True)]
testP_RFE = testTerms_RFE / testTerms_RFE.sum(axis=1)

## Train Logistic Regression Model after RFE

In [28]:
%%time

# Train logistic regression model
model_TF_LR_RFE = model_TF_LR_RFE.fit(trainP_RFE, dfTrain.failed_hygiene)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.61 ms


In [29]:
# Calucalate F1 score
model_TF_LR_RFE_Pred = model_TF_LR_RFE.predict(testP_RFE)
printModelF1(dfTest.failed_hygiene, model_TF_LR_RFE_Pred, \
             "Model 02 Logistic Regression of Term Probabilities after RFE")

Model 02 Logistic Regression of Term Probabilities after RFE
-----
F-1 score: 0.442857
True Negatives: 55
True Positives: 31
False Negatives: 55
False Positives: 23


Restricting the logistic model to the 100 best terms improves its performance significantly. Tuning the best terms - whether with RFE or earlier in the count vectorizer - might improve prediction quality further.

# Model 03: Latent Symantic Analysis of  Unigrams Before Logistic Regression

RFE selects the best features from an existing data set. Those features it removes do not predict *as well* as the features it keeps but they can still have predictive value. I therefore try a feature decomposition method called Latent Symantic Analysis ([`TruncatedSVD` in SciKit Learn](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD)). The decomposition process still reduces the number of features - useful for logistic regression - but does so by linear combination of those features. Some ability to explain variation still gets loss. Usually much less than feature selection.

I tuned the number of decomposed features to 180. LSA frequently starts with 100 but I find - through trial and error - 180 produced the best results.

In [30]:
# Create Latent Semantic Analysis instance
lsa = TruncatedSVD(n_components=180, random_state=42)

In [31]:
%%time

# Perform Latent Semantic Analysis on training terms
decomp_LSA = lsa.fit(trainTerms)

CPU times: user 5.22 s, sys: 1.12 s, total: 6.34 s
Wall time: 2.16 s


In [135]:
# Transform the raw trainging term counts into the
# LSA decomposed features
trainTermsLSA = decomp_LSA.transform(trainTerms)

In [136]:
# Transform the raw testing term counts into the
# LSA decomposed features
testTermsLSA = decomp_LSA.transform(testTerms)

In [34]:
# Create logistic regression model from LSA
model_LSA_LR = LogisticRegression(random_state=42)

In [35]:
%%time

# Train logistic regression model on LSA features
model_LSA_LR = model_LSA_LR.fit(trainTermsLSA, dfTrain.failed_hygiene)

CPU times: user 46.9 ms, sys: 0 ns, total: 46.9 ms
Wall time: 51.1 ms


In [36]:
# Calucalate F1 score
model_LSA_LR_Pred = model_LSA_LR.predict(testTermsLSA)
printModelF1(dfTest.failed_hygiene, model_LSA_LR_Pred, \
             "Model 03 Logistic Regression of LSA from Term Frequencies")

Model 03 Logistic Regression of LSA from Term Frequencies
-----
F-1 score: 0.674419
True Negatives: 50
True Positives: 58
False Negatives: 28
False Positives: 28


Logistic regression of LSA components yieled good predictive value (within the limits of this training data). I would consider recommending it to a county hygiene inspector especially one with more restaurants to inspect than time. It has a higher false negative rate than ideal **but** the method scales well. After initial training, a predictive pipeline could parallelize across restaurants thanks to:

* Simple term count as opposed to IDF weighting that requires evaluation over an entire corpus


# Model 04: Latent Symantic Analysis of Frequent Phrases before Logistic Regression

*

I ran [AutoPhrase](https://github.com/shangjingbo1226/AutoPhrase) with custom `MODEL`, `RAW_TRAIN`, and `RAW_LABEL_FILE` parameters to train it agains the Yelp reviews for Mexican restaurants and the expert labels. Full command:

```bash
MODEL='./models/hygiene' RAW_TRAIN='./wip/train_hygiene.dat' RAW_LABEL_FILE='./wip/train_hygiene.dat.labels' ./auto_phrase.sh 2>&1 | tee ./models/hygiene/log.txt
```

In [55]:
# Print AutoPhrase log file
with open(AUTOPHRASE_LOG, "r") as f:
    print(f.read())

[32m===Compilation===(B[m
[32m===Tokenization===(B[m
Current step: Tokenizing input file...[0K
real	0m1.996s
user	0m5.688s
sys	0m0.797s
Detected Language: EN[0K
Current step: Tokenizing stopword file...[0K
Current step: Tokenizing wikipedia phrases...[0K
Current step: Tokenizing expert labels...[0K
com.cybozu.labs.langdetect.LangDetectException: no features in text
	at com.cybozu.labs.langdetect.Detector.detectBlock(Detector.java:235)
	at com.cybozu.labs.langdetect.Detector.getProbabilities(Detector.java:221)
	at com.cybozu.labs.langdetect.Detector.detect(Detector.java:209)
	at Tokenizer.detectLanguage(Tokenizer.java:151)
	at Tokenizer.main(Tokenizer.java:824)
Using default setting for unknown languages...
Using default setting for unknown languages...
Using default setting for unknown languages...
Using default setting for unknown languages...
Using default setting for unknown languages...
[32m===Part-Of-Speech Tagging===(B[m
Current step: Splitting files...[0K
Current 

In [60]:
# Read AutoPhrase frequent phrases into dataframe
dfPhrases = pd.read_csv(AUTOPHRASE_RESULTS, sep="\t", \
                        names=["score", "phrase"], index_col="phrase")
dfPhrases.reset_index(inplace=True)

In [62]:
# Convert phrase dataframe to vocabulary dictionary for
# use in `CountVectorizer`
phrases = dfPhrases.phrase.to_dict() # {index:dish}
phrases = {v: k for k, v in phrases.items()} # {dish:index}

In [68]:
# Create phrase frequency vectorizer 
pf = CountVectorizer(max_df=1.0, min_df=1, \
                     stop_words="english", \
                     tokenizer=MyTokenizer(), \
                     vocabulary=phrases)

In [69]:
%%time

# Calculate training phrase frequencies
trainPhrases = pf.fit_transform(dfTrain.review_text)

CPU times: user 6.73 s, sys: 15.6 ms, total: 6.75 s
Wall time: 6.98 s


In [70]:
%%time

# Calculate testing phrase frequences; Note: Transform ONLY,
# no additional fitting
testPhrases = pf.transform(dfTest.review_text)

CPU times: user 2.52 s, sys: 0 ns, total: 2.52 s
Wall time: 2.65 s


In [71]:
# Create Latent Semantic Analysis instance
lsaPhrases = TruncatedSVD(n_components=180, random_state=42)

In [72]:
%%time

# Perform Latent Semantic Analysis on training phrases
decomp_LSAPhrases = lsaPhrases.fit(trainPhrases)

CPU times: user 1.03 s, sys: 46.9 ms, total: 1.08 s
Wall time: 320 ms


In [74]:
# Transform the raw training phrase counts into the
# LSA decomposed features
trainPhrasesLSA = decomp_LSAPhrases.transform(trainPhrases)

In [76]:
# Transform the raw testing term counts into the
# LSA decomposed features
testPhrasesLSA = decomp_LSAPhrases.transform(testPhrases)

In [77]:
# Create logistic regression model from phrase LSA
model_Phrase_LSA_LR = LogisticRegression(random_state=42)

In [78]:
%%time

# Train logistic regression model on phrase LSA features
model_Phrase_LSA_LR = model_LSA_LR.fit(trainPhrasesLSA, dfTrain.failed_hygiene)

CPU times: user 31.2 ms, sys: 15.6 ms, total: 46.9 ms
Wall time: 40.5 ms


In [79]:
# Calucalate F1 score
model_Phrase_LSA_LR_Pred = model_Phrase_LSA_LR.predict(testPhrasesLSA)
printModelF1(dfTest.failed_hygiene, model_Phrase_LSA_LR_Pred, \
             "Model 04 Logistic Regression of LSA from Phrase Frequencies")

Model 03 Logistic Regression of LSA from Phrase Frequencies
-----
F-1 score: 0.647059
True Negatives: 49
True Positives: 55
False Negatives: 31
False Positives: 29


# Model 05: K-Nearest Neighbor Classifier of Unigrams

*

In [149]:
# Create KNN classifier
model_KNN = KNeighborsClassifier(n_neighbors=15)

In [150]:
%%time

# Train KNN classifier on training terms
model_KNN = model_KNN.fit(trainTerms, dfTrain.failed_hygiene)

CPU times: user 15.6 ms, sys: 15.6 ms, total: 31.2 ms
Wall time: 1.39 ms


In [151]:
# Calucalate F1 score
model_KNN_Pred = model_KNN.predict(testTerms)
printModelF1(dfTest.failed_hygiene, model_KNN_Pred, \
             "Model 05 K-Nearest Neighbor Classifier for Term Frequencies")

Model 04 K-Nearest Neighbor Classifier for Term Frequencies
-----
F-1 score: 0.551282
True Negatives: 51
True Positives: 43
False Negatives: 43
False Positives: 27


# Model 06: K-Nearest Neighbor Classifier of Unigrams

*

In [211]:
# Create Random Forest classifier
model_RF = RandomForestClassifier(n_estimators=55, criterion="entropy", random_state=42)

In [212]:
%%time

# Train Random Forest classifier on training terms
model_RF = model_RF.fit(trainTerms, dfTrain.failed_hygiene)

CPU times: user 203 ms, sys: 0 ns, total: 203 ms
Wall time: 202 ms


In [213]:
# Calucalate F1 score
model_RF_Pred = model_RF.predict(testTerms)
printModelF1(dfTest.failed_hygiene, model_RF_Pred, \
             "Model 06 Random Forest Classifer for Term Frequencies")

Model 06 Random Forest Classifer for Term Frequencies
-----
F-1 score: 0.627219
True Negatives: 48
True Positives: 53
False Negatives: 33
False Positives: 30
