# Assignment 4
## Chantel A. Miller

Create classification model, predicting the outcome of food safety inspection based on the inspectors’ comments

Leverage the results of your homework from Week-1 and Week-2 to extract free-form text comments from inspectors
Discard the text from “Health Code” – only keep inspectors’ comments
Build classification model, predicting the outcome of inspection – your target variable is “Results”
Explain why you selected a particular text pre-processing technique
Visualize results of at least two text classifiers and select the most robust one
You can choose to build a binary classifier (limiting your data to Pass / Fail) or multinomial classifier with all available values in Results
Rules and requirements:

Your final output and the code should be contained within Jupyter Notebook

In [31]:
# General
import pandas as pd
import numpy as np
from collections import Counter

# Regex
import re

# Plotting 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import metrics

In [3]:
data = pd.read_csv('Food_Inspections.csv')

### Data Exploration

In [4]:
data.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location
0,2130049,"JOE & THE JUICE ILLINOIS, LLC",JOE & THE JUICE,2564512.0,Restaurant,Risk 2 (Medium),10 E DELAWARE PL,CHICAGO,IL,60611.0,01/05/2018,License,Pass,,41.899255,-87.627835,"(41.89925505559848, -87.62783463799146)"
1,2130022,BOO BAE TEA INC,BOO BAE TEA INC,2570290.0,,Risk 2 (Medium),1013 W Webster AVE,CHICAGO,IL,60614.0,01/05/2018,License,Not Ready,,41.92162,-87.654051,"(41.92161984057171, -87.65405058579164)"
2,2130018,FRESHII,FRESHII,2446395.0,Restaurant,Risk 1 (High),1166 W MADISON ST,CHICAGO,IL,60607.0,01/05/2018,Canvass,Fail,"9. WATER SOURCE: SAFE, HOT & COLD UNDER CITY P...",41.881731,-87.656851,"(41.881731324473414, -87.65685079354886)"
3,2129964,LINCOLN PARK PRESCHOOL,LINCOLN PARK PRESCHOOL & KINDERGARTEN,2215624.0,Daycare (2 - 6 Years),Risk 1 (High),108 W GERMANIA PL,CHICAGO,IL,60610.0,01/04/2018,License,Pass w/ Conditions,8. SANITIZING RINSE FOR EQUIPMENT AND UTENSILS...,41.910486,-87.631996,"(41.91048634702192, -87.63199583676088)"
4,2129963,ORIGINAL STEAM,ORIGINAL STEAM,2574892.0,,Risk 1 (High),2428 S WALLACE AVE,CHICAGO,IL,60616.0,01/04/2018,License,Not Ready,,41.848386,-87.64196,"(41.84838625123219, -87.64196007758322)"


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163254 entries, 0 to 163253
Data columns (total 17 columns):
Inspection ID      163254 non-null int64
DBA Name           163254 non-null object
AKA Name           160795 non-null object
License #          163239 non-null float64
Facility Type      158603 non-null object
Risk               163188 non-null object
Address            163254 non-null object
City               163110 non-null object
State              163234 non-null object
Zip                163183 non-null float64
Inspection Date    163254 non-null object
Inspection Type    163253 non-null object
Results            163254 non-null object
Violations         130310 non-null object
Latitude           162683 non-null float64
Longitude          162683 non-null float64
Location           162683 non-null object
dtypes: float64(4), int64(1), object(12)
memory usage: 21.2+ MB


In [6]:
data['Results'].unique()

array(['Pass', 'Not Ready', 'Fail', 'Pass w/ Conditions', 'No Entry',
       'Out of Business', 'Business Not Located'], dtype=object)

Trim data to columns and results needed for analysis. 

In [7]:
data = data.loc[data['Results'].isin(['Pass', 'Fail'])]
data = data[['Results', 'Inspection ID', 'Facility Type', 'Risk', 'Zip', 'Inspection Date', 'Inspection Type', 'Violations']]
data.head()

Unnamed: 0,Results,Inspection ID,Facility Type,Risk,Zip,Inspection Date,Inspection Type,Violations
0,Pass,2130049,Restaurant,Risk 2 (Medium),60611.0,01/05/2018,License,
2,Fail,2130018,Restaurant,Risk 1 (High),60607.0,01/05/2018,Canvass,"9. WATER SOURCE: SAFE, HOT & COLD UNDER CITY P..."
5,Fail,2129953,Restaurant,Risk 1 (High),60624.0,01/04/2018,Complaint,3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERATUR...
6,Fail,2129949,Restaurant,Risk 1 (High),60608.0,01/04/2018,Canvass,12. HAND WASHING FACILITIES: WITH SOAP AND SAN...
7,Pass,2129931,Restaurant,Risk 1 (High),60645.0,01/04/2018,Complaint,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR..."


### Missing Values

We saw some missing values in violations before, are they still there after restricting our data.

In [8]:
len(data[data['Violations'].isnull()])

12701

Remove missing violations since we won't be able to analyze them as predictors.

In [9]:
data = data[~data['Violations'].isnull()]
data.head()

Unnamed: 0,Results,Inspection ID,Facility Type,Risk,Zip,Inspection Date,Inspection Type,Violations
2,Fail,2130018,Restaurant,Risk 1 (High),60607.0,01/05/2018,Canvass,"9. WATER SOURCE: SAFE, HOT & COLD UNDER CITY P..."
5,Fail,2129953,Restaurant,Risk 1 (High),60624.0,01/04/2018,Complaint,3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERATUR...
6,Fail,2129949,Restaurant,Risk 1 (High),60608.0,01/04/2018,Canvass,12. HAND WASHING FACILITIES: WITH SOAP AND SAN...
7,Pass,2129931,Restaurant,Risk 1 (High),60645.0,01/04/2018,Complaint,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR..."
8,Pass,2129924,Restaurant,Risk 2 (Medium),60601.0,01/04/2018,License,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...


In [39]:
### Helper functions

def split_violations(row):
    #violations_array = []
    #violations_array.append(row.split('|'))
    return row.split('|')
    
def parse_comments(row):
    # row == violations_array?
    comments_array = []
    for violation in row:
        comments_array.append(violation.rpartition(' - Comments: ')[2])
    return ''.join(comments_array)


def apply_comment_extraction(row):
    return parse_comments(split_violations(row))

In [23]:
#test_df['Comments'] = test_df['Violations'].apply(apply_comment_extraction)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [25]:
#test_df['Comments'].iloc[0]

['OBSERVE NO HOT RUNNING WATER ON THE PREMISES; THAT INCLUDES EXPOSED HAND SINKS IN FRONT & REAR PREP AREAS, 3-COMPARTMENT SINK IN REAR PREP, AND HAND SINKS IN BOTH TOILET ROOMS. INSTRUCTED TO CONTACT PLUMMER TO HAVE HOT WATER RESTORED. CRITICAL VIOLATION 7-38-030. ',
 'PREVIOUS MINOR VIOLATION NOT CORRECTED FROM INSPECTION REPORT 1989320, DATED 2/17/2017. VIOLATION INCLUDES; #38; NO RUNNING HOT AND COLD WATER TO TOP LOADING SOFT SERVE MACHINE, INSTRUCTED TO PROVIDE,  \n \nVIOLATION STILL EXISTS. SERIOUS VIOLATION 7-42-090 ',
 'MUST DISCONTINUE USING MILK CRATES AS STORAGE RACKS THROUGHOUT FRONT AND REAR PREP AREAS, AND IN THE WALK IN COOLER. INSTALL CORRECT STORAGE RACKS. ',
 'MUST REPAIR COVING ON WALL ACROSS FROM THE EXPOSED HAND SINK  \nIN THE REAR PREP AREA.']

In [40]:
data['Comments'] = data['Violations'].apply(apply_comment_extraction)

In [29]:
data['Results'] = data['Results'].apply(lambda x: 1 if x == 'Pass' else 0)

In [41]:
data.head()

Unnamed: 0,Results,Inspection ID,Facility Type,Risk,Zip,Inspection Date,Inspection Type,Violations,Comments
2,0,2130018,Restaurant,Risk 1 (High),60607.0,01/05/2018,Canvass,"9. WATER SOURCE: SAFE, HOT & COLD UNDER CITY P...",OBSERVE NO HOT RUNNING WATER ON THE PREMISES; ...
5,0,2129953,Restaurant,Risk 1 (High),60624.0,01/04/2018,Complaint,3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERATUR...,OBSERVED POTENTIALLY HAZARDOUS FOODS AT IMPROP...
6,0,2129949,Restaurant,Risk 1 (High),60608.0,01/04/2018,Canvass,12. HAND WASHING FACILITIES: WITH SOAP AND SAN...,OBSERVED NO SOAP OR PAPER TOWELS AT EXPOSED HA...
7,1,2129931,Restaurant,Risk 1 (High),60645.0,01/04/2018,Complaint,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",HOLES IN WALLS IN THE FOLLOWING AREAS:LARGE HO...
8,1,2129924,Restaurant,Risk 2 (Medium),60601.0,01/04/2018,License,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,"COOLERS NOT CLEAN INSIDE, LEFTOVER FOODS LEFT ..."


In [42]:
X = data['Comments']
y = data['Results']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


Can do some cleanup on comments but lets build model first.

### Vectorize Dataset

In [37]:
### instantiate the vectorizer
vect = CountVectorizer()

In [43]:
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<85940x31216 sparse matrix of type '<class 'numpy.int64'>'
	with 5231157 stored elements in Compressed Sparse Row format>

In [46]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<28647x31216 sparse matrix of type '<class 'numpy.int64'>'
	with 1739806 stored elements in Compressed Sparse Row format>

## Model Building

### NaiveBayes

In [44]:
# instantiate a multinomial naive bayes model

nb = MultinomialNB()

# train and time model using X_train_dtm
%time nb.fit(X_train_dtm, y_train)

Wall time: 81.2 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [47]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy of class predictions
print(metrics.accuracy_score(y_test, y_pred_class))

0.925541941565


In [48]:
# calculate precision and recall
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.82      0.90      0.86      7124
          1       0.97      0.93      0.95     21523

avg / total       0.93      0.93      0.93     28647



In [49]:
# calculate the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[ 6398   726]
 [ 1407 20116]]


### Logistic Regression

In [51]:
# instantiate a logistic regression model
logreg = LogisticRegression()

# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

Wall time: 45.7 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [52]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

# calculate accuracy of class predictions
print(metrics.accuracy_score(y_test, y_pred_class))

0.972981464028


In [53]:
# calculate precision and recall
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.97      0.92      0.94      7124
          1       0.97      0.99      0.98     21523

avg / total       0.97      0.97      0.97     28647



In [54]:
# calculate the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[ 6550   574]
 [  200 21323]]


### Support Vector Machine

In [55]:
# instantiate a SVM model
svm = SGDClassifier(max_iter=100, tol=None)

# train the model using X_train_dtm
%time svm.fit(X_train_dtm, y_train)

Wall time: 3.28 s


SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=100, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

In [60]:
# make class predictions for X_test_dtm
y_pred_class = svm.predict(X_test_dtm)

print(metrics.accuracy_score(y_test, y_pred_class))

0.964080008378


In [61]:
# calculate accuracy of class predictions
# calculate precision and recall
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.94      0.91      0.93      7124
          1       0.97      0.98      0.98     21523

avg / total       0.96      0.96      0.96     28647



In [62]:
# calculate the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[ 6517   607]
 [  422 21101]]


### Conclusions

Even without cleaning our comments, we saw very high model accuracy with all models predicting the training data with > 92% accuracy. Our best performing model was the logistic regression with an accuracy of 97%. Though this is very high, lets try to improve on this by cleaning up our comment data and performing some basic processing.

Some processing techniques that might be helpful for us are removing stop words, removing words that appear in many documents (inspections) and therefore have less predictive power, ensuring words aren't two rare, therefore excluding words that only appear in one document (inspection) as they also won't have much predictive power. 

In [78]:
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=None, min_df=2,
        ngram_range=(1, 3), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [63]:
### instantiate the vectorizer, remove stop words, limit lower and upper bounds of n grams used
# ignore words that have a document frequency above 50%
# only include words that appear in at least two documents
vect = CountVectorizer(stop_words='english', ngram_range=(1, 3), max_df=0.5, min_df=2)

In [64]:
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<85940x627512 sparse matrix of type '<class 'numpy.int64'>'
	with 13537647 stored elements in Compressed Sparse Row format>

In [65]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<28647x627512 sparse matrix of type '<class 'numpy.int64'>'
	with 4340395 stored elements in Compressed Sparse Row format>

## Model Building

### NaiveBayes

In [66]:
# instantiate a multinomial naive bayes model

nb = MultinomialNB()

# train and time model using X_train_dtm
%time nb.fit(X_train_dtm, y_train)

Wall time: 315 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [67]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy of class predictions
print(metrics.accuracy_score(y_test, y_pred_class))

0.941110762034


In [68]:
# calculate precision and recall
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.86      0.91      0.88      7124
          1       0.97      0.95      0.96     21523

avg / total       0.94      0.94      0.94     28647



In [69]:
# calculate the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[ 6476   648]
 [ 1039 20484]]


### Logistic Regression

In [70]:
# instantiate a logistic regression model
logreg = LogisticRegression()

# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

Wall time: 1min 19s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [71]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

# calculate accuracy of class predictions
print(metrics.accuracy_score(y_test, y_pred_class))

0.977275107341


In [72]:
# calculate precision and recall
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.98      0.93      0.95      7124
          1       0.98      0.99      0.99     21523

avg / total       0.98      0.98      0.98     28647



In [73]:
# calculate the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[ 6621   503]
 [  148 21375]]


### Support Vector Machine

In [74]:
# instantiate a SVM model
svm = SGDClassifier(max_iter=100, tol=None)

# train the model using X_train_dtm
%time svm.fit(X_train_dtm, y_train)

Wall time: 11.8 s


SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=100, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

In [75]:
# make class predictions for X_test_dtm
y_pred_class = svm.predict(X_test_dtm)

print(metrics.accuracy_score(y_test, y_pred_class))

0.963346947324


In [76]:
# calculate accuracy of class predictions
# calculate precision and recall
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.94      0.91      0.93      7124
          1       0.97      0.98      0.98     21523

avg / total       0.96      0.96      0.96     28647



In [77]:
# calculate the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[ 6485   639]
 [  411 21112]]


### Post cleanup conclusions

As expected, since the models performed well without cleaning up the comments, there wasn't much room for additional improvement. Our lowest performing model without cleanup, the naive bayes, went from 92% to 94% accuracy on the test data. The best performing model, logit, went from 97.2% accuracy to 97.7%. 