## Food Inspection: Creating a Machine Learning Pipeline

The purpose of this notebook is to create a better preprocessing pipeline method by utilizing the ColumnTransformer method.  I have used pipelines in the past but most of those steps were for balancing the data, standardizing data, running different models, and optimizing the models.  Most of the preprocessing apsects in those cases had already been done manually by using pandas.  In this notebook, I aim to include in my pipeline processes for manipulating columns based on data type, interpolating missing data, and potentially creating custom transformers to handle special cases.  

In [1]:
# Import required libraries

# # Code formatter
# # !pip3 install nb_black
# %load_ext nb_black

# eda tools
import numpy as np
import pandas as pd

# visualization dependencies
import matplotlib.pyplot as plt  
import seaborn as sns  

# machine learning libraries
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
# from sklearn.ensemble import RandomForestClassifier  

# pipeline generation
from sklearn.pipeline import Pipeline  
from sklearn.compose import ColumnTransformer  
from sklearn.compose import make_column_selector
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# hide jupyter lab warnings
import warnings
warnings.filterwarnings('ignore')

# expand the number of dataframe columns visible
pd.options.display.max_columns = 100

# make sound when this code executes: Audio(sound_file, autoplay=True)
from IPython.display import Audio
sound_file = './sound/chord.wav'

# display package informatin
# !conda install -c conda-forge session-info
import session_info
session_info.show()

### Read Dataset

In [2]:
# Read data
restaurant_df = pd.read_csv('./data/manipulated/combined_data.csv', parse_dates=['inspect_date', 'approx_start_date'])

In [3]:
# renamed first column - caused by including the index during the export from `feature_extraction.ipynb`.  This is the original index.  
restaurant_df.rename(columns={'Unnamed: 0':'original_index'}, inplace=True)

# simplify column headers (after one-hot-encoding these categories become header titles)
temp_dict = {'Risk 1 (High)':'high', 'Risk 2 (Medium)':'medium', 'Risk 3 (Low)':'low',np.nan: np.nan}
restaurant_df['risk'] = restaurant_df['risk'].apply(lambda x: temp_dict[x])

# limit dataset records
df = restaurant_df[restaurant_df['inspect_date'] > '2018-01-01'] 

# define target and feature columns
# maybe add violation id number to the model
target = df['results']

columns = ['risk', 'inspect_type', 'violation_count',
       'vl_must_comply_count',
       'vl_instructed_comply_count',
       'vl_citation_count', 'ward', 
       'license_code', 'bus_activity_id',
       'application_type', 'conditional_approval',
       'bus_age', 'number_of_chains']
features = df[columns]

# Convert objects to category
features[features.select_dtypes(['object']).columns] = features.select_dtypes(['object']).apply(lambda x: x.astype('category'))

# Convert non-objects to category
features['ward'] = features['ward'].astype('category')
features['license_code'] = features['license_code'].astype('category')

# Convert float to integer
features['bus_age'] = features['bus_age'].astype('int')

# create labels for model
# I would like to add this to the pipeline but scikit-learn does not have this builtin for classfication
# The regression version of this TransformedTargetRegressor
# The closest extension is mlinsights and their TransformedTargetClassifier2
label_encoder = LabelEncoder()
label_encoder.fit(target)
encoded_target = label_encoder.transform(target)


### Pipeline method

In [6]:
import pickle

# # load model
with open('model.pkl', 'rb') as f:
    clf_rf = pickle.load(f)

# view model
clf_rf

In [7]:
# create model - split data, select model, fit model and output accuracy as generic outcome
X_train, X_test, y_train, y_test = train_test_split(features, encoded_target, train_size=0.75, random_state=42)

https://machinelearningmastery.com/columntransformer-for-numerical-and-categorical-data/

In [8]:
# identify numeric and categorial columns
numeric_features = make_column_selector(dtype_exclude='category')
numeric_transformer = Pipeline(steps=[
    # ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
    ])

categorical_features = make_column_selector(dtype_include='category')
categorical_transformer = OneHotEncoder(categories='auto', handle_unknown='ignore') 


# add number and category transformations and specify columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
        ])

In [9]:
# create pipeline
clf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_rf)
                    ])

In [10]:
# train pipeline on data
clf_pipeline.fit(X_train, y_train)

# show score - need to evaluation with classification report  
print(clf_pipeline.score(X_test, y_test))

0.6400029987255417


In [32]:
from sklearn.model_selection import cross_validate

cv = KFold(n_splits=10, shuffle=True, random_state=1)
score_methods = ['accuracy', 'precision', 'recall', 'neg_mean_absolute_error']
scores = cross_validate(clf_pipeline, X_train, y_train, scoring=score_methods, cv=cv, n_jobs=-1, return_train_score=True)



# scores = np.absolute(scores[3])
# print('MAE: %.3f (%.3f)' % (np.mean(scores[3]), np.std(scores[3])))

dict_keys(['fit_time', 'score_time', 'test_accuracy', 'train_accuracy', 'test_precision', 'train_precision', 'test_recall', 'train_recall', 'test_neg_mean_absolute_error', 'train_neg_mean_absolute_error'])


In [38]:
print(scores.keys())
print(scores['test_accuracy'].mean(), scores['test_accuracy'].std())
print(scores['train_accuracy'].mean(), scores['train_accuracy'].std())

dict_keys(['fit_time', 'score_time', 'test_accuracy', 'train_accuracy', 'test_precision', 'train_precision', 'test_recall', 'train_recall', 'test_neg_mean_absolute_error', 'train_neg_mean_absolute_error'])
0.6461490699289196 0.00868774646190784
0.6765604270618965 0.0023670199280752314


In [28]:
onehot_columns = list(clf_pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out())
numerical_columns = features.columns[features.dtypes != 'category'].tolist()
cols = numerical_columns + onehot_columns

# show dataframe of feature name and model importance
imp_df = pd.DataFrame({
    "Varname": cols,
    "Imp": clf_pipeline.steps[1][1].feature_importances_
})
temp = imp_df.sort_values(by="Imp", ascending=False)
temp.head(50)

Unnamed: 0,Varname,Imp
0,violation_count,0.467353
3,vl_citation_count,0.272344
10,inspect_type_Canvass Re-Inspection,0.023362
4,bus_age,0.023032
9,inspect_type_Canvass,0.018197
11,inspect_type_Complaint,0.017604
5,number_of_chains,0.016485
12,inspect_type_Complaint Re-Inspection,0.0117
14,inspect_type_License,0.008287
62,ward_42.0,0.006746


In [None]:
import pickle

# save model  
with open('model2.pkl','wb') as f:
    pickle.dump(clf,f)

# # load model
# with open('model.pkl', 'rb') as f:
#     clf2 = pickle.load(f)

In [30]:
# show evaluation of training data  
# note1:  this evaluation is not very useful except when comparing the differences to the test data set
# note2:  large changes in the model fit (going down between the train and test) indicate that there is overfitting occuring.  
# note3:  underfitting is observed by seeing large errors in the model.  
from sklearn.metrics import classification_report

# Get the predictions and labels
predictions = clf_pipeline.steps[1][1].predict(X_train)
labels = y_train

# Get the classification report
report = classification_report(labels, predictions)

ValueError: could not convert string to float: 'high'

In [None]:
# show evaluation of testing data  
# Get the predictions and labels
predictions = clf.predict(X_test)
labels = y_test

# Get the classification report
report = classification_report(labels, predictions)

In [None]:
# create ROC/AUC evaluation  
from sklearn.metrics import roc_auc_score
from sklearn.metrics import plot_roc_curve

# Predict the labels of the test data
y_pred = clf.predict(X_test)

# Calculate the ROC AUC
roc_auc = roc_auc_score(y_test, y_pred)

# Plot the ROC curve
plot_roc_curve(clf, X_test, y_test)

### Analysis  
- compare classification report
- evaluate feature importance  

### Future Steps  
- What features to remove
- What features to add to the model  
- Change CV eval metric from accuracy to precision/recall
- Create new notebook for further evaluation - classification reports, ROC/AUC curves  
- Maybe add food type (geo_feature notebook) and menu information to model  
