***
# What's Cooking? Phase II
***

Here we are staring Phase II which will build on Phase I and we will start building some models. We will be keeping the models hight level and will focus on establishing some quick baselines and see if it is possible to eliminate any classifiers from additional parameter tuning in the next phase of the process

***
# Index
***

* [**Imports**](#import)
* [**Custom Functions**](#cust)
* [**Data Import**](#read)
* [**Model Evaluation**](#model)
* [**Summary**](#sum)
* [**Next Steps**](#next)

<a id="---"></a>

<a id="import"></a>
# Imports

In [58]:
# imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

<a id="cust"></a>
# Custom Functions

In [13]:
# create new features function 
def new_features(df_name):
    '''Create 4 new features for given dataframe'''
    # create feature 1 avg ingredient length
    num_ingred = 0
    avg_length =[]
    for i in df_name.itertuples():
        ing_length = 0
        num_ingred = len(i.ingredients)
        # get the length each ingredient
        for ing in i.ingredients:
            ing_length += len(ing)
        avg_length.append(round(ing_length / num_ingred,2))
    df_name['avg_ingredients_len'] = avg_length  
    # create feature 2 count of ingredients 
    df_name['num_ingredients'] = df_name['ingredients'].apply(len)
    # create feature 3 avg ingredient length
    df_name['difference'] = df_name['avg_ingredients_len'] - df_name['num_ingredients']
    # create feature 4 convert ingredients from list to string 
    df_name['ingredients_str'] = df_name['ingredients'].astype('str')
    return df_name


def hist_bins(d_frame, col_name, 
               title="",y_axis="",x_axis="",adder = 0, start = 0, stop = 0):
    '''Create a histogram with adjusted bins'''
    bin_edges = np.arange(start, stop + adder, adder)

    plt.axvline(d_frame[col_name].median(),color='y',linestyle='--',lw=2,label='Median')
    plt.axvline(d_frame[col_name].mean(),color='r',linestyle='--',lw=2,label='Mean')
    
    plt.title(title,fontsize=14)
    plt.ylabel(y_axis,fontsize=14)
    plt.xlabel(x_axis,fontsize=14)
    plt.xticks(fontsize=14)
    plt.yticks(fontsize=14)
    plt.legend(fontsize=14)
    return plt.hist(d_frame[col_name], bins = bin_edges);

<a id="read"></a>
# Data Import 

Using the custom function from Phase I we will import the two datasets which will have our new features:

* **avg_ingredients_lens**: For each observation average the length those ingredients 
* **num_ingredients**: For each observation provide the count of ingredients 
* **difference**: is...`avg_ingredienta_len` - `num_ingredients`
* **ingredients_str**: ingredients converted to a string

In [14]:
df = new_features(pd.read_json('whats_cooking_train.json'))
new = new_features(pd.read_json('whats_cooking_test.json'))

In [15]:
df.head()

Unnamed: 0,id,cuisine,ingredients,avg_ingredients_len,num_ingredients,difference,ingredients_str
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes...",12.0,9,3.0,"['romaine lettuce', 'black olives', 'grape tom..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g...",10.09,11,-0.91,"['plain flour', 'ground pepper', 'salt', 'toma..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g...",10.33,12,-1.67,"['eggs', 'pepper', 'salt', 'mayonaise', 'cooki..."
3,22213,indian,"[water, vegetable oil, wheat, salt]",6.75,4,2.75,"['water', 'vegetable oil', 'wheat', 'salt']"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe...",10.1,20,-9.9,"['black pepper', 'shallots', 'cornflour', 'cay..."


In [16]:
df.tail()

Unnamed: 0,id,cuisine,ingredients,avg_ingredients_len,num_ingredients,difference,ingredients_str
39769,29109,irish,"[light brown sugar, granulated sugar, butter, ...",12.17,12,0.17,"['light brown sugar', 'granulated sugar', 'but..."
39770,11462,italian,"[KRAFT Zesty Italian Dressing, purple onion, b...",17.0,7,10.0,"['KRAFT Zesty Italian Dressing', 'purple onion..."
39771,2238,irish,"[eggs, citrus fruit, raisins, sourdough starte...",8.25,12,-3.75,"['eggs', 'citrus fruit', 'raisins', 'sourdough..."
39772,41882,chinese,"[boneless chicken skinless thigh, minced garli...",13.14,21,-7.86,"['boneless chicken skinless thigh', 'minced ga..."
39773,2362,mexican,"[green chile, jalapeno chilies, onions, ground...",12.0,12,0.0,"['green chile', 'jalapeno chilies', 'onions', ..."


The training data is 39,774 observations with 7 features. The testing data is 9,944 by 6. It does not contain the features 'cuisine' which is the variable we want to predict. 

In [17]:
print(f"Shape of Training Data:\t {df.shape}")
print(f"Shape of Testing Data:\t {new.shape}")

Shape of Training Data:	 (39774, 7)
Shape of Testing Data:	 (9944, 6)


The features are not missing any values and the data types for each feature should be fine.

<a id="model"></a>
# Model Evaluation

## Null Accuracy/DummyClassifier
Here we are going to use sklearns DummyClassifier to provide the scores using a couple different strategies. 

Create four variables for the different feature options and our predictor value y.  


In [18]:
# features 
features1 = ['ingredients_str']
features2 = ['avg_ingredients_len','num_ingredients']
features3 = ['difference']
features4 = ['avg_ingredients_len','num_ingredients','difference']

In [19]:
# train, test split
X_train, X_test, y_train, y_test = train_test_split(df[features2], df['cuisine'], random_state=42)

# create most frequent dummyclassifier
dumb_mf = DummyClassifier(strategy='most_frequent')

# create dummy classifier
dumb_mf.fit(X_train,y_train)
# create y pred values
y_pred_mf = dumb_mf.predict(X_test)

# run accuracy score
score_mf = metrics.accuracy_score(y_test,y_pred_mf)
print(f"DummyClassifier Most Frequent Score: {round(score_mf,4)}")

DummyClassifier Most Frequent Score: 0.1961


We have confirmed that by selecting Italian for all cuisines our model will be correct **19.61** percent of the time.  We will now start crating a simple baseline strategy for multiple models. We will first instantiate our classifiers for the models we want to try. 

In [53]:
# instantiate classifiers 
vect = CountVectorizer() # token_pattern=r"'([a-z ]+)'"
svm_cl = svm.SVC()
knn = KNeighborsClassifier()
nb = MultinomialNB()
log = LogisticRegression(max_iter=1000)
rf = RandomForestClassifier()

Here we tried `Support Vector Machines`, `K-Neighbors` and `Multinominal Naive Bayes` on two of the features we engineered. The best score was 0.2222 which is only slightly higher than our `most frequent` baseline of 0.1961. 

Note: SVM takes the longest to run. If this is a model used during parameter tuning maybe switch to Train, Test Split? 

In [41]:
# define X and y 
X = df[features2]
y = df['cuisine']

**SVM High Score**

In [42]:
cross_val_score(svm_cl, X, y, cv=5, scoring='accuracy').mean()

0.22220545319749468

**KNN**

In [23]:
cross_val_score(knn, X, y, cv=5, scoring='accuracy').mean()

0.13936245605822417

**MN Naive Bayes**

In [24]:
cross_val_score(nb, X, y, cv=5, scoring='accuracy').mean()

0.2119977203932037

## Ingredients Only 
Here we are going to focus on using `CountVectorizer` to create a document term matrix of the ingredients only. We converted the the list of ingredients to a string and now we will convert to a Document Term Matrix. 

In [44]:
# define X and y 

# ingredients as a string 
X = df['ingredients_str'] # NOTE must be 1d np array or panda series
y = df['cuisine']

**NOTE** When creating `X` for the DTM it must a 1d numpy array or pandas series. **`df[features1]`** returns a pandas core Dataframe which will not work with CountVectorizer. 

**Create Pipeline** for Proper Cross-val:
* Create a pipeline for `vect` and each classifier. We want to do this so when we use cross validation we want each of the 5 folds to introduce the data for the first time. 

**`CountVect`** has `all default settings` as do the individual classifiers. 

In [54]:
# countVect and nb
pipe_nb = make_pipeline(vect, nb)
# countVect and knn
pipe_knn = make_pipeline(vect, knn)
# countVect and svm_cl
pipe_svm_cl = make_pipeline(vect, svm_cl)
# countVect and logistic regression
pipe_log = make_pipeline(vect, log)
# countVect and random forest
pipe_rf = make_pipeline(vect, rf)

**KNN** with CountVect all defaults scored **0.6361**

In [46]:
# vect and knn all defaults
cross_val_score(pipe_knn, X, y, cv=5, scoring='accuracy').mean()

0.6360688888829185

**MN Naive Bayes** with CountVect all defaults scored **0.7235**

In [48]:
# vect and nb all defaults
cross_val_score(pipe_nb, X, y, cv=5, scoring='accuracy').mean()

0.723487776272334

**Logistic Regression** with CounVect all defaults scored **0.7834**
* Note, `max_iter` was increased from default of 100 to 1000 to allow for algorithm to converge. The document term matrix is over 6,000 words which requires larger width before converging. 

In [55]:
# vect and log reg all defaults (max_iter=1000)
cross_val_score(pipe_log, X, y, cv=5, scoring='accuracy').mean()

0.783376239271474

**Random Forest** with CountVect all defaults scored **0.7562**

In [56]:
# vect and random forest all defaults
cross_val_score(pipe_rf, X, y, cv=5, scoring='accuracy').mean()

0.7562730198958277

**Support Vector Machines** with CountVect all defaults scored **0.7772**

In [57]:
# vect and svm_cl all defaults
cross_val_score(pipe_svm_cl, X, y, cv=5, scoring='accuracy').mean()

0.7772164205653279

<a id="sum"></a>
# Summary Baseline Models 
We wanted to establish some early baselines for a few different classifiers. As well, take a look at how using just the ingredients with our engineered features faired. 

**Null Accuracy**
Our first baseline was to establish what would happen if we built a most-frequent model. We found that we would be correct **19.61 percent** of the time. That value seems low and we should be able to improve on that but its a quick baseline and will actually help us.

**Engineered Features**
We want to test out our engineered features.  For that we used three classifiers and two of the features. Average ingredient length and number of ingredients. We used k-fold cross-validation and received the following scores:
* SVM - Had the high score at 0.222
* KNN - Scored 0.1393
* MN Naive Bayes- Scored 0.212

KNN was actually worse than the null model and the other two scores only slightly higher. These features alone are not enough. We can try creating a model that combines the engineered features with list of ingredients.

**Ingredients Only**
Final baseline we tested models using CountVectorizer to create a DTM on the list of ingredients. We introduced two classifiers because we noticed higher scores.  Everything had default settings with the exception of logistic regression. Again, we used cross validation and set k equal to 5. We tested at 10 and received a similar score so we will remain with 5.
* KNN Scored 0.6360
* NB Scored 0.72
* Log Reg Scored 0.78
* Random Forest Scored 0.76
* SVM Scored 0.78

With the exception of KNN all of the models scored in the the 70's. At this point it we will need to continue with all of the models as it is not clear which will perform the best.  While KNN is the lowest it is not uncommon to see a large increase just by adjusting the number of neighbors. 


<a id="next"></a>
# Next Steps 
Now we will move on to tuning individual model parameters. We will look at single model tuning and even multiple model tuning. Recall we have parameters that we can tune within CountVectorizer and the classifier.  For this we will utilize pipelines and grid search. 