# Build Classification Models

In [4]:
import pandas as pd # for data processing
from sklearn.linear_model import LogisticRegression # classification model
from sklearn.model_selection import train_test_split, cross_val_score #Metrics and tools
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve #metrics and tools
from sklearn.svm import SVC #SVM
import numpy as np # for data processing

Lets import the data we cleaned in the previous exercise

In [5]:
df=pd.read_csv('../data/cleaned_cuisines.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Let us prepare the data for the classification model

In [7]:
labels_df = df.cuisine
features_df = df.drop(['Unnamed: 0','cuisine'],axis=1)
print(labels_df.head())
print(features_df.head())

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object
   almond  angelica  anise  anise_seed  apple  apple_brandy  apricot  \
0       0         0      0           0      0             0        0   
1       1         0      0           0      0             0        0   
2       0         0      0           0      0             0        0   
3       0         0      0           0      0             0        0   
4       0         0      0           0      0             0        0   

   armagnac  artemisia  artichoke  ...  whiskey  white_bread  white_wine  \
0         0          0          0  ...        0            0           0   
1         0          0          0  ...        0            0           0   
2         0          0          0  ...        0            0           0   
3         0          0          0  ...        0            0           0   
4         0          0          0  ...        0            0           0   

   whole_grain_wheat_

## Choosing the right classifier

There are many different type of classifier models out there, as there are many different ways to train and classify the data. It really all depends on what you are trying to do.

Let us try to understand which classifier is useful for this dataset:
<ul>
<li><font color='aqua'>Do we have a lot of data? </font> 799 observations is not a lot of data, so neural network is too heavy</li>
<li><font color='aqua'>Is it a two classifer problem ? </font> No, this data is not a two classifier problem, so we dont need to use One vs All classifier </li>
<li><font color='aqua'> Are we building ranks for this classifer? </font> No, this is data is not being ranked, so we dont need a multiclass boosted decision tree algorithm </li>
<li><font color='aqua'> Are we trying to solve this classification using decision trees? </font> Yes, a logistic regression tree could work.
</ul>

## Logistic Regression
So we have chosen the logistic regression for our multiclass problem, and will be using Scikit learn for this. However, does the logistic regression really match for multiclass problem. Logistic regression is used for binary classification which takes two classes. Therefore, directly, we cannot use logistic regression for multiclass problem. This is where heurstic methods come in that we set in the multiclass parameter.
The following heurstic methods are:

<ul>
<li><font color='aqua'>One vs Rest (ovr): </font> it involves in splitting the multiclass problem into multiple binary class problem. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident. Given three class 'shirt', 'pants' and 'jacket', the 3 binary class problem would be:</li>
<ul>
<li> shirt vs [pants,jacket] </li>
<li> pants vs [shirt,jacket] </li>
<li> jacket vs [shirt,pants] </li>
</ul>
<li><font color='aqua'>One vs One (ovo): </font> It involves in splitting the multiclass problem into multiple binary class problem. However, the difference here is that it will be trained on one by one binary class. Given four class 'shirt', 'pants', 'cap', and 'jacket', the binary problem would be 6 different binary class problems: </li> 
<ul>
<li> shirt vs pants </li>
<li> shirt vs jacket</li>
<li> shirt vs cap </li>
<li> pants vs jacket</li>
<li> pants vs cap </li>
<li> cap vs jacket</li>
</ul>
<li><font color='aqua'>Cross Entropy: </font> It involves in determining the loss function when you quantify the difference between two probabilities. 
</ul>

Lets prepare the datasets:

In [8]:
X_train,X_test,y_train,y_test = train_test_split(features_df,labels_df,test_size=0.3)


In [9]:
# deveelop the model:
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train)) 

accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))

Accuracy is 0.8090075062552127


In [17]:
print(f'ingredients: {X_test.iloc[40][X_test.iloc[40]!=0].keys()}')
print(f'cuisine: {y_test.iloc[40]}')

ingredients: Index(['carrot', 'egg_noodle', 'pea', 'peanut_oil', 'pork', 'scallion',
       'sesame_oil', 'soy_sauce', 'starch'],
      dtype='object')
cuisine: chinese


In [16]:
df.iloc[40]

Unnamed: 0        40
cuisine       indian
almond             0
angelica           0
anise              0
               ...  
wood               0
yam                0
yeast              0
yogurt             0
zucchini           0
Name: 40, Length: 382, dtype: object