# Recipe classification

Imagine a colaborative recipe website which wants to propose an automatic tag function which categorize a proposed recipe among a set of recipe type like french, mexican, italian, korean, .... This functionality will be build thanks to a classifier wich will take as input the ingredients of the recipe and will output the recipe type.

In this exercice we will build such a classifier using a set of labelled recipes.

In [None]:
import pandas as pd
import numpy as np
import json

## Data preprocessing

First we need to read and encode the dataset for further processing. The "recipe_train.json" contains an array of recipe with a cuisine filed with the recipe cuisine type (a string) and an ingredients fileds with an array of ingredients (strings). We first load the data and build an np.array of cuisine type to predict (y) and a list of recipes ingredients. 

In [None]:
with open("./recipes_train.json") as f:
    data_train=json.load(f)
    y = np.array([recipe["cuisine"] for recipe in data_train])
    xtext = [recipe["ingredients"] for recipe in data_train]

In [None]:
data_train

We must encode the ingredients and define the numeric features that will describe our recipes. To do so we will take a classic approach in text processing called bag of words so in our case bag of ingredients. Each recipe will be then associated with a big vectors of zeros and one. Each element of this vector will correspond to an ingredients and we will put a one if the recipe use this ingredients and a zero if it's not the case.

To do so we start by computing the list of all the ingredients and a dict which associated to each ingredients an integer

In [None]:
ingredients = np.unique(np.concatenate(xtext))
dict_ingredients = dict((ingredients[i],i) for i in range(0, len(ingredients)))

In [None]:
dict_ingredients

We may then encode the recipe using a small function that will take an array of ingredients and return a vector of size (number of possible ingredients) with ones at the right places. Eventually we stack all these vectors in our data matrix X.

In [None]:
def encode(recipe):
    x = np.zeros((1,len(ingredients)))
    indices = [dict_ingredients[ing] for ing in recipe]
    x[0,indices]=1
    return x
X = np.vstack([encode(recipes) for recipes in xtext])

This data set is quite voluminous and since our computational ressource are scarses we will remove the columns that corresponds to rarely used ingredients and select only the recipes of 10 types.

In [None]:
select_ingredients=ingredients[np.sum(X,axis=0)>100]
selected_type= np.array(['chinese', 'french', 'greek', 'indian', 'italian', 'jamaican',
       'korean', 'mexican', 'moroccan', 'thai'])
Xs=X[:,np.sum(X,axis=0)>100]
Xs=Xs[np.isin(y,selected_type),:]
ys=y[np.isin(y,selected_type)]

Our dataset is now ready to be processed. The final list of ingredients that we will used to recognize the cuisine types is :

In [None]:
Xs.shape

In [None]:
select_ingredients

And the types of cuisines that we must recognize

In [None]:
selected_type

## Build the training and test set

As usually, we will split the data into a training and a test dataset. the "train_test_split" function from scikit learn is dedicated to this task.

In [None]:
from sklearn.model_selection import train_test_split,cross_val_score


X_train, X_test, y_train, y_test = ## TO ADD

## Random Forest

We will use randomForest to solve the classification problem as a first try. We will fit a randomforest with 
 trees and estimate the accuracy of such classifier using cross_validation and the test set :

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix


clf = ## DEF CLASSIFIER
scores = ## UTILISER LA FONCTION DE CROSS VALIDATION POUR CLASSIFIER LES RECETTE

score.## PRINT MEAN SCORE

We may also simply fit this model with the training data and produce prediction for the test set with the predict method.

In [None]:
## TRAIN THE MODEL
## USE PREDICT FONCTION ON TEST DATA

And compute the accuracy with numpy or the built-in function of scikit learn

In [None]:
## COMPUTE THE ACCURACY SCORE

In [None]:
## USE THE ACCURACY FUNCTION

Others metrics can be computed with the classification report tools from scikit learn:

In [None]:
## PRINT THE CLASSIFICATION REPORT

and we may compute the detailled result with the confusion matrix

In [None]:
## AFFICHER LA MATRICE DE CONFUTION

## Questions?
### 1) What is the accuracy of this classifier on training data? What do you conclude?


### 2) Try to improve the performance of this classifier in terms of accuracy ?
You can vary the number of trees between [20,50,100,200]. Use the GridSearchCV function seen in the first notebbok.

### 3) Test another solution (Logistic regression)
You will be able to read the documentation of sklearn.linear_model.LogisticRegression

In [None]:
from sklearn.linear_model import LogisticRegression

### 4) Try to implement a rejection solution, to obtain an accuracy of at least 90%? 
To do this you will use the probabilities provided by the classifier and will only make a decision when these probabilities are above a certain threshold (to be determined). The predict_proba method should be able to help you.

### 5)  Try to solve the problem with more features with logistic regression.  Try to use the regularization parameter and compare the results of Multinomial and One-vs-Rest for handling the multiclass problem. 

In [None]:
select_ingredients=ingredients[np.sum(X,axis=0)>50]
selected_type= np.array(['chinese', 'french', 'greek', 'indian', 'italian', 'jamaican',
       'korean', 'mexican', 'moroccan', 'thai'])
Xs=X[:,np.sum(X,axis=0)>20]
Xs=Xs[np.isin(y,selected_type),:]
ys=y[np.isin(y,selected_type)]
X_train, X_test, y_train, y_test = train_test_split(Xs, ys, test_size=0.3, random_state=0)
X_train.shape

In [None]:

## DEFINE A HYPER PARAMETER GRID
hyper_params = [{"penalty":["none"],"multi_class":["ovr","multinomial"]},
                {"penalty":["l2"],"multi_class":["ovr","multinomial"],"C":np.linspace(0.1,2,num=10)}]

from sklearn.model_selection import GridSearchCV

In [None]:
## USE THE GRID SEARCH WITH THE HYPERPARAMETER GRID TO GET THE BEST SET

In [None]:
## FIT YOUR GRID

In [None]:
## PRINT SCORE

In [None]:
import pandas as pd
res = pd.DataFrame(grid_res.cv_results_["params"])
res["mean_accuracy"]=grid_res.cv_results_["mean_test_score"]
res["std_accuracy"]=grid_res.cv_results_["std_test_score"]

In [None]:
import matplotlib.pyplot as plt

In [None]:
res

In [None]:
plt.plot(res.loc[res["multi_class"]=="ovr",["C"]],res.loc[res["multi_class"]=="ovr",["mean_accuracy"]],label="ovr")
plt.plot(res.loc[res["multi_class"]=="multinomial",["C"]],res.loc[res["multi_class"]=="multinomial",["mean_accuracy"]],label="multinomial")
plt.legend()