## Predicting Nutritional Value for Variety of Food Items
### Vishnupriya Venkateswaran
### 10/20/2019

### While building a model we follow these steps as below:

#### Step 1: Exploratory Data Analysis on the datasets used.
#### Step 2: Visually respresnt the data and find correlation between variables
#### Step 3: Divide the data into training and testing sets
#### Step 4: Standardize the features and target
#### Step 5: Choose the model
#### Step 6: Train the model on the training data
#### Step 7: Test the model on the testing data
#### Step 8: Get the accuracy of the model



In [1]:
#Import required python packages and Helper functions to prepare dataset
from utils import read_and_clean_data, split_X_y
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Read the clean data
dataset = read_and_clean_data("./data/Nutrients.xlsx")

# Split dataset into X and y
X, y = split_X_y(dataset)

In [12]:
X

array([[15.87, 717, 0.85, ..., 14.2, '1', 0.0],
       [16.72, 718, 0.49, ..., 9.4, '1', 0.0],
       [0.24, 876, 0.28, ..., 205.0, '1', 0.0],
       ...,
       [26.0, 269, 0.0, ..., 164.0, 1.0, 0.0],
       [79.2, 90, 16.1, ..., 164.0, 1.0, 0.0],
       [78.5, 89, 19.8, ..., 164.0, 1.0, 0.0]], dtype=object)

In [13]:
y

array([[2.400e+01, 2.000e-02, 9.000e-02, 2.499e+03, 0.000e+00, 3.000e+00],
       [2.300e+01, 5.000e-02, 5.000e-02, 2.468e+03, 0.000e+00, 4.000e+00],
       [4.000e+00, 0.000e+00, 1.000e-02, 3.069e+03, 0.000e+00, 0.000e+00],
       ...,
       [1.300e+01, 3.600e+00, 1.900e-01, 0.000e+00, 0.000e+00, 0.000e+00],
       [1.000e+01, 3.500e+00, 1.000e+00, 1.000e+02, 0.000e+00, 6.000e+00],
       [1.180e+02, 1.400e+00, 1.000e+00, 1.000e+02, 0.000e+00, 1.500e+01]])

In [3]:
# Split the data into train and test data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=42)


In [4]:
# Transform the input data using sklearn RobustScaler
#This is done to standardize our data so that all of them are robust against the differences in units
from sklearn.preprocessing import RobustScaler

scl_X = RobustScaler().fit(X_train)
scl_y = RobustScaler().fit(y_train)

input_x_train = scl_X.transform(X_train)
input_y_train = scl_y.transform(y_train)

input_x_test = scl_X.transform(X_test)
input_y_test = scl_y.transform(y_test)

#### Model chosen : Random Forset Classifier 
#### Since we want to predict multiple values we use the Multi Output Regressor. 
#### The algorithm uses bootstrapping to create randomly subsets of data with subsets of features from the original dataset. Then decision trees algorithm is used to build a model on each subset.Then each prediction of each model is averaged to create one final prediction, which would work well as we are predicting multiple nutrients values

In [5]:
# Import Model from sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor

  from numpy.core.umath_tests import inner1d


In [6]:
#Define the more
RF = MultiOutputRegressor(RandomForestRegressor(n_estimators=100,max_features="sqrt"))
RF.fit(input_x_train, input_y_train)

MultiOutputRegressor(estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='sqrt', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
           n_jobs=1)

In [21]:
# Predicting the output
a=RF.predict(input_x_test)
a

array([[-1.32142857e-01,  8.23333333e-01,  3.57535565e+00,
        -7.15048544e-02,  3.91000000e+00, -1.65333333e-01],
       [ 9.33035714e-01,  4.09082126e-01, -6.35983264e-03,
         4.59053398e+00,  0.00000000e+00,  3.66666667e-03],
       [ 7.71875000e+00, -3.84154589e-01,  3.59456067e-01,
         2.11500000e+00,  2.89500000e+01, -1.23333333e-02],
       ...,
       [ 9.53571429e-01, -2.05555556e-01, -2.47322176e-01,
         1.88029612e+01,  1.00000000e-02,  4.05133333e+00],
       [ 1.61428571e-01, -2.84492754e-01, -2.48786611e-01,
         2.40980583e+00,  4.30000000e-01, -1.20000000e-01],
       [-1.13392857e-01, -5.77729469e-01,  2.23748685e-07,
         2.70601942e+00,  4.90000000e-01,  0.00000000e+00]])

In [22]:
#Accuracy score for input training data
RF.score(input_x_train, input_y_train)

0.9366197355554688

In [23]:
# score on Test Data
RF.score(input_x_test, input_y_test)

0.5995799255199504

#### The performance of Multi Output Random Forest Model is about 60% which is not very bad.