# Intermittent Fasting - A Statistical Self-Study

This study will help me learn how food and excercise are affecting my weight and overall well-being. I am using historical data (unfortenately not continuous) exported from myfitnesspal acount. It covers a period of 6 years of unsuccessfull weight management strategies. 

Since 07.12.2017 I have decided to try a new strategy called Intermittent Fasting. I came accross this strategy in a youtube video from Dr. Jason Fung who advocates the health benefits from fasting in general. And so after conducting some research on the topic I decided to follow the so called "Warrior Diet" proposed from Ori Hofmekler. However I took the idea further and  restricted the eating plan even more by combining it with Low Carbohydrate Diet.

So in the end I eat only once a day, usually at dinner, and when I do that I try to minimize the amount of strachy carbohydrates in my meals. In simple words - no rice, bread or potatoes.

## Table of Contents
1. Data description
2. Loading and manipulating the data
3. Exploratory data analysis
4. Calories Equation Model
5. Linear Regression Model
6. Regression Tree Model
7. Neural Network Model
8. Comparison and Evaluation of the different Models
9. Meal optimization
9. Insights from the data

### 1. Data Description

The data which I use is collected in my Myfitnesspal account. When I export it in a csv file format it comes in three separate files - Exercise-Summary.csv, Measurement-Summary.csv and Nutrition-Summary.csv.

**Exercise-Summary.csv**
* Date - [YYYY-MM-DD] - observation date
* Excercise - [String] - description of the exercise
* Type - [cardio/strength] - type of the exercise
* Calories - [num] - calories burned during the exercise
* Exercise Minutes - [num] - minutes spend on the exercise
* Sets - [num] - number of sets
* Reps per Set - [num] - number of repetitions per set
* Kilograms - [num] - kilograms for each repetition
* Steps - [num] - steps count from the exercise (via Google Fit / Mi Fit)

**Measurement-Summary.csv**
* Date - [YYYY-MM-DD] - observation date
* % Body Fat - [num] - percentage of body fat
* Biceps - [num] - biceps circumference in cm
* Calves - [num] - calves circumference in cm
* Hips - [num] - hips circumference in cm
* Neck - [num] - neck circumference in cm
* Tights - [num] - tights circumference in cm
* Waist - [num] - waist circumference in cm
* Weight - [num] - weight in kg

**Nutrition-Summary.csv**
* Date - [YYYY-MM-DD] - observation date
* Meal - [String] - meal name [breakfast/lunch/dinner/snack]
* Calories - [num] - number of calories for the meal
* Fat g - [num] - grams of fat for the meal
* Saturated Fat - [num] - grams of saturated fat for the meal
* Polysaturated Fat - [num] - grams of polysaturated fat for the meal
* Monosaturated Fat - [num] - grams of monosaturated fat for the meal
* Trans Fat - [num] - grams of trans fat for the meal
* Cholesterol - [num] - mg of cholesterol for the meal
* Sodium (mg) - [num] - mg of sodium for the meal
* Potassium - [num] - mg of potassium for the meal
* Carbohydrates (g) - [num] - grams of carbohydrates for the meal 
* Fiber - [num] - grams of fiber for the meal
* Sugar - [num] - grams of sugar for the meal
* Protein (g) - [num] - grams of protein for the meal
* Vitamin A - [num] - % of the recommended daily intake
* Vitamin C - [num] - % of the recommended daily intake
* Calcium - [num] - % of the recommended daily intake
* Iron - [num] - % of the recommended daily intake

### 2. Loading and Manipulating the Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
%matplotlib inline

First I will read the data I have into a pandas dataframes:

In [2]:
exercise = pd.read_csv('myfitnesspal/Exercise-Summary.csv', index_col = 0)
measurement = pd.read_csv('myfitnesspal/Measurement-Summary.csv', index_col = 0)
nutrition = pd.read_csv('myfitnesspal/Nutrition-Summary.csv', index_col = 0)

Before I start manipulating my data I want to check the exact time period which I have data for in each csv file. I am doing this as I am not sure if all the data is for the same time intervals

In [3]:
# converting the index of the dataframes into a date-time format
exercise.index = pd.to_datetime(exercise.index)
measurement.index = pd.to_datetime(measurement.index)
nutrition.index = pd.to_datetime(nutrition.index)

# printing the time intervals
print("Exercise Period :  ", exercise.index[0] , ' - ', exercise.index[-1])
print("Measurement Period :  ", measurement.index[0] , ' - ', measurement.index[-1])
print("Nutrition Period :  ", nutrition.index[0] , ' - ', nutrition.index[-1])

print("Overall Period: ", min(exercise.index[0],measurement.index[0],nutrition.index[0]), ' - ',
                          max(exercise.index[-1],measurement.index[-1],nutrition.index[-1]))

# assigning overall common start and end dates which I will use to build my analysis dataframe
startDate = min(exercise.index[0],measurement.index[0],nutrition.index[0])
endDate = max(exercise.index[-1],measurement.index[-1],nutrition.index[-1])

Exercise Period :   2012-11-10 00:00:00  -  2018-02-15 00:00:00
Measurement Period :   2012-11-10 00:00:00  -  2018-02-15 00:00:00
Nutrition Period :   2012-11-10 00:00:00  -  2018-02-15 00:00:00
Overall Period:  2012-11-10 00:00:00  -  2018-02-15 00:00:00


Now I will check the structure for each dataframe:

In [4]:
exercise.head()

Unnamed: 0_level_0,Exercise,Type,Exercise Calories,Exercise Minutes,Sets,Reps Per Set,Kilograms,Steps
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2012-11-10,Dips,Strength,,,7.0,10.0,36.3,
2012-11-10,"Rowing, stationary, very vigorous effort",Cardio,227.0,22.0,,,,
2012-11-10,"Step-ups, vigorous",Cardio,251.0,20.0,,,,
2012-11-11,Chin-Ups,Strength,,,3.0,8.0,36.3,
2012-11-11,Dips,Strength,,,5.0,10.0,36.3,


In [5]:
measurement.head()

Unnamed: 0_level_0,% Body Fat,Biceps,Calves,Hips,Neck,Tights,Waist,Weight
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2012-11-10,,,,,,,,113.8
2012-11-11,,,,,,,,114.1
2012-11-13,,,,,,,,113.5
2012-11-18,,,,,,,,113.9
2012-11-24,,,,,,,,114.2


In [6]:
nutrition.head()

Unnamed: 0_level_0,Meal,Calories,Fat (g),Saturated Fat,Polyunsaturated Fat,Monounsaturated Fat,Trans Fat,Cholesterol,Sodium (mg),Potassium,Carbohydrates (g),Fiber,Sugar,Protein (g),Vitamin A,Vitamin C,Calcium,Iron
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2012-11-10,Breakfast,230.0,18.8,2.6,4.0,8.0,0.0,0.0,120.2,190.0,5.4,3.2,1.2,12.8,0.0,0.0,34.0,8.0
2012-11-10,Lunch,805.0,35.0,14.7,0.0,11.9,0.0,182.0,2597.0,0.0,91.0,22.4,0.0,35.0,0.0,0.0,0.0,0.0
2012-11-10,Dinner,829.2,56.1,22.6,3.7,19.5,0.0,0.0,281.0,1627.5,74.0,7.5,72.0,25.7,89.6,127.8,17.8,12.2
2012-11-11,Breakfast,574.0,28.4,18.7,0.0,0.0,0.0,0.0,186.0,0.0,60.7,0.0,53.9,18.5,0.0,0.0,0.0,0.0
2012-11-11,Lunch,805.0,35.0,14.7,0.0,11.9,0.0,182.0,2597.0,0.0,91.0,22.4,0.0,35.0,0.0,0.0,0.0,0.0


The first thing which I don't like about this data is the presence of NaN values in my columns. I want to replace this with 0s so I can actually make calculations and etc.

I know that I haven't filled the variables Sets, Reps Per Set and Kilograms consistently thus there is no point keeping them for our analysis. On the other hand you can see that the dataframe have multiple rows with the same index (3 entries for the same date) my aim is to simply aggregate this data to a single value and have only one entry per day which can describe all the data in the original table.

In order to accomplish the data aggregation I will add some categorical variables to my excercise dataframe **Strength Training  - [yes/no]**


It will not be much fun to try to aggregate the dataframe in place so I will simply make a new dataframe and transfer the aggregated information into it. Another reason why I don't want to do it in place is that my index values (the dates) are not continuous, for example some of the days I haven't logged any activity. In the end I want to perform a time-series analysis of this data and I would prefer not to have "holes" in the time-series. Thus I will add all missing days and simply give them value of 0.

In [None]:
exercise.fillna(value=0, inplace=True)

exercise_agg = pd.DataFrame(-1, columns=['Calories Burned', 
                                        'Cardio Minutes', 
                                        'Strength Training'
                                       ], index=pd.date_range(start=startDate, end=endDate ,freq='D'))

exercise_agg['Calories Burned'] = exercise.groupby('Date')['Exercise Calories'].sum()
exercise_agg['Cardio Minutes'] = exercise.groupby('Date')['Exercise Minutes'].sum()
exercise_agg['Strength Training'] = exercise.groupby('Date')['Exercise Calories'].min()
exercise_agg['Steps'] = exercise.groupby('Date')['Steps'].sum()
exercise_agg.fillna(value=0, inplace=True)


def boolStrength(x):
    if x == 0.0:
        return 'yes'
    else:
        return 'no'

exercise_agg['Strength Training'] = exercise_agg['Strength Training'].apply(lambda x: boolStrength(x))

And finally we got our data in the following form:

In [None]:
exercise_agg.head()

## Data Engineering for the Measurement Dataframe
Showing the structure of the measurement dataframe

In [None]:
measurement.head()

I know that my measurments for hips, neck and waist are total garbage as I measured them only once and it wasn't even a proper measurment, thus I will simply remove them from the dataframe. Next thing which I will do is to extend the time-series of measurements and add all missing days. Now the question is what will I fill for weight for the days where I actually didn't measure it. Well I will use linear interpolation to fill in all the gaps.

In [None]:
measurement.drop(['Hips','Neck','Waist'], axis=1, inplace=True)

In [None]:
measurement_agg = pd.DataFrame(-1, columns=['Weight', 
                                            'dW',
                                            'Age',
                                            'BMI',
                                            'Height'
                                           ], index=pd.date_range(start=startDate, end=endDate ,freq='D'))

measurement_agg['Weight'] = measurement.groupby('Date')['Weight'].mean()
measurement_agg['Weight'].plot.line()

Look at the gaps in weight measrements above. Let's fix this

In [None]:
fig, axes = plt.subplots()
fig.set_size_inches(17, 6)

measurement_agg['Weight'].interpolate(inplace=True)
measurement_agg['Weight'].plot.line(color='b')
axes.grid(color='black', alpha=0.5, linestyle='-.', linewidth=0.5)

In [None]:
measurement_agg.head()

Now let's fill in the data in the rest of the columns:

In [None]:
measurement_agg['Height'] = 1.76  #constant durting the entire period
measurement_agg['BMI'] = np.round(measurement_agg['Weight'] / measurement_agg['Height']**2,2)
measurement_agg['Age'] = np.round((measurement_agg.index - pd.Timestamp('1988-06-07')) / pd.Timedelta(days=365),1)
measurement_agg['dW'] = measurement_agg['Weight'].diff(periods=1)
measurement_agg.at[measurement_agg.index[0],'dW'] = 0

measurement_agg.fillna(value=0, inplace=True)
measurement_agg.head()

## Data Engineering for the Nutrition Dataframe
Showing the structure of the nutrition dataframe

In [None]:
nutrition.head()

For this dataframe I again see that we have multiple entries per day which I don't like at all. The first task would be aggregation into a new data frame. Every time we do aggregation we lose from the original data. Thus I have to add some categorical variables so I don't compromise my analysis.

In [None]:
nutrition_agg = pd.DataFrame(0, columns=[   'Number of Meals', 
                                            'Mean Calories per Meal',
                                            'Max Calories per Meal',
                                            'Min Calories per Meal',
                                            'Total Calories',
                                            'Total Calories from Fat %',
                                            'Total Calories from Carbs %',
                                            'Total Calories from Protein %',
                                            'Total Amount of Nutrients g'
                                           ], index=pd.date_range(start=startDate, end=endDate ,freq='D'))

nutrition_agg['Number of Meals'] = nutrition.groupby('Date')['Meal'].count()
nutrition_agg['Total Calories'] = np.round(nutrition.groupby('Date')['Calories'].sum(),0)
nutrition_agg['Mean Calories per Meal'] = np.round(nutrition.groupby('Date')['Calories'].mean(),0)
nutrition_agg['Max Calories per Meal'] = np.round(nutrition.groupby('Date')['Calories'].max(),0)
nutrition_agg['Min Calories per Meal'] = np.round(nutrition.groupby('Date')['Calories'].min(),0)

nutrition_agg['Total Calories from Fat %'] = np.round((nutrition.groupby('Date')['Fat (g)'].sum() * 9 / 
                                              nutrition_agg['Total Calories'])*100,2) 

nutrition_agg['Total Calories from Carbs %'] = np.round((nutrition.groupby('Date')['Carbohydrates (g)'].sum() * 4 / 
                                              nutrition_agg['Total Calories'])*100,2) 

nutrition_agg['Total Calories from Protein %'] = np.round((nutrition.groupby('Date')['Protein (g)'].sum() * 4 / 
                                              nutrition_agg['Total Calories'])*100,2) 

nutrition_agg['Total Amount of Nutrients g'] = np.round(nutrition.groupby('Date')['Fat (g)'].sum() + 
                                                        nutrition.groupby('Date')['Carbohydrates (g)'].sum() +
                                                        nutrition.groupby('Date')['Protein (g)'].sum(),2)
nutrition_agg['Cholesterol mg'] = nutrition.groupby('Date')['Cholesterol'].sum()
nutrition_agg['Potassium mg'] = nutrition.groupby('Date')['Potassium'].sum()
nutrition_agg['Sodium mg'] = nutrition.groupby('Date')['Sodium (mg)'].sum()
nutrition_agg['Fiber g'] = nutrition.groupby('Date')['Fiber'].sum()
nutrition_agg['Sugar g'] = nutrition.groupby('Date')['Sugar'].sum()
nutrition_agg['Vitamin A %'] = nutrition.groupby('Date')['Vitamin A'].sum()
nutrition_agg['Vitamin C %'] = nutrition.groupby('Date')['Vitamin C'].sum()
nutrition_agg['Calcium %'] = nutrition.groupby('Date')['Calcium'].sum()
nutrition_agg['Iron %'] = nutrition.groupby('Date')['Iron'].sum()

In [None]:
nutrition_agg.fillna(value=0, inplace=True)
nutrition_agg.head()

## Combining The Data

As all my dataframes have in fact the same number of rows with the same key values I can simply stich the dataframes together column-wise, this is done really easy in Pandas:

In [None]:
fullData = pd.concat([nutrition_agg, exercise_agg, measurement_agg], axis=1)

In [None]:
fullData.head()

So this was all the data which I have available. However to study the effects of the intermittent fasting I don't need all of that, as I started practicing it from **7th December 2017**. Since this date I have logged all my meals and excercies consistently, I only have some gaps for the weight as it is kind of pointless to weight yourself every day. Now I will slice the data and use only the numbers gathered after 7th Dec.

In [None]:
df = fullData['2017-12-07':]

I want to add one more column to our dataframe:

In [None]:
df.index.name = 'Date'
df['Net Calories'] = nutrition_agg['Total Calories'] - exercise_agg['Calories Burned']

## Exploratory Data Analysis

In [None]:
sns.distplot(df['dW'], bins=5)

In [None]:
fig, ax = plt.subplots(figsize=(17,9)) # increasing the original size of the heatmap
sns.heatmap(df.corr(), cmap='coolwarm', annot=True)

In [None]:
df.describe()

In [None]:
df_vis = df[['Net Calories', 'Total Amount of Nutrients g', 'Fiber g', 'Total Calories from Fat %', 
             'Total Calories from Carbs %', 'Total Calories from Protein %', 'Steps', 'dW']]

sns.pairplot(df_vis)

In [None]:
sns.lmplot(y='dW', x='Net Calories', data=df)

In [None]:
sns.lmplot(y='dW', x='Total Amount of Nutrients g', data=df)

In [None]:
sns.lmplot(y='dW', x='Fiber g', data=df)

In [None]:
sns.lmplot(y='dW', x='Total Calories from Fat %', data=df)

In [None]:
sns.lmplot(y='dW', x='Total Calories from Carbs %', data=df)

In [None]:
sns.lmplot(y='dW', x='Total Calories from Protein %', data=df)

In [None]:
sns.lmplot(y='dW', x='Steps', data=df)

In [None]:
sns.clustermap(df.corr())

In [None]:
fig, axes = plt.subplots()

fig.set_size_inches(16, 6)

axes.plot(df.index, df['Weight'], lw=3, marker='o', markersize=10)

labels = ["Fat % ", "Carbs %", "Protein %"]
axes.stackplot(df.index, df['Total Calories from Fat %'], 
                         df['Total Calories from Carbs %'], 
                         df['Total Calories from Protein %'],
                         labels = labels)

axes.legend(loc=7, bbox_to_anchor=(1.1, 0.5))
axes.grid(color='b', alpha=0.5, linestyle='dashed', linewidth=0.5)
axes.set_title('Time-History');

In [None]:
fig, axes = plt.subplots()

fig.set_size_inches(15, 6)

axes.plot(df.index, df['Weight'], lw=3, marker='o', markersize=10)

ax2 = axes.twinx()
ax2.plot(df.index, df['Net Calories'], 'r')

axes.legend(loc=7, bbox_to_anchor=(1.15, 0.5))
axes.grid(color='b', alpha=0.5, linestyle='dashed', linewidth=0.5)
axes.set_title('Time-History');

## Linear Regression

In [None]:
df.columns

In [None]:
X = df[['Net Calories', 'Steps', 'Total Amount of Nutrients g', 'Fiber g']]

y = df['dW']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)

prediction = lm.predict(X_test)

acc_lm_train = round(lm.score(X_train, y_train) * 100, 2)
acc_lm_test = round(lm.score(X_test, y_test) * 100 ,2)
print("Accuracy on the training Set", round(acc_lm_train,2,), "%")
print("Accuracy on the testing Set", round(acc_lm_test,2,), "%")

from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, prediction))
print('MSE:', metrics.mean_squared_error(y_test, prediction))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))

plt.scatter(y_test,prediction)

In [None]:
sns.distplot((y_test-prediction))

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

features = X.columns

forest = RandomForestRegressor(n_estimators=500, random_state = 0, oob_score = True, max_depth=3)
forest.fit(X_train, y_train)

prediction = forest.predict(X_test)

forest.score(X_train,y_train)
acc_forest_train = round(forest.score(X_train, y_train) * 100, 2)
acc_forest_test = round(forest.score(X_test, y_test) * 100 ,2)

print("Accuracy on the training Set", round(acc_forest_train,2,), "%")
print("Accuracy on the testing Set", round(acc_forest_test,2,), "%")

from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, prediction))
print('MSE:', metrics.mean_squared_error(y_test, prediction))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))

plt.scatter(y_test,prediction)

In [None]:
sns.distplot((y_test-prediction))

In [None]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydot

import os     
os.environ["PATH"] += os.pathsep + 'C:/anaconda3/Library/bin/graphviz'

dot_data = StringIO()
export_graphviz(forest[0], out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                feature_names=features)
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())

## Linear Regression using StatsModels not Sckikit-Learn

In [None]:
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

X_train = sm.add_constant(X_train)
lm1 = sm.OLS(y_train, X_train)
prediction = lm1.fit()

print(prediction.summary())

## New Approach Needed

It seems that I will not be able to find what causes weight loss looking only in short term results - dW. This means that I need to construct new variables and new ways to measure what is actually taking place in my body. So let's start.

But first let's check what we have so far:

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df['Streak'] = range(0,len(df.index),1)
df['BMR'] = 10*df['Weight'] + 625*df['Height'] + 5*df['Age'] + 5
df['Caloric Balance'] = df['Net Calories'] - df['BMR']
df['Cumulative Caloric Balance'] = df['Caloric Balance'].cumsum()
df['Calculated Weight kg'] = df.ix[0]['Weight'] + df['Cumulative Caloric Balance']/9000

fig, axes = plt.subplots()

fig.set_size_inches(15, 6)

axes.plot(df.index, df['Weight'], lw=3, marker='o', markersize=10)
axes.plot(df.index, df['Calculated Weight kg'], 'r', lw=3, marker='o', markersize=10)

axes.legend()
axes.grid(color='b', alpha=0.5, linestyle='dashed', linewidth=0.5)
axes.set_title('Time-History');

You see that using the simple calories equation almost got me a realistic estimate of the weight loss process. However there are other factors which drive the weight loss process, you see that they are actually accelerating it ... This is what I want to find out, what are these factors. But first let's calculate what was my exact BMR during each day:

In [None]:
df['Cumulative Caloric Balance A'] = (df['Weight'] - df.ix[0]['Weight'])*9000
df['Caloric Balance A'] = df['Cumulative Caloric Balance A'].diff()
df['Daily BMR'] = df['Caloric Balance A'] - df['Net Calories']

df['Caloric Balance'] = df['Net Calories'] + df['Daily BMR']
df['Cumulative Caloric Balance'] = df['Caloric Balance'].cumsum()
df['Calculated Weight kg'] = df.ix[0]['Weight'] + df['Cumulative Caloric Balance']/9000

fig, axes = plt.subplots()

fig.set_size_inches(15, 6)

axes.plot(df.index, df['Weight'], lw=3, marker='o', markersize=10)
axes.plot(df.index, df['Calculated Weight kg'], 'r', lw=3, marker='o', markersize=10)

axes.legend()
axes.grid(color='b', alpha=0.5, linestyle='dashed', linewidth=0.5)
axes.set_title('Time-History');

In [None]:
df['Daily BMR'].describe()

Ok now I have my daily BMR values, let's run a regression to see if there is a pattern.

In [None]:
df.columns

In [None]:
'''X = df[['Number of Meals', 
       'Total Calories', 
       'Total Calories from Fat %',
       'Total Calories from Carbs %', 
       'Total Calories from Protein %',
       'Total Amount of Nutrients g', 
       'Cholesterol mg', 
       'Potassium mg',
       'Sodium mg', 
       'Fiber g', 
       'Sugar g', 
       'Vitamin A %', 
       'Vitamin C %',
       'Calcium %', 
       'Iron %', 
       'Cardio Minutes',
       'Net Calories'
       ]]'''

X = df[[
       'Total Calories from Fat %',
       'Potassium mg',
       'Sugar g', 
       'Vitamin C %',
       'Calcium %', 
       'Iron %', 
       'Cardio Minutes',
       'Net Calories'
       ]]

'''X = df[[
       'Total Calories from Carbs %', 
       'Vitamin C %',
       'BMI',
       ]]'''

y = df['Daily BMR']
y.ix[0] = np.mean(y)

In [None]:
X.head()

## Random Forest

In [None]:
features = X.columns

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
from sklearn.model_selection import GridSearchCV

estimator = RandomForestRegressor()
    
param_grid = { 
        "n_estimators"      : [10],
        "min_samples_leaf"  : range(1,21,1),
        "min_samples_split" : range(2,11,1),
        "max_depth"         : range(2,11,1),
        "oob_score"         : [True],
        "random_state"      : [0]
        }

grid = GridSearchCV(estimator, param_grid, n_jobs=-1, return_train_score=True, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)

In [None]:
grid.best_score_ , grid.best_params_

In [None]:
forest = RandomForestRegressor(n_estimators=10, random_state = 0, 
                               oob_score = True, min_samples_leaf=7, max_depth = 2,
                               min_samples_split=2)
forest.fit(X_train, y_train)

prediction = forest.predict(X_test)

print("Accuracy on the training Set", round(forest.score(X_train, y_train) * 100, 2), "%")
print("Accuracy on the testing Set", round(forest.score(X_test, y_test) * 100 ,2), "%")
print("Out-of-bound score ", round(forest.oob_score_,2))
print('RMSE on Training Set:', np.round(np.sqrt(metrics.mean_squared_error(y_train, forest.predict(X_train))),0))
print('RMSE on Testing Set:', np.round(np.sqrt(metrics.mean_squared_error(y_test, forest.predict(X_test))),0))

plt.scatter(y_test,prediction)

In [None]:
forest.feature_importances_

bar_x = range(len(forest.feature_importances_))

indices = np.argsort(forest.feature_importances_)
sorted_importances = []

for i in indices:
    sorted_importances.append(X.columns[i])

fig, ax = plt.subplots(figsize=(17,6))
plt.barh(bar_x, forest.feature_importances_[indices])
plt.yticks(bar_x, sorted_importances)

In [None]:
tree_scores = [tree.score(X_train, y_train) for tree in forest]

best_tree = tree_scores.index(max(tree_scores))

dot_data = StringIO()
export_graphviz(forest[best_tree], 
                out_file=dot_data,  
                filled=True,
                precision=0,
                special_characters=True,
                feature_names=features,
                leaves_parallel=True,
                rounded=True,
                rotate=False, 
                proportion=True, 
                impurity=False
               )
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())

In [None]:
fig, ax = plt.subplots(figsize=(17,10))

ax.plot(df['Streak'], y, 'o-', label="data")
ax.plot(df['Streak'], forest.predict(X), 'r', label="Tree")
ax.legend(loc='best');
ax.grid(color='g', alpha=0.5, linestyle='dashed', linewidth=0.5)

## OLS

In [None]:
lm2 = sm.OLS(y, X)
prediction = lm2.fit(use_t=True, )

In [None]:
print(prediction.summary2())

In [None]:
prediction.predict([80,500,25,100,20,1,60,1500])

In [None]:
prstd, iv_l, iv_u = wls_prediction_std(prediction)

fig, ax = plt.subplots(figsize=(17,10))

ax.plot(df['Streak'], y, 'o-', label="data")
ax.plot(df['Streak'], prediction.fittedvalues, 'r--', label="OLS")
ax.plot(df['Streak'], forest.predict(X), 'y--', label="Tree")
#ax.plot(df['Streak'], iv_u, 'r--')
#ax.plot(df['Streak'], iv_l, 'r--')
ax.legend(loc='best');
ax.grid(color='g', alpha=0.5, linestyle='dashed', linewidth=0.5)

In [None]:
fig, ax = plt.subplots(figsize=(17,10))
plt.scatter(y,prediction.fittedvalues)

In [None]:
fig, ax = plt.subplots(figsize=(17,10))
sns.distplot(y-prediction.fittedvalues)

In [None]:
df_corr =  df[[
       'Total Calories from Fat %',
       'Potassium mg',
       'Sugar g', 
       'Vitamin C %',
       'Calcium %', 
       'Iron %', 
       'Cardio Minutes',
       'Net Calories', 
       'Daily BMR'
       ]]

fig, ax = plt.subplots(figsize=(17,10))
sns.heatmap(df_corr.corr(), cmap='coolwarm', annot=True, linewidths=5, )

In [None]:
sns.pairplot(df_corr)

In [None]:
fig, ax = plt.subplots(figsize=(17,10))
sns.barplot(x = df['Streak'], y = df['Net Calories'], color='b')