# Nutritional Analysis and Modeling

This notebook is designed to allow people to import their nutritional data from MyFitnessPal using a custom python module developed by coddingbear and made available on GitHub. Once the data is obtained this notebook is designed to allow users to visually analyse it, model the contribution of base macros to caloric intake, and score that model with provided data. The goal is to provide a visual and statistical analysis of nutrition data that can be used by anyone.

In [None]:
!pip install -r requirements.txt

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.dates as mdates
import seaborn as sns
from datetime import date, datetime, timedelta
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import myfitnesspal

In [None]:
def mfp_df_cleanup(df):
    df.dropna(axis='columns', how='all', inplace=True)
    col_list = df.columns.to_list()
    col_list.remove('date')
    col_list.remove('meal')
    df[col_list] = df[col_list].astype('float64')
    return df


In [None]:
sns.set_style()

## Part I: Nutrition Data Analysis

### Import and Prepare Data

Enter your username and password for the MyFitnessPal website below

In [None]:
user = ''
pwd = ''

Set the date range to analyse in the following format: (yyyy,m,d). Please note that the wider the date range, the longer it will take to acquire the data.

In [None]:
start_date = date(2021,1,1)
end_date = date(2021,12,31)

In [None]:
df = pd.DataFrame(columns=['date','meal','calories','carbohydrates','fat','protein','sodium','fiber','sugar'])

In [None]:
client = myfitnesspal.Client(user, password=pwd)

In [None]:
delta = timedelta(days=1)

temp_dict = {}

while start_date <= end_date:
    day = client.get_date(start_date)
    meal = day.keys()
    #add loop to obtain totals per meal
    #add date and meal to entries
    date_value = {'date': day.date.strftime('%Y-%m-%d')}
    temp_dict.update(date_value)
    i = 0
    while i < len(meal):
        meal_value = {'meal': meal[i]}
        temp_dict.update(meal_value)
        temp_dict.update(day.meals[i].totals)
        df = df.append(temp_dict, ignore_index=True)
        i += 1
    start_date += delta


In [None]:
df = mfp_df_cleanup(df)

In [None]:
df['dateindex'] = pd.to_datetime(df['date'])
df = df.set_index(['dateindex'])
macro_df = df.groupby(['date','meal']).sum()
date_macro_df = macro_df.groupby(['date']).sum()
date_macro_df['date'] = date_macro_df.index
date_macro_df.reset_index(drop=True, inplace=True)
date_macro_df['date']= pd.to_datetime(date_macro_df['date'])
date_macro_df.set_index('date',inplace=True)

In [None]:
macro_cols = ['fat','carbohydrates','protein']

In [None]:
df.head()

### Summarize Data

In [None]:
df.shape

In [None]:
df.describe().transpose()

In [None]:
df.groupby(['date']).sum().describe().transpose()


### Visualize Data

In [None]:
loc = mdates.MonthLocator(interval=1)
fmt = mdates.DateFormatter('%m-%d-%y')

df['date'] = pd.to_datetime(df['date'])

fig, ax = plt.pyplot.subplots(figsize=(15, 6))
df.groupby(['date'])['calories'].sum().plot(kind='line', ax = ax)

_ = plt.pyplot.xlabel('Month')
_ = plt.pyplot.ylabel('Calories')

_ = plt.pyplot.title('Total Calories by Day')

ax.xaxis.set_major_locator(loc)
ax.xaxis.set_major_formatter(fmt)

plt.pyplot.show()

In [None]:
f = sns.FacetGrid(df, col="meal")
f.map(plt.pyplot.hist, 'fat')
f.fig.suptitle('Fat Histogram by Meal')
f.tight_layout()
c = sns.FacetGrid(df, col="meal")
c.map(plt.pyplot.hist, 'carbohydrates')
c.fig.suptitle('Carbohydrate Histogram by Meal')
c.tight_layout()
p = sns.FacetGrid(df, col="meal")
p.map(plt.pyplot.hist, 'protein')
p.fig.suptitle('Protein Histogram by Meal')
p.tight_layout()

In [None]:
pp = sns.pairplot(df, hue='meal')
pp.fig.suptitle('Pairplot of All Columns, Colored by Meal',size='large')
pp.tight_layout()

In [None]:
rpp = sns.pairplot(macro_df.groupby('date').sum(),kind='reg')
rpp.fig.suptitle('Pairplot of All Columns with Trendline',size='large')
rpp.tight_layout()

In [None]:
ax = plt.pyplot.axes()
sns.heatmap(macro_df.groupby('meal')[macro_cols].mean(), ax = ax)
ax.set_title('Heatmap of Macros by Meal')
plt.pyplot.show()

In [None]:
fig, ax = plt.pyplot.subplots(figsize=(15, 6))

date_macro_df['date'] = pd.to_datetime(date_macro_df.index)
date_macro_df.resample(rule='M', on='date')[macro_cols].mean().plot(kind='line', ax = ax)

_ = plt.pyplot.xlabel('Month')
_ = plt.pyplot.ylabel('Grams')

_ = plt.pyplot.title('Average Macros by Month')

plt.pyplot.show()

### Questions to consider

* How are calories distributed by meal?
* How strongly correlated are protein, fats, and carbs to calorie intake?
* How are fiber, sugar, and sodium intake related to macros?
* How do calories vary over time?
* Is there an observable trend in peaks and troughs of caloric intake?

## Part II: Statistical Modeling 

### Linear Regression Model - Macros + Fiber, Sugar, and Sodium 

In [None]:
x_cols = date_macro_df.columns.values.tolist()
x_cols.remove('calories')

In [None]:
#regress against key macros to see how changing one impacts others - fat, carb, protein, salt, sugar
## calories in as dependent variable - should have issues as there is a known relationship between factors
X = date_macro_df[x_cols]

cal_reg = LinearRegression().fit(X,date_macro_df['calories'])

print('Intercept: \n', cal_reg.intercept_)
print('Coefficients: \n', cal_reg.coef_)

import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(date_macro_df['calories'], X1).fit()

print(result.summary())

### Linear Regression Model - Macros Only

In [None]:
#regress against key macros to see how changing one impacts others - fat, carb, protein, salt, sugar
## calories in as dependent variable - should have issues as there is a known relationship between factors
X = date_macro_df[macro_cols]

cal_reg = LinearRegression().fit(X,date_macro_df['calories'])

print('Intercept: \n', cal_reg.intercept_)
print('Coefficients: \n', cal_reg.coef_)

import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(date_macro_df['calories'], X1).fit()

print(result.summary())

### Log-Log Linear Regression Model - Macros + Fiber, Sugar, and Sodium 

In [None]:
##log regression
date_macro_df_copy = date_macro_df.copy()
date_macro_df_copy.replace(to_replace=0,value=1,inplace=True)
log_date_macro_df = date_macro_df_copy.apply(np.log,axis=1)

In [None]:
#regress against key macros to see how changing one impacts others - fat, carb, protein, salt, sugar
## calories in as dependent variable - should have issues as there is a known relationship between factors
X = log_date_macro_df[x_cols]

cal_reg = LinearRegression().fit(X,log_date_macro_df['calories'])

print('Intercept: \n', cal_reg.intercept_)
print('Coefficients: \n', cal_reg.coef_)

import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(log_date_macro_df['calories'], X1).fit()

print(result.summary())

### Log-Log Linear Regression Model - Macros Only

In [None]:
#regress against key macros to see how changing one impacts others - fat, carb, protein, salt, sugar
## calories in as dependent variable - should have issues as there is a known relationship between factors
X = log_date_macro_df[macro_cols]

cal_reg = LinearRegression().fit(X,log_date_macro_df['calories'])

print('Intercept: \n', cal_reg.intercept_)
print('Coefficients: \n', cal_reg.coef_)

import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(log_date_macro_df['calories'], X1).fit()


print(result.summary())

### Questions to consider

* Where do I see multicollinearity warnings? This is a sign that two or more variables are highly correlated. Refer back to the correlation matrices to see where that overlap may be.
* Do these results align with what I expect? For example, in the standard linear regressions, do I have coefficients of approximately 4 for protein and carbs and 9 for fat?
* Are there any areas where the results differ from what I expect? For example, is something highly significant or insignificant in the regressions that should or should not be?

## Part III: Predictive Modeling

In [None]:
#apply log log model and score
#macro only log-log model
X_L = log_date_macro_df[macro_cols]

X_train, X_test, Y_train, Y_test = train_test_split(X_L,log_date_macro_df['calories'],test_size=0.30,random_state=0)

cal_reg = LinearRegression().fit(X_train,Y_train)

y_pred = cal_reg.predict(X_test)

#get error values
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, y_pred)))

#score test
score = cal_reg.score(X_test,Y_test)

print('Score: ',score)

## Conclusion

This notebook started as a project for me to explore some basics of data science and python while looking to answer key questions around my own health and nutrition. This notebook contains the portion of those exercises that I felt had value for a wider audience. For example, in my initial efforts I had access to calorie expenditure from my Fitbit. Using that data with MFP data, I was able to see how closely my results mirrored the calorie-in-calorie-out model. Because this would require additional data sets that 1) I could not guarantee would be accessible to others and 2) could not guarantee would be in an easy to import and/or manipulate format, I chose to omit that work. Likewise, this also meant omitting my time-series regressions using lagged calories burned. As it was not significant, omitting it would make sense even if the calories burned data had been included. I also chose to omit at this time work that I did with normalizing the data prior to running my regressions; while the process was successful, interpretation of the data was significantly more difficult.

I hope that this notebook has been successful at helping you review and analyse your nutrition.

DK - 1/21/22