# GUESS MY WEIGHT 

![guess_your_weight.gif](images/guess_your_weight.gif)

## Table of Contents TOC
[Overview](#overview)<br />
[Data Understanding](#data-understanding)<br />
[Data Preparation](#data-preparation)<br />
[Modeling](#modeling)<br />
[Evaluation](#evaluation)<br />
[Github Repository and Resources](#github-repository-and-resources)<br />


## Overview
Health and Wellness is a big business. Specifically, weight loss. We’re all trying because it’s very, very hard. I recently went on my own weight loss journey, losing about 50 lbs in roughly 18 months. Weighing myself every morning, I agonized over every tenth of a lb, recording it in an app on my phone. I realized that losing big chunks of weights starts with small, incremental progress on the scale. But I didn’t stop there. As a data nerd I thought, “let’s record every meal.” So I did that too. I wondered… given all this data I have, could I predict my weight? My watch and phone captures my exercise, sleep, eating, and so much more. There must be trends here. At a minimum, I should be able to predict whether my weight will go up or down from the previous day. So let’s do it.<br />
[return to TOC](#table-of-contents-TOC)

## Data Understanding
I have much (and probably too much) of this data in my iphone and Apple Watch. It contains the weight information, workouts, heart rate, meals - broken down into subcategories (proteins, fats, etc). Most importantly is the weight. That will be the feature that I primarily use for classification.  

Because it’s my data, there’s more clarity about data entry methods. This is more subjective, than a controlled experiment with many participants. I know what data I was diligent about collecting so I should be able to scrub it appropriately. For instance, I didn’t record my fluids consistently - water, tea, coffee. Water consumption is a big part of this so I’ll have to be clear about the gaps in the data.<br />
[return to TOC](#table-of-contents-TOC)


## Data Preparation
The data is stored on a csv file in a kaggle repository.

in an xml file on my phone. After downloading it into python notebook and digging a little, there are roughly 180 rows of weight entries (approximately 6 months) but it’s not clear how many gaps there are. All of the data is stored as an entry, with time stamps and usually some numeric form. Whether it’s heart rate, weight, caloric info, it’s one numeric entry with an associated units. We’re primarily dealing with ints and floats, all numeric, and we’ll be using daily totals/averages. Because we only have one weigh-in per day, we’re only going to use daily values of other data. So… we know we have approximately 100-180 rows. I can’t say at the moment how many columns, because this will be based on what happens in pre-processing. Which brings me to../.

There are two major challenges with the pre-processing. The first deals with the privacy of my personal health data. How do I balance reproducibility requirements with privacy concerns? I need to make the dataset publicly available, including all of my pre-processing steps, but I also want to make sure no one can link it back to me, Andrew Q. Bennett (my real middle name doesn’t start with Q… gotcha!!!!). And the initial dataset is large, maybe 40 MB. The approach we’ll use is to perform some pre-processing locally, and then upload to the kaggle site when it’s ready for public consumption. In my jupyter notebook, I will comment out some of this code so that we can see the work, but it won’t affect the code when we press “run”.

The second is dealing with correlation efforts. For instance, we know that all data related to working out is going to be correlated with eachother. The steps, average heart rate, workout calories, etc will all be correlated to whether I went for a jog that day. Making decisions about which data to use will be a challenge, even with some baseline domain knowledge. There is a treasure trove that may have nothing (or very little) to do with weight loss, like Vitamin A intake. PCA Analysis will be critical without losing some data. I know about health…but I’m no expert. Maybe Vitamin A intake can help/hurt weight loss.

The many visualization efforts will come from making sure the weight data is presented cleanly. A nice, regression line showing weight trends over different periods will be very helpful.<br />
[return to TOC](#table-of-contents-TOC)

### Instructions for Google Colab
Do not run the code snippet below. This is merely a reference if you'd like to download the dataset from Kaggle. Uncomment the below code snippet for downloading datasets from kaggle in Google Colab, the very first time.  

In [1]:
! pip install opendatasets
! pip install kaggle



In [2]:
import opendatasets as od
import pandas
 
od.download(
    "https://www.kaggle.com/datasets/andrewmbennett/guess-my-weight-4-25")

Skipping, found downloaded files in ".\guess-my-weight-4-25" (use force=True to force download)


In [14]:
import pandas as pd
import datetime as dt
import numpy as np
from statsmodels.tsa.stattools import adfuller
import tensorflow as tf
from sklearn.model_selection import train_test_split

### Importing csv file

In [4]:
df = pd.read_csv('/content/guess-my-weight-4-25/merge_health_4_25.csv')

In [5]:
df

Unnamed: 0,date,BodyMass_lb,StepCount_count,DistanceWalkingRunning_mi,BasalEnergyBurned_Cal,ActiveEnergyBurned_Cal,FlightsClimbed_count,DietaryFatTotal_g,DietaryFatPolyunsaturated_g,DietaryFatMonounsaturated_g,...,DietaryZinc_mg,DietarySelenium_mcg,DietaryCopper_mg,DietaryManganese_mg,DietaryPotassium_mg,AppleExerciseTime_min,SleepAnalysis_AsleepDeep_hrs,SleepAnalysis_AsleepCore_hrs,SleepAnalysis_AsleepREM_hrs,SleepAnalysis_Awake_hrs
0,2023-08-24,196.9,8895.0,4.163569,2055.322,564.7780,24.0,159.7455,11.8,9.5,...,0.5,9.0,0.3,1.1,1572.0,12.0,0.783333,5.558333,1.766667,0.266667
1,2023-08-25,195.1,9276.0,4.512434,2174.950,793.3800,7.0,62.9275,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,36.0,1.008333,3.700000,1.500000,0.133333
2,2023-08-26,195.1,10883.0,4.948209,2074.476,395.3870,9.0,118.3000,8.3,15.0,...,1.4,13.0,0.5,0.8,1943.0,8.0,1.400000,3.916667,1.558333,0.050000
3,2023-08-27,192.9,19174.0,9.909258,2187.383,895.4360,14.0,79.9300,3.1,2.9,...,1.5,18.0,0.3,0.5,1986.0,45.0,0.891667,5.566667,2.591667,0.066667
4,2023-08-28,192.9,13636.0,6.833914,2186.244,901.5490,21.0,70.8500,4.6,7.1,...,1.3,17.0,0.3,0.9,455.0,43.0,0.641667,5.275000,2.008333,0.158333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193,2024-03-04,175.7,8191.0,4.051709,1983.933,499.0720,4.0,87.7000,7.9,9.9,...,2.8,58.0,0.3,2.8,1023.0,76.0,0.000000,0.000000,0.000000,0.000000
194,2024-03-05,174.2,8882.0,4.448750,2009.083,566.5723,9.0,88.6000,4.8,6.3,...,1.9,123.0,0.1,0.8,2387.0,135.0,0.816667,4.775000,1.858333,2.683333
195,2024-03-06,173.3,2610.0,1.272886,759.761,127.8580,2.0,,,,...,,,,,,3.0,,,,
196,2023-08-23,,7325.0,3.399540,2057.531,476.7400,17.0,80.7000,1.8,0.9,...,0.5,2.0,0.2,0.5,422.0,10.0,0.983333,3.400000,1.091667,0.241667


One of the first things will do is make the date the index and convert date to

### Date Column Feature and Formatting

In [6]:
#convert the date
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

In [7]:
df['day'] = df['date'].dt.day_name()
df

Unnamed: 0,date,BodyMass_lb,StepCount_count,DistanceWalkingRunning_mi,BasalEnergyBurned_Cal,ActiveEnergyBurned_Cal,FlightsClimbed_count,DietaryFatTotal_g,DietaryFatPolyunsaturated_g,DietaryFatMonounsaturated_g,...,DietarySelenium_mcg,DietaryCopper_mg,DietaryManganese_mg,DietaryPotassium_mg,AppleExerciseTime_min,SleepAnalysis_AsleepDeep_hrs,SleepAnalysis_AsleepCore_hrs,SleepAnalysis_AsleepREM_hrs,SleepAnalysis_Awake_hrs,day
0,2023-08-24,196.9,8895.0,4.163569,2055.322,564.7780,24.0,159.7455,11.8,9.5,...,9.0,0.3,1.1,1572.0,12.0,0.783333,5.558333,1.766667,0.266667,Thursday
1,2023-08-25,195.1,9276.0,4.512434,2174.950,793.3800,7.0,62.9275,0.0,0.0,...,0.0,0.0,0.0,0.0,36.0,1.008333,3.700000,1.500000,0.133333,Friday
2,2023-08-26,195.1,10883.0,4.948209,2074.476,395.3870,9.0,118.3000,8.3,15.0,...,13.0,0.5,0.8,1943.0,8.0,1.400000,3.916667,1.558333,0.050000,Saturday
3,2023-08-27,192.9,19174.0,9.909258,2187.383,895.4360,14.0,79.9300,3.1,2.9,...,18.0,0.3,0.5,1986.0,45.0,0.891667,5.566667,2.591667,0.066667,Sunday
4,2023-08-28,192.9,13636.0,6.833914,2186.244,901.5490,21.0,70.8500,4.6,7.1,...,17.0,0.3,0.9,455.0,43.0,0.641667,5.275000,2.008333,0.158333,Monday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193,2024-03-04,175.7,8191.0,4.051709,1983.933,499.0720,4.0,87.7000,7.9,9.9,...,58.0,0.3,2.8,1023.0,76.0,0.000000,0.000000,0.000000,0.000000,Monday
194,2024-03-05,174.2,8882.0,4.448750,2009.083,566.5723,9.0,88.6000,4.8,6.3,...,123.0,0.1,0.8,2387.0,135.0,0.816667,4.775000,1.858333,2.683333,Tuesday
195,2024-03-06,173.3,2610.0,1.272886,759.761,127.8580,2.0,,,,...,,,,,3.0,,,,,Wednesday
196,2023-08-23,,7325.0,3.399540,2057.531,476.7400,17.0,80.7000,1.8,0.9,...,2.0,0.2,0.5,422.0,10.0,0.983333,3.400000,1.091667,0.241667,Wednesday


In [8]:
# Make Date the index 
df.set_index('date', inplace=True)

In [9]:
# delete the last row 
df.drop(['2023-08-22', '2023-08-23'], axis=0,inplace=True)

In [10]:
df

Unnamed: 0_level_0,BodyMass_lb,StepCount_count,DistanceWalkingRunning_mi,BasalEnergyBurned_Cal,ActiveEnergyBurned_Cal,FlightsClimbed_count,DietaryFatTotal_g,DietaryFatPolyunsaturated_g,DietaryFatMonounsaturated_g,DietaryFatSaturated_g,...,DietarySelenium_mcg,DietaryCopper_mg,DietaryManganese_mg,DietaryPotassium_mg,AppleExerciseTime_min,SleepAnalysis_AsleepDeep_hrs,SleepAnalysis_AsleepCore_hrs,SleepAnalysis_AsleepREM_hrs,SleepAnalysis_Awake_hrs,day
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-08-24,196.9,8895.0,4.163569,2055.322,564.7780,24.0,159.7455,11.8,9.5,36.2203,...,9.0,0.3,1.1,1572.0,12.0,0.783333,5.558333,1.766667,0.266667,Thursday
2023-08-25,195.1,9276.0,4.512434,2174.950,793.3800,7.0,62.9275,0.0,0.0,10.8165,...,0.0,0.0,0.0,0.0,36.0,1.008333,3.700000,1.500000,0.133333,Friday
2023-08-26,195.1,10883.0,4.948209,2074.476,395.3870,9.0,118.3000,8.3,15.0,39.5000,...,13.0,0.5,0.8,1943.0,8.0,1.400000,3.916667,1.558333,0.050000,Saturday
2023-08-27,192.9,19174.0,9.909258,2187.383,895.4360,14.0,79.9300,3.1,2.9,27.9600,...,18.0,0.3,0.5,1986.0,45.0,0.891667,5.566667,2.591667,0.066667,Sunday
2023-08-28,192.9,13636.0,6.833914,2186.244,901.5490,21.0,70.8500,4.6,7.1,16.3000,...,17.0,0.3,0.9,455.0,43.0,0.641667,5.275000,2.008333,0.158333,Monday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-03-02,174.6,13416.0,6.533640,2048.925,1651.9890,16.0,76.2000,2.4,3.0,25.0000,...,24.0,0.3,1.4,1996.0,148.0,0.000000,0.000000,0.000000,0.000000,Saturday
2024-03-03,175.0,15876.0,7.722016,2048.189,1443.2150,22.0,59.9000,1.3,0.6,9.7000,...,51.0,0.4,3.2,1987.0,173.0,1.108333,3.925000,1.966667,0.300000,Sunday
2024-03-04,175.7,8191.0,4.051709,1983.933,499.0720,4.0,87.7000,7.9,9.9,25.9000,...,58.0,0.3,2.8,1023.0,76.0,0.000000,0.000000,0.000000,0.000000,Monday
2024-03-05,174.2,8882.0,4.448750,2009.083,566.5723,9.0,88.6000,4.8,6.3,21.8000,...,123.0,0.1,0.8,2387.0,135.0,0.816667,4.775000,1.858333,2.683333,Tuesday


### BodyMass Inspection

In [11]:
df['BodyMass_lb'].describe()

count    196.000000
mean     127.684184
std       88.425096
min        0.000000
25%        0.000000
50%      181.100000
75%      188.300000
max      388.500000
Name: BodyMass_lb, dtype: float64

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

df['BodyMass_lb'].plot(figsize = (16,7));
plt.show()

In [None]:
null_weights = len(df[df['BodyMass_lb'] < 100])
total = len(df['BodyMass_lb'])
null_weights/total

We have a few issues to resolve. The biggest issue is the number of zero entries. Based on our knowledge of human weight fluctation, we know it's impossible to weight 0 pounds. More than likely, these are the dates when a wiegh-in was never performed. We should convert these values to NaN to make our graph appear better.

In [None]:
#import numpy as np
#df[df['BodyMass_lb'] == 0]['BodyMass_lb'] = np.NaN 
#df['BodyMass_lb'].replace(0.0,np.NaN)
df.loc[df['BodyMass_lb'] == 0.0,'BodyMass_lb'] = np.nan

In [None]:
df['BodyMass_lb'].plot(figsize = (16,6));
plt.show()

In [None]:
df.loc[df['BodyMass_lb'] == 388.5,'BodyMass_lb'] = np.nan

In [None]:
df['BodyMass_lb'].plot(figsize = (16,6));
plt.show()

This confirms that we don't have much data related to weight prior to late August. When we go into startdate, it might be prudent to consider Aug. 24th as the actual start date. I'm going to modify the data so there's nothing prior

### Important Column inspection - Sleep

In [None]:
col_sleep = ['SleepAnalysis_AsleepDeep_hrs', 'SleepAnalysis_AsleepCore_hrs', 'SleepAnalysis_AsleepREM_hrs', 'SleepAnalysis_Awake_hrs', 'AppleExerciseTime_min']

In [None]:
plt.rcParams['figure.figsize']=(15,7)

plt.plot(df['SleepAnalysis_AsleepCore_hrs'], color='blue', label = 'Core')
plt.plot(df['SleepAnalysis_AsleepREM_hrs'], color='red', label = 'REM')
plt.plot(df['SleepAnalysis_AsleepDeep_hrs'], color='green', label = 'Deep')
plt.plot(df['SleepAnalysis_Awake_hrs'], color='yellow', label = 'Awake')
 
plt.title('Sleep_hrs')
plt.legend()
plt.show()

In [None]:
df['SleepAnalysis_AsleepDeep_hrs'].replace(to_replace=0,value = df['SleepAnalysis_AsleepDeep_hrs'].mean(), inplace = True)
df['SleepAnalysis_AsleepCore_hrs'].replace(to_replace=0,value = df['SleepAnalysis_AsleepCore_hrs'].mean(), inplace = True)
df['SleepAnalysis_AsleepREM_hrs'].replace(to_replace=0,value = df['SleepAnalysis_AsleepREM_hrs'].mean(), inplace = True)
df['SleepAnalysis_Awake_hrs'].replace(to_replace=0,value = df['SleepAnalysis_Awake_hrs'].mean(), inplace = True)

In [None]:
df

In [None]:
df[col_sleep] = df[col_sleep].fillna(df[col_sleep].mean())

In [None]:
plt.rcParams['figure.figsize']=(15,7)

plt.plot(df['SleepAnalysis_AsleepDeep_hrs'], color='green')
plt.plot(df['SleepAnalysis_AsleepCore_hrs'], color='blue')
plt.plot(df['SleepAnalysis_AsleepREM_hrs'], color='red')
plt.plot(df['SleepAnalysis_Awake_hrs'], color='yellow')
 
plt.title('Sleep')
plt.show()

### Important Column inspection - Exercise

In [None]:
col_exercise = ['StepCount_count', 'DistanceWalkingRunning_mi', 'BasalEnergyBurned_Cal', 'ActiveEnergyBurned_Cal', 'FlightsClimbed_count']

In [None]:
df.drop(['StepCount_count', 'DistanceWalkingRunning_mi', 'FlightsClimbed_count'], axis = 1, inplace = True)

In [None]:
plt.rcParams['figure.figsize']=(15,7)

#plt.plot(df['StepCount_count'], color='green')
#plt.plot(df['DistanceWalkingRunning_mi'], color='blue')
plt.plot(df['BasalEnergyBurned_Cal'], color='red')
plt.plot(df['ActiveEnergyBurned_Cal'], color='yellow')
#plt.plot(df['FlightsClimbed_count'], color='yellow')
 
plt.title('Exercise')
plt.show()

In [None]:
#Cond_act = df['ActiveEnergyBurned_Cal'] < 250
#bas_act = df['BasalEnergyBurned_Cal'] < 250

df.loc[df['ActiveEnergyBurned_Cal'] < 250, 'ActiveEnergyBurned_Cal'] = df['ActiveEnergyBurned_Cal'].mean()
df.loc[df['BasalEnergyBurned_Cal'] < 1750, 'BasalEnergyBurned_Cal'] = df['BasalEnergyBurned_Cal'].mean()


In [None]:
plt.rcParams['figure.figsize']=(15,7)

#plt.plot(df['StepCount_count'], color='green')
#plt.plot(df['DistanceWalkingRunning_mi'], color='blue')
plt.plot(df['BasalEnergyBurned_Cal'], color='red', label = 'Basal')
plt.plot(df['ActiveEnergyBurned_Cal'], color='yellow', label = 'Active')
#plt.plot(df['FlightsClimbed_count'], color='yellow')
 
plt.title('Exercise Calories')
plt.legend()
plt.show()

### Important Column inspection - Dietary

In [None]:
df['DietaryCarbohydrates_g'].hist(figsize = (16,6), width = 25);
plt.show()

In [None]:
null_carbs = len(df[df['DietaryCarbohydrates_g'] == 0])
null_carbs/total

OKay, so we have considerable 0 values here. 

Missing at Random (MAR): Data points are missing depending on observed values in other variables, but not on the missing values themselves. This is a more complex scenario, but imputation using observed data can still be effective.

After doing some previewing, I'm determining that those 3 data points, whose carbs are under 75g, are also incomplete. So this isn't just the carbohydrate data, but all of the dietary information. So, we'll set all of the dietary information to Nan where the daily carbohydrates are less than 75g.


In [None]:
#we'll go ahead and limit the 
Nan_cond = df['DietaryCarbohydrates_g'] < 75.0

col_dietary = [col for col in df.columns if "Dietary" in col]
df.loc[df['DietaryCarbohydrates_g'] < 75.0, col_dietary] = np.nan


#df.loc[df['DietaryCarbohydrates_g'] < 75.0,'DietaryCarbohydrates_g'] = np.nan

In [None]:
df.loc['2023-08-29':'2023-09-04','DietaryFatTotal_g':'DietaryProtein_g']

Let's figure out which columns we want to keep.

In [None]:
#let's plot our carbohydrates
df.groupby(['day'])['DietaryCarbohydrates_g'].plot(figsize = (13,8), subplots=False, legend=True);
plt.show()

In [None]:
#let's plot our carbohydrates
df['DietaryCarbohydrates_g'].plot(figsize = (13,8), subplots=False, legend=True);
plt.show()

okay, so we have some gaps to fill, let's start with filling in some 

Let's look at our null data

In [None]:
nul_carbs = pd.isnull(df['DietaryCarbohydrates_g'])
df[nul_carbs]['DietaryCarbohydrates_g']

A quick scan here shows that we have chunks of time series data missing. The best way to handle this, in my opion, is to divide into two subsets, to disregard those stretches of missing data.

These chunks will be (8-23: 10-24), (10-31: 12-23), (1-01: 02-05), (02-22: 03-05). These were chunks of data were determined by finding "chunks" of both null and valid data. Chunks of valid data were determined to have no more than 3 consecutive days of null data. To fill these in, let's start create the chunks

In [None]:
SepOct = df['2023-08-23':'2023-10-24']
NovDec = df['2023-10-31':'2023-12-23']
Jan = df['2024-01-01':'2024-02-05']
FebMar = df['2024-02-24':'2024-03-05']

okay, now let's fill in the null values with the mean for all of the dietary nulls here.

In [None]:
df[col_dietary] = df[col_dietary].fillna(df[col_dietary].mean())

Okay so this looks promising. We see a little flattening of the curve, but, it doesn't mess with our data too much. Let's go ahead and create this for all of our data, in Sept, Oct

In [None]:
plt.rcParams['figure.figsize']=(15,7)

plt.plot(df['DietaryCarbohydrates_g'], color='green', label = 'Carbs')
plt.plot(df['DietaryProtein_g'], color='red', label = 'Protein')
plt.plot(df['DietaryFatTotal_g'], color='blue', label = 'Fats')
 
plt.title('Diet Macros (g)')
plt.legend()
plt.show()

### Focusing on Weight
What we really care about is weight, and the difference of weight.

#### Test for Stationality
First, let's use a Dickey-Fuller Test on our data to see if we have Stationality. We're going to use the Dickey-Fuller test in the stats model. This function does not permit null values. And, we have some null values, so we'll have to fill in missing data. To do this, we'll need to utilize a certain level of synthetic data. For starters, let's just first look at our data. 

In [None]:
plt.rcParams['figure.figsize']=(15,7)

plt.plot(df['BodyMass_lb'], color='green', label = 'BodyMass_lb')
 
plt.title('Weigh-In Data')
plt.legend()
plt.show()

In [None]:
df['BodyMass_lb_inter'] = df['BodyMass_lb'].interpolate(option='spline')

In [None]:
import matplotlib.pyplot as plt

SMALL_SIZE = 8
MEDIUM_SIZE = 10
BIGGER_SIZE = 12

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

In [None]:
import matplotlib

plt.rcParams['figure.figsize']=(15,7)

plt.plot(df['BodyMass_lb_inter'], color='blue', label = 'Interpolated_Data')
plt.plot(df['BodyMass_lb'], color='blue', label = 'Actual')

#plt.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%b"))

SMALL_SIZE = 8
MEDIUM_SIZE = 10
BIGGER_SIZE = 18

plt.rc('axes', titlesize=SMALL_SIZE, labelsize=MEDIUM_SIZE)

plt.rc('font', size=MEDIUM_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=BIGGER_SIZE)     # fontsize of the axes title
#plt.rc('axes', labelsize=BIGGER_SIZE)    # fontsize of the x and y labels
#plt.rc('xtick', labelsize=BIGGER_SIZE)    # fontsize of the tick labels
#plt.rc('ytick', labelsize=BIGGER_SIZE)    # fontsize of the tick labels
#plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
#plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

#matplotlib.rc('font', size=BIGGER_SIZE)
#matplotlib.rc('axes', titlesize=BIGGER_SIZE)

plt.title('Daily Weigh-in Data')
plt.xlabel('Dates (Yr-Mo)')
plt.ylabel('Lbs')
#plt.legend()
plt.show()

In [None]:
#initialize a blank series series without the date index
series = df['BodyMass_lb_inter'].reset_index()

#loop through series and move the interpolated weight one index (data) up
for ind in range(0,len(series)-1):
    series.loc[ind, 'BodyMass_lb_inter'] = series.loc[ind+1, 'BodyMass_lb_inter']

#make the last value Nan    
series.loc[ind+1, 'BodyMass_lb_inter'] = np.NaN

#re-stablish date index
series.set_index('date', inplace = True)

#create new feature in df to represent the new lagged body mass
df['BodyMass_lb_inter'] = series['BodyMass_lb_inter']

In [None]:
df

In [None]:
#let's create the weight difference in a new column
df['BodyMass_lb_diff'] = df['BodyMass_lb_inter'].diff() 

In [None]:
#provide the first differenced entry
df.iloc[0,len(df.columns)-1] = df.iloc[0,len(df.columns)-2] - df.iloc[0,0]

In [None]:
plt.rcParams['figure.figsize']=(15,7)

plt.plot(df['BodyMass_lb_diff'], color='blue', label = 'Weight Diff')

plt.title('Weigh-In Data Difference (lbs)')
plt.legend()
plt.show()

In [None]:
new_df = df.dropna(subset=['BodyMass_lb_diff'])

In [None]:
#SepOct = df['2023-08-23': '2023-10-24']
#NovDec = df['2023-10-31': '2023-12-23']
#Jan = df['2024-01-01': '2024-02-05']
#FebMar = df['2024-02-24': '2024-03-05']

#new_df = pd.concat([SepOct, NovDec, Jan, FebMar])


In [None]:
new_df

In [None]:
new_df['BodyMass_lb_diff'].isna().sum()

Now that we have differenced the data, and have nothing null, let's go ahead and test for Dickey-Fuller

In [None]:
dftest = adfuller(new_df['BodyMass_lb_diff'])

In [None]:
 # Print Dickey-Fuller test results
print('Results of Dickey-Fuller Test: \n')

dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', 
                                             '#Lags Used', 'Number of Observations Used'])
for key, value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
    

This looks good. The differenced data appears stationary. Let's see how the decomposed time looks.

In [None]:
plt.figure(figsize=(12,5))
pd.plotting.autocorrelation_plot(new_df['BodyMass_lb_inter']);


In [None]:
from statsmodels.graphics.tsaplots import plot_pacf
from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 14, 5

plot_pacf(new_df['BodyMass_lb_inter'], lags=50);

Both plots look pretty stationary. So that's great. Also, both plots trail off with time. This is a good sign that these functions are a good candidate for AutoRegressive (AR) and Moving Average (MA) Analysis

#### ARMA Analysis
We've confirmed our data is stationary. We observed the PACF And ACF plots and understood that both trail off wtih time. This means our weigh-in data is a good candidate for both AR and MA. To do this. We're going to utilize our original weigh-in data. There's plenty of missing data, but luckily our ARIMA works with missing data. We also no that 1st order differencing made our data stationary, so we can jump straight to that when we check for ARMA.

In [None]:
new_df['BodyMass_lb_inter'].isna().sum()

In [None]:
# split into train and test sets. Let's do an 80/20 split
#SepOct = new_df['2023-08-25': '2023-10-24']
#NovDec = new_df['2023-10-31': '2023-12-23']
#Jan = new_df['2024-01-01': '2024-02-05']
#FebMar = new_df['2024-02-24': '2024-03-05']

train = new_df['2023-08-25': '2023-12-23']['BodyMass_lb_inter']
test = new_df['2023-12-24': '2024-03-05']['BodyMass_lb_inter']

train_len = len(train)
test_len = len(test)

# walk-forward validation


In [None]:
pip install pmdarima

In [None]:
from pmdarima import auto_arima

model = auto_arima(new_df['BodyMass_lb_inter'], seasonal=False, m=0, stepwise=True)

# Get the best ARIMA model
print(model.summary())

So... we run the auto and we find the most accurate ARIMA arrangement, which is 1st order lagged on both the moving average and the autoregressed term. This makes sense - we already determined that the differenced data was stationary, and it appears that we also care about both AR and MA. To run the auto, we had to use some synthetic data, but we can also utilize a manual check. We can also check the interpolated data and the actual data.

In [None]:
# Import ARIMA
from statsmodels.tsa.arima.model import ARIMA
import statsmodels.api as sm

# Instantiate an AR(1) model to the simulated data
mod_arma_raw = ARIMA(new_df['BodyMass_lb_inter'], order=(1,1,1))

In [None]:
# Fit the model to data
res_arma_raw = mod_arma_raw.fit()

In [None]:
# Print out summary information on the fit
print(res_arma_raw.summary())

So it appears we got a significantly more accurate model, which also dropped out the Y-intercept term. We also have statistically significant

In [None]:
# Instantiate an MA(1) model to the simulated data
mod_arma_inter = ARIMA(new_df['BodyMass_lb_inter'], order=(1,1,1))

# Fit the model to data
res_arma_inter = mod_arma_inter.fit()

# Print out summary information on the fit
print(res_arma_inter.summary())

Interesting. On our first check with got an AIC/BIC in the high 400s, but coefficients with high high statistical confidence. This was using the raw, uninterpolated data. Let's see how this look with the interpolated data.

In [None]:
arma_raw_resid = pd.Series(res_arma_raw.resid)
arma_raw_resid.drop('2023-08-24', axis = 0, inplace = True)

arma_inter_resid = pd.Series(res_arma_inter.resid)
arma_inter_resid.drop('2023-08-24', axis = 0, inplace = True)

In [None]:
arma_inter_resid

In [None]:
plt.rcParams['figure.figsize']=(15,7)

plt.plot(arma_raw_resid, color='blue', label = 'Predictions - Actual Differences')
plt.plot(arma_inter_resid, color='red', label = 'Predictions - Interpolated Differences')
plt.plot(new_df['BodyMass_lb'].diff(), color='green', label = 'Weight Diff')

plt.title('Actual vs Predicted Weigh-In Data (lbs)')
plt.legend()
plt.show()

In [None]:
arma_inter_resid.mean()

So... we can see that the predictions with interpolated differences, do a good job of sticking to the general peaks. We don't visually see much of a drop off in accuracy, even though our model tells us otherwise. I'm inclined to utilize the residuals from the interpolated data as our error.

But what does this mean about our weight data?  It means that are both enourages and fights whatever weight difference we experienced? It's almost as if a part of our metabolism wants to continue a trend and another part is trying to course correct. Anecdotally, there are stories about how really in shape super athletes will metablolize excess carbs/fat as opposed to storing them as fat. It's almost if the body knows our behavior, and wants to continue it. Why store fat on an athlete that is in burn mode. Let's just store as glycogen or, get rid of it. On the one hand, the course correction side, perhaps there's a mechanism in our body that's continually trying to use/store/release all of the calolories that are body consumes. Perhaps as very finite corrections. As if the body says, "Well, I thought I was going to burn X amount of calories, but I only burned Y. So tomorrow, I'll slow down and metabolize less." The course correction could also swing the other way - if too much weight gain, the body could metabolize more.

But how does that account for weight loss. That's where our trend of moving average comes in (also, the drift in the random walk model). Herein lies a conundrum, our original weight difference numbers passed the test for stationality, but there is a slight trend in the data (-.12). This is approximately 0.12 lbs per day that, on average, of weight loss. It's small enough to not throw off stationality, yet large enough to lose more than 15 lbs in 6 months.

Okay, so, back to our diet information. We have two separate errors now from which we can predict some noise. First we have our original weight loss change from day to day. Second, we have the residuals from our ARMA model with which it predict. So... let's do it.

### Feature Engineering

So, now that we added have scrubbed our data. We're going to create a few different target variables, all in the name of weight loss. The first, will be just the difference of our weight loss from day to day. The second, will be the residuals from our predicted

In [None]:
#let's separate our target and feature columns.
#df['diff_inter'] = arma_inter_resid
df['BodyMass_lb_raw'] = df['BodyMass_lb']

df.drop('BodyMass_lb', axis = 1, inplace = True)
df.drop('day', axis = 1, inplace = True)

Let's move the interpolated weight up an index. Let's also drop our Nans on the last row

In [None]:
#drop NAs from the last row
df = df.dropna(subset=['BodyMass_lb_diff'])

We can also use this time to make a category to determine if weight loss occurred. This is relatively simple. Let's call it weight loss, and we'll give it a 1, if there's was weight loss, and 0 if there wasn't. In this scenario, even 0 lbs would be the same as weight gain.

In [None]:
df['weight_loss'] = df['BodyMass_lb_diff'] < 0.01
df['weight_loss'] = df['weight_loss'].astype(int)

In [None]:
weight_days = pd.DataFrame(df[df['weight_loss'] == 1]['weight_loss'].resample('M').count())
weight_days['weight_gain'] = df[df['weight_loss'] == 0]['weight_loss'].resample('M').count()
weight_days.reset_index(inplace = True)

In [None]:
#weight_days['date'] = weight_days['date'].dt.month
weight_days['date'] = weight_days['date'].dt.month_name().str[:3]

In [None]:
weight_days.set_index('date', inplace = True)

In [None]:
fig, ax = plt.subplots(figsize=(8,6))

ax = weight_days['weight_gain'].plot.bar(color='black', label = 'Weight Gain Days')
ax = weight_days['weight_loss'].plot.bar(bottom = weight_days['weight_gain'], color ='red', label = 'Weight Loss Days')

ax.set_title('Morning Weigh-In Count')
ax.set_xlabel('Month')
ax.set_ylabel('Days')
ax.legend()
plt.xticks(rotation=None)

In [None]:
# Specify the values of blue bars (height)
weight_gain = weight_days['weight_gain']

# Specify the values of orange bars (height)
weight_loss = weight_days['weight_loss']

# Position of bars on x-axis
ind = np.arange(len(weight_days['weight_gain']))

# Figure size
plt.figure(figsize=(10,5))

# Width of a bar 
width = 0.3       

# Plotting
plt.bar(ind, weight_loss, width, label='Weight Loss Days', color = 'red')
plt.bar(ind + width, weight_gain, width, label='Weight Gain Days', color = 'black')

plt.xlabel('Months')
plt.ylabel('Days')
plt.title('Morning Weigh-In Days by Month')

# xticks()
# First argument - A list of positions at which ticks should be placed
# Second argument -  A list of labels to place at the given locations
plt.xticks(ind + width/2, weight_days.index)

# Finding the best position for legends and putting it
plt.legend(loc='best')
plt.show()

In [None]:
weight_days.loc['totals'] = [weight_days['weight_loss'].sum(), weight_days['weight_gain'].sum()]
weight_days

### PCA
Now that we have all of these feature variables, and we believe we're in good shape. Let's figure out

In [None]:
df

In [None]:
targets = df.loc[:,'BodyMass_lb_inter':'weight_loss']
features = df.loc[:,'BasalEnergyBurned_Cal':'SleepAnalysis_Awake_hrs']

In [None]:
targets['BodyMass_lb_diff']

Whoa! That is pretty good. In fact, it's really where there's no data do we see these big gaps

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler_minmax = MinMaxScaler() 
features_minmax = scaler_minmax.fit_transform(features)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler_std = StandardScaler() 
features_std = pd.DataFrame(scaler_std.fit_transform(features), columns = features.columns)

In [None]:
# Your code here
#import seaborn as sns
corr_check = features_std.corr()
corr_check

In [None]:
#let's try to analyze the PCA's here
from sklearn.decomposition import PCA

pca_1 = PCA(n_components=12)
pca_2 = PCA(n_components=24)
pca_3 = PCA(n_components=36)

principalComponents = pca_1.fit_transform(features_std)
principalComponents = pca_2.fit_transform(features_std)
principalComponents = pca_3.fit_transform(features_std)

print(np.sum(pca_1.explained_variance_ratio_))
print(np.sum(pca_2.explained_variance_ratio_))
print(np.sum(pca_3.explained_variance_ratio_))


Wow, okay, so we can maintain about 80% of our data through 12 components, down from 45. At the same time, a lot of correlation (the heat in our correlation map). It's probably a good time to delve into the data a bit more. Previously, we divided our data into dietary, exercise, and sleep. It turns out, we may need to create further subsets. Let's start with dietary. 

For dietary information, it's useful to think of it in levels. It starts with Level 1 - `DietaryEnergyConsumed_Cal`, from there we go to Level 2 - macronutrients  `DietaryFatTotals_g`, `DietaryCarbohydrates_g`, `DietaryProtein_g`. But fortunately for us, we have, what I call, Level 3 - sub-macronutrients still measured in grams, which includes things like `DietarySugar_g` which is a carbohydrate, and `DietarySaturatedFats_g` which is a fat. Going further, we have micronutrients, or Level 4 - measured in milligrams (or even micrograms) of things like `DietarySodium_mg` and `DietaryCholesterol_mg`.

Same with sleep. With sleep, we have level 2 data - REM, Core, Deep. We also have awake hours as well. Level 1 data, if we wanted it, would consist of the total hours of sleep we got. So, if we chose to include only Level 1 diet data in our analysis, it might be better to be consistent with sleep as well. Same with exercise. We have basal and active calories, or Level 2, and we have exercise minutes. Exercise minutes are even a collary category of workout. 

There's a big correlative overlap between Level 1, 2, & 3. So, we have to make a decision on what we want to include. Given where we are, let's start with Level 1 and go from there.

To do that, let's create these categories of sub-data. For sleep, we'll have to feature engineer to add it.

In [None]:
#let's add totals for sleep and energy burned
df['SleepAnalysis_AsleepTotal_hrs'] = df['SleepAnalysis_AsleepDeep_hrs'] + df['SleepAnalysis_AsleepCore_hrs'] + df['SleepAnalysis_AsleepREM_hrs']
df['TotalEnergyBurned_Cal'] = df['BasalEnergyBurned_Cal'] + df['ActiveEnergyBurned_Cal']


In [None]:
#combine all 3 - Level 1 
level_1 = ['DietaryEnergyConsumed_Cal', 'TotalEnergyBurned_Cal', 'SleepAnalysis_AsleepTotal_hrs']
level_1_diet = ['DietaryEnergyConsumed_Cal']
level_1_exer = ['TotalEnergyBurned_Cal']
level_1_sleep = ['SleepAnalysis_AsleepTotal_hrs']

feature_1 = df[level_1]

In [None]:
feature_1.corr()

In [None]:
#combine - Level 2
level_2_diet = ['DietaryFatTotal_g', 'DietaryProtein_g', 'DietaryCarbohydrates_g']
level_2_exer = ['BasalEnergyBurned_Cal','ActiveEnergyBurned_Cal']
level_2_sleep = ['SleepAnalysis_AsleepDeep_hrs','SleepAnalysis_AsleepCore_hrs','SleepAnalysis_AsleepREM_hrs', 'SleepAnalysis_Awake_hrs']
level_2 = level_2_diet + level_2_exer + level_2_sleep
feature_2 = df[level_2]

In [None]:
df[level_2_diet].corr()

In [None]:
df[level_2_exer].corr()

In [None]:
#feature engineering - let's create some of the categories for dietary 3
df['DietaryCarbsResidual_g'] = df['DietaryCarbohydrates_g'] - df['DietarySugar_g'] - df['DietaryFiber_g'] 
df['DietaryFatsResidual_g'] = df['DietaryFatTotal_g'] - df['DietaryFatMonounsaturated_g'] -  df['DietaryFatPolyunsaturated_g'] - df['DietaryFatSaturated_g'] 

#let's aggregate the level 3 dietary information
level_3_diet_carbs = ['DietaryCarbsResidual_g', 'DietarySugar_g', 'DietaryFiber_g']
level_3_diet_fat = ['DietaryFatsResidual_g', 'DietaryFatMonounsaturated_g', 'DietaryFatPolyunsaturated_g', 'DietaryFatSaturated_g']
level_3_diet_protein = ['DietaryProtein_g']
level_3_diet = level_3_diet_carbs + level_3_diet_fat + level_3_diet_protein

#combine - Level 3, please note, there is no level 3 for sleep and exercise, we will reuse level 2 info there
level_3 = level_3_diet + level_2_exer + level_2_sleep
feature_3 = df[level_3]

In [None]:
df[level_3_diet].corr()

In [None]:
#now, let's scale the data and redo or correlation matrix, will use both minmax and standard for reference
scaler_minmax = MinMaxScaler() 
feature_1_minmax = pd.DataFrame(scaler_minmax.fit_transform(feature_1), columns = feature_1.columns)
feature_1_minmax['date'] = targets['BodyMass_lb_diff'].index
feature_1_minmax = feature_1_minmax.set_index('date')

scaler_std = StandardScaler() 
feature_1_std = pd.DataFrame(scaler_std.fit_transform(feature_1), columns = feature_1.columns)
feature_1_std['date'] = targets['BodyMass_lb_diff'].index
feature_1_std = feature_1_std.set_index('date')


In [None]:
#import seaborn as sns
feature_1_std.corr()

In [None]:
scaler_std = StandardScaler() 
feature_2_std = pd.DataFrame(scaler_std.fit_transform(feature_2), columns = feature_2.columns)
feature_2_std['date'] = targets['BodyMass_lb_diff'].index
feature_2_std = feature_2_std.set_index('date')

In [None]:
feature_2_std.corr()

Okay, so we solved our correlation and components problem... simply by applying domain knowledge and feature engineering. Now, we can run some analysis here.

So... which analysis should we use first. The solution is obvious... linear regression. Before we dive in, we should be aware of something in our protocol. The weigh-ins occurred every morning, first thing. They are recorded as weight's for that day. But, much like sleep, the weight recorded that morning is really a reflection of the previous days activities. Or, put it this way, the weight recorded on, say, October 17th as nothing to do with the food, exercise, and sleep on October 17th. As their shown in the data, they're linked. It's more accurate to show the weigh-in occuring on October 17th as the result of behaviors on October 16th. We'll make a new column called "Lagged Weight"

### LINEAR REGRESSION

In [None]:
#specify X and Y, remembering to drop the last entry as
X = feature_1_std
y = targets['BodyMass_lb_diff']

In [None]:
#create model
level_1_model = sm.OLS(y, sm.add_constant(X))
level_1_results = level_1_model.fit()

#print results
print(level_1_results.summary())

In [None]:
feature_1_rolling_2 = feature_1.rolling(2).sum().drop(['2023-08-24'], axis = 0)
y.drop('2023-08-24', axis = 0, inplace = True)

In [None]:
#specify X and Y, remembering to drop the first entry as
X = feature_1_rolling_2

#create model
level_1_model_minmax = sm.OLS(y, sm.add_constant(X))
level_1__minmax_results = level_1_model_minmax.fit()

#print results
print(level_1__minmax_results.summary())

Okay, or model got worse! We have a slightly higher AIC/BIC, we have one moving average variable (basically 0). And non of our coefficients are statistically significant.

In [None]:
feature_1_rolling_3 = feature_1.rolling(3).sum().drop(['2023-08-24', '2023-08-25'], axis = 0)
y.drop('2023-08-25', axis = 0,inplace = True)

In [None]:
#specify X and Y, remembering to drop the first entry as
X = feature_1_rolling_3

#create model
level_1_model_minmax = sm.OLS(y, sm.add_constant(X))
level_1__minmax_results = level_1_model_minmax.fit()

#print results
print(level_1__minmax_results.summary())

### Linear Regression with Deep Learning

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense

In [None]:
pip install tensorflow_addons

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_1, targets['BodyMass_lb_diff'], random_state = 243, test_size = .25)

# Split the data
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train, y_train, random_state = 243, test_size = .25)

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train_final)
scaled_data_val = scaler.fit_transform(X_val)

In [None]:
model_1 = Sequential()

#we'll start with 10 neurons, and an input shape of 14
model_1.add(Dense(12, activation='linear', input_shape=(3,)))
model_1.add(Dense(8, activation='linear'))
model_1.add(Dense(4, activation='linear'))

#output classification layer
model_1.add(Dense(1, activation='linear'))

In [None]:
from keras import optimizers
from tensorflow_addons.metrics import RSquare
# Compile the model
#metric = keras.metrics.R2Score()
model_1.compile(loss='mse', optimizer=optimizers.RMSprop(learning_rate=0.001), metrics=['accuracy'])

In [None]:
#fit model
results_1  = model_1.fit(scaled_data_train,
                   y_train_final,
                    epochs=100,
                    validation_data=(scaled_data_val, y_val))

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_2, targets['BodyMass_lb_diff'], random_state = 243, test_size = .25)

# Split the data
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train, y_train, random_state = 243, test_size = .25)

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train_final)
scaled_data_val = scaler.fit_transform(X_val)

In [None]:
model_1 = Sequential()

#we'll start with 10 neurons, and an input shape of 9
model_1.add(Dense(12, activation='linear', input_shape=(9,)))
model_1.add(Dense(8, activation='linear'))
model_1.add(Dense(4, activation='linear'))

#output classification layer
model_1.add(Dense(1, activation='linear'))

In [None]:
model_1.compile(loss='mse', optimizer='rmsprop', metrics=[RSquare()])

In [None]:
#fit model
results_1  = model_1.fit(scaled_data_train,
                   y_train_final,
                    epochs=100,
                    validation_data=(scaled_data_val, y_val))

This seems to be getting worse. It may be that we don't have adequate data to determine a link with linear regression. Or, let's try testing using binary classification before we go further. We can use binary classification to use alternative

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_3, targets['BodyMass_lb_diff'], random_state = 243, test_size = .25)

# Split the data
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train, y_train, random_state = 243, test_size = .25)

In [None]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train_final)
scaled_data_val = scaler.fit_transform(X_val)

In [None]:
model_1 = Sequential()

#we'll start with 10 neurons, and an input shape of 14
model_1.add(Dense(12, activation='linear', input_shape=(14,)))
model_1.add(Dense(8, activation='linear'))
model_1.add(Dense(4, activation='linear'))

#output classification layer
model_1.add(Dense(1, activation='linear'))

In [None]:
model_1.compile(loss='mse', optimizer=optimizers.RMSprop(learning_rate=0.001), metrics=['accuracy'])

In [None]:
#fit model
results_1  = model_1.fit(scaled_data_train,
                   y_train_final,
                    epochs=100,
                    validation_data=(scaled_data_val, y_val))

### Binomial Classification

### Baseline Model
As we mentioned earlier. Let's see what a model would produce, picking weight loss days at random.

In [None]:
#import random module
import random

#initialize baseline dataframe from the weight loss column
baseline = pd.DataFrame(targets['weight_loss'])

#create a predictions column that randomly chooses 0 or 1
baseline['Predictions'] = np.random.randint(0,1,len(baseline))

#create another column which determines which are correct
baseline['Correct?'] = (baseline['weight_loss'] == baseline['Predictions'])

#count the true and false answers
baseline['Correct?'].value_counts(normalize=True)

In [None]:
aggs = df.groupby('weight_loss').agg(['mean', 'std'])
aggs

Okay, that's not surprising. This is basically a 50-50 model (48%), with a few more weight_loss days than weight_gain days. Our model was about 48% accurate. Let 

### KNN Neighbors
So, let's run through some of the standard algorithms for each level of features.

#### Level 1 Features

In [None]:
# Import train_test_split 
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_1, df['weight_loss'], random_state = 42, test_size = .25)

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.fit_transform(X_test)

# Convert into a DataFrame
scaled_df_train = pd.DataFrame(scaled_data_train, columns = feature_1.columns)

In [None]:
# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

# Instantiate KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=11)

# Fit the classifier
clf.fit(scaled_data_train, y_train)

# Predict on the test set
test_preds = clf.predict(scaled_data_test)

In [None]:
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

# Complete the function
def print_metrics(labels, preds):
    print("Precision Score: {}".format(precision_score(labels, preds)))
    print("Recall Score: {}".format(recall_score(labels, preds)))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds)))
    
print_metrics(y_test, test_preds)

In [None]:
X_transformed = scaler.fit_transform(feature_1)

In [None]:
scores_1 = cross_val_score(clf, X_transformed, df['weight_loss'], cv=10) #10 fold cross validation
scores_1.mean()

So... Our level one cross-validated accuracy was 63%

### Level 2 Features

In [None]:
# Import train_test_split 
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_2, df['weight_loss'], random_state = 42, test_size = .25)

In [None]:
X_train

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.fit_transform(X_test)

# Convert into a DataFrame
scaled_df_train = pd.DataFrame(scaled_data_train, columns = feature_2.columns)
scaled_df_train.head()

In [None]:
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = 0
    best_score = 0.0
    for k in range(min_k, max_k+1, 2):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        preds = knn.predict(X_test)
        f1 = f1_score(y_test, preds)
        if f1 > best_score:
            best_k = k
            best_score = f1
    
    print("Best Value for k: {}".format(best_k))
    print("F1-Score: {}".format(best_score))
find_best_k(scaled_data_train, y_train, scaled_data_test, y_test)

In [None]:
find_best_k(scaled_data_train, y_train, scaled_data_test, y_test)

In [None]:
# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

# Instantiate KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors = 21)

# Fit the classifier
clf.fit(scaled_data_train, y_train)

# Predict on the test set
test_preds = clf.predict(scaled_data_test)

In [None]:
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

# Complete the function
def print_metrics(labels, preds):
    print("Precision Score: {}".format(precision_score(labels, preds)))
    print("Recall Score: {}".format(recall_score(labels, preds)))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds)))
    
print_metrics(y_test, test_preds)

Okay, it looks like we have some decent accurate right out of the gate. Let's check the validation.

In [None]:
X_transformed = scaler.fit_transform(feature_2)

In [None]:
from sklearn.model_selection import cross_val_score

scores_2 = cross_val_score(clf, X_transformed, df['weight_loss'], cv=10) #10 fold cross validation
scores_2.mean()

So, we got an accuracy of 60%, which is lower than our previous one.

### KNN - Level 3

In [None]:
# Import train_test_split 
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_3, df['weight_loss'], random_state = 42, test_size = .25)

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.fit_transform(X_test)

In [None]:
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = 0
    best_score = 0.0
    for k in range(min_k, max_k+1, 2):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        preds = knn.predict(X_test)
        f1 = f1_score(y_test, preds)
        if f1 > best_score:
            best_k = k
            best_score = f1
    
    print("Best Value for k: {}".format(best_k))
    print("F1-Score: {}".format(best_score))
find_best_k(scaled_data_train, y_train, scaled_data_test, y_test)

In [None]:
# Instantiate KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier
clf.fit(scaled_data_train, y_train)

# Predict on the test set
test_preds = clf.predict(scaled_data_test)

In [None]:
X_transformed = scaler.fit_transform(feature_3)
scores_3 = cross_val_score(clf, X_transformed, df['weight_loss'], cv=10) #10 fold cross validation
scores_3.mean()

#### 56% Accurate on Test Data, so, it turns out we were most accurate with our high level 1 data

### Logistic Regression
Level 1 Features

In [None]:
# Import train_test_split 
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_1, df['weight_loss'], random_state = 42, test_size = .25)

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.fit_transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(fit_intercept=False, C=1E12, solver='lbfgs')
model_log = logreg.fit(scaled_data_train, y_train)
model_log

In [None]:
X_transformed = scaler.fit_transform(feature_1)

In [None]:
scores_1 = cross_val_score(logreg, X_transformed, df['weight_loss'], cv=20) #10 fold cross validation
scores_1.mean()

70% Accurate on cross-validated data. That's pretty good.

### Logistic Regression
Level 2 Features

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_2, df['weight_loss'], random_state = 42, test_size = .25)

In [None]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.fit_transform(X_test)

In [None]:
logreg = LogisticRegression(fit_intercept=False, C=1E12, solver='lbfgs')
model_log = logreg.fit(scaled_data_train, y_train)
model_log

In [None]:
X_transformed = scaler.fit_transform(feature_2)

In [None]:
scores_2 = cross_val_score(logreg, X_transformed, df['weight_loss'], cv=20) #10 fold cross validation
scores_2.mean()

Nearly 61% accurate when cross-validated. So far... the Logistic Regression with level 1 features are giving us our best results.

Level 3 Features

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_3, df['weight_loss'], random_state = 42, test_size = .25)

In [None]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.fit_transform(X_test)

In [None]:
logreg = LogisticRegression(fit_intercept=False, C=1E12, solver='lbfgs')
model_log = logreg.fit(scaled_data_train, y_train)
model_log

In [None]:
X_transformed = scaler.fit_transform(feature_3)

In [None]:
scores_3 = cross_val_score(logreg, X_transformed, df['weight_loss'], cv=20) #10 fold cross validation
scores_3.mean()

So, level 3 was 64% accurate. This is better.

Okay, now let's try decision tree
### Decision Tree - Level 1

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_1, df['weight_loss'], random_state = 42, test_size = .25)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

clf = DecisionTreeClassifier()

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [1, 5, 10, 20]
}

gs_tree = GridSearchCV(clf, param_grid, cv=20)
gs_tree.fit(X_train, y_train)

gs_tree.best_params_

In [None]:
# Instantiate Decision Tree Classifier
clf = DecisionTreeClassifier (criterion='gini', max_depth = 1, min_samples_split = 5, random_state = 42)

# Fit the classifier
clf.fit(X_train, y_train)

# Predict on the test set
test_preds = clf.predict(X_test)

In [15]:
scores_1 = cross_val_score(clf, feature_1, df['weight_loss'], cv=20) #10 fold cross validation
scores_1.mean()

NameError: name 'cross_val_score' is not defined

In [None]:
def plot_feature_importances(model):
    n_features = X_train.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(range(n_features), model.feature_importances_, align='center') 
    plt.yticks(np.arange(n_features), X_train.columns.values) 
    plt.xlabel('Feature importance')
    plt.ylabel('Feature')

plot_feature_importances(clf)

OKay, we got an accuracy of 68%. This is not bad on feature 1 data.
### Decision Tree - Level 2

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_2, df['weight_loss'], random_state = 142, test_size = .25)

In [None]:
clf = DecisionTreeClassifier()

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [1, 5, 10, 20]
}

gs_tree = GridSearchCV(clf, param_grid, cv=20)
gs_tree.fit(X_train, y_train)

gs_tree.best_params_

In [None]:
# Instantiate Decision Tree Classifier
clf = DecisionTreeClassifier (criterion='gini', max_depth = 1, min_samples_split = 5, random_state = 142)

# Fit the classifier
clf.fit(X_train, y_train)

# Predict on the test set
test_preds = clf.predict(X_test)

In [None]:
scores_2 = cross_val_score(clf, feature_2, df['weight_loss'], cv=20) #10 fold cross validation
scores_2.mean()

In [None]:
def plot_feature_importances(model):
    n_features = X_train.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(range(n_features), model.feature_importances_, align='center') 
    plt.yticks(np.arange(n_features), X_train.columns.values) 
    plt.xlabel('Feature importance')
    plt.ylabel('Feature')

plot_feature_importances(clf)

In [None]:
from sklearn import tree

fig, axes = plt.subplots(nrows = 1,ncols = 1, figsize = (3,3), dpi=300)
tree.plot_tree(clf,
               feature_names = X_train.columns, 
               class_names=np.unique(df['weight_loss']).astype('str'),
               filled = True)
plt.show()

Improvement, we've matched our best score with 70% Accuracy on our model. Let's see...
### Level 3.

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_3, df['weight_loss'], random_state = 142, test_size = .25)

In [None]:
clf = DecisionTreeClassifier()

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [1, 5, 10, 20]
}

gs_tree = GridSearchCV(clf, param_grid, cv=20)
gs_tree.fit(X_train, y_train)

gs_tree.best_params_

In [None]:
# Instantiate Decision Tree Classifier
clf = DecisionTreeClassifier (criterion='gini', max_depth = 10, min_samples_split = 5, random_state = 142)

# Fit the classifier
clf.fit(X_train, y_train)

# Predict on the test set
test_preds = clf.predict(X_test)

In [None]:

scores_3 = cross_val_score(clf, feature_3, df['weight_loss'], cv=20) #10 fold cross validation
scores_3.mean()

In [None]:
def plot_feature_importances(model):
    n_features = X_train.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(range(n_features), model.feature_importances_, align='center') 
    plt.yticks(np.arange(n_features), X_train.columns.values) 
    plt.xlabel('Feature importance')
    plt.ylabel('Feature')

plot_feature_importances(clf)

Interesting... our accuracy went down with the more

### Naive Bayes 
### Level 1
theorem and see if that helps.

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_1, df['weight_loss'], random_state = 42, test_size = .25)

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

#Guassian
GNB = GaussianNB()
GNB.fit(X_train,y_train)
#
# Predict for test set
#
y_pred = GNB.predict(X_test)
print(classification_report(y_test,y_pred))

In [None]:
X_transformed = scaler.fit_transform(feature_1)
scores_1 = cross_val_score(GNB, X_transformed, df['weight_loss'], cv=20) #10 fold cross validation
scores_1.mean()

Accuracy of 65%. Not as good as our previous answers but not bad.
### Naive Bayes 
### Level 2
theorem and see if that helps.

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_2, df['weight_loss'], random_state = 42, test_size = .25)

In [None]:
#Guassian
GNB = GaussianNB()
GNB.fit(X_train,y_train)

In [None]:
X_transformed = scaler.fit_transform(feature_2)
scores_2 = cross_val_score(GNB, X_transformed, df['weight_loss'], cv=20) #10 fold cross validation
scores_2.mean()

Accuracy of 59%. Not as good as our previous answers but not bad.
### Naive Bayes 
### Level 3
theorem and see if that helps.

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_3, df['weight_loss'], random_state = 42, test_size = .25)

In [None]:
#Guassian
GNB = GaussianNB()
GNB.fit(X_train,y_train)

In [None]:
X_transformed = scaler.fit_transform(feature_3)
scores_3 = cross_val_score(GNB, feature_3, df['weight_loss'], cv=20) #10 fold cross validation
scores_3.mean()

Okay, so that didn't do much either. Our Naive Bayes first model yeilded a result of .53. So... not that good. 

Okay, we got 70% true with Gaussian. This is, arguably our best result. Something to keep in mind as we go forward.

### SVM
#### Level 1
Let's do an analysis of linear SVM. We'll avoid non-linear for now as the level of complexity might be too high for a binomial classification

In [None]:
from sklearn import svm

# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_1, df['weight_loss'], random_state = 142, test_size = .25)

In [None]:
std = StandardScaler()
X_train_transformed = std.fit_transform(X_train)
X_test_transformed = std.transform(X_test)

In [None]:
from sklearn import svm

#r_range = np.array([0.01, 1, 10])  
#gamma_range = np.array([0.001, 0.01, 0.1]) 
#param_grid = dict(gamma=gamma_range, coef0=r_range)
#details = []
#for gamma in gamma_range:
#     for r in r_range:
#        clf = svm.SVC(kernel='linear', coef0=r , gamma=gamma)
#        clf.fit(X_train_transformed, y_train)
#        score = clf.score(X_test_transformed, y_test)
#        details.append((r, gamma, clf, score))

In [None]:
clf = svm.SVC(kernel='linear')
clf.fit(X_train_transformed, y_train)

In [None]:
clf.score(X_test_transformed, y_test)

In [None]:
X_transformed = std.fit_transform(feature_1)

scores_1 = cross_val_score(clf, X_transformed, df['weight_loss'], cv=20) #10 fold cross validation
scores_1.mean()

In [None]:
#.get_params()

In [None]:
#wow, let's plot the feature importance
pd.Series(clf.coef_[0], index=X_train.columns).nlargest(11).plot(kind='barh', title='Weight Loss Contributors')

65% on an SVM model is good. Let's see if we can improve it.
### SVM
#### Level 2
Let's do an analysis of linear SVM. We'll avoid non-linear for now as the level of complexity might be too high for a binomial classification

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_2, df['weight_loss'], random_state = 142, test_size = .25)

In [None]:
std = StandardScaler()
X_train_transformed = std.fit_transform(X_train)
X_test_transformed = std.transform(X_test)

In [None]:
clf = svm.SVC(kernel='linear')
clf.fit(X_train_transformed, y_train)

In [None]:
X_transformed = std.fit_transform(feature_2)

scores_2 = cross_val_score(clf, X_transformed, df['weight_loss'], cv=20) #10 fold cross validation
scores_2.mean()

In [None]:
#wow, let's plot the feature importance
pd.Series(clf.coef_[0], index=X_train.columns).nlargest(11).plot(kind='barh', title='Weight Loss Contributors')

65% on an SVM model is good. Let's see if we can improve it.
### SVM
#### Level 3

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(feature_3, df['weight_loss'], random_state = 142, test_size = .25)

Okay, so after a few SVM rounds, it appears we don't have great luck

In [None]:
std = StandardScaler()
X_train_transformed = std.fit_transform(X_train)
X_test_transformed = std.transform(X_test)

In [None]:
clf = svm.SVC(kernel='linear')
clf.fit(X_train_transformed, y_train)

In [None]:
X_transformed = std.fit_transform(feature_3)

scores_2 = cross_val_score(clf, X_transformed, df['weight_loss'], cv=20) #10 fold cross validation
scores_2.mean()

In [None]:
#wow, let's plot the feature importance
pd.Series(clf.coef_[0], index=X_train.columns).nlargest(14).plot(kind='barh', title='Weight Loss Contributors')

### Deep Learning

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense
#from sklearn.preprocessing import StandardScaler, LabelBinarizer

In [None]:
# Split the data
X_train_total, X_test, y_train_total, y_test = train_test_split(feature_1, df['weight_loss'], random_state = 124, test_size = .25)

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X_train_total, y_train_total, random_state = 124, test_size = .25)

In [None]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_val = scaler.fit_transform(X_val)

In [None]:
model_1 = Sequential()

#we'll start with 10 neurons, and an input shape of 14
model_1.add(Dense(12, activation='relu', input_shape=(3,)))
model_1.add(Dense(6, activation='tanh'))
model_1.add(Dense(2, activation='relu'))

#output classification layer
model_1.add(Dense(1, activation='sigmoid'))

In [None]:
from keras import optimizers
# Compile the model
model_1.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['acc'])

In [None]:
#fit model
results_1  = model_1.fit(scaled_data_train,
                    y_train,
                    epochs=100,
                    validation_data=(scaled_data_val, y_val))

Okay, so we've done a fair bit of analysis here. We've used some traditional machine learning algorithms and neural networks, and we haven't exceeded 62-65% accuracy. Some of our best models were KNN, Logisitic Regression, SVM. Our worst models were Linear Regression, Naive Bayes, and Decision Tree. Because we have only numeric data, it's no surprised that Decision Tree performed poorly. The accuracy on Linear Regression was particularly bad. For now, we'll abandoned our goals of finding a combined linear regression model with time response, and just focus on classification of weight loss and weight gain.

So how do we optimize the model? A helpful analysis is there on our SVM plot. Let's look at that again.

In [None]:
scaled_data_test = scaler.fit_transform(X_test)

score = model_1.evaluate(scaled_data_test, y_test, verbose=0)
print(score)

Okay, so this overfit our data. It's... a lot of analysis for only 3 variables.

Let's check level 2 testing

In [None]:
# Split the data
X_train_total, X_test, y_train_total, y_test = train_test_split(feature_2, df['weight_loss'], random_state = 124, test_size = .25)

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X_train_total, y_train_total, random_state = 124, test_size = .25)

In [None]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_val = scaler.fit_transform(X_val)

In [None]:
model_2 = Sequential()

#we'll start with 10 neurons, and an input shape of 14
model_2.add(Dense(12, activation='tanh', input_shape=(len(X_train.columns),)))
model_2.add(Dense(8, activation='tanh'))
model_2.add(Dense(4, activation='tanh'))

#output classification layer
model_2.add(Dense(1, activation='sigmoid'))

In [None]:
from keras import optimizers
# Compile the model
model_2.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['acc'])

In [None]:
#fit model
results_2  = model_2.fit(scaled_data_train,
                    y_train,
                    epochs=150,
                    validation_data=(scaled_data_val, y_val))

In [None]:
def visualize_training_results(results):
    history = results.history
    plt.figure()
    plt.plot(history['val_loss'])
    plt.plot(history['loss'])
    plt.legend(['val_loss', 'loss'])
    plt.title('Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.show()
    
    plt.figure()
    plt.plot(history['val_acc'])
    plt.plot(history['acc'])
    plt.legend(['val_acc', 'acc'])
    plt.title('Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.show()

In [None]:
visualize_training_results(results_2)

In [None]:
scaled_data_test = scaler.fit_transform(X_test)

score = model_2.evaluate(scaled_data_test, y_test, verbose=0)
print(score)

Not good either. It appears this was overfit on both

Let's check level 3 testing

In [None]:
# Split the data
X_train_total, X_test, y_train_total, y_test = train_test_split(feature_3, df['weight_loss'], random_state = 124, test_size = .25)

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X_train_total, y_train_total, random_state = 124, test_size = .25)

In [None]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_val = scaler.fit_transform(X_val)

In [None]:
model_3 = Sequential()

#we'll start with 10 neurons, and an input shape of 14
model_3.add(Dense(12, activation='tanh', input_shape=(len(X_train.columns),)))
model_3.add(Dense(8, activation='tanh'))
model_3.add(Dense(4, activation='tanh'))

#output classification layer
model_3.add(Dense(1, activation='sigmoid'))

In [None]:
from keras import optimizers
# Compile the model
model_3.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['acc'])

In [None]:
#fit model
results_3  = model_3.fit(scaled_data_train,
                    y_train,
                    epochs=100,
                    validation_data=(scaled_data_val, y_val))

In [None]:
def visualize_training_results(results):
    history = results.history
    plt.figure()
    plt.plot(history['val_loss'])
    plt.plot(history['loss'])
    plt.legend(['val_loss', 'loss'])
    plt.title('Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.show()
    
    plt.figure()
    plt.plot(history['val_acc'])
    plt.plot(history['acc'])
    plt.legend(['val_acc', 'acc'])
    plt.title('Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.show()

In [None]:
visualize_training_results(results_3)

In [None]:
scaled_data_test = scaler.fit_transform(X_test)

score = model_3.evaluate(scaled_data_test, y_test, verbose=0)
print(score)

So... it looks as though our best models our the Decision Tree with Level 2 data, and Logistic Regression with Level 1 data.
Let's go ahead and iterate on those two models.

Okay, so we achieved, 61% accuracy on test data. It's not quite what we achieved with our SVM and decision tree models

Perfect. So it turns out this model was perfect at epock 100. So let's stop the model at epoch 100 and then predict on our test data

From the importance feature above we can tell a few things.

Surprises:
1. The most important feature is Basal Energy Burned. This feature essentially tells us how heavy we are, meaning, it looks like our existing weight is the biggest predictor of whether or not we will lose weight the next day.

2. Protein is the second largest negative factor contributing to weight loss. We hear a lot about how we need more protein in our diet. It contributes to building and maintaining muscle and organ function. It's possible that weight loss is more than just fat and stored carbohydrates. It's also about losing muscle.

3. Residual fats (characterized as all fats not saturated, monounsaturated, and polyunsaturated) are a small contributor to weight loss. We hear sometimes about "healthy fats" but interesting to see it as a 3rd largest contributor to weight loss.

4. Active Calorie burned was a contributing factor too... weight gain? It's incredibly small, so perhaps with more data, we can get a different result. But regardless, it seems that active calorie burn was insignificant, or even a negative coefficient, to weight loss.

Expected:
1. REM sleep and Core Sleep both factor into some weight loss, however Deep sleep factors slightly against. The deep sleep coefficient is so small, that it seems as though Total sleep hours would be a more significant factor for next day weight loss.

2. Dietary Fiber is the second most important factor contributing to weight loss. This is interesting because we here that having lots of fiber in your diet is important.

3. Sugar is the largest negative factor contributing to weight loss. This confirms what we've heard for a while. Interesting to see it here.

4. Saturated and Polyunsaturated fats were negative contributors to weight loss, with polyunsaturated fats being the largest contributor. 

Next steps. Well, we can combine the sleep into one category, that would simplify the analysis, especially from a PCA perspective. We can also add some rolling sums. As we mentioned before, we didn't have much luck doing a time response strictly with the weight data, but perhaps we could include both rolling sums of the data, as well as an indication if there was weight loss the dat before our after.

So... let's resume Decision Tree, Logistic Regression, linear SVM 

First, we'll consolidate the sleep data - we'll leave the hours awake data out, 
Second, we'll add a 2 day rolling average to each of the numbers.
Third, we'll add a previous day weight loss component.

### Sleep Consolidation

In [None]:
#let's create our dataset
sleep_consolidation = df[level_3_diet + level_2_exer + level_1_sleep]
targets = df['weight_loss']

### KNN
okay, let's run a KNN model with updated sleep and see if there's any budge.

In [None]:
#train-test split
X_train, X_test, y_train, y_test = train_test_split(sleep_consolidation, targets, test_size=0.25, random_state=24)

# Standardize the data
std = StandardScaler()
X_train_transformed = std.fit_transform(X_train)
X_test_transformed = std.transform(X_test)


In [None]:
#let's search for the best K for our algorithm
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = 0
    best_score = 0.0
    for k in range(min_k, max_k+1, 2):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        preds = knn.predict(X_test)
        f1 = f1_score(y_test, preds)
        if f1 > best_score:
            best_k = k
            best_score = f1
    
    print("Best Value for k: {}".format(best_k))
    print("F1-Score: {}".format(best_score))

In [None]:
find_best_k(X_train_transformed, y_train, X_test_transformed, y_test)

In [None]:
# Instantiate KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=11)

# Fit the classifier
clf.fit(X_train_transformed, y_train)

# Predict on the test set
test_preds = clf.predict(X_test_transformed)

In [None]:
def print_metrics(labels, preds):
    print("Precision Score: {}".format(precision_score(labels, preds)))
    print("Recall Score: {}".format(recall_score(labels, preds)))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds)))
    
print_metrics(y_test, test_preds)

In [None]:
#let's cross-validate
X_transformed = std.transform(sleep_consolidation)

scores = cross_val_score(clf, X_transformed, targets, cv=10)
scores.mean()

### Logistic Regression

In [None]:
#train-test split
X_train, X_test, y_train, y_test = train_test_split(sleep_consolidation, targets, test_size=0.25, random_state=24)

# Standardize the data
std = StandardScaler()
X_train_transformed = std.fit_transform(X_train)
X_test_transformed = std.transform(X_test)

X_transformed = std.transform(sleep_consolidation)

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(fit_intercept=False, C=100, solver='liblinear')
model_log = logreg.fit(X_train_transformed, y_train)
model_log

In [None]:
y_hat_train = model_log.predict(X_train_transformed)

train_residuals = np.abs(y_train - y_hat_train)
print(pd.Series(train_residuals, name="Residuals (counts)").value_counts())
print()
print(pd.Series(train_residuals, name="Residuals (proportions)").value_counts(normalize=True))

In [None]:
y_hat_test = model_log.predict(X_test_transformed)

test_residuals = np.abs(y_test - y_hat_test)
print(pd.Series(test_residuals, name="Residuals (counts)").value_counts())
print()
print(pd.Series(test_residuals, name="Residuals (proportions)").value_counts(normalize=True))

In [None]:
#let's cross-validate
scores = cross_val_score(model_log, X_transformed, targets, cv=10)
scores.mean()

65%   mean accuracy on cross-validation is not very good.


### SVM 


In [None]:
# Split the data
#X_train, X_test, y_train, y_test = train_test_split(sleep_consolidation, targets, random_state = 42, test_size = .99)

In [None]:
# Standardize the data
std = StandardScaler()
X_train_transformed = std.fit_transform(sleep_consolidation)
#X_test_transformed = std.transform(X_test)

X_transformed = std.transform(sleep_consolidation) 

In [None]:
from sklearn import svm

svm = svm.SVC(kernel='linear')
svm.fit(X_train_transformed, targets)

#svm.score(X_test_transformed, y_test)

In [None]:
scores = cross_val_score(svm, X_transformed, targets, cv=15)
scores.mean()

In [None]:
#wow, let's plot the feature importance
pd.Series(svm.coef_[0], index=sleep_consolidation.columns).nlargest(11).plot(kind='barh', title='                         <----- Weight Gain | Weight Loss ----->')

Okay, so we were able to 67% accuracy on mean cross-validation. This is an improvement. Additionally, once we consolidated the sleep factors (and removed awake time), we see a decent increase in the contribution of Dietary Sugar. The active calorie burned stayed relatively small, almost nonexistent.

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(sleep_consolidation, targets, random_state = 243, test_size = .25)

# Split the data
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train, y_train, random_state = 243, test_size = .25)

In [None]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
X_train_transform = scaler.fit_transform(X_train_final)
X_val_transform = scaler.fit_transform(X_val)

In [None]:
model_1 = Sequential()

#we'll start with 10 neurons, and an input shape of 14
model_1.add(Dense(12, activation='relu', input_shape=(11,)))
model_1.add(Dense(8, activation='relu'))
model_1.add(Dense(4, activation='relu'))

#output classification layer
model_1.add(Dense(1, activation='sigmoid'))

In [None]:
from keras import optimizers
# Compile the model
model_1.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['acc'])

In [None]:
#fit model
results_1  = model_1.fit(X_train_transform,
                    y_train_final,
                    epochs=75,
                    validation_data=(X_val_transform, y_val))

In [None]:
# Transform the training and test sets
X_train_transform_1 = scaler.fit_transform(X_train)
X_test_transform = scaler.fit_transform(X_test)

In [None]:
results_train = model_1.evaluate(X_train_transform_1, y_train)
print('----------')
print(f'Training Loss: {results_train[0]:.3} \nTraining Accuracy: {results_train[1]:.3}')

In [None]:
results_train = model_1.evaluate(X_test_transform, y_test)
print('----------')
print(f'Training Loss: {results_train[0]:.3} \nTraining Accuracy: {results_train[1]:.3}')

Okay, so we don't have much improvement here. In fact, we got a decrease from our SVM model, which, so far, has the best output.

But let's continue with our modeling. Let's do a two day rolling sum of the all of the features and see if that changes anything.

### 2 Day Rolling Sum
Previously we created a sleep_consolidation set, that looked something like this.
`sleep_consolidation = df[level_3_diet + level_2_exer + level_1_sleep]`

so, let's take our new df, and


In [None]:
#let's create our dataset
newdf_roll_sum_2 = sleep_consolidation.rolling(2).sum().drop(['2023-08-24'], axis = 0)
#y.drop('2023-08-25', axis = 0,inplace = True)
newdf_roll_sum_2

targets_roll_sum_2 = targets.drop(['2023-08-24'], axis = 0)
targets_roll_sum_2

# KNN

In [None]:
#train-test split
X_train, X_test, y_train, y_test = train_test_split(newdf_roll_sum_2, targets_roll_sum_2, test_size=0.25, random_state=24)

# Standardize the data
std = StandardScaler()
X_train_transformed = std.fit_transform(X_train)
X_test_transformed = std.transform(X_test)


#let's search for the best K for our algorithm
find_best_k(X_train_transformed, y_train, X_test_transformed, y_test)
    

In [None]:
# Instantiate KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=11)

# Fit the classifier
clf.fit(X_train_transformed, y_train)

# Predict on the test set
test_preds = clf.predict(X_test_transformed)

In [None]:
#let's cross-validate
X_transformed = std.transform(newdf_roll_sum_2)

scores = cross_val_score(clf, X_transformed, targets_roll_sum_2, cv=10)
scores.mean()

### Logistic Regression

In [None]:
#train-test split
X_train, X_test, y_train, y_test = train_test_split(newdf_roll_sum_2, targets_roll_sum_2, test_size=0.25, random_state=24)

# Standardize the data
std = StandardScaler()
X_train_transformed = std.fit_transform(X_train)
X_test_transformed = std.transform(X_test)

X_transformed = std.transform(newdf_roll_sum_2)

In [None]:
#from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(fit_intercept=False, C=1E12, solver='liblinear')
model_log = logreg.fit(X_train_transformed, y_train)
model_log

In [None]:
#let's cross-validate
scores = cross_val_score(model_log, X_transformed, targets_roll_sum_2, cv=10)
scores.mean()

### SVM

In [None]:
#train-test split
X_train, X_test, y_train, y_test = train_test_split(newdf_roll_sum_2, targets_roll_sum_2, test_size=0.25, random_state=24)

# Standardize the data
std = StandardScaler()
X_train_transformed = std.fit_transform(X_train)
X_test_transformed = std.transform(X_test)

X_transformed = std.transform(newdf_roll_sum_2)

In [None]:
from sklearn import svm

svm = svm.SVC(kernel='linear')
svm.fit(X_train_transformed, y_train)

svm.score(X_test_transformed, y_test)

In [None]:
scores = cross_val_score(svm, newdf_roll_sum_2, targets_roll_sum_2, cv=15)
scores.mean()

In [None]:
#wow, let's plot the feature importance
pd.Series(svm.coef_[0], index=X_train.columns).nlargest(14).plot(kind='barh', title='Weight Loss Contributors')

### Neural Network

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(newdf_roll_sum_2, targets_roll_sum_2, random_state = 24, test_size = .15)

# Split the data
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train, y_train, random_state = 24, test_size = .15)

In [None]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
X_train_transform = scaler.fit_transform(X_train_final)
X_val_transform = scaler.fit_transform(X_val)


In [None]:
model_1 = Sequential()

#we'll start with 10 neurons, and an input shape of 14
model_1.add(Dense(12, activation='tanh', input_shape=(11,)))
model_1.add(Dense(8, activation='tanh'))
model_1.add(Dense(4, activation='tanh'))

#output classification layer
model_1.add(Dense(1, activation='sigmoid'))

In [None]:
# Compile the model
model_1.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['acc'])

In [None]:
#fit model
results_1  = model_1.fit(X_train_transform,
                    y_train_final,
                    epochs=75,
                    validation_data=(X_val_transform, y_val))

### SUMMARY
So, we did not see an improvement with the addition of the rolling sum for 2 days. But I still would like to test some sort of time response idea. Let's see if we can add a feature to represent the previous day was weight loss or weight gain.

### Add another feature - previous day's weight loss or gain.
So, we know that the rolling sum didn't help. Let's return to our previous day weight gain idea. I'm going to circle back to the Decision Tree - Level 2 and Logistic Regression - Level 1. 

### SUMMARY
We downloaded the data, analyzed it, and decided to run some more analysis.

### Pipeline, 
let's establish pipelines for each of our tests. We won't necessarily worry about some of our less accurate ones, but we can start with the basics. We've had good results for Naive Bayes, KNN, and SVM. So, let's run a few test