# GUESS MY WEIGHT 

![guess_your_weight.gif](images/guess_your_weight.gif)

## Table of Contents TOC
[Overview](#overview)<br />
[Data Understanding](#data-understanding)<br />
[Data Preparation](#data-preparation)<br />
[Modeling](#modeling)<br />
[Evaluation](#evaluation)<br />
[Github Repository and Resources](#github-repository-and-resources)<br />


## Overview
Health and Wellness is a big business. Specifically, weight loss. We’re all trying because it’s very, very hard. I recently went on my own weight loss journey, losing about 50 lbs in roughly 18 months. Weighing myself every morning, I agonized over every tenth of a lb, recording it in an app on my phone. I realized that losing big chunks of weights starts with small, incremental progress on the scale. But I didn’t stop there. As a data nerd I thought, “let’s record every meal.” So I did that too. I wondered… given all this data I have, could I predict my weight? My watch and phone captures my exercise, sleep, eating, and so much more. There must be trends here. At a minimum, I should be able to predict whether my weight will go up or down from the previous day. So let’s do it.<br />
[return to TOC](#table-of-contents-TOC)

## Data Understanding
I have much (and probably too much) of this data in my iphone and Apple Watch. It contains the weight information, workouts, heart rate, meals - broken down into subcategories (proteins, fats, etc). Most importantly is the weight. That will be the feature that I primarily use for classification.  

Because it’s my data, there’s more clarity about data entry methods. This is more subjective, than a controlled experiment with many participants. I know what data I was diligent about collecting so I should be able to scrub it appropriately. For instance, I didn’t record my fluids consistently - water, tea, coffee. Water consumption is a big part of this so I’ll have to be clear about the gaps in the data.<br />
[return to TOC](#table-of-contents-TOC)


## Data Preparation
The data is stored on a csv file in a kaggle repository.

in an xml file on my phone. After downloading it into python notebook and digging a little, there are roughly 180 rows of weight entries (approximately 6 months) but it’s not clear how many gaps there are. All of the data is stored as an entry, with time stamps and usually some numeric form. Whether it’s heart rate, weight, caloric info, it’s one numeric entry with an associated units. We’re primarily dealing with ints and floats, all numeric, and we’ll be using daily totals/averages. Because we only have one weigh-in per day, we’re only going to use daily values of other data. So… we know we have approximately 100-180 rows. I can’t say at the moment how many columns, because this will be based on what happens in pre-processing. Which brings me to../.

There are two major challenges with the pre-processing. The first deals with the privacy of my personal health data. How do I balance reproducibility requirements with privacy concerns? I need to make the dataset publicly available, including all of my pre-processing steps, but I also want to make sure no one can link it back to me, Andrew Q. Bennett (my real middle name doesn’t start with Q… gotcha!!!!). And the initial dataset is large, maybe 40 MB. The approach we’ll use is to perform some pre-processing locally, and then upload to the kaggle site when it’s ready for public consumption. In my jupyter notebook, I will comment out some of this code so that we can see the work, but it won’t affect the code when we press “run”.

The second is dealing with correlation efforts. For instance, we know that all data related to working out is going to be correlated with eachother. The steps, average heart rate, workout calories, etc will all be correlated to whether I went for a jog that day. Making decisions about which data to use will be a challenge, even with some baseline domain knowledge. There is a treasure trove that may have nothing (or very little) to do with weight loss, like Vitamin A intake. PCA Analysis will be critical without losing some data. I know about health…but I’m no expert. Maybe Vitamin A intake can help/hurt weight loss.

The many visualization efforts will come from making sure the weight data is presented cleanly. A nice, regression line showing weight trends over different periods will be very helpful.<br />
[return to TOC](#table-of-contents-TOC)

In [200]:
import pandas as pd
#import numpy as np

In [201]:
df = pd.read_csv('pre_kaggle/weight_data.csv')
df

Unnamed: 0.1,Unnamed: 0,type,unit,creationDate,startDate,endDate,value
0,793917,HKCategoryTypeIdentifierSleepAnalysis,HKCategoryValueSleepAnalysisInBed,2023-07-24 05:30:00,2023-07-23 21:52:17,2023-07-23 21:53:47,1.0000
1,793918,HKCategoryTypeIdentifierSleepAnalysis,HKCategoryValueSleepAnalysisInBed,2023-07-24 05:30:00,2023-07-23 22:13:46,2023-07-23 22:13:59,1.0000
2,793919,HKCategoryTypeIdentifierSleepAnalysis,HKCategoryValueSleepAnalysisInBed,2023-07-24 05:30:00,2023-07-23 22:14:56,2023-07-23 22:58:00,1.0000
3,793920,HKCategoryTypeIdentifierSleepAnalysis,HKCategoryValueSleepAnalysisAsleepCore,2023-07-24 07:05:22,2023-07-23 23:12:48,2023-07-23 23:40:48,1.0000
4,793921,HKCategoryTypeIdentifierSleepAnalysis,HKCategoryValueSleepAnalysisAsleepDeep,2023-07-24 07:05:22,2023-07-23 23:40:48,2023-07-24 00:06:18,1.0000
...,...,...,...,...,...,...,...
856980,856980,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,ms,2024-03-06 01:01:10,2024-03-06 01:00:08,2024-03-06 01:01:07,44.3289
856981,856981,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,ms,2024-03-06 03:01:11,2024-03-06 03:00:09,2024-03-06 03:01:08,54.3759
856982,856982,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,ms,2024-03-06 05:01:11,2024-03-06 05:00:09,2024-03-06 05:01:08,76.2300
856983,856983,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,ms,2024-03-06 07:01:22,2024-03-06 07:00:20,2024-03-06 07:01:20,45.6944


### EDA - Prescrubbing
Some EDA was previously performed to get the dataset into Kaggle. Now, of course, we have to get our data oriented for our analysis. We know we want our daily weigh-in data to be our target feature. We also know we want one daily value for each variable feature. For instance, there's multiple data on sleep, but perhaps we only care about the total hours that we're slept. First, let's strip the data type descriptions to something readable.

I want to create a table with weight as the target with only one value for each day.

1. Separate sleep and non-sleep - create a check to separate daily values with non-daily values.
    Sample, create table with daily values.
2. Create a 
 

In [202]:
import datetime as dt

#convert the time/date columns to datetime without time zone
df['creationDate'] = pd.to_datetime(df['creationDate']).dt.tz_localize(None)
df['startDate'] = pd.to_datetime(df['startDate']).dt.tz_localize(None)
df['endDate'] = pd.to_datetime(df['endDate']).dt.tz_localize(None)

Let's process the different quantities of data separately. We have a few different categories to consider, the main difference being categorical and numeric. Within the categorical data, we have sleep data and non-sleep data. The sleep data presents a challenge because of the way the data is stored across multiple days. More on this later. 

EDA Scrubbing - Let's clean the names. To do this, we will split the df into three frames and then reconnect

In [203]:
#create two dataframes, category and non-category
category_df = df[df['type'].str.contains("Category")]
quantity_df = df[df['type'].str.contains("Quantity")]

In [204]:
#scrub the two categories to make the data types more readable
category_df.loc[:,'type'] = category_df['type'].str.replace('HKCategoryTypeIdentifier', "")
quantity_df.loc[:,'type'] = quantity_df['type'].str.replace('HKQuantityTypeIdentifier', "")

In [205]:
#separate the category or binary classification into sleep and non-sleep categories
sleep_df = category_df[category_df['type'].str.contains("Sleep")]
non_sleep_df = category_df[~category_df['type'].str.contains("Sleep")]

### Non-Sleep Data - Categorical and non-Categorical

## Scrubbing

In [206]:
#scrub unit column (non_sleep)
non_sleep_df.loc[non_sleep_df['unit'].str.contains("HKCategoryValueAppleStand"),'unit'] = non_sleep_df['unit'].str.replace('HKCategoryValueAppleStand', "")
non_sleep_df.loc[non_sleep_df['unit'].str.contains("HKCategoryValueEnvironmentalAudioExposureEvent"),'unit'] = non_sleep_df['unit'].str.replace("HKCategoryValueEnvironmentalAudioExposureEvent", "")

In [207]:
#scrub quantity df 
quantity_df.loc[:,'type'] = quantity_df['type'] + '_' + quantity_df['unit']
quantity_df = quantity_df.drop('unit', axis = 1)

Unnamed: 0.1,Unnamed: 0,type,creationDate,startDate,endDate,value
13687,0,DietaryWater_mL,2023-07-24 10:41:41,2023-07-23 10:41:00,2023-07-23 10:41:00,473.1760
13688,1,DietaryWater_mL,2023-08-23 09:54:44,2023-08-23 09:54:00,2023-08-23 09:54:00,236.5880
13689,2,DietaryWater_mL,2023-08-25 07:02:45,2023-08-25 07:02:00,2023-08-25 07:02:00,473.1760
13690,3,DietaryWater_mL,2023-08-25 07:02:55,2023-08-26 07:02:00,2023-08-26 07:02:00,473.1760
13691,4,DietaryWater_mL,2023-09-05 08:30:09,2023-09-05 08:30:00,2023-09-05 08:30:00,473.1760
...,...,...,...,...,...,...
856980,856980,HeartRateVariabilitySDNN_ms,2024-03-06 01:01:10,2024-03-06 01:00:08,2024-03-06 01:01:07,44.3289
856981,856981,HeartRateVariabilitySDNN_ms,2024-03-06 03:01:11,2024-03-06 03:00:09,2024-03-06 03:01:08,54.3759
856982,856982,HeartRateVariabilitySDNN_ms,2024-03-06 05:01:11,2024-03-06 05:00:09,2024-03-06 05:01:08,76.2300
856983,856983,HeartRateVariabilitySDNN_ms,2024-03-06 07:01:22,2024-03-06 07:00:20,2024-03-06 07:01:20,45.6944


In [208]:
#scrub non-sleep df
non_sleep_df.loc[:,'type'] = non_sleep_df['type'] + '_' + non_sleep_df['unit']
non_sleep_df = non_sleep_df.drop('unit', axis = 1)

### Create new DataFrame
Now that we've done some preliminary scrubbing, let's see if we can create our new dataFrame with only daily information.

#### Scrubbing and Date column formatting

In [209]:
#scrubbed
combined_df = pd.concat([quantity_df, non_sleep_df]).drop(['Unnamed: 0', 'creationDate', 'endDate'], axis = 1)
combined_df = combined_df.rename(columns = {'startDate': 'date'})

In [210]:
combined_df.set_index('date', inplace = True)

Now that we've scrubbed, let's begin creating our new dataframe with daily values. As we look at the data, it appears there is some data we want to aggregate and observe the total. There is other data that we want averages and min/max.

In [211]:
#separate the quantity into data to be aggregated and snapshot data for mean/min/max
data_means = combined_df[combined_df['type'].str.contains("/") | combined_df['type'].str.contains("/")]
data_totals = combined_df[~combined_df['type'].str.contains("/") & ~combined_df['type'].str.contains("%")]

In [212]:
#move some columns from totals to
col_switch = ['RunningPower_W',
            'WalkingStepLength_in', 
            'EnvironmentalAudioExposure_dBASPL', 
            'HeadphoneAudioExposure_dBASPL',
            'HeartRateVariabilitySDNN_ms',
            'RunningStrideLength_m', 
            'RunningVerticalOscillation_cm',
            'RunningGroundContactTime_ms']

In [213]:
#let's move the coluumns to the other dataframes
data_means =  pd.concat([data_means, data_totals.loc[data_totals['type'].isin(col_switch)]])
data_totals = data_totals.loc[~data_totals['type'].isin(col_switch)]

In [214]:
#let's create the dataframe for the data we are summing
new_df = pd.DataFrame()

for col in data_totals['type'].unique():
    values = data_totals.loc[data_totals['type'] == col, :]
    daily_totals = values.resample('D')
    new_daily = daily_totals.sum()
    new_daily.loc[:,'type'] = col
    new_df = pd.concat([new_df, new_daily])


In [215]:
#now we'll add to the previous DataFrame the 
for col in data_means['type'].unique():
    values = data_means.loc[data_means['type'] == col, :]
    values = values.drop('type', axis = 1)
    daily_totals = values.resample('D')
    new_daily = daily_totals.mean()
    new_daily['type'] = col + '_mean'
    new_df = pd.concat([new_df, new_daily])
    new_daily = daily_totals.min()
    new_daily['type'] = col + '_min'
    new_df = pd.concat([new_df, new_daily])
    new_daily = daily_totals.max()
    new_daily['type'] = col + '_max'
    new_df = pd.concat([new_df, new_daily])

### Sleep scrubbing
For example, one night of sleep occurs over two days. We don't consider this to be sleep on different days. After we wake up, we consider the sleep occuring "last night" or the "night before". So, if today

In [216]:
#scrub unit column (sleep)
sleep_df.loc[sleep_df['unit'].str.contains("HKCategoryValueSleepAnalysis"),'unit'] = sleep_df['unit'].str.replace("HKCategoryValueSleepAnalysis", "")

In [217]:
#scrub sleep column
sleep_df.loc[:,'type'] = sleep_df['type'] + '_' + sleep_df['unit'] + '_hrs'
sleep_df = sleep_df.drop('unit', axis = 1)

In [218]:
#sleep_df['startDate'] = pd.to_datetime(sleep_df['startDate']).dt.date
sleep_df['startDate'] = pd.to_datetime(sleep_df['startDate'], format='%d%b%Y:%H:%M:%S')
sleep_df['endDate'] = pd.to_datetime(sleep_df['endDate'], format='%d%b%Y:%H:%M:%S')
sleep_df['value'] = (sleep_df['endDate'] - sleep_df['startDate'])/dt.timedelta(hours=1)

#remove unwanted sleep columns
sleep_df = sleep_df[sleep_df['type'] != 'SleepAnalysis_InBed_hrs']
sleep_df = sleep_df[sleep_df['type'] != 'SleepAnalysis_AsleepUnspecified_hrs']

In [219]:
#sleep_df = sleep_df.drop(['Unnamed: 0', 'creationDate', 'endDate'], axis = 1)
def sleep_date (date):
    if date.hour > 11:
        return date
    else:
        return (date - pd.Timedelta(1, unit='D'))

sleep_df['date'] = sleep_df['startDate'].apply(sleep_date)
sleep_df['date'] = pd.to_datetime(sleep_df['date'])  

In [221]:
sleep_df.set_index('date', inplace=True)

In [222]:
sleep_df = sleep_df.drop(['Unnamed: 0', 'creationDate', 'startDate', 'endDate'], axis = 1)

In [223]:
#let's append our sleep data to our larger frame
for col in sleep_df['type'].unique():
    values = sleep_df.loc[sleep_df['type'] == col, :]
    daily_totals = values.resample('D')
    new_daily = daily_totals.sum()
    new_daily.loc[:,'type'] = col
    new_df = pd.concat([new_df, new_daily])

In [224]:
#now we're going to take the data from

all_cols = np.array(new_df['type'].unique())
merge_df = pd.DataFrame()

for date in new_df.index.unique():
    test = new_df.loc[(new_df.index == date)].set_index('type')
    test = test.transpose()
    new_cols = np.setdiff1d(all_cols, np.array(test.columns))
    test[new_cols] = np.nan
    test = test[all_cols]
    test['date'] = date
    test = test.set_index('date')
    merge_df = pd.concat([merge_df, test])

In [225]:
merge_df

type,DietaryWater_mL,BodyMassIndex_count,BodyMass_lb,StepCount_count,DistanceWalkingRunning_mi,BasalEnergyBurned_Cal,ActiveEnergyBurned_Cal,FlightsClimbed_count,DietaryFatTotal_g,DietaryFatPolyunsaturated_g,...,RunningPower_W_mean,RunningPower_W_min,RunningPower_W_max,HeartRateVariabilitySDNN_ms_mean,HeartRateVariabilitySDNN_ms_min,HeartRateVariabilitySDNN_ms_max,SleepAnalysis_AsleepCore_hrs,SleepAnalysis_AsleepDeep_hrs,SleepAnalysis_AsleepREM_hrs,SleepAnalysis_Awake_hrs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-07-23,473.176,,,12858.0,6.175728,2032.022,615.7960,19.0,228.96,19.0,...,,,,48.104756,24.73500,82.9265,4.958333,1.033333,1.816667,0.075000
2023-07-24,0.000,26.7,,11354.0,5.570903,2076.231,689.7560,19.0,211.85,0.0,...,,,,59.130508,16.03640,168.1800,5.825000,0.675000,2.125000,0.175000
2023-07-25,0.000,0.0,,9960.0,4.792149,2084.157,540.9530,24.0,59.65,0.0,...,,,,63.203467,30.40250,123.2030,3.525000,1.183333,1.791667,0.016667
2023-07-26,0.000,0.0,,10316.0,5.045083,2087.316,774.5730,16.0,158.82,0.0,...,,,,57.859967,27.56430,101.3990,4.950000,0.841667,1.541667,0.066667
2023-07-27,0.000,0.0,,10599.0,5.087486,2111.616,801.1600,14.0,62.34,0.0,...,,,,43.415391,28.00220,75.4496,4.975000,0.941667,2.158333,0.175000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-03-02,,0.0,174.6,13416.0,6.533640,2048.925,1651.9890,16.0,162.40,4.8,...,,,,60.246540,25.34630,171.9290,0.000000,0.000000,0.000000,0.000000
2024-03-03,,0.0,175.0,15876.0,7.722016,2048.189,1443.2150,22.0,119.80,2.6,...,,,,57.224815,9.10452,121.8720,3.925000,1.108333,1.966667,0.300000
2024-03-04,,0.0,175.7,8191.0,4.051709,1983.933,499.0720,4.0,175.40,15.8,...,,,,63.538830,24.34150,110.1320,0.000000,0.000000,0.000000,0.000000
2024-03-05,,0.0,174.2,8882.0,4.448750,2009.083,566.5723,9.0,177.20,9.6,...,,,,43.409440,28.54260,73.7540,4.775000,0.816667,1.858333,2.683333


In [227]:
filepath = 'pre_kaggle/merge_health_4_17.csv'

# Export the DataFrame to the specified file
merge_df.to_csv(filepath)

NameError: name 'merged_df' is not defined