# GUESS MY WEIGHT 
A program to predict the weight from my health data

![guess_your_weight.gif](images/guess_your_weight.gif)

## Overview
Health and Wellness is a big business. Specifically, weight loss. We’re all trying because it’s very, very hard. I recently went on my own weight loss journey, losing about 50 lbs in roughly 18 months. Weighing myself every morning, I agonized over every tenth of a lb, recording it in an app on my phone. I realized that losing big chunks of weights starts with small, incremental progress on the scale. But I didn’t stop there. As a data nerd I thought, “let’s record every meal.” So I did that too. I wondered… given all this data I have, could I predict my weight? My watch and phone captures my exercise, sleep, eating, and so much more. There must be trends here. At a minimum, I should be able to predict whether my weight will go up or down from the previous day. So let’s do it.

## Data Understanding
I have much (and probably too much) of this data in my iphone and Apple Watch. It contains the weight information, workouts, heart rate, meals - broken down into subcategories (proteins, fats, etc). Most importantly is the weight. That will be the feature that I primarily use for classification.  

Because it’s my data, there’s more clarity about data entry methods. This is more subjective, than a controlled experiment with many participants. I know what data I was diligent about collecting so I should be able to scrub it appropriately. For instance, I didn’t record my fluids consistently - water, tea, coffee. Water consumption is a big part of this so I’ll have to be clear about the gaps in the data

### Weigh-In Protocol
The routine for entering the weigh-in was pretty basic. I recorded my weight on a 3rd party app, on the same bathroom scale, before I drank any fluids in the morning but after urination. Morning wiegh-in works well because it's a simple routine. More importantly though, you likely weight the least because you're dehydrated after a night of sleep.

### Apple Health Data
Besides the weigh-in and meal logging, all of the other data is generated by Apple's proprietary software. I can not speak to it's accuracy.

### Meal Logging
All of the meal logging was done to the best of my ability using judgements about serving sizes, volume, weights, etc. A kitchen scale was incorporated after January to the measurements would have improved in accuracy after that time. There are certain weeks where there is no data, especially around holidays and weekends. You'll have to do your best there.

### Data scrubbing and transfer to Kaggle
To execute this project, personal data was utilized from the iphone, scrubbed, and uploaded to Kaggle for storage. The file is approximately, 40 MB, so a public area where this is easily downloaded.
#### Data Export from the Source
To begin this project, I was able to Airdrop my health data from my Iphone to my personal labtop.
#### Data Import to Jupyter Notebook
So, we know we're dealing with an .xml file. We'll utilize the Element Tree (ET) function to bring into our notebook and then convert to pandas.

In [81]:
#import relevant functions
import pandas as pd
import xml.etree.ElementTree as ET

In [82]:
#extract data from the xml file and assign the root of the tree
tree = ET.parse("data_raw/export.xml")
root = tree.getroot()

In [83]:
#create pandas dataframe from list of health records
health_records = [x.attrib for x in root.iter('Record')]
record_data = pd.DataFrame(health_records)

In [84]:
#review dataframe to do high level inspection
record_data.head()

Unnamed: 0,type,sourceName,sourceVersion,unit,creationDate,startDate,endDate,value,device
0,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-06-01 13:20:27 -0500,2022-05-31 23:00:00 -0500,2022-05-31 23:00:00 -0500,354.84,
1,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-11 09:43:30 -0500,2022-07-10 23:00:00 -0500,2022-07-10 23:00:00 -0500,1064.52,
2,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-13 20:57:54 -0500,2022-07-12 23:00:00 -0500,2022-07-12 23:00:00 -0500,2129.04,
3,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-14 12:42:54 -0500,2022-07-13 23:00:00 -0500,2022-07-13 23:00:00 -0500,946.24,
4,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-16 18:11:29 -0500,2022-07-15 23:00:00 -0500,2022-07-15 23:00:00 -0500,2129.04,


Looking at the above dataFrame, we can see the entries have come in 9 columns, exluding the index. It appears as though each entry contatins at least 3 date values, as well as information on the Source. Likely, the device information is blocked out. It looks we only need the `type` column, one `data` column, and the `value`. The sourceName, version, and additional timestamps are not needed.

### Data Scrubbing prior to Kaggle:
As we mentioned above, we have some columns we'll delete. But prior to then, we need to focus out data on a date range that is relevant to our wieght loss. Let's search out the information related to Body Mass and see what we can find.

In [98]:
record_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2036997 entries, 0 to 2036996
Data columns (total 9 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   type           object
 1   sourceName     object
 2   sourceVersion  object
 3   unit           object
 4   creationDate   object
 5   startDate      object
 6   endDate        object
 7   value          object
 8   device         object
dtypes: object(9)
memory usage: 139.9+ MB


So... We have many entries, apporxiamtely 2M. And there all generic string objects. There's a little scrubbing to do ahead of time. We'l go ahead and delete those 3 columns.

In [152]:
# drop SourceName, Source Type, and device.
records_df = record_data.drop(['sourceName', 'sourceVersion', 'device'], axis = 1)

In [153]:
import datetime as dt
#convert the time/date columns to datetime without time zone
records_df['creationDate'] = pd.to_datetime(records_df['creationDate']).dt.tz_localize(None)
records_df['startDate'] = pd.to_datetime(records_df['startDate']).dt.tz_localize(None)
records_df['endDate'] = pd.to_datetime(records_df['endDate']).dt.tz_localize(None)

#records_df['creationDate'] = records_df['creationDate'].dt.tz_localize(None)

#records_df['creationDate'].apply(lambda val: datetime.strptime(val, "%Y-%m-%d"))

Okay, now let's convert the `value` to floats. We are fairly confident this is appropriate.

In [161]:
import datetime

cut_off = datetime.datetime(2023, 7, 23, 0, 0, 0, 0, tzinfo = None)

records_df = records_df.drop(records_df[(records_df['startDate'] < cut_off)].index, axis = 1)
#records_df.drop(records_df['startDate'] < cut_off, axis=0, inplace = True)

In [162]:
records_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 856985 entries, 177 to 2036996
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   type          856985 non-null  object        
 1   unit          843298 non-null  object        
 2   creationDate  856985 non-null  datetime64[ns]
 3   startDate     856985 non-null  datetime64[ns]
 4   endDate       856985 non-null  datetime64[ns]
 5   value         856985 non-null  object        
dtypes: datetime64[ns](3), object(3)
memory usage: 45.8+ MB


In [173]:
records_df = records_df.reset_index()

In [192]:
classifiers = records_df.drop(records_df[records_df['value'] < '99999999.9999999'].index, axis = 0)
classifiers

Unnamed: 0,index,type,unit,creationDate,startDate,endDate,value
793917,1899798,HKCategoryTypeIdentifierSleepAnalysis,,2023-07-24 05:30:00,2023-07-23 21:52:17,2023-07-23 21:53:47,HKCategoryValueSleepAnalysisInBed
793918,1899799,HKCategoryTypeIdentifierSleepAnalysis,,2023-07-24 05:30:00,2023-07-23 22:13:46,2023-07-23 22:13:59,HKCategoryValueSleepAnalysisInBed
793919,1899800,HKCategoryTypeIdentifierSleepAnalysis,,2023-07-24 05:30:00,2023-07-23 22:14:56,2023-07-23 22:58:00,HKCategoryValueSleepAnalysisInBed
793920,1899801,HKCategoryTypeIdentifierSleepAnalysis,,2023-07-24 07:05:22,2023-07-23 23:12:48,2023-07-23 23:40:48,HKCategoryValueSleepAnalysisAsleepCore
793921,1899802,HKCategoryTypeIdentifierSleepAnalysis,,2023-07-24 07:05:22,2023-07-23 23:40:48,2023-07-24 00:06:18,HKCategoryValueSleepAnalysisAsleepDeep
...,...,...,...,...,...,...,...
807599,1921352,HKCategoryTypeIdentifierAudioExposureEvent,,2024-01-21 11:51:53,2024-01-21 11:49:03,2024-01-21 11:51:53,HKCategoryValueEnvironmentalAudioExposureEvent...
807600,1921353,HKCategoryTypeIdentifierAudioExposureEvent,,2024-02-09 10:46:59,2024-02-09 10:44:04,2024-02-09 10:46:59,HKCategoryValueEnvironmentalAudioExposureEvent...
807601,1921354,HKCategoryTypeIdentifierAudioExposureEvent,,2024-02-12 15:42:33,2024-02-12 15:39:43,2024-02-12 15:42:33,HKCategoryValueEnvironmentalAudioExposureEvent...
807602,1921355,HKCategoryTypeIdentifierAudioExposureEvent,,2024-02-13 08:30:05,2024-02-13 08:27:10,2024-02-13 08:30:05,HKCategoryValueEnvironmentalAudioExposureEvent...


In [193]:
numbers = records_df.drop(records_df[records_df['value'] > '99999999.9999999'].index, axis = 0)

Unnamed: 0,index,type,unit,creationDate,startDate,endDate,value
671690,1645644,HKQuantityTypeIdentifierDietaryThiamin,mg,2023-11-29 23:35:51,2023-11-29 18:00:00,2023-11-29 18:00:00,0
660014,1628309,HKQuantityTypeIdentifierDietaryVitaminA,mcg,2023-09-28 15:30:55,2023-09-28 11:00:00,2023-09-28 11:00:00,0
660019,1628314,HKQuantityTypeIdentifierDietaryVitaminA,mcg,2023-09-29 10:26:59,2023-09-29 07:00:00,2023-09-29 07:00:00,0
819914,1996404,HKQuantityTypeIdentifierDietaryVitaminC,mg,2023-09-26 10:48:52,2023-09-26 14:00:00,2023-09-26 14:00:00,0
660021,1628316,HKQuantityTypeIdentifierDietaryVitaminA,mcg,2023-09-29 19:54:45,2023-09-29 17:00:00,2023-09-29 17:00:00,0
...,...,...,...,...,...,...,...
682618,1663374,HKQuantityTypeIdentifierDietaryPotassium,mg,2023-09-11 21:52:50,2023-09-11 17:00:00,2023-09-11 17:00:00,992
815406,1991896,HKQuantityTypeIdentifierDietaryPotassium,mg,2023-09-11 21:52:50,2023-09-11 17:00:00,2023-09-11 17:00:00,992
814278,1990768,HKQuantityTypeIdentifierDietaryPotassium,mg,2023-09-07 21:13:59,2023-09-07 17:00:00,2023-09-07 17:00:00,992
646748,1585531,HKQuantityTypeIdentifierDietarySodium,mg,2023-10-07 10:04:32,2023-10-06 17:00:00,2023-10-06 17:00:00,999


Fantastic, so we have two dataframes, one with numeric classifiers and one with binomial classifiers. Let's go ahead and do some feature engineering on the `value` information.

In [202]:
classifier_scrubbed = classifiers
classifier_scrubbed['unit'] = classifiers['value']
classifier_scrubbed['value'] = 1

In [210]:
final_record = pd.concat([classifier_scrubbed, numbers])
final_records = final_record.drop('index', axis = 1)

<class 'pandas.core.frame.DataFrame'>
Index: 856985 entries, 793917 to 856984
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   type          856985 non-null  object        
 1   unit          856985 non-null  object        
 2   creationDate  856985 non-null  datetime64[ns]
 3   startDate     856985 non-null  datetime64[ns]
 4   endDate       856985 non-null  datetime64[ns]
 5   value         856985 non-null  object        
dtypes: datetime64[ns](3), object(3)
memory usage: 45.8+ MB


In [212]:
import pickle
pickle.dump(final_records, open('final_records.p', 'wb'))

So, it appears there's no other information in the Sleep Analysis value that we need to worry about. So we can convert the 

In [None]:
weight_data = final_records.to_csv('weight_data.csv', index = True) 