# GUESS MY WEIGHT 
A program to predict the weight from my health data

![guess_your_weight.gif](images/guess_your_weight.gif)

## Overview
Health and Wellness is a big business. Specifically, weight loss. We’re all trying because it’s very, very hard. I recently went on my own weight loss journey, losing about 50 lbs in roughly 18 months. Weighing myself every morning, I agonized over every tenth of a lb, recording it in an app on my phone. I realized that losing big chunks of weights starts with small, incremental progress on the scale. But I didn’t stop there. As a data nerd I thought, “let’s record every meal.” So I did that too. I wondered… given all this data I have, could I predict my weight? My watch and phone captures my exercise, sleep, eating, and so much more. There must be trends here. At a minimum, I should be able to predict whether my weight will go up or down from the previous day. So let’s do it.

## Data Understanding
I have much (and probably too much) of this data in my iphone and Apple Watch. It contains the weight information, workouts, heart rate, meals - broken down into subcategories (proteins, fats, etc). Most importantly is the weight. That will be the feature that I primarily use for classification.  

Because it’s my data, there’s more clarity about data entry methods. This is more subjective, than a controlled experiment with many participants. I know what data I was diligent about collecting so I should be able to scrub it appropriately. For instance, I didn’t record my fluids consistently - water, tea, coffee. Water consumption is a big part of this so I’ll have to be clear about the gaps in the data

### Weigh-In Protocol
The routine for entering the weigh-in was pretty basic. I recorded my weight on a 3rd party app, on the same bathroom scale, before I drank any fluids in the morning but after urination. Morning wiegh-in works well because it's a simple routine. More importantly though, you likely weight the least because you're dehydrated after a night of sleep.

### Apple Health Data
Besides the weigh-in and meal logging, all of the other data is generated by Apple's proprietary software. I can not speak to it's accuracy.

### Meal Logging
All of the meal logging was done to the best of my ability using judgements about serving sizes, volume, weights, etc. A kitchen scale was incorporated after January to the measurements would have improved in accuracy after that time. There are certain weeks where there is no data, especially around holidays and weekends. You'll have to do your best there.

### Data scrubbing and transfer to Kaggle
To execute this project, personal data was utilized from the iphone, scrubbed, and uploaded to Kaggle for storage. The file is approximately, 40 MB, so a public area where this is easily downloaded.
#### Data Export from the Source
To begin this project, I was able to Airdrop my health data from my Iphone to my personal labtop.
#### Data Import to Jupyter Notebook
So, we know we're dealing with an .xml file. We'll utilize the Element Tree (ET) function to bring into our notebook and then convert to pandas.

In [3]:
#import relevant functions
import pandas as pd
import xml.etree.ElementTree as ET

In [70]:
#extract data from the xml file and assign the root of the tree
tree = ET.parse("data_raw/export.xml")
root = tree.getroot()

In [71]:
#create pandas dataframe from list of health records
health_records = [x.attrib for x in root.iter('Record')]
record_data = pd.DataFrame(health_records)

In [72]:
#review dataframe to do high level inspection
record_data.head()

Unnamed: 0,type,sourceName,sourceVersion,unit,creationDate,startDate,endDate,value,device
0,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-06-01 13:20:27 -0500,2022-05-31 23:00:00 -0500,2022-05-31 23:00:00 -0500,354.84,
1,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-11 09:43:30 -0500,2022-07-10 23:00:00 -0500,2022-07-10 23:00:00 -0500,1064.52,
2,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-13 20:57:54 -0500,2022-07-12 23:00:00 -0500,2022-07-12 23:00:00 -0500,2129.04,
3,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-14 12:42:54 -0500,2022-07-13 23:00:00 -0500,2022-07-13 23:00:00 -0500,946.24,
4,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-16 18:11:29 -0500,2022-07-15 23:00:00 -0500,2022-07-15 23:00:00 -0500,2129.04,


Looking at the above dataFrame, we can see the entries have come in 9 columns, exluding the index. It appears as though each entry contatins at least 3 date values, as well as information on the Source. Likely, the device information is blocked out. It looks we only need the `type` column, one `data` column, and the `value`. The sourceName, version, and additional timestamps are not needed.

### Data Scrubbing prior to Kaggle:
As we mentioned above, we have some columns we'll delete. But prior to then, we need to focus out data on a date range that is relevant to our wieght loss. Let's search out the information related to Body Mass and see what we can find.

In [73]:
record_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2036997 entries, 0 to 2036996
Data columns (total 9 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   type           object
 1   sourceName     object
 2   sourceVersion  object
 3   unit           object
 4   creationDate   object
 5   startDate      object
 6   endDate        object
 7   value          object
 8   device         object
dtypes: object(9)
memory usage: 139.9+ MB


So... We have many entries, apporxiamtely 2M. And there all generic string objects. There's a little scrubbing to do ahead of time. We'l go ahead and delete those 3 columns.

In [74]:
# drop SourceName, Source Type, and device.
record_data.drop(['sourceName', 'sourceVersion', 'device'], axis = 1, inplace = True)

In [75]:
time_cols = ['creationDate', 'startDate', 'endDate']
record_data[time_cols] = record_data[time_cols].apply(pd.to_datetime)

In [76]:
record_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2036997 entries, 0 to 2036996
Data columns (total 6 columns):
 #   Column        Dtype                    
---  ------        -----                    
 0   type          object                   
 1   unit          object                   
 2   creationDate  datetime64[ns, UTC-05:00]
 3   startDate     datetime64[ns, UTC-05:00]
 4   endDate       datetime64[ns, UTC-05:00]
 5   value         object                   
dtypes: datetime64[ns, UTC-05:00](3), object(3)
memory usage: 93.2+ MB


Okay, now let's convert the `value` to floats. We are fairly confident this is appropriate.

In [77]:
record_data['value'].describe()

count     2036997
unique     242599
top             1
freq        65281
Name: value, dtype: object

In [78]:
record_data[record_data['type'] == 'HKCategoryTypeIdentifierSleepAnalysis']

Unnamed: 0,type,unit,creationDate,startDate,endDate,value
1892698,HKCategoryTypeIdentifierSleepAnalysis,,2022-07-03 06:00:17-05:00,2022-07-02 20:37:03-05:00,2022-07-02 21:31:33-05:00,HKCategoryValueSleepAnalysisAsleepUnspecified
1892699,HKCategoryTypeIdentifierSleepAnalysis,,2022-07-03 06:00:17-05:00,2022-07-02 21:39:33-05:00,2022-07-02 22:12:03-05:00,HKCategoryValueSleepAnalysisAsleepUnspecified
1892700,HKCategoryTypeIdentifierSleepAnalysis,,2022-07-03 06:00:17-05:00,2022-07-02 22:21:03-05:00,2022-07-02 22:36:33-05:00,HKCategoryValueSleepAnalysisAsleepUnspecified
1892701,HKCategoryTypeIdentifierSleepAnalysis,,2022-07-03 06:00:17-05:00,2022-07-02 22:37:33-05:00,2022-07-02 22:39:33-05:00,HKCategoryValueSleepAnalysisAsleepUnspecified
1892702,HKCategoryTypeIdentifierSleepAnalysis,,2022-07-03 06:00:17-05:00,2022-07-02 23:03:03-05:00,2022-07-02 23:09:33-05:00,HKCategoryValueSleepAnalysisAsleepUnspecified
...,...,...,...,...,...,...
1907937,HKCategoryTypeIdentifierSleepAnalysis,,2024-03-06 07:12:32-05:00,2024-03-06 06:49:57-05:00,2024-03-06 06:51:26-05:00,HKCategoryValueSleepAnalysisInBed
1907938,HKCategoryTypeIdentifierSleepAnalysis,,2024-03-06 07:12:32-05:00,2024-03-06 06:54:57-05:00,2024-03-06 06:59:21-05:00,HKCategoryValueSleepAnalysisInBed
1907939,HKCategoryTypeIdentifierSleepAnalysis,,2024-03-06 07:12:32-05:00,2024-03-06 07:06:57-05:00,2024-03-06 07:10:32-05:00,HKCategoryValueSleepAnalysisInBed
1907940,HKCategoryTypeIdentifierSleepAnalysis,,2024-03-06 07:12:32-05:00,2024-03-06 07:11:03-05:00,2024-03-06 07:12:27-05:00,HKCategoryValueSleepAnalysisInBed


In [83]:
unique_list = record_data[record_data['type'] == 'HKCategoryTypeIdentifierSleepAnalysis'].describe()

Unnamed: 0,type,unit,creationDate,startDate,endDate,value
count,15244,0.0,15244,15244,15244,15244
unique,1,0.0,,,,6
top,HKCategoryTypeIdentifierSleepAnalysis,,,,,HKCategoryValueSleepAnalysisInBed
freq,15244,,,,,4825
mean,,,2023-07-23 19:42:21.696601856-05:00,2023-07-23 14:48:10.393400576-05:00,2023-07-23 15:31:36.053201408-05:00,
min,,,2022-07-03 06:00:17-05:00,2022-07-02 20:30:52-05:00,2022-07-02 21:31:33-05:00,
25%,,,2023-04-01 05:38:46-05:00,2023-04-01 03:24:36-05:00,2023-04-01 03:45:28.500000-05:00,
50%,,,2023-08-08 07:22:41-05:00,2023-08-08 07:01:33-05:00,2023-08-08 07:20:33-05:00,
75%,,,2023-11-22 07:25:07-05:00,2023-11-22 04:59:10-05:00,2023-11-22 05:54:25-05:00,
max,,,2024-03-06 07:12:34-05:00,2024-03-06 07:12:28-05:00,2024-03-06 07:12:32-05:00,


In [86]:
unique_list = record_data[record_data['type'] == 'HKCategoryTypeIdentifierSleepAnalysis']['value'].unique()
unique_list

array(['HKCategoryValueSleepAnalysisAsleepUnspecified',
       'HKCategoryValueSleepAnalysisInBed',
       'HKCategoryValueSleepAnalysisAsleepCore',
       'HKCategoryValueSleepAnalysisAsleepDeep',
       'HKCategoryValueSleepAnalysisAwake',
       'HKCategoryValueSleepAnalysisAsleepREM'], dtype=object)

There does appear to a piece of data in `value` called `HKCategoryValueSleepAnalysisinBed`. This appears to be a binomial classification telling telling the person whether or not there in bed. Instead of the hours they slept, it's whether or not I'm in bed at this particular moment. Sleep is tricky... so let's do a little more examination of this.

In [87]:
def categorize_unit(row):
    if row['value'] == 'HKCategoryValueSleepAnalysisAsleepUnspecified':
        return 'AsleepUnspecified'
    elif row['value'] == 'HKCategoryValueSleepAnalysisInBed':
        return 'inBed'
    elif row['value'] == 'HKCategoryValueSleepAnalysisAsleepCore':
        return 'AsleepCore'
    elif row['value'] == 'HKCategoryValueSleepAnalysisAsleepDeep':
        return 'AsleepDeep'
    elif row['value'] == 'HKCategoryValueSleepAnalysisisAwake':
        return 'isAwake'
    elif row['value'] == 'HKCategoryValueSleepAnalysisasleepREM':
        return 'asleepREM'
    
record_data['unit'] = record_data.apply(categorize_unit, axis=1)

So, it appears there's no other information in the Sleep Analysis value that we need to worry about. So we can convert the 

In [88]:
record_data[record_data['type'] == 'HKCategoryTypeIdentifierSleepAnalysis']

Unnamed: 0,type,unit,creationDate,startDate,endDate,value
1892698,HKCategoryTypeIdentifierSleepAnalysis,AsleepUnspecified,2022-07-03 06:00:17-05:00,2022-07-02 20:37:03-05:00,2022-07-02 21:31:33-05:00,HKCategoryValueSleepAnalysisAsleepUnspecified
1892699,HKCategoryTypeIdentifierSleepAnalysis,AsleepUnspecified,2022-07-03 06:00:17-05:00,2022-07-02 21:39:33-05:00,2022-07-02 22:12:03-05:00,HKCategoryValueSleepAnalysisAsleepUnspecified
1892700,HKCategoryTypeIdentifierSleepAnalysis,AsleepUnspecified,2022-07-03 06:00:17-05:00,2022-07-02 22:21:03-05:00,2022-07-02 22:36:33-05:00,HKCategoryValueSleepAnalysisAsleepUnspecified
1892701,HKCategoryTypeIdentifierSleepAnalysis,AsleepUnspecified,2022-07-03 06:00:17-05:00,2022-07-02 22:37:33-05:00,2022-07-02 22:39:33-05:00,HKCategoryValueSleepAnalysisAsleepUnspecified
1892702,HKCategoryTypeIdentifierSleepAnalysis,AsleepUnspecified,2022-07-03 06:00:17-05:00,2022-07-02 23:03:03-05:00,2022-07-02 23:09:33-05:00,HKCategoryValueSleepAnalysisAsleepUnspecified
...,...,...,...,...,...,...
1907937,HKCategoryTypeIdentifierSleepAnalysis,inBed,2024-03-06 07:12:32-05:00,2024-03-06 06:49:57-05:00,2024-03-06 06:51:26-05:00,HKCategoryValueSleepAnalysisInBed
1907938,HKCategoryTypeIdentifierSleepAnalysis,inBed,2024-03-06 07:12:32-05:00,2024-03-06 06:54:57-05:00,2024-03-06 06:59:21-05:00,HKCategoryValueSleepAnalysisInBed
1907939,HKCategoryTypeIdentifierSleepAnalysis,inBed,2024-03-06 07:12:32-05:00,2024-03-06 07:06:57-05:00,2024-03-06 07:10:32-05:00,HKCategoryValueSleepAnalysisInBed
1907940,HKCategoryTypeIdentifierSleepAnalysis,inBed,2024-03-06 07:12:32-05:00,2024-03-06 07:11:03-05:00,2024-03-06 07:12:27-05:00,HKCategoryValueSleepAnalysisInBed


In [90]:
def inBedInt (value):
    if value == 'HKCategoryValueSleepAnalysisInBed' in value:
        value = 1
    elif value == 'HKCategoryValueSleepAnalysisAsleepUnspecified':
        value = 0
    return value

record_data['value'] = record_data['value'].map(inBedInt)

TypeError: argument of type 'int' is not iterable