# GUESS MY WEIGHT 
A program to predict the weight from my health data

![guess_your_weight.gif](images/guess_your_weight.gif)

## Overview
Health and Wellness is a big business. Specifically, weight loss. We’re all trying because it’s very, very hard. I recently went on my own weight loss journey, losing about 50 lbs in roughly 18 months. Weighing myself every morning, I agonized over every tenth of a lb, recording it in an app on my phone. I realized that losing big chunks of weights starts with small, incremental progress on the scale. But I didn’t stop there. As a data nerd I thought, “let’s record every meal.” So I did that too. I wondered… given all this data I have, could I predict my weight? My watch and phone captures my exercise, sleep, eating, and so much more. There must be trends here. At a minimum, I should be able to predict whether my weight will go up or down from the previous day. So let’s do it.

## Data Understanding
I have much (and probably too much) of this data in my iphone and Apple Watch. It contains the weight information, workouts, heart rate, meals - broken down into subcategories (proteins, fats, etc). Most importantly is the weight. That will be the feature that I primarily use for classification.  

Because it’s my data, there’s more clarity about data entry methods. This is more subjective, than a controlled experiment with many participants. I know what data I was diligent about collecting so I should be able to scrub it appropriately. For instance, I didn’t record my fluids consistently - water, tea, coffee. Water consumption is a big part of this so I’ll have to be clear about the gaps in the data

### Weigh-In Protocol
The routine for entering the weigh-in was pretty basic. I recorded my weight on a 3rd party app, on the same bathroom scale, before I drank any fluids in the morning but after urination. Morning wiegh-in works well because it's a simple routine. More importantly though, you likely weight the least because you're dehydrated after a night of sleep.

### Apple Health Data
Besides the weigh-in and meal logging, all of the other data is generated by Apple's proprietary software. I can not speak to it's accuracy.

### Meal Logging
All of the meal logging was done to the best of my ability using judgements about serving sizes, volume, weights, etc. A kitchen scale was incorporated after January to the measurements would have improved in accuracy after that time. There are certain weeks where there is no data, especially around holidays and weekends. You'll have to do your best there.

### Data scrubbing and transfer to Kaggle
To execute this project, personal data was utilized from the iphone, scrubbed, and uploaded to Kaggle for storage. The file is approximately, 40 MB, so a public area where this is easily downloaded.
#### Data Export from the Source
To begin this project, I was able to Airdrop my health data from my Iphone to my personal labtop.
#### Data Import to Jupyter Notebook
So, we know we're dealing with an .xml file. We'll utilize the Element Tree (ET) function to bring into our notebook and then convert to pandas.

In [1]:
#import relevant functions
import pandas as pd
import xml.etree.ElementTree as ET

In [2]:
#extract data from the xml file and assign the root of the tree
tree = ET.parse("data_raw/export.xml")
root = tree.getroot()

In [3]:
#create pandas dataframe from list of health records
health_records = [x.attrib for x in root.iter('Record')]
record_data = pd.DataFrame(health_records)

In [4]:
#review dataframe to do high level inspection
record_data.head()

Unnamed: 0,type,sourceName,sourceVersion,unit,creationDate,startDate,endDate,value,device
0,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-06-01 13:20:27 -0500,2022-05-31 23:00:00 -0500,2022-05-31 23:00:00 -0500,354.84,
1,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-11 09:43:30 -0500,2022-07-10 23:00:00 -0500,2022-07-10 23:00:00 -0500,1064.52,
2,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-13 20:57:54 -0500,2022-07-12 23:00:00 -0500,2022-07-12 23:00:00 -0500,2129.04,
3,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-14 12:42:54 -0500,2022-07-13 23:00:00 -0500,2022-07-13 23:00:00 -0500,946.24,
4,HKQuantityTypeIdentifierDietaryWater,MyPlate,4,mL,2022-07-16 18:11:29 -0500,2022-07-15 23:00:00 -0500,2022-07-15 23:00:00 -0500,2129.04,


Looking at the above dataFrame, we can see the entries have come in 9 columns, exluding the index. It appears as though each entry contatins at least 3 date values, as well as information on the Source. Likely, the device information is blocked out. It looks we only need the `type` column, one `data` column, and the `value`. The sourceName, version, and additional timestamps are not needed.

### Data Scrubbing prior to Kaggle:
As we mentioned above, we have some columns we'll delete. But prior to then, we need to focus out data on a date range that is relevant to our wieght loss. Let's search out the information related to Body Mass and see what we can find.

In [5]:
#record_data.info()

So... We have many entries, apporxiamtely 2M. And there all generic string objects. There's a little scrubbing to do ahead of time. We'l go ahead and delete those 3 columns `['sourceVersion', 'device', 'sourceName']` as well as any duplicates in the files.

In [22]:
# drop SourceName, Source Type, and device.
records_df = record_data.drop(['sourceVersion', 'device', 'sourceName'], axis = 1)

In [23]:
# drop duplicates in the dataFrame
records_df.drop_duplicates(keep='first', inplace=True)

#### Date cleanups.

I know from my weight loss app that I started the journey around Aug 24, 2023. So we can delete records prior to that date, as we don't need any extraneous data. To do this, let's convert each of the date columns to a `datetime64[ns]` so that we can better work with it. ONce that's complete, we'll get rid of the data prior to late August.

In [24]:
records_df = records_df.drop(records_df[records_df['startDate'] < '2023-08-23 00:00:00 -0500'].index, axis = 0)

In [25]:
records_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 735946 entries, 230 to 2036996
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   type          735946 non-null  object
 1   unit          723968 non-null  object
 2   creationDate  735946 non-null  object
 3   startDate     735946 non-null  object
 4   endDate       735946 non-null  object
 5   value         735946 non-null  object
dtypes: object(6)
memory usage: 39.3+ MB


Okay, so we shrunk the size of our file from 139.9+ MB to 39.3+ MB. So that's good. 

### Categorical Data Clean-Up

Next step is to clean up the categorical data (which we'll call `classifiers`).  As of now, it only appears as an object representing a unit or category that it's measuring. But we'd like it as a 0/1 in the `value` column.

For instance, let's think of the sleep data. Currently it shows this data in the 'value' column as the 'HKCategoryValueSleepAnalysisAsleepDeep'. This is to show that some duration of time was classified as AsleepDeep (or deep sleep) you might say. We want to keep this information but let it be shown in the 'type' or 'unit' column, and have only a 1 in the `value` column.

We'll create two dataframes, one for classifiers and one for numeric values. Because these columns are string objects, we'll sort them by their alphanumeric order. We'll perfrom the transformations two the classifier dataFrame only and then concatenate the dataFrames.

In [26]:
#create classifiers dataframe by dropping the rows that are numeric (less than a high number)
classifiers = records_df.drop(records_df[records_df['value'] < '99999999.9999999'].index, axis = 0)

In [27]:
#create numeric dataframe by dropping the rows that are alphabetic (greater than a high number)
numbers = records_df.drop(records_df[records_df['value'] > '99999999.9999999'].index, axis = 0)

Fantastic, so we have two dataframes, one with numeric classifiers and one with binomial classifiers. Let's go ahead and do some feature engineering on the `value` information.

In [28]:
#copy the classifier dataframe and assign the 'value' elements to the 'unit' columns
classifier_scrubbed = classifiers
classifier_scrubbed['unit'] = classifiers['value']

In [29]:
#make the column value 1
classifier_scrubbed['value'] = 1


In [30]:
#concatenate the dataframes back together
final_record = pd.concat([classifier_scrubbed, numbers])
final_record

Unnamed: 0,type,unit,creationDate,startDate,endDate,value
1900808,HKCategoryTypeIdentifierSleepAnalysis,HKCategoryValueSleepAnalysisAsleepDeep,2023-08-23 07:36:55 -0500,2023-08-23 00:07:54 -0500,2023-08-23 00:17:54 -0500,1
1900809,HKCategoryTypeIdentifierSleepAnalysis,HKCategoryValueSleepAnalysisAsleepCore,2023-08-23 07:36:55 -0500,2023-08-23 00:17:54 -0500,2023-08-23 00:37:54 -0500,1
1900810,HKCategoryTypeIdentifierSleepAnalysis,HKCategoryValueSleepAnalysisAsleepDeep,2023-08-23 07:36:55 -0500,2023-08-23 00:37:54 -0500,2023-08-23 00:57:54 -0500,1
1900811,HKCategoryTypeIdentifierSleepAnalysis,HKCategoryValueSleepAnalysisAsleepCore,2023-08-23 07:36:55 -0500,2023-08-23 00:57:54 -0500,2023-08-23 01:04:24 -0500,1
1900812,HKCategoryTypeIdentifierSleepAnalysis,HKCategoryValueSleepAnalysisAsleepREM,2023-08-23 07:36:55 -0500,2023-08-23 01:04:24 -0500,2023-08-23 01:25:54 -0500,1
...,...,...,...,...,...,...
2036992,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,ms,2024-03-06 01:01:10 -0500,2024-03-06 01:00:08 -0500,2024-03-06 01:01:07 -0500,44.3289
2036993,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,ms,2024-03-06 03:01:11 -0500,2024-03-06 03:00:09 -0500,2024-03-06 03:01:08 -0500,54.3759
2036994,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,ms,2024-03-06 05:01:11 -0500,2024-03-06 05:00:09 -0500,2024-03-06 05:01:08 -0500,76.23
2036995,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,ms,2024-03-06 07:01:22 -0500,2024-03-06 07:00:20 -0500,2024-03-06 07:01:20 -0500,45.6944


Looks great. The only thing left to do is export this to a csv file

In [31]:
weight_data = final_record.to_csv('weight_data_4_18.csv', index = True) 