# An Exploratory Data Analysis on Val's Habits

## 👋 Introduction
James Clear, writer of the critically-acclaimed Atomic Habits, declares that habits are the compound interest of self-improvement. During the second semester of my sophomore year (January 2024 - May 2024), I was able to set-up systems that manually and automatically recorded my habits. Understanding my habits can allow me to better fine-tune my systems to improve the skills and relationships I value. Right now, I have a lot of data but I haven't utilized any of it yet. Thus, this exploratory data analysis (EDA) comes into light.
## 💡Project Description
This EDA focuses on Val's habits during his second semester of sophomore year. The goal is to extract data-driven insights that will be useful for his first semester of junior year. What observations can be made and how can this help? Specifically, the following data are used:
- `habits`: Sourced from a Google Form Val made that records daily habits such as number of toothbrushes in a day, minutes meditated, and overall mood.
- `exercise`: Sourced from Val's Strava account using the app's API. Strava is an app that lets users record their exercise data.
- `steps`: Sourced from Val's Samsung Galaxy Watch 6 through the Samsung Health app.
- `sleep`:  Sourced from Val's Samsung Galaxy Watch 6 through the Samsung Health app.
- `events`: A table containing significant academic or extracurricular events Val participated in.

## 📚 Step 0: Imports and Reading Data

In [1]:
# Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly

In [2]:
# Styling and preferences
plt.style.use('ggplot')

In [3]:
# For manually created CSV data
habits = pd.read_csv('data/habits-edited.csv')
events = pd.read_csv('data/events.csv')

In [70]:
# For automated CSV data
steps = pd.read_csv('data/steps-converted.csv')
sleep = pd.read_csv('data/sleep-converted.csv')
exercise = pd.read_csv('data/exercise.csv')
devices = pd.read_csv('data/devices.csv')

- I edited the habits CSV file because there were date inconsistencies (thus the filename `habits-edited.csv`). I did not put a date field on the habit form when I made it so I only relied on the timestamp. But there were some days I answered the form more than once to "catch up" on days I was not able to answer. There were also a few days that were blank. I manually went through all dates to fix the timestamp. Luckily there were only around 126 rows. This is already a helpful insight for next semester's habit form: Add a date field.
- Downloading my health data was done through the Samsung Health app. The data provided consists of JSON files and a lot of CSV files. I decided to use the `com.samsung.shealth.tracker.pedometer_step_count.[date of download].csv` for steps and `com.samsung.shealth.sleep.[date of download].csv` for sleep. I also need the `com.samsung.health.device_profile.[date of download].csv` to determine the device ID of my smartwatch. This will be explained more later.
- Also, using `pd.read_csv` on these files produced a `ParserError` so I first imported them to Google Sheets and removed the first row. After downloading them again as CSV files, they're now converted to Pandas dataframes.
- Downloading my Strava exercise data was done through https://entorb.net/strava/. The app allowed me to skip the trouble of having to access Strava's API myself.
- Note to self: Possibly include screen time/usage data from mobile phone and tablet (when I gain the energy to do so). Current methods prove to be difficult.

## 🤔 Step 1: Data Understanding

In [6]:
habits.iloc[0]

Timestamp                                                                                   1/15
Toothbrush                                                                                   2.0
Skincare                                                                                Complete
Daily mental well-being                                                                      3.0
Read                                                                                         Yes
Touch typing                                                                                  No
Exercise                                                                             Daily steps
Minutes meditated                                                                            5.0
What were you grateful for today?                                                         malena
Any notable wins today?                                                                      NaN
Any message for future Val?   

The bottom half of the columns are weekly open-ended questions I ended up not following because I found the questions long and tiresome to do. So, I'll focus on columns I consistenly answered.

In [7]:
habits['Timestamp']

0      1/15
1      1/16
2      1/17
3      1/18
4      1/19
       ... 
122    5/16
123    5/17
124    5/18
125    5/19
126    5/20
Name: Timestamp, Length: 127, dtype: object

I'll need to convert the timestamp column into a proper datatype. But at least I already know that I'll be focusing on the data from January 15 to May 20, the official days of the semester.

In [8]:
events.head(3)

Unnamed: 0,event,type,date_start,date_end
0,Samsung Mission,extracurricular,"January 12, 2024",
1,Samsung Mission,extracurricular,1/15/24,
2,Samsung Mission,extracurricular,1/18/24,


In [8]:
events['type'].unique()

array(['extracurricular', 'fun', 'hackathon', 'academic', 'long holiday',
       'long test'], dtype=object)

I'll need to convert the date columns into proper format as well. Moreover, I see that I have six main types of events.

In [123]:
steps.head(3)

Unnamed: 0,duration,version_code,run_step,walk_step,com.samsung.health.step_count.start_time,com.samsung.health.step_count.sample_position_type,com.samsung.health.step_count.custom,com.samsung.health.step_count.update_time,com.samsung.health.step_count.create_time,com.samsung.health.step_count.count,com.samsung.health.step_count.speed,com.samsung.health.step_count.distance,com.samsung.health.step_count.calorie,com.samsung.health.step_count.time_offset,com.samsung.health.step_count.deviceuuid,com.samsung.health.step_count.pkg_name,com.samsung.health.step_count.end_time,com.samsung.health.step_count.datauuid
0,4704,4,0,10,2024-06-26 18:25:00,,,2024-06-26 18:32:08,2024-06-26 18:32:08,10,1.666667,7.84,0.32,UTC+0800,0yH08JetXB,com.sec.android.app.shealth,2024-06-26 18:26:00,d2af0334-0836-4e8a-bd58-841f92cbf268
1,9333,4,0,15,2024-06-26 19:50:00,,,2024-06-26 19:55:00,2024-06-26 19:55:00,15,1.138889,10.630001,0.56,UTC+0800,0yH08JetXB,com.sec.android.app.shealth,2024-06-26 19:51:00,fe90e3db-4e31-4df1-bf47-64c4eaeddc9d
2,7604,4,0,19,2024-06-26 19:59:00,,,2024-06-26 20:01:18,2024-06-26 20:01:18,19,2.027778,15.419998,0.57,UTC+0800,0yH08JetXB,com.sec.android.app.shealth,2024-06-26 20:00:00,45406ca0-0b60-43cf-a232-4fdd65e023f0


In [124]:
steps.shape

(9079, 18)

In [125]:
steps.columns

Index(['duration', 'version_code', 'run_step', 'walk_step',
       'com.samsung.health.step_count.start_time',
       'com.samsung.health.step_count.sample_position_type',
       'com.samsung.health.step_count.custom',
       'com.samsung.health.step_count.update_time',
       'com.samsung.health.step_count.create_time',
       'com.samsung.health.step_count.count',
       'com.samsung.health.step_count.speed',
       'com.samsung.health.step_count.distance',
       'com.samsung.health.step_count.calorie',
       'com.samsung.health.step_count.time_offset',
       'com.samsung.health.step_count.deviceuuid',
       'com.samsung.health.step_count.pkg_name',
       'com.samsung.health.step_count.end_time',
       'com.samsung.health.step_count.datauuid'],
      dtype='object')

In [126]:
steps['com.samsung.health.step_count.deviceuuid'].unique()

array(['0yH08JetXB', 'rQMD+kro3I'], dtype=object)

First, there are a lot of columns for this dataset so I'll have to create a subset from it. There are also too many rows. The `deviceuuid` column describes what device was used to record the data. This implies that there are recorded entries with the same date but differing device IDs. Since I'll be using my Galaxy Watch 6, the IDs to use are `iGosmEieUd` and `0yH08JetXB`. I found this through the `com.samsung.health.device_profile.[date of summary].csv`.

In [10]:
sleep.head(3)

Unnamed: 0,original_efficiency,mental_recovery,factor_01,factor_02,factor_03,factor_04,factor_05,factor_06,factor_07,factor_08,...,com.samsung.health.sleep.custom,com.samsung.health.sleep.modify_sh_ver,com.samsung.health.sleep.update_time,com.samsung.health.sleep.create_time,com.samsung.health.sleep.time_offset,com.samsung.health.sleep.deviceuuid,com.samsung.health.sleep.comment,com.samsung.health.sleep.pkg_name,com.samsung.health.sleep.end_time,com.samsung.health.sleep.datauuid
0,,78.0,32.0,55.0,5.0,2.0,27.0,245.0,133.0,0.0,...,,,2023-12-18 4:53:19,2023-12-16 21:51:31,UTC+0800,iGosmEieUd,,com.sec.android.app.shealth,2023-12-16 21:36:00,16f8cf4c-d949-4d40-b2cb-687457112eac
1,,41.0,11.0,47.0,4.0,19.0,32.0,475.0,79.0,4.0,...,,,2023-12-18 4:53:19,2023-12-17 23:11:09,UTC+0800,iGosmEieUd,,com.sec.android.app.shealth,2023-12-17 23:11:00,5f323226-adc6-4c95-b4a8-4beace795e8b
2,,61.0,8.0,32.0,0.0,19.0,38.0,317.0,42.0,2.0,...,,,2023-12-18 23:04:39,2023-12-18 23:04:29,UTC+0800,iGosmEieUd,,com.sec.android.app.shealth,2023-12-18 22:59:00,127d64de-2f7c-41c8-8ec6-61de5b7f58e8


In [13]:
sleep.shape

(277, 48)

In [14]:
sleep.columns

Index(['original_efficiency', 'mental_recovery', 'factor_01', 'factor_02',
       'factor_03', 'factor_04', 'factor_05', 'factor_06', 'factor_07',
       'factor_08', 'factor_09', 'factor_10', 'integrated_id',
       'has_sleep_data', 'bedtime_detection_delay',
       'wakeup_time_detection_delay', 'total_rem_duration', 'combined_id',
       'sleep_type', 'sleep_latency', 'data_version', 'physical_recovery',
       'original_wake_up_time', 'movement_awakening', 'is_integrated',
       'original_bed_time', 'goal_bed_time', 'quality', 'extra_data',
       'goal_wake_up_time', 'sleep_cycle', 'total_light_duration',
       'efficiency', 'sleep_score', 'sleep_duration', 'stage_analyzed_type',
       'com.samsung.health.sleep.create_sh_ver',
       'com.samsung.health.sleep.start_time',
       'com.samsung.health.sleep.custom',
       'com.samsung.health.sleep.modify_sh_ver',
       'com.samsung.health.sleep.update_time',
       'com.samsung.health.sleep.create_time',
       'com.samsung.hea

This dataset has even more columns. I'm mainly interested in the high-level values such as sleep score, sleep time, among others.

In [15]:
exercise.head(3)

Unnamed: 0,id,type,x_gear_name,start_date_local,x_week,x_start_h,name,x_min,x_km,x_min/km,...,start_date,timezone,total_photo_count,trainer,upload_id,upload_id_str,utc_offset,x_date,x_elev_%,x_url
0,11500446337,Run,,02.01.2024 17:11:00,2023-W53,17.2,Afternoon Run,42.8,6.11,7.01,...,02.01.2024 17:11:00,(GMT+00:00) GMT,0,0,,,0,2024-01-02,,https://www.strava.com/activities/11500446337
1,11500452565,Run,,04.01.2024 17:16:00,2023-W53,17.3,Afternoon Run,18.6,3.02,6.14,...,04.01.2024 17:16:00,(GMT+00:00) GMT,0,0,,,0,2024-01-04,,https://www.strava.com/activities/11500452565
2,11500460588,Run,,07.01.2024 18:41:00,2024-W01,18.7,Evening Run,11.0,2.05,5.38,...,07.01.2024 18:41:00,(GMT+00:00) GMT,0,0,,,0,2024-01-07,,https://www.strava.com/activities/11500460588


In [11]:
exercise.shape

(71, 72)

In [16]:
exercise.columns

Index(['id', 'type', 'x_gear_name', 'start_date_local', 'x_week', 'x_start_h',
       'name', 'x_min', 'x_km', 'x_min/km', 'km/h', 'x_max_km/h', 'x_mi',
       'x_min/mi', 'x_mph', 'x_max_mph', 'total_elevation_gain', 'x_elev_m/km',
       'average_heartrate', 'max_heartrate', 'average_cadence',
       'average_watts', 'kilojoules', 'commute', 'private', 'visibility',
       'workout_type', 'x_nearest_city_start', 'x_start_locality',
       'x_end_locality', 'x_dist_start_end_km', 'start_latlng', 'end_latlng',
       'elev_low', 'elev_high', 'kudos_count', 'comment_count',
       'achievement_count', 'athlete', 'athlete_count', 'average_speed',
       'display_hide_heartrate_option', 'distance', 'elapsed_time',
       'external_id', 'flagged', 'from_accepted_tag', 'gear_id',
       'has_heartrate', 'has_kudoed', 'heartrate_opt_out', 'location_city',
       'location_country', 'location_state', 'manual', 'map', 'max_speed',
       'moving_time', 'photo_count', 'pr_count', 'resource_stat

In [17]:
exercise.sport_type.unique()

array(['Run', 'WeightTraining', 'Hike'], dtype=object)

The same observations apply.

In [72]:
devices.name 

Unnamed: 0,manufacturer,providing_step_goal,create_sh_ver,step_source_group,device_type,backsync_step_goal,capability,modify_sh_ver,device_group,update_time,create_time,name,model,connectivity_type,deviceuuid,pkg_name,fixed_name,datauuid
0,Samsung,,,,,,,,360001,2024-01-29 23:56:30,2024-01-29 23:56:30,My Device,SM-S921B,,rQMD+kro3I,com.sec.android.app.shealth,Val's S24,3b454f4f-dffc-f3cc-8fa7-a627ee73ed81
1,Samsung,,,,,,,,360001,2023-12-18 4:52:24,2023-12-18 4:52:24,My Device,SM-S711B,,yj7lN7gpxt,com.sec.android.app.shealth,Val's S23 FE,3229b35c-0992-f7a8-92eb-cd6b1180f8f7
2,Combined,,,,,,,,0,2024-02-02 1:10:07,2024-02-02 1:10:07,Combined,Combined,,VfS0qUERdZ,com.sec.android.app.shealth,Val's Tab S9 FE,7af9722c-cd99-07b7-853a-43ce58e819c5
3,Samsung Electronics,1.0,,106.0,10058.0,1.0,ed84774b-af9d-4eee-9c57-5bb2badf905d.capabilit...,,360003,2023-12-16 16:51:19,2023-11-07 13:11:06,Galaxy Watch6,SM-R930,,iGosmEieUd,com.sec.android.app.shealth,,ed84774b-af9d-4eee-9c57-5bb2badf905d
4,all_target,,,,,,,,0,2024-02-02 1:10:10,2024-02-02 1:10:10,all_target,all_target,,Mk66SbFqK1,com.sec.android.app.shealth,Val's Tab S9 FE,471ca32b-a665-1e9e-7c27-7ac69485dc05


## 🤓 Step 2: Data Preparation
For each dataset, we'll do the following procesing steps:
- Drop irrelevant columns and rows
- Change datatypes
- Rename columns
- Feature creation

Other forms of preparation will also be done depending on the nature of each dataset. Once everything's clean, we'll merge all dataframes.

In [33]:
# Some variables
start_date = pd.to_datetime('2024-01-15')
end_date = pd.to_datetime('2024-05-20')
gw6_id1 = 'iGosmEieUd'
gw6_id2 = '0yH08JetXB'

### Habits

In [6]:
habits.columns

Index(['Timestamp', 'Toothbrush', 'Skincare', 'Daily mental well-being',
       'Read', 'Touch typing', 'Exercise', 'Minutes meditated',
       'What were you grateful for today?', 'Any notable wins today?',
       'Any message for future Val?', 'Poop', 'Type',
       'What went well this week?',
       'Which goals did I NOT achieve? Which intentions did I NOT keep?',
       'What is my Most Important Task for this Week? How will I Make Sure I Get it Done?',
       ' How can I Make Things Faster, Easier or Obsolete?',
       'If I were to 10x my goals, what would I do to achieve them?',
       'What am I NOT doing even though I know I should?',
       'Math practice outside class in hours',
       'What was a win for you today? (It's perfectly fine to have none! 😊)',
       'Infinite scrolled? (more than 30 minutes of scrolling)'],
      dtype='object')

In [7]:
habits.isna().sum()

Timestamp                                                                              0
Toothbrush                                                                             7
Skincare                                                                               7
Daily mental well-being                                                                7
Read                                                                                   7
Touch typing                                                                           7
Exercise                                                                               8
Minutes meditated                                                                      7
What were you grateful for today?                                                      7
Any notable wins today?                                                              127
Any message for future Val?                                                          127
Poop                 

In [8]:
habits = habits[['Timestamp', 'Toothbrush', 'Skincare', 'Daily mental well-being',
       'Read', 'Touch typing', 'Exercise', 'Minutes meditated',
       'What were you grateful for today?', 
       # 'Any notable wins today?',
       # 'Any message for future Val?', 
        'Poop', 
       # 'Type',
       # 'What went well this week?',
       # 'Which goals did I NOT achieve? Which intentions did I NOT keep?',
       # 'What is my Most Important Task for this Week? How will I Make Sure I Get it Done?',
       # ' How can I Make Things Faster, Easier or Obsolete?',
       # 'If I were to 10x my goals, what would I do to achieve them?',
       # 'What am I NOT doing even though I know I should?',
       'Math practice outside class in hours',
       "What was a win for you today? (It's perfectly fine to have none! 😊)",
       'Infinite scrolled? (more than 30 minutes of scrolling)']].copy()

In [9]:
# Rename columns
habits.columns = 'date,toothbrush,skincare,mood,read,touch_type,exercise,minutes_meditated,grateful_for,poop,math_practice,win,infinite_scrolled'.split(',')

In [10]:
# Fix timestamp datatype
habits.date = pd.to_datetime(habits.date + '/2024', format='%m/%d/%Y')
# Turn to numeric
habits.toothbrush = pd.to_numeric(habits.toothbrush)
for col in ['read', 'touch_type', 'poop']:
    habits[col] = habits[col].map({'Yes': 1, 'No': 0})
habits.skincare = habits.skincare.map({'Complete': 1, 'Morning only': 0.5, 'Night only': 0.5, 'Incomplete': 0})

In [11]:
habits.head(3)

Unnamed: 0,date,toothbrush,skincare,mood,read,touch_type,exercise,minutes_meditated,grateful_for,poop,math_practice,win,infinite_scrolled
0,2024-01-15,2.0,1.0,3.0,1.0,0.0,Daily steps,5.0,malena,1.0,,,
1,2024-01-16,1.0,0.5,3.0,1.0,0.0,Badminton,0.0,malena,0.0,,,
2,2024-01-17,2.0,1.0,4.0,1.0,1.0,Walk,5.0,malena,1.0,,,


### Events

In [12]:
events.date_start = pd.to_datetime(events.date_start, format='mixed')
events.date_end = pd.to_datetime(events.date_end, format='mixed')

In [13]:
events

Unnamed: 0,event,type,date_start,date_end
0,Samsung Mission,extracurricular,2024-01-12,NaT
1,Samsung Mission,extracurricular,2024-01-15,NaT
2,Samsung Mission,extracurricular,2024-01-18,NaT
3,Samsung Mission,extracurricular,2024-01-27,NaT
4,Samsung Mission,extracurricular,2024-02-02,NaT
5,Samsung Mission,extracurricular,2024-02-29,NaT
6,Samsung Mission,extracurricular,2024-03-04,NaT
7,Samsung Mission,extracurricular,2024-03-01,NaT
8,Samsung Mission,extracurricular,2024-03-19,NaT
9,Samsung Mission,extracurricular,2024-03-27,NaT


Everything looks good with this dataset. We can move on.

### Steps

In [35]:
steps.head(3)

Unnamed: 0,create_sh_ver,step_count,binning_data,active_time,recommendation,modify_sh_ver,run_step_count,update_time,source_package_name,create_time,...,speed,distance,calorie,walk_step_count,deviceuuid,pkg_name,healthy_step,achievement,datauuid,day_time
0,,9985,e133e7fd-212e-4f0a-9e44-487e4373e2ea.binning_d...,4729744,6000,,475,2023-12-18 17:12:05,com.sec.android.app.shealth,2023-12-18 4:53:19,...,1.674499,7919.9517,324.4396,9510,VfS0qUERdZ,com.sec.android.app.shealth,0,e133e7fd-212e-4f0a-9e44-487e4373e2ea.achieveme...,e133e7fd-212e-4f0a-9e44-487e4373e2ea,1702857600000
1,,908,c2549c34-43e7-4607-a6d6-3b59e4d2186e.binning_d...,513471,6000,,5,2023-12-18 4:53:19,com.sec.android.app.shealth,2023-12-18 4:53:19,...,1.326753,681.25,32.42,903,VfS0qUERdZ,com.sec.android.app.shealth,0,c2549c34-43e7-4607-a6d6-3b59e4d2186e.achieveme...,c2549c34-43e7-4607-a6d6-3b59e4d2186e,1702684800000
2,,6381,a642decc-d8b4-4799-b5cc-4d7e5b0351d8.binning_d...,3488910,6000,,125,2023-12-18 4:53:19,com.sec.android.app.shealth,2023-12-18 4:53:19,...,1.380428,4816.1904,205.31001,6256,VfS0qUERdZ,com.sec.android.app.shealth,0,a642decc-d8b4-4799-b5cc-4d7e5b0351d8.achieveme...,a642decc-d8b4-4799-b5cc-4d7e5b0351d8,1702771200000


In [34]:
steps.columns

Index(['create_sh_ver', 'step_count', 'binning_data', 'active_time',
       'recommendation', 'modify_sh_ver', 'run_step_count', 'update_time',
       'source_package_name', 'create_time', 'source_info', 'speed',
       'distance', 'calorie', 'walk_step_count', 'deviceuuid', 'pkg_name',
       'healthy_step', 'achievement', 'datauuid', 'day_time'],
      dtype='object')

In [36]:
steps.deviceuuid.unique()

array(['VfS0qUERdZ', 'iGosmEieUd', 'yj7lN7gpxt', 'cYtOYUbVZi',
       'gKzgTXX1pl', 'rQMD+kro3I', '0yH08JetXB', 'P0IBIORKg0'],
      dtype=object)

In [52]:
steps = steps[(steps.deviceuuid == gw6_id2) | (steps.deviceuuid == gw6_id1)].copy()

In [53]:
# Drop irrelevant columns
steps = steps[['update_time','run_step_count', 'walk_step_count','distance','calorie']].copy()

In [54]:
# Rename columns
steps.columns = 'date,run_step,walk_step,distance,calorie'.split(',')

In [55]:
steps.dtypes

date          object
run_step       int64
walk_step      int64
distance     float64
calorie      float64
dtype: object

In [56]:
# Fix timestamp
steps.date = pd.to_datetime(steps.date).dt.normalize()
# Drop irrelevant rows
steps = steps[(steps['date'] >= start_date) & (steps['date'] <= end_date)].reset_index(drop=True)

In [59]:
steps.shape

(252, 5)

In [62]:
steps = steps.groupby(['date']).sum()

In [65]:
steps.shape

(109, 4)

In [92]:
steps.tail()

Unnamed: 0,date,steps,calories,distance
120,2024-05-16,22652,821.19808,18024.7285
121,2024-05-17,13557,456.38005,10676.5713
122,2024-05-18,20257,657.93967,15882.2533
123,2024-05-19,22586,749.01975,17694.563793
124,2024-05-20,30737,993.369939,23872.081


In [104]:
test = steps[steps.date == pd.to_datetime('2024-05-20')]
# test.groupby(['date'], as_index=False).mean()
test

Unnamed: 0,steps,date,calories,distance
1171,14001,2024-05-20,425.12994,10500.77
1172,16103,2024-05-20,547.71,12898.521
1173,633,2024-05-20,20.529999,472.79


Not the same steps shown in my phone! Must use a different dateset. Perhaps the steps trend

### Merging dataframes

In [196]:
events.rename(columns={'date_start': 'date'}, inplace=True)

In [197]:
df = pd.merge(habits, events, on='date', how='outer')

In [200]:
df.isna().sum()

date                   0
toothbrush             9
skincare               9
mood                   9
read                   9
touch_type             9
exercise              10
minutes_meditated      9
grateful_for           9
poop                   9
math_practice         18
win                   28
infinite_scrolled     28
event                102
type                 102
date_end             125
dtype: int64

## 🔎 Feature Understanding