# Bellabeat Smart Device Usage Analysis
### Case Study 2: How Can a Wellness Technology Company Play It Smart?

**Author:** Danijia Haggins  
**Date:** October 2025  
**Tools:** Python, Pandas, Matplotlib, NumPy

## **Scenario** 
I am a consumer insights analyst on the marketing analytics team at Bellabeat, a tech company that makes wellness products for women. The CCO of the company believes that analyzing smart device fitness data could provide valuable insights to inform Bellbeats marketing strategy. 

## **Business Task**:
Analyze smart device usage from FitBit users to discover trends in activity, sleep, and wellness habits. Apply these insights to help **Bellabeat** understand how consumers enage wit health-tracking devices. Apply these insights to help Bellabeat improve marketing strategies for the Leaf wellness tracker. 

[Data source](https://www.kaggle.com/datasets/arashnic/fitbit)

In this case study I am following 6 steps of the data analysis process: 
1. Ask
2. Prepare
3. Process
4. Analyze
5. Share
6. Act 


## ASK
**Key stakeholders**:
1. Urška Sršen (Cofounder & Chief Creative Officer)
2. Sando Mur (Cofounder, Executive Team)

**Key Questions Guiding Analysis**
1. What are the main trends in smart device usage?
2. How do these trends relate to Bellabeat customers?
3. How can these insights inform Bellabeats marketing strategy?

## 1. Prepare

### About this data
* The data is being loaded, processed, and analyzed within Kaggle notebooks using python, stored for the purposes of this project memory
* The time range this data represents is March 12, 2016-May 12, 2016
* The source data is stored in CSVs (29 total)
* The data is mostly long format, with a few of the csvs replicated in wide format
* The data has been made available by Creative Commons under the [CCO: Public Domain License](https://creativecommons.org/publicdomain/zero/1.0/)

In [1]:
# installing libraries needed 

import numpy as np # linear algebra
from datetime import datetime, date, timedelta
import matplotlib.pyplot as plt # plotting 
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/working'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/working/__notebook__.ipynb


## 2. Process 
* I'm using python to process (clean/transform) because:
    * I can clean, transform, and analyze the data in one place without switching platforms like with sql and excel.
    * Python is also good for scalability as it can support high-volume data better than spreadsheets can (excel won't display more than 1,048,576 rows--this data includes more rows than that).
    * I can document every step and analyze in one place, creating easily reproducible results.

### 📈🛑 Data Limitations
* There are two folders containing data, one folder for user data between March 12, 2016 to April 11, 2016 and another for data from April 12, 2016 to May 12, 2016. Each folder contains 11 csvs with the same columns, and will need to be combined in the data cleaning/preprocessing.
* Two participants (IDs 2891001357 and 6391747486) appear only in the first dailyActivity_merged dataset (from 3.12.16 to 4.11.16) and have no recorded activity after April 11, 1016. This may indicate device non-use or drop out during the study period. 
* Some csvs only represent a subset of data for users between April 12, 2016 - May 12, 2016: dailySteps, dailyintensities, dailycalories, and sleepday. This data is also redundant as total counts for steps and calories are included in the dailyActivities csvs, and steps are included in the minuteSteps csvs. We will not be using these csvs for the purposes of this analysis.
* Some of the minute data is too granular for the purposes of this analysis: minuteCalories, minuteIntensities, and minuteMETS will not be used for the purposes of this analysis.
*  Users may not wear their devices every day

### 📁 Data Preparation 
Of the 29 csvs provided in the fitbit dataset, only the following were used for this analysis: 
* daily_activity
* heartrate_seconds
* minute_sleep
* hourly_calories
* hourly_intensities
* hourly_steps
* weight_logInfo 

These were selected because they contain the most relevant and complete data for understanding user activity and health behavior. Files with incomplete, duplicate, or overly granular data (such as minute level logs) were excluded to simplify analysis and maintain clarity. 

### 🔄 Data Loading
* Import the csvs into dataframes 
* Inspect structure
* Combine csvs with matching columns + data into one dataframe
* Check dataframe data types to avoid computational errors. 

In [2]:
# loading the csv data into separate dataframes
da1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/dailyActivity_merged_1.csv')
da2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/dailyActivity_merged_2.csv')
hs1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/heartrate_seconds_merged_1.csv')
hs2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/heartrate_seconds_merged_2.csv')
hc1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/hourlyCalories_merged_1.csv')
hc2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/hourlyCalories_merged_2.csv')
hi1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/hourlyIntensities_merged_1.csv')
hi2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/hourlyIntensities_merged_2.csv')
hstps1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/hourlySteps_merged_1.csv')
hstps2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/hourlySteps_merged_2.csv')
msl1= pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/minuteSleep_merged_1.csv')
msl2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/minuteSleep_merged_2.csv')
wlg1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/weightLogInfo_merged_1.csv') 
wlg2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/weightLogInfo_merged_2.csv')



In [3]:
# check for nulls / missing data
# create a dictionary with the dataframes and their names
dict1 = {'da1': da1, 
       'da2': da2, 
       'hs1': hs1, 
       'hs2': hs2, 
       'hc1': hc1, 
       'hc2': hc2, 
       'hi1': hi1, 
       'hi2': hi2, 
       'hstps1': hstps1, 
       'hstps': hstps2,  
       'msl1': msl1, 
       'msl2': msl2, 
       'wlg1': wlg1, 
       'wlg2': wlg2
      }

# iterate over dict1
for name, df in dict1.items():
    print(f"Null values in {name}:")
    null_counts = df.isnull().sum()
    if null_counts.sum() > 0: # can use .any() or .sum()
        print(null_counts[null_counts > 0]) # Print only columns with nulls
    else:
        print("No null values found.")
    print("\n")



Null values in da1:
No null values found.


Null values in da2:
No null values found.


Null values in hs1:
No null values found.


Null values in hs2:
No null values found.


Null values in hc1:
No null values found.


Null values in hc2:
No null values found.


Null values in hi1:
No null values found.


Null values in hi2:
No null values found.


Null values in hstps1:
No null values found.


Null values in hstps:
No null values found.


Null values in msl1:
No null values found.


Null values in msl2:
No null values found.


Null values in wlg1:
Fat    31
dtype: int64


Null values in wlg2:
Fat    65
dtype: int64




**Note:** 
Datframes wlg1 & wlg2 contain NaN values in the body fat column because some users did not record their body fat percentages. This will not cause errors in analysis. Leaving this as is but making a note here. 

In [4]:
# combining csv's with the same columns into one dataframe
daily_activity = pd.concat([da1, da2])
heartrate_seconds = pd.concat([hs1, hs2])
hourly_calories = pd.concat([hc1, hc2])
hourly_intensities = pd.concat([hi1, hi2])
hourly_steps = pd.concat([hstps1, hstps2])
minute_sleep = pd.concat([msl1, msl2])
weight_log_info = pd.concat([wlg1, wlg2])

In [5]:
# store combined + other dataframes in a dataframe
all_dfs = [daily_activity, heartrate_seconds, hourly_calories, hourly_intensities, hourly_steps, minute_sleep, weight_log_info]

# view first 5 rows of each dataframe
for i, x in enumerate(all_dfs, 1): 
    print(f'Dataframe {i}:')
    display(x.head())
    print('\n')

Dataframe 1:


Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,3/25/2016,11004,7.11,7.11,0.0,2.57,0.46,4.07,0.0,33,12,205,804,1819
1,1503960366,3/26/2016,17609,11.55,11.55,0.0,6.92,0.73,3.91,0.0,89,17,274,588,2154
2,1503960366,3/27/2016,12736,8.53,8.53,0.0,4.66,0.16,3.71,0.0,56,5,268,605,1944
3,1503960366,3/28/2016,13231,8.93,8.93,0.0,3.19,0.79,4.95,0.0,39,20,224,1080,1932
4,1503960366,3/29/2016,12041,7.85,7.85,0.0,2.16,1.09,4.61,0.0,28,28,243,763,1886




Dataframe 2:


Unnamed: 0,Id,Time,Value
0,2022484408,4/1/2016 7:54:00 AM,93
1,2022484408,4/1/2016 7:54:05 AM,91
2,2022484408,4/1/2016 7:54:10 AM,96
3,2022484408,4/1/2016 7:54:15 AM,98
4,2022484408,4/1/2016 7:54:20 AM,100




Dataframe 3:


Unnamed: 0,Id,ActivityHour,Calories
0,1503960366,3/12/2016 12:00:00 AM,48
1,1503960366,3/12/2016 1:00:00 AM,48
2,1503960366,3/12/2016 2:00:00 AM,48
3,1503960366,3/12/2016 3:00:00 AM,48
4,1503960366,3/12/2016 4:00:00 AM,48




Dataframe 4:


Unnamed: 0,Id,ActivityHour,TotalIntensity,AverageIntensity
0,1503960366,3/12/2016 12:00:00 AM,0,0.0
1,1503960366,3/12/2016 1:00:00 AM,0,0.0
2,1503960366,3/12/2016 2:00:00 AM,0,0.0
3,1503960366,3/12/2016 3:00:00 AM,0,0.0
4,1503960366,3/12/2016 4:00:00 AM,0,0.0




Dataframe 5:


Unnamed: 0,Id,ActivityHour,StepTotal
0,1503960366,3/12/2016 12:00:00 AM,0
1,1503960366,3/12/2016 1:00:00 AM,0
2,1503960366,3/12/2016 2:00:00 AM,0
3,1503960366,3/12/2016 3:00:00 AM,0
4,1503960366,3/12/2016 4:00:00 AM,0




Dataframe 6:


Unnamed: 0,Id,date,value,logId
0,1503960366,3/13/2016 2:39:30 AM,1,11114919637
1,1503960366,3/13/2016 2:40:30 AM,1,11114919637
2,1503960366,3/13/2016 2:41:30 AM,1,11114919637
3,1503960366,3/13/2016 2:42:30 AM,1,11114919637
4,1503960366,3/13/2016 2:43:30 AM,1,11114919637




Dataframe 7:


  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,Id,Date,WeightKg,WeightPounds,Fat,BMI,IsManualReport,LogId
0,1503960366,4/5/2016 11:59:59 PM,53.299999,117.506384,22.0,22.969999,True,1459900799000
1,1927972279,4/10/2016 6:33:26 PM,129.600006,285.719105,,46.169998,False,1460313206000
2,2347167796,4/3/2016 11:59:59 PM,63.400002,139.773078,10.0,24.77,True,1459727999000
3,2873212765,4/6/2016 11:59:59 PM,56.700001,125.002104,,21.450001,True,1459987199000
4,2873212765,4/7/2016 11:59:59 PM,57.200001,126.104416,,21.65,True,1460073599000






In [6]:
# checking data types
all_dfs[0].info() # daily_activity, position 0 in all_dfs 
all_dfs[1].info() # heartrate seconds, position 1 in all_dfs
all_dfs[2].info() # hourly calories, position 2 in all_dfs
all_dfs[3].info() # hourly intensities, position 3 in all_dfs
all_dfs[4].info() # hourly steps, position 4 in all_dfs
all_dfs[5].info() # minute sleep, position 5 in all_dfs
all_dfs[6].info() # weight_log_info, position 6 in all_dfs

<class 'pandas.core.frame.DataFrame'>
Index: 1397 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        1397 non-null   int64  
 1   ActivityDate              1397 non-null   object 
 2   TotalSteps                1397 non-null   int64  
 3   TotalDistance             1397 non-null   float64
 4   TrackerDistance           1397 non-null   float64
 5   LoggedActivitiesDistance  1397 non-null   float64
 6   VeryActiveDistance        1397 non-null   float64
 7   ModeratelyActiveDistance  1397 non-null   float64
 8   LightActiveDistance       1397 non-null   float64
 9   SedentaryActiveDistance   1397 non-null   float64
 10  VeryActiveMinutes         1397 non-null   int64  
 11  FairlyActiveMinutes       1397 non-null   int64  
 12  LightlyActiveMinutes      1397 non-null   int64  
 13  SedentaryMinutes          1397 non-null   int64  
 14  Calories      

In [7]:
# convert datetime columns from objects to datetime64 datatypes & check the columns and datatypes
daily_activity['ActivityDate'] = pd.to_datetime(daily_activity['ActivityDate'])
heartrate_seconds['Time'] = pd.to_datetime(heartrate_seconds['Time'])
hourly_calories['ActivityHour'] = pd.to_datetime(hourly_calories['ActivityHour'])
hourly_intensities['ActivityHour'] = pd.to_datetime(hourly_intensities['ActivityHour'])
hourly_steps['ActivityHour'] = pd.to_datetime(hourly_steps['ActivityHour'])
minute_sleep['date'] = pd.to_datetime(minute_sleep['date'])
weight_log_info['Date'] = pd.to_datetime(weight_log_info['Date'])
 

  hourly_calories['ActivityHour'] = pd.to_datetime(hourly_calories['ActivityHour'])
  hourly_intensities['ActivityHour'] = pd.to_datetime(hourly_intensities['ActivityHour'])
  hourly_steps['ActivityHour'] = pd.to_datetime(hourly_steps['ActivityHour'])
  weight_log_info['Date'] = pd.to_datetime(weight_log_info['Date'])


In [8]:
# check that datetime columns changed to datetime64
daily_activity['ActivityDate'].info() # daily_activity still an object but thats fine i think
heartrate_seconds['Time'].info() # heartrate seconds
hourly_calories['ActivityHour'].info() # hourly calories
hourly_intensities['ActivityHour'].info() # hourly intensities
hourly_steps['ActivityHour'] .info() # hourly steps
minute_sleep['date'].info() # minute sleep
weight_log_info['Date'].info() # weight_log_info

<class 'pandas.core.series.Series'>
Index: 1397 entries, 0 to 939
Series name: ActivityDate
Non-Null Count  Dtype         
--------------  -----         
1397 non-null   datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 21.8 KB
<class 'pandas.core.series.Series'>
Index: 3638339 entries, 0 to 2483657
Series name: Time
Non-Null Count    Dtype         
--------------    -----         
3638339 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 55.5 MB
<class 'pandas.core.series.Series'>
Index: 46183 entries, 0 to 22098
Series name: ActivityHour
Non-Null Count  Dtype         
--------------  -----         
46183 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 721.6 KB
<class 'pandas.core.series.Series'>
Index: 46183 entries, 0 to 22098
Series name: ActivityHour
Non-Null Count  Dtype         
--------------  -----         
46183 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 721.6 KB
<class 'pandas.core.series.Series'>
Index: 46183 entrie

In [9]:
# Extracting hour of day from hourly data dataframes and adding a new column
hourly_calories['HourOfDay'] = hourly_calories['ActivityHour'].dt.hour
hourly_intensities['HourOfDay'] = hourly_intensities['ActivityHour'].dt.hour
hourly_steps['HourOfDay'] = hourly_steps['ActivityHour'].dt.hour
minute_sleep['HourOfDay'] = minute_sleep['date'].dt.hour
minute_sleep['Minute'] = minute_sleep['date'].dt.minute
minute_sleep['Second'] = minute_sleep['date'].dt.second


In [10]:
# minute_sleep.head(10)
hourly_calories.head()

Unnamed: 0,Id,ActivityHour,Calories,HourOfDay
0,1503960366,2016-03-12 00:00:00,48,0
1,1503960366,2016-03-12 01:00:00,48,1
2,1503960366,2016-03-12 02:00:00,48,2
3,1503960366,2016-03-12 03:00:00,48,3
4,1503960366,2016-03-12 04:00:00,48,4


## 3. Analyze 
Now that data has be loaded and preprocessed, it's time for analysis. 

**Guiding questions:**
* What suprises did you find in the data?
* What trends and relationships did you find in the data?

**Key Tasks**
* Aggregate data so it's useful and accessible
* Organize and format data
* Perform calculations

### Calculate Aggregates

In [11]:
# avg calories per day
avg_calories_per_day = (
    daily_activity.groupby('ActivityDate')['Calories']
    .mean()
    .round()
    .rename('AvgCaloriesPerDay')
    .reset_index()
    .sort_values('ActivityDate', ascending=True)
)
print(type(avg_calories_per_day))
display(avg_calories_per_day)
print('Min average total calories per day: ' + str(avg_calories_per_day['AvgCaloriesPerDay'].min()))
print('Max Average total calories: ' + str(avg_calories_per_day['AvgCaloriesPerDay'].max()))

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,ActivityDate,AvgCaloriesPerDay
0,2016-03-12,2384.0
1,2016-03-13,2128.0
2,2016-03-14,2512.0
3,2016-03-15,2396.0
4,2016-03-16,2882.0
...,...,...
57,2016-05-08,2303.0
58,2016-05-09,2336.0
59,2016-05-10,2229.0
60,2016-05-11,2190.0


Min average total calories per day: 1139.0
Max Average total calories: 2882.0


Appears that, as time goes on, average total steps increases. Could be due to more activity in the warmer spring months.

In [12]:
# average steps per day
avg_steps_per_day = (
    daily_activity.groupby('ActivityDate')['TotalSteps']
    .mean()
    .round()
    .rename('AvgStepsPerDay') # Rename the 'TotalSteps' series to AvgStepsPerDay
    .reset_index() # convert the groupby index to a regular column 
    .sort_values('ActivityDate', ascending=True)
    
)

print(type(avg_steps_per_day))
display(avg_steps_per_day)


<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,ActivityDate,AvgStepsPerDay
0,2016-03-12,2772.0
1,2016-03-13,1613.0
2,2016-03-14,5728.0
3,2016-03-15,2953.0
4,2016-03-16,7311.0
...,...,...
57,2016-05-08,7049.0
58,2016-05-09,8249.0
59,2016-05-10,7951.0
60,2016-05-11,7520.0


In [13]:
# find out what days people took the most steps?
# pd.set_option('display.max_rows', None)
total_steps_by_day = daily_activity.groupby('ActivityDate').TotalSteps.sum()
total_steps_by_day = total_steps_by_day.to_frame().sort_values('TotalSteps', ascending=True)
print(total_steps_by_day) # 2016-04-12 had the most steps = 314095 -- the last day of the data

              TotalSteps
ActivityDate            
2016-03-18          1317
2016-03-24          1958
2016-03-13          3226
2016-03-12          5543
2016-03-19          5702
...                  ...
2016-04-06        263630
2016-04-21        263795
2016-04-23        267124
2016-04-16        277733
2016-04-12        314095

[62 rows x 1 columns]


In [14]:
# calories vs. steps vs. sedentary minutes
sum_steps_calories_per_user = daily_activity.groupby('Id').agg(
    TotalCalories = ('Calories', 'sum'), 
    TotalSteps = ('TotalSteps', 'sum'), 
    SedentaryMinutes = ('SedentaryMinutes', 'sum')
)

print(sum_steps_calories_per_user)

            TotalCalories  TotalSteps  SedentaryMinutes
Id                                                     
1503960366          90437      596789             41680
1624580081          71689      258360             63278
1644430081         113503      311237             45198
1844505072          68169      123669             49829
1927972279          94405       54570             52275
2022484408         107513      498589             47197
2026352035          64026      213286             29282
2320127002          71834      183884             52814
2347167796          67102      318355             22627
2873212765          79775      313868             47656
2891001357          18187        6189              8799
3372868164          57265      198508             32402
3977333714          62187      433504             29731
4020332650         172372      255135             73018
4057192912          68808       75743             48327
4319703577          87099      319181           

Lets switch to a different dataframe

In [15]:
# display(hourly_calories.groupby('HourOfDay')[['Calories']].sum())
# print(hourly_calories.groupby('HourOfDay')[['Calories']].sum().max()) # 6pm hour with highest total calories

display(hourly_calories.groupby('HourOfDay')[['Calories']].mean().round())
print(hourly_calories.groupby('HourOfDay')[['Calories']].mean().round().max()) # hour 18 and hour 19 (6 and 7pm) are tied for highest avg calories 


Unnamed: 0_level_0,Calories
HourOfDay,Unnamed: 1_level_1
0,72.0
1,70.0
2,69.0
3,68.0
4,68.0
5,80.0
6,84.0
7,92.0
8,101.0
9,106.0


Calories    119.0
dtype: float64


In [16]:
hourly_intensities.groupby('HourOfDay')[['TotalIntensity']].mean().round() #highest average intensities tied at 5, 7, 8 pm 

Unnamed: 0_level_0,TotalIntensity
HourOfDay,Unnamed: 1_level_1
0,2.0
1,1.0
2,1.0
3,0.0
4,1.0
5,4.0
6,7.0
7,10.0
8,14.0
9,15.0


In [17]:
hourly_steps.groupby('HourOfDay')['StepTotal'].mean().round() # highest average steps at 6pm 

HourOfDay
0      43.0
1      22.0
2      14.0
3       7.0
4      11.0
5      35.0
6     148.0
7     286.0
8     395.0
9     432.0
10    459.0
11    455.0
12    534.0
13    496.0
14    506.0
15    398.0
16    471.0
17    500.0
18    550.0
19    555.0
20    378.0
21    284.0
22    204.0
23    112.0
Name: StepTotal, dtype: float64

In [18]:
# find minutes sleep, minutes restless, and minutes awake
sleep_summary = (minute_sleep
    .groupby(['Id', 'logId'])
    .agg (
        start_time=('date', 'min'),
        end_time=('date', 'max'),
        minutes_asleep=('value', lambda x: ( x == 1).sum()), 
        minutes_restless=('value', lambda x: (x == 2).sum()), 
        minutes_awake=('value', lambda x: (x == 3).sum())
    )
    .reset_index()
)

sleep_summary.head()

Unnamed: 0,Id,logId,start_time,end_time,minutes_asleep,minutes_restless,minutes_awake
0,1503960366,11114919637,2016-03-13 02:39:30,2016-03-13 09:44:30,411,15,0
1,1503960366,11126343681,2016-03-14 01:32:00,2016-03-14 07:57:00,354,27,5
2,1503960366,11134971215,2016-03-15 02:36:00,2016-03-15 08:10:00,312,16,7
3,1503960366,11142197163,2016-03-16 03:12:00,2016-03-16 08:14:00,272,26,5
4,1503960366,11142197164,2016-03-16 19:43:00,2016-03-16 20:45:00,61,2,0


In [19]:
# add hours_slept column, where we calculate how many hours users slept rounded to 2 decimal places  
sleep_summary['hours_slept'] = (sleep_summary['minutes_asleep'] / 60).round(2)

# sleep_summary.head(5).sort_values('start_time', ascending=True)

In [20]:
# display(sleep_summary[sleep_summary['hours_slept'] < 7].sort_values('start_time', ascending=True).reset_index(drop=True)) # 599
# display(sleep_summary[sleep_summary['hours_slept'] >= 7].sort_values('start_time', ascending=True).reset_index(drop=True)) # 408

# are users that get more sleep more active?? 

# print(sleep_summary.Id.nunique()) # only 25 users tracked sleep

# sleep_summary.info() # 1007 entries 

# display(sleep_summary.groupby('Id')[['hours_slept']].mean().round(2)) # average hours slept per Id

display(sleep_summary[sleep_summary['Id'] == 6962181067]) # tracked their sleep quite a bit during the study

# seems like many people weren't consistent with tracking their sleep

Unnamed: 0,Id,logId,start_time,end_time,minutes_asleep,minutes_restless,minutes_awake,hours_slept
760,6962181067,11103653021,2016-03-11 23:29:00,2016-03-12 06:38:00,415,15,0,6.92
761,6962181067,11111697363,2016-03-12 21:46:00,2016-03-13 07:58:00,595,15,3,9.92
762,6962181067,11120259470,2016-03-13 23:42:00,2016-03-14 06:47:00,402,23,1,6.70
763,6962181067,11128078273,2016-03-14 23:07:00,2016-03-15 05:29:00,355,21,7,5.92
764,6962181067,11136373061,2016-03-15 23:01:00,2016-03-16 05:53:00,381,29,3,6.35
...,...,...,...,...,...,...,...,...
824,6962181067,11579365841,2016-05-07 23:27:00,2016-05-08 08:55:00,541,22,6,9.02
825,6962181067,11587333914,2016-05-08 23:30:00,2016-05-09 07:46:00,489,8,0,8.15
826,6962181067,11596233059,2016-05-09 22:49:00,2016-05-10 06:49:00,469,10,2,7.82
827,6962181067,11605753758,2016-05-10 23:27:00,2016-05-11 07:26:00,452,26,2,7.53


Some people who participated in the study were not consistent with tracking their sleep. 
Reccomendations: 
    1. Modify watch product so that it can be worn comfortably during sleep, or offer alternative watch bands to make them more comfortable
    2. Create sleep rewards, rewarding users when they sleep for 7 hours or more
    3. Give users reminders to get ready to sleep

In [21]:
daily_activity.head()

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,2016-03-25,11004,7.11,7.11,0.0,2.57,0.46,4.07,0.0,33,12,205,804,1819
1,1503960366,2016-03-26,17609,11.55,11.55,0.0,6.92,0.73,3.91,0.0,89,17,274,588,2154
2,1503960366,2016-03-27,12736,8.53,8.53,0.0,4.66,0.16,3.71,0.0,56,5,268,605,1944
3,1503960366,2016-03-28,13231,8.93,8.93,0.0,3.19,0.79,4.95,0.0,39,20,224,1080,1932
4,1503960366,2016-03-29,12041,7.85,7.85,0.0,2.16,1.09,4.61,0.0,28,28,243,763,1886
