# Bellabeat Smart Device Usage Analysis
### Case Study 2: How Can a Wellness Technology Company Play It Smart?

**Author:** Danijia Haggins  
**Date:** October 2025  
**Tools:** Python, Pandas, Matplotlib, NumPy

## **Scenario** 
I am a consumer insights analyst on the marketing analytics team at Bellabeat, a tech company that makes wellness products for women. The CCO of the company believes that analyzing smart device fitness data could provide valuable insights to inform Bellbeats marketing strategy. 

## **Business Task**:
Analyze smart device usage from FitBit users to discover trends in activity, sleep, and wellness habits. Apply these insights to help **Bellabeat** understand how consumers enage wit health-tracking devices. Apply these insights to help Bellabeat improve marketing strategies for the Leaf wellness tracker. 

[Data source](https://www.kaggle.com/datasets/arashnic/fitbit)

In this case study I am following 6 steps of the data analysis process: 
1. Ask
2. Prepare
3. Process
4. Analyze
5. Share
6. Act 


## ASK
**Key stakeholders**:
1. Urška Sršen (Cofounder & Chief Creative Officer)
2. Sando Mur (Cofounder, Executive Team)

**Key Questions Guiding Analysis**
1. What are the main trends in smart device usage?
2. How do these trends relate to Bellabeat customers?
3. How can these insights inform Bellabeats marketing strategy?

## 1. Prepare

### About this data
* The data is being loaded, processed, and analyzed within Kaggle notebooks using python, stored for the purposes of this project memory
* The time range this data represents is March 12, 2016-May 12, 2016
* The source data is stored in CSVs (29 total)
* The data is mostly long format, with a few of the csvs replicated in wide format
* The data has been made available by Creative Commons under the [CCO: Public Domain License](https://creativecommons.org/publicdomain/zero/1.0/)

In [None]:
# installing libraries needed 

import numpy as np # linear algebra
import matplotlib.pyplot as plt # plotting 
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/working'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 2. Process 
* I'm using python to process (clean/transform) because:
    * I can clean, transform, and analyze the data in one place without switching platforms like with sql and excel.
    * Python is also good for scalability as it can support high-volume data better than spreadsheets can (excel won't display more than 1,048,576 rows--this data includes more rows than that).
    * I can document every step and analyze in one place, creating easily reproducible results.

### 📈🛑 Data Limitations
* There are two folders containing data, one folder for user data between March 12, 2016 to April 11, 2016 and another for data from April 12, 2016 to May 12, 2016. Each folder contains 11 csvs with the same columns, and will need to be combined in the data cleaning/preprocessing.
* Two participants (IDs 2891001357 and 6391747486) appear only in the first dailyActivity_merged dataset (from 3.12.16 to 4.11.16) and have no recorded activity after April 11, 1016. This may indicate device non-use or drop out during the study period. 
* Some csvs only represent a subset of data for users between April 12, 2016 - May 12, 2016: dailySteps, dailyintensities, dailycalories, and sleepday. This data is also redundant as total counts for steps and calories are included in the dailyActivities csvs, and steps are included in the minuteSteps csvs. We will not be using these csvs for the purposes of this analysis.
* Some of the minute data is too granular for the purposes of this analysis: minuteCalories, minuteIntensities, and minuteMETS will not be used for the purposes of this analysis.
*  Users may not wear their devices every day

### 📁 Data Preparation 
Of the 29 csvs provided in the fitbit dataset, only the following were used for this analysis: 
* daily_activity
* heartrate_seconds
* minute_sleep
* hourly_calories
* hourly_intensities
* hourly_steps
* weight_logInfo 

These were selected because they contain the most relevant and complete data for understanding user activity and health behavior. Files with incomplete, duplicate, or overly granular data (such as minute level logs) were excluded to simplify analysis and maintain clarity. 

### 🔄 Data Loading
* Import the csvs into dataframes 
* Inspect structure
* Combine csvs with matching columns + data into one dataframe
* Check dataframe data types to avoid computational errors. 

In [None]:
# loading the csv data into separate dataframes
da1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/dailyActivity_merged_1.csv')
da2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/dailyActivity_merged_2.csv')
hs1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/heartrate_seconds_merged_1.csv')
hs2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/heartrate_seconds_merged_2.csv')
hc1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/hourlyCalories_merged_1.csv')
hc2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/hourlyCalories_merged_2.csv')
hi1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/hourlyIntensities_merged_1.csv')
hi2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/hourlyIntensities_merged_2.csv')
hstps1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/hourlySteps_merged_1.csv')
hstps2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/hourlySteps_merged_2.csv')
msl1= pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/minuteSleep_merged_1.csv')
msl2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/minuteSleep_merged_2.csv')
wlg1 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_3.12.16_to_4.11.16/weightLogInfo_merged_1.csv') 
wlg2 = pd.read_csv('/kaggle/input/bella-beat-case-study/data/fitabase_data_4.12.16_to_5.12.16/weightLogInfo_merged_2.csv')



In [None]:
# inspecting dataframes structure

# store data frames loaded above in a list 
dfs = [da1, da2, hs1, hs2, hc1, hc2, hi1, hi2, hstps1, hstps2, msl1, msl2, wlg1, wlg2]

# iterate over the list with a loop
# enumerate(iterable, start=0)
for i, x in enumerate(dfs, 1): # i = counter, start=1, x = df from dfs list
    print(f"Dataframe {i}:") # print number of the dataframe
    display(x.head(2)) # use display() for notebook friendly display
    print("\n") # add space between outputs

In [None]:
# check for nulls / missing data
# create a dictionary with the dataframes and their names
dict1 = {'da1': da1, 
       'da2': da2, 
       'hs1': hs1, 
       'hs2': hs2, 
       'hc1': hc1, 
       'hc2': hc2, 
       'hi1': hi1, 
       'hi2': hi2, 
       'hstps1': hstps1, 
       'hstps': hstps2,  
       'msl1': msl1, 
       'msl2': msl2, 
       'wlg1': wlg1, 
       'wlg2': wlg2
      }

# iterate over dict1
for name, df in dict1.items():
    print(f"Null values in {name}:")
    null_counts = df.isnull().sum()
    if null_counts.sum() > 0: # can use .any() or .sum()
        print(null_counts[null_counts > 0]) # Print only columns with nulls
    else:
        print("No null values found.")
    print("\n")



**Note:** 
Datframes wlg1 & wlg2 contain NaN values in the body fat column because some users did not record their body fat percentages. This will not cause errors in analysis. Leaving this as is but making a note here. 

In [None]:
# combining csv's with the same columns into one dataframe
daily_activity = pd.concat([da1, da2])
heartrate_seconds = pd.concat([hs1, hs2])
hourly_calories = pd.concat([hc1, hc2])
hourly_intensities = pd.concat([hi1, hi2])
hourly_steps = pd.concat([hstps1, hstps2])
minute_sleep = pd.concat([msl1, msl2])
weight_log_info = pd.concat([wlg1, wlg2])

In [None]:
# store combined + other dataframes in a dataframe
all_dfs = [daily_activity, heartrate_seconds, hourly_calories, hourly_intensities, hourly_steps, minute_sleep, weight_log_info]

# view first 5 rows of each dataframe
for i, x in enumerate(all_dfs, 1): 
    print(f'Dataframe {i}:')
    display(x.head())
    print('\n')

In [None]:
# checking data types
all_dfs[0].info() # daily_activity, position 0 in all_dfs 
all_dfs[1].info() # heartrate seconds, position 1 in all_dfs
all_dfs[2].info() # hourly calories, position 2 in all_dfs
all_dfs[3].info() # hourly intensities, position 3 in all_dfs
all_dfs[4].info() # hourly steps, position 4 in all_dfs
all_dfs[5].info() # minute sleep, position 5 in all_dfs
all_dfs[6].info() # weight_log_info, position 6 in all_dfs

In [None]:
# convert datetime columns from objects to datetime64 datatypes & check the columns and datatypes
daily_activity['ActivityDate'] = pd.to_datetime(daily_activity['ActivityDate'])
heartrate_seconds['Time'] = pd.to_datetime(heartrate_seconds['Time'])
hourly_calories['ActivityHour'] = pd.to_datetime(hourly_calories['ActivityHour'])
hourly_intensities['ActivityHour'] = pd.to_datetime(hourly_intensities['ActivityHour'])
hourly_steps['ActivityHour'] = pd.to_datetime(hourly_steps['ActivityHour'])
minute_sleep['date'] = pd.to_datetime(minute_sleep['date'])
weight_log_info['Date'] = pd.to_datetime(weight_log_info['Date'])
 

In [None]:
# check that datetime columns changed to datetime64
daily_activity['ActivityDate'].info() # daily_activity still an object but thats fine i think
heartrate_seconds['Time'].info() # heartrate seconds
hourly_calories['ActivityHour'].info() # hourly calories
hourly_intensities['ActivityHour'].info() # hourly intensities
hourly_steps['ActivityHour'] .info() # hourly steps
minute_sleep['date'].info() # minute sleep
weight_log_info['Date'].info() # weight_log_info

In [None]:
# checking dates: 

# view first 5 rows of each dataframe
for i, x in enumerate(all_dfs, 1): 
    print(f'Dataframe {i}:')
    display(x.head())
    print('\n')



## 3. Analyze 
Now that data has be loaded and preprocessed, it's time for analysis. 

**Guiding questions:**
* What suprises did you find in the data?
* What trends and relationships did you find in the data?

**Key Tasks**
* Aggregate data so it's useful and accessible
* Organize and format data
* Perform calculations

In [None]:
# start with daily_activities dataframe

daily_activity.head(2)

In [None]:
# print(daily_activity['ActivityDate'].min())
display(daily_activity.sort_values('ActivityDate', ascending=True).head(10))


    

In [None]:
print(daily_activity['Id'].nunique()) # 35 unique ids
unique_ids = daily_activity.Id.unique()
print(type(unique_ids))
unique_ids = unique_ids.tolist()
print(type(unique_ids))
print(unique_ids)
display(daily_activity[daily_activity['Id'] == 2891001357].sort_values('ActivityDate', ascending=True)) 
display(daily_activity[daily_activity['Id'] == 6391747486].sort_values('ActivityDate', ascending=True))
# display(daily_activity[daily_activity.TotalSteps == 0].sort_values('ActivityDate', ascending=True).reset_index(drop=True))

In [None]:
# calculate aggregates

# avg steps per user 
avg_steps_per_user = daily_activity.groupby('Id').TotalSteps.mean()
avg_steps_per_user = avg_steps_per_user.to_frame().sort_values('TotalSteps', ascending=True)
print(type(avg_steps_per_user))

print(avg_steps_per_user)

# user with total average steps = 8877689391

# sum of all steps per user
total_steps_per_user = daily_activity.groupby('Id').TotalSteps.sum()

total_steps_per_user = total_steps_per_user.to_frame().sort_values('TotalSteps', ascending=True)

print(total_steps_per_user)

In [None]:
# find out what days people took the most steps?
# pd.set_option('display.max_rows', None)
total_steps_by_day = daily_activity.groupby('ActivityDate').TotalSteps.sum()
total_steps_by_day = total_steps_by_day.to_frame().sort_values('TotalSteps', ascending=True)
print(total_steps_by_day) # 2016-04-12 had the most steps = 314095 -- the last day of the data

In [None]:
sum_steps_calories_per_user = daily_activity.groupby('Id').agg(
    TotalCalories = ('Calories', 'sum'), 
    TotalSteps = ('TotalSteps', 'sum')
)

print(sum_steps_calories_per_user)
# calories vs. steps