# Google Data Analytics Capstone Case Study
This is a data analysis capstone case study (Bellabeat) from Google Data Analytics Professional Certificate.
## About Bellabeat
Bellabeat, founded by Urška Sršen and Sando Mur, is a high-tech company that creates health-focused smart products for women. The company leverages Sršen's artistic background to design aesthetically pleasing technology that tracks activity, sleep, stress, and reproductive health, empowering women with valuable health insights.
## About the case study
The data analyst is asked to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. Based on the analysis of the usage data, the data analyst is asked to select one Bellabeat product to apply the discovered insights in the presentation.
## 1. Phase 1: Ask
### 1.1 **Business Task** 
Analyze the usage data of non-Bellabeat smart devices to identify consumer behaviour patterns and preferences. The insights gained from this analysis will be used to recommend improvements or new features to one of Bellabeat’s products, enhancing its appeal and usability based on real-world consumer data.
### 1.2 **Key Stakeholders**
* Urška Sršen - Bellabeat cofounder and Chief Creative Officer
* Sando Mur - Bellabeat cofounder and key member of the Bellabeat executive team
* Bellabeat Marketing Analytics Team

## 2. Phase 2: Prepare
### 2.1 Dataset Source
FitBit Fitness Tracker Data, dataset made available through [Mobius](https://www.kaggle.com/datasets/arashnic/fitbit) (CC0: Public Domain)
### 2.2 Dataset Information
This dataset was generated from a distributed survey conducted via Amazon Mechanical Turk between December 3, 2016, and December 5, 2016. Thirty eligible Fitbit users consented to submit their personal tracker data, which includes minute-level details on physical activity, heart rate, and sleep monitoring. The data can be segmented by export session ID (column A) or timestamp (column B). Differences in the data reflect the use of various Fitbit tracker models and individual tracking habits/preferences.
### 2.3 Dataset Organization
The dataset consists of 18 CSV files, containing information on activity, calories, intensities, and steps, categorized into daily, hourly, and minute intervals. It also includes data on heart rate, metabolic equivalents (METs), sleep duration, and weight. The dataset is organized in a long table format, where each user is assigned a unique ID, and each row represents an individual observation for a specific user. Consequently, each user ID appears multiple times across rows, as the data is tracked by day and time.
### 2.4 Dataset Credibility and Integrity
The dataset is relatively small, containing only 30 users and lacking demographic information such as location, age, and health condition, which could introduce sampling bias. Additionally, some data categories, like weight, have only 8 user submissions, making it difficult to analyze alongside data categories like activity, calories, and intensities, which have 33 user submissions. Furthermore, the dataset was last updated four months ago, so it cannot be considered current. Consequently, the analytical results of this case study may not accurately reflect the current insights of the smart device market.

## 3. Phase 3: Process
In this case study, I aim to investigate the dataset to uncover insights on the following questions:
* Does daily activity impact users' sleep duration?
* During which time period does most daily activity occur?

The following dataset will be used for this case study:
* dailyActivity_merged.csv
* sleepDay_merged.csv
* hourlyIntensities_merged.csv

This notebook is prepared in Python programming language. To see a similar version in R, please refer to this [notebook](https://www.kaggle.com/code/willbao33/bellabeat-case-study-with-r) 
### 3.1 Libraries Imports
Necessary libraries need to be initialized to preprocess the data.

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os

# List files in the directory
base_path = '/kaggle/input'
for dirname, _, filenames in os.walk(base_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# Load and observe the datasets
daily_activity = pd.read_csv('/kaggle/input/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
sleep_day = pd.read_csv("/kaggle/input/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
hourly_intensities = pd.read_csv("/kaggle/input/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
daily_activity.head(),  sleep_day.head(), hourly_intensities.head()

Notice that the activity and sleep data can be merged using two attributes, Id and date (ActivityDate and SleepDay). The SleepDay attribute contains both date and time while ActivityDate contains only date. Similarly, the ActivityHour in the intensitity data also contains both date and time. Thus, some data preprocessing is required before merging and future analysis.

In [None]:
# convert date columns to Date and Time format and create new data frames 
daily_activity['ActivityDate'] = pd.to_datetime(daily_activity['ActivityDate'], format='%m/%d/%Y')
daily_activity_new = daily_activity.rename(columns={'ActivityDate': 'Date'})

sleep_day['SleepDay'] = pd.to_datetime(sleep_day['SleepDay'], format='%m/%d/%Y %I:%M:%S %p')
sleep_day_new = sleep_day.rename(columns={'SleepDay':'Date'})

hourly_intensities['ActivityHour'] = pd.to_datetime(hourly_intensities['ActivityHour'], format='%m/%d/%Y %I:%M:%S %p')
hourly_intensities['Date'] = hourly_intensities['ActivityHour'].dt.date
hourly_intensities['Time'] = hourly_intensities['ActivityHour'].dt.time
hourly_intensities_new = hourly_intensities.drop(columns=['ActivityHour'])
daily_activity_new.head(), sleep_day_new.head(), hourly_intensities_new.head()

In [None]:
# check for numbers of unique users and duplicates
# Count distinct Ids
n_distinct_daily_activity = daily_activity_new['Id'].nunique()
n_distinct_sleep_day = sleep_day_new['Id'].nunique()
n_distinct_hourly_intensities = hourly_intensities_new['Id'].nunique()

print(f'Number of distinct Ids in daily_activity_new: {n_distinct_daily_activity}')
print(f'Number of distinct Ids in sleep_day_new: {n_distinct_sleep_day}')
print(f'Number of distinct Ids in hourly_intensities_new: {n_distinct_hourly_intensities}')

# Count duplicated rows
sum_duplicated_daily_activity = daily_activity_new.duplicated().sum()
sum_duplicated_sleep_day = sleep_day_new.duplicated().sum()
sum_duplicated_hourly_intensities = hourly_intensities_new.duplicated().sum()

print(f'Number of duplicated rows in daily_activity_new: {sum_duplicated_daily_activity}')
print(f'Number of duplicated rows in sleep_day_new: {sum_duplicated_sleep_day}')
print(f'Number of duplicated rows in hourly_intensities_new: {sum_duplicated_hourly_intensities}')

In [None]:
# remove duplicate and N/A
daily_activity_new = daily_activity_new.drop_duplicates().dropna()
sleep_day_new = sleep_day_new.drop_duplicates().dropna()
hourly_intensities_new = hourly_intensities_new.drop_duplicates().dropna()

# Count distinct Ids
n_distinct_daily_activity = daily_activity_new['Id'].nunique()
n_distinct_sleep_day = sleep_day_new['Id'].nunique()
n_distinct_hourly_intensities = hourly_intensities_new['Id'].nunique()

print(f'Number of distinct Ids in daily_activity_new: {n_distinct_daily_activity}')
print(f'Number of distinct Ids in sleep_day_new: {n_distinct_sleep_day}')
print(f'Number of distinct Ids in hourly_intensities_new: {n_distinct_hourly_intensities}')

# Count duplicated rows
sum_duplicated_daily_activity = daily_activity_new.duplicated().sum()
sum_duplicated_sleep_day = sleep_day_new.duplicated().sum()
sum_duplicated_hourly_intensities = hourly_intensities_new.duplicated().sum()

print(f'Number of duplicated rows in daily_activity_new: {sum_duplicated_daily_activity}')
print(f'Number of duplicated rows in sleep_day_new: {sum_duplicated_sleep_day}')
print(f'Number of duplicated rows in hourly_intensities_new: {sum_duplicated_hourly_intensities}')

In [None]:
# merge activity and sleep dataset on 'Id' and 'Date'
daily_activity_sleep = pd.merge(daily_activity_new, sleep_day_new, on=['Id', 'Date'])
daily_activity_sleep.head()

In [None]:
# remove the Id that do not contain sleep information
daily_activity_sleep = daily_activity_sleep.drop_duplicates().dropna()
print(daily_activity_sleep['Id'].nunique())

## 4. Phase 4: Analyze
### 4.1.1 
#### Hypothesis: More daily activities (steps and burned calories) lead to more daily sleep time, and more steps lead to more calories burned.
First, I would like to use the scatter plot and correlation coefficient to find the relationship betwwen total steps, calories burned, and total minutes asleep.

In [None]:
# Relationship between total steps and total minutes asleep
#Calculate correlation
cor_steps_sleep = daily_activity_sleep['TotalSteps'].corr(daily_activity_sleep['TotalMinutesAsleep'])
cor_steps_sleep_text = f'Correlation: {cor_steps_sleep:.2f}'

# Create scatter plot with regression line
plt.figure(figsize=(10,6))
sns.regplot(x='TotalSteps', y='TotalMinutesAsleep', data=daily_activity_sleep, scatter_kws={'s':10}, line_kws={'color':'blue'})
plt.title('Total Steps vs. Total Minutes Asleep')
plt.xlabel('Total Steps')
plt.ylabel('Total Minutes Asleep')
plt.annotate(cor_steps_sleep_text, xy=(1,1), xycoords='axes fraction', fontsize=12, ha='right', va='top', color='red')
plt.show()

In [None]:
# Relationship between burned calories and total minutes asleep
# Calculate correlation
cor_cal_sleep = daily_activity_sleep['Calories'].corr(daily_activity_sleep['TotalMinutesAsleep'])
cor_cal_sleep_text = f'Correlation: {cor_cal_sleep:.2f}'

# Create scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='Calories', y='TotalMinutesAsleep', data=daily_activity_sleep, scatter_kws={'s':10}, line_kws={'color':'blue'})
plt.title('Burned Calories vs. Total Minutes Asleep')
plt.xlabel('Burned Calories')
plt.ylabel('Total Minutes Asleep')
plt.annotate(cor_cal_sleep_text, xy=(1, 1), xycoords='axes fraction', fontsize=12, ha='right', va='top', color='red')
plt.show()

### 4.1.2 
#### Analyze
The hypothesis suggests that increased daily activities, such as steps and calories burned, should result in longer sleep duration, as physical exertion can lead to greater bodily fatigue. However, the visualization of the relationship between total steps, calories, and total sleep time indicates little to no correlation (-0.19 and -0.03). This lack of correlation may be attributed to the sample dataset's participants having tight schedules that prevent longer sleep durations, or they may have habitual sleep patterns unaffected by their level of tiredness. Nonetheless, it is important to note that we cannot definitively conclude there is no correlation between daily activities and total sleep time due to the small dataset size and potential sampling bias.

In [None]:
# Relationship between total steps and burned calcories
# Calculate correlation
cor_steps_cal = daily_activity_sleep['TotalSteps'].corr(daily_activity_sleep['Calories'])
cor_steps_cal_text = f'Correlation: {cor_steps_cal:.2f}'

# Create scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='TotalSteps', y='Calories', data=daily_activity_sleep, scatter_kws={'s':10}, line_kws={'color':'blue'})
plt.title('Total Steps vs. Burned Calories')
plt.xlabel('Total Steps')
plt.ylabel('Burned Calories')
plt.annotate(cor_steps_cal_text, xy=(1, 1), xycoords='axes fraction', fontsize=12, ha='right', va='top', color='red')
plt.show()

#### Analyze
The hypothesis suggests that more daily steps should lead to more calories burned, as walking is a key contributor to daily caloric expenditure. The visualization reveals a moderate positive correlation between total steps and calories burned. However, the data does not exhibit a strong positive correlation, likely because many steps are taken during routine movements that do not significantly elevate the heart rate. Consequently, even with a high number of daily steps, the total calories burned may remain relatively low. Additionally, the dataset does not specify whether the total steps occurred before or after the recorded sleep duration on the same day. This distinction is important when collecting data, as increased activity levels might influence sleep duration on the following day.

### 4.2.1
#### Hypothesis: Most activities happened during after-work hours and before bed time. 
To visualize the hourly activity intensity, I need to group all activity intensities by hour and find the mean value for each intensity hour.

In [None]:
hourly_intensity_mean = hourly_intensities_new.groupby('Time').agg({'TotalIntensity':'mean'}).reset_index()
hourly_intensity_mean = hourly_intensity_mean.rename(columns={'TotalIntensity':'MeanTotalIntensity'})

plt.figure(figsize=(12,6))
sns.barplot(x='Time', y='MeanTotalIntensity', data=hourly_intensity_mean, color='steelblue')
plt.title('Mean Total Intensity by Hour of the Day')
plt.xlabel('Time of Day')
plt.ylabel('Mean Total Intensity')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

#### Analyze
From the bar plot, we can observe two distinct time periods during the day where the mean total intensities peak: from 12 to 2 PM and from 5 to 7 PM. The peak between 12 and 2 PM is likely due to activities such as lunch breaks or work breaks. During this time, some individuals might engage in physical activities like walking or running to refresh themselves for the afternoon. The peak between 5 and 7 PM is probably because many people choose to go to the gym or engage in other forms of exercise after work, before having dinner or going to bed.

## 5. Phase 5: Share
By analyzing the smart device usage data, I would like to share the following results and insights with stakeholders:
1. Correlation Analysis Between Total Steps/Burned Calories and Sleep Time:
* The analysis revealed that daily activity has little to no correlation with users' sleep duration. This is likely because most activities or steps were of low intensity, which did not exhaust users significantly. Additionally, some users may adhere to a fixed sleeping schedule that is unaffected by their daily intensity levels.
2. Correlation Analysis Between Total Steps and Burned Calories:
* A moderate positive correlation was found between total steps and calories burned, indicating that the number of daily steps taken does impact the total daily calories burned. However, the correlation is not strong, possibly because many steps were of low intensity, such as casual walking, rather than higher intensity activities like inclined walking or running, which burn calories more efficiently.
* For Bellabeat smart device users who aim to lose weight, I suggest that the Bellabeat app include features that track daily activities, such as steps and workouts, allow users to set daily activity or calorie-burning goals, and create reminders to encourage users to achieve their daily goals. On days with high activity levels, the Bellabeat app should also remind users to sleep early to better prepare for the following day.
3. Analysis of Mean Total Intensity by Hour:
* The analysis shows that most activities occur during standard lunch breaks (12 to 2 PM) and after work hours (5 to 7 PM).
* Based on this analysis, I recommend that Bellabeat remind users to take breaks for walking or exercise after long periods of working to refresh the body and mind. Additionally, a short walk after meals is suggested to aid digestion and increase circulation.

## 6. Phase 6: Act
### 6.1 Final Conclusion:
Based on our analysis of smart device usage data, we have identified key insights into the correlation between daily activities and sleep duration, the impact of daily steps on calories burned, and the peak times for user activities. Specifically, we found that:

1. Daily activities, such as total steps and burned calories, show little to no correlation with sleep duration.
2. There is a moderate positive correlation between total steps and calories burned.
3. The peak times for user activities are during lunch breaks (12 to 2 PM) and after work hours (5 to 7 PM).

### 6.2 Applying Insights:
Our team and business can leverage these insights to enhance the Bellabeat app and better support our users' health and wellness goals. Specifically:
1. **Sleep and Activity Correlation**:
    * **App Feature**: Introduce features that allow users to track both their daily activities and sleep patterns, helping them understand their behavior better.
    * **User Education**: Educate users about the importance of sleep hygiene and suggest maintaining a consistent sleep schedule irrespective of daily activity levels.

2. **Daily Steps and Calories Burned**:
    * **Activity Tracking**: Implement a robust activity tracking system that records both the number and intensity of steps.
    * **Goal Setting**: Allow users to set daily activity and calorie-burning goals, and provide real-time feedback and reminders to encourage goal completion.
    * **Weight Management**: For users aiming to lose weight, offer tailored recommendations for incorporating higher intensity activities that burn calories more efficiently.
    
3. **Peak Activity Times**:
    * **Break Reminders**: Send reminders during typical break periods (12 to 2 PM and 5 to 7 PM) to encourage physical activity, whether it’s a short walk or a workout.
    * **Post-Meal Activities**: Promote short walks after meals to aid digestion and improve circulation.
    
### 6.3 Next Steps:
1. **Feature Development**: Begin developing the new app features as outlined above, focusing on activity tracking, goal setting, and reminders.
2. **User Testing**: Conduct user testing to gather feedback on these new features and ensure they are user-friendly and effective.
3. **Marketing**: Create a marketing campaign to highlight these new features and their benefits to users.

### 6.4 Additional Data for Expansion:
1. **Demographic Data**: Collect demographic information (age, gender, occupation) to understand how different user groups interact with the app and their specific needs.
2. **Activity Types**: Gather more detailed data on the types of activities users engage in, such as walking, running, or cycling, to provide more personalized recommendations.
3. **Lifestyle Data**: Include data on users' lifestyle factors, such as diet and stress levels, to offer holistic wellness advice.

By implementing these steps, we can enhance the Bellabeat app, providing more value to our users and promoting healthier lifestyles.



