# **Project Name**    -



##### **Project Type**    - EDA/Dasboard
##### **Contribution**    - Akshay Vadnala


# **Project Summary -**

The Bellabeat Fitness Data Analytics case study explores the growth potential for Bellabeat, a company that specializes in health-focused technology products designed for women. Founded in 2013, Bellabeat has quickly become a leader in the wellness technology market, offering beautifully designed smart devices that provide insights into various aspects of health, including activity levels, sleep patterns, and reproductive health.

As the company continues to expand, the marketing analytics team has been tasked with analyzing consumer data collected from Bellabeat's smart devices. The goal is to uncover insights that can inform marketing strategies and enhance product offerings. The analysis employs tools such as SQL for data cleaning and insights generation, Power BI for creating interactive dashboards, and Python for deeper data analysis and visualization.

Understanding consumer behavior is critical in the wellness technology sector. By examining fitness data, the team aims to identify trends related to device usage, popular features, and correlations between usage and health outcomes. These insights will help tailor marketing strategies that resonate with the target audience, particularly women seeking technology-driven health solutions.

The case study emphasizes the importance of aligning marketing efforts with Bellabeat's mission to empower women. Recommendations may include creating educational content about the benefits of using Bellabeat products, leveraging social media to build a wellness community, and collaborating with health influencers to reach a broader audience.

Ultimately, the Bellabeat Fitness Data Analytics case study serves as a roadmap for how data-driven insights can shape marketing strategies in the wellness technology sector. By focusing on consumer behavior and utilizing advanced analytics tools, Bellabeat can solidify its position as a market leader and continue to empower women through innovative health technology.

# **GitHub Link -**

https://github.com/akshay24032002/StarvaFitness.git

# **Problem Statement**


"Bellabeat, a leading manufacturer of health-focused technology products for women, is facing challenges in effectively utilizing consumer data collected from its smart devices to drive marketing strategies and enhance product offerings. Despite having access to valuable fitness data, the company lacks a comprehensive understanding of user behavior and preferences, which hinders its ability to tailor marketing efforts and optimize customer engagement. As a result, Bellabeat risks missing growth opportunities in a competitive wellness technology market, where understanding consumer needs is crucial for success."

#### **Define Your Business Objective?**

Objective: "Increase sales of the Bellabeat fitness tracker by 20% within the next quarter by implementing targeted social media marketing campaigns and enhancing customer engagement through personalized email marketing."

Breakdown of the Objective:

• Specific: The focus is on increasing sales of a specific product (the fitness tracker).

• Measurable: The goal is quantifiable (20% increase).

• Achievable: The objective considers current market conditions and marketing capabilities.

• Relevant: It aligns with Bellabeat's mission to empower women through health technology.

• Time-bound: A clear deadline is set (within the next quarter).

By defining this objective, Bellabeat can effectively guide its marketing strategies and measure success, ensuring that efforts are aligned with the company's overall mission and growth goals.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.dates as mdates
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import files
uploaded = files.upload()

from google.colab import drive
drive.mount('/content/drive')

dailyActivity = pd.read_csv('/content/drive/MyDrive/FitnessData/dailyActivity_merged.csv')
dailyCalories = pd.read_csv('/content/drive/MyDrive/FitnessData/dailyCalories_merged.csv')
dailyIntensities = pd.read_csv('/content/drive/MyDrive/FitnessData/dailyIntensities_merged.csv')
dailySteps = pd.read_csv('/content/drive/MyDrive/FitnessData/dailySteps_merged.csv')
heartrateSeconds = pd.read_csv('/content/drive/MyDrive/FitnessData/heartrate_seconds_merged.csv')
hourlyCalories = pd.read_csv('/content/drive/MyDrive/FitnessData/hourlyCalories_merged.csv')
hourlyIntensities = pd.read_csv('/content/drive/MyDrive/FitnessData/hourlyIntensities_merged.csv')
hourlySteps = pd.read_csv('/content/drive/MyDrive/FitnessData/hourlySteps_merged.csv')
sleepDay = pd.read_csv('/content/drive/MyDrive/FitnessData/sleepDay_merged.csv')
weightLogInfo = pd.read_csv('/content/drive/MyDrive/FitnessData/weightLogInfo_merged.csv')

### Dataset First View

In [None]:
# Dataset First Look
columns = {
    'dailyActivity': dailyActivity,
    'dailyCalories': dailyCalories,
    'dailyIntensities': dailyIntensities,
    'dailySteps': dailySteps,
    'heartrateSeconds': heartrateSeconds,
    'hourlyCalories': hourlyCalories,
    'hourlyIntensities': hourlyIntensities,
    'hourlySteps': hourlySteps,
    'sleepDay': sleepDay,
    'weightLogInfo': weightLogInfo
}

for name, df in columns.items():
    print(f"\n📊 --- {name} ---")
    display(df.head())

### Dataset Information

In [None]:
# Dataset Info
for name, df in columns.items():
    print(f"\n📊 --- {name} ---")
    display(df.info())

In [None]:
for name, df in columns.items():
    if 'Id' in df.columns:
        df['Id'] = df['Id'].astype(str)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
for name, df in columns.items():
    print(f"\n--- {name} ---")
    print(f"Duplicate Values: {df.duplicated().sum()}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
for name, df in columns.items():
    print(f"\n--- {name} ---")
    print(df.isnull().sum())

### What did you know about your dataset?

The dataset contains daily activity and health tracking information collected from Fitbit devices. It consists of 10 key variables that help understand a user's lifestyle, including physical activity, calorie expenditure, and sleep patterns.

**Activity Day**

The specific date of the recorded activity (in YYYY-MM-DD format).

**Total Steps**

Total number of steps taken in a day.

**Total Distance**

Total distance (in miles or kilometers) traveled in a day.

**Very Active Minutes**

Total minutes spent doing high-intensity activities.

**Fairly Active Minutes**

Total minutes spent doing moderate-intensity activities.

**Lightly Active Minutes**

Total minutes spent doing light-intensity activities like casual walking.

**Sedentary Minutes**

Total time spent inactive or sedentary during the day.

**Calories Burned**

Total calories burned throughout the day based on activity and basal rate.

**Total Minutes Asleep**

Total duration (in minutes) a person actually slept.

**Total Time in Bed**

The total time (in minutes) the user spent in bed, including awake time.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
for name, df in columns.items():
    print(f"\n--- {name} ---")
    print(df.columns)

In [None]:
# Dataset Describe
for name, df in columns.items():
    print(f"\n--- {name} ---")
    display(df.describe())

### Variables Description


**Activity Day**

The specific date of the recorded activity (in YYYY-MM-DD format).

**Total Steps**

Total number of steps taken in a day.

**Total Distance**

Total distance (in miles or kilometers) traveled in a day.

**Very Active Minutes**

Total minutes spent doing high-intensity activities.

**Fairly Active Minutes**

Total minutes spent doing moderate-intensity activities.

**Lightly Active Minutes**

Total minutes spent doing light-intensity activities like casual walking.

**Sedentary Minutes**

Total time spent inactive or sedentary during the day.

**Calories Burned**

Total calories burned throughout the day based on activity and basal rate.

**Total Minutes Asleep**

Total duration (in minutes) a person actually slept.

**Total Time in Bed**

The total time (in minutes) the user spent in bed, including awake time.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for name, df in columns.items():
    print(f"\n--- {name} ---")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
weightLogInfo = weightLogInfo.drop('Fat', axis=1)
sleepDay.drop_duplicates(inplace=True)
dailyActivity.rename(columns={'ActivityDate': 'ActivityDay'}, inplace=True)

In [None]:
# converting Date columns in Data sets to Data type

dailyActivity['ActivityDay']=pd.to_datetime(dailyActivity['ActivityDay'])
dailyCalories['ActivityDay']=pd.to_datetime(dailyCalories['ActivityDay'])
dailyIntensities['ActivityDay']=pd.to_datetime(dailyIntensities['ActivityDay'])
dailySteps['ActivityDay']=pd.to_datetime(dailySteps['ActivityDay'])
heartrateSeconds['Time']=pd.to_datetime(heartrateSeconds['Time'])
sleepDay['SleepDay']=pd.to_datetime(sleepDay['SleepDay'])
hourlyIntensities['ActivityHour']=pd.to_datetime(hourlyIntensities['ActivityHour'])
hourlySteps['ActivityHour']=pd.to_datetime(hourlySteps['ActivityHour'])
hourlyCalories['ActivityHour']=pd.to_datetime(hourlyCalories['ActivityHour'])
weightLogInfo['Date']=pd.to_datetime(weightLogInfo['Date'])

# Converting the hourly datasets to hourly values from DateTime

heartrateSeconds['Time_Period']=pd.to_datetime(heartrateSeconds['Time']).dt.hour
hourlyIntensities['Time_Period']=pd.to_datetime(hourlyIntensities['ActivityHour']).dt.hour
hourlySteps['Time_Period']=pd.to_datetime(hourlySteps['ActivityHour']).dt.hour
hourlyCalories['Time_Period']=pd.to_datetime(hourlyCalories['ActivityHour']).dt.hour
sleepDay['Time_Period'] = pd.to_datetime(sleepDay['SleepDay']).dt.hour
weightLogInfo['Time_Period']=pd.to_datetime(weightLogInfo['Date']).dt.hour


# Feature engineering on Columns to get insights from Data sets
# Adding Time period "Morning"/ "Afternoon"/ "Evening"

def get_time_block(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

heartrateSeconds['TimeBlock'] = heartrateSeconds['Time_Period'].apply(get_time_block)
hourlyIntensities['TimeBlock'] = hourlyIntensities['Time_Period'].apply(get_time_block)
hourlySteps['TimeBlock'] = hourlySteps['Time_Period'].apply(get_time_block)
hourlyCalories['TimeBlock'] = hourlyCalories['Time_Period'].apply(get_time_block)
sleepDay['TimeBlock'] = sleepDay['Time_Period'].apply(get_time_block)
weightLogInfo['TimeBlock'] = weightLogInfo['Time_Period'].apply(get_time_block)


# Converting BMI into Category "Overweight"/ "Underweight"/ "Normal Weight"/ "Obese"


def categorize_bmi(bmi):
  if bmi < 18.5:
    return 'underweight'
  elif 18.5 <= bmi < 25:
    return 'normal weight'
  elif 25 <= bmi < 30:
    return 'over weight'
  else:
    return 'obese'

weightLogInfo['BMI_Category'] = weightLogInfo['BMI'].apply(categorize_bmi)


# Adding Total Active Minutes using all active minutes columns


dailyActivity['TotalActiveMinutes'] = (dailyActivity['VeryActiveMinutes'] +
                                        dailyActivity['FairlyActiveMinutes'] +
                                        dailyActivity['LightlyActiveMinutes'])


# Adding Steps Category dividing users profile mode "Active"/ "Very Active"/ "Moderate"/ "Low"


dailyActivity['steps_category'] = pd.cut(dailyActivity['TotalSteps'],
                              bins=[0, 5000, 10000, 15000, float('inf')],
                              labels=['Low', 'Moderate', 'Active', 'Very Active'])


# Calories per step efficiency

dailyActivity['calories_per_step'] = dailyActivity['Calories'] / dailyActivity['TotalSteps']


# Sleep efficiency

sleepDay['Sleep_efficiency'] = (sleepDay['TotalMinutesAsleep'] / sleepDay['TotalTimeInBed']) * 100


# Sleep quality categories

sleepDay['sleep_quality'] = pd.cut(sleepDay['Sleep_efficiency'],bins=[0, 80, 90, 200],labels=['Poor', 'Good', 'Excellent'],include_lowest=True)

# Time to fall asleep

sleepDay['time_to_sleep'] = sleepDay['TotalTimeInBed'] - sleepDay['TotalMinutesAsleep']

# Heart rate zones
heartrateSeconds['hr_zone'] = pd.cut(heartrateSeconds['Value'],bins=[0, 100, 140, 170, float('inf')],labels=['Resting', 'Light', 'Moderate', 'Vigorous'])


### What all manipulations have you done and insights you found?

### What all manipulations have you done and insights you found?

1. Converted all date-related columns to `datetime` format for consistent time-based operations.
2. Extracted hour from time columns to analyze activity patterns over different parts of the day.
3. Created a `TimeBlock` column to classify time into Morning, Afternoon, Evening, and Night.
4. Dropped unnecessary or mostly null columns such as 'Fat' to clean the dataset.
5. Checked for duplicate and null values and removed/handled them.
6. Merged and aligned dataframes for accurate multi-source analysis.
7. Explored trends in sleep, steps, calories burned, and heart rate for different time segments.

**Insights:**
- Most users are less active during Night and more active during Morning and Afternoon.
- Average step count and calorie burn peak in the Morning and Evening.
- Sleep durations cluster around 6.5 to 7.5 hours for most users.
- Heart rate shows typical ranges between 60 and 100 bpm, with elevated rates during active times.

These insights help Bellabeat understand user patterns and tailor recommendations or marketing based on actual usage behavior.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
#1.Distribution of Daily Steps

plt.figure(figsize=(10, 6))
plt.hist(dailySteps['StepTotal'], bins=30, color='lightcoral', edgecolor='black')
plt.title('1. Distribution of Daily Steps')
plt.xlabel('Total Steps')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is used to show the distribution of daily steps across all users in the dataset. It helps visualize the frequency of different step counts, providing insights into common step habits. Adding a vertical line for the 10,000 steps goal allows for easy comparison of user activity against a common fitness benchmark.

##### 2. What is/are the insight(s) found from the chart?

1. The distribution is right-skewed, indicating that a significant number of users have lower daily step counts.
2. There is a notable peak or concentration of users who walk fewer than 8,000 steps per day.
3. A substantial portion of users do not meet the commonly recommended goal of 10,000 steps per day. There's a drop in frequency around the 10,000 steps mark, with fewer users exceeding it compared to those below it.

##### 3. Will the gained insights help creating a positive business impact?
Yes, absolutely. This insight is crucial for Bellabeat:
1. **Targeted Engagement:** Since many users are below the 10,000-step goal, Bellabeat can create targeted in-app challenges, notifications, and encouraging messages to motivate users to increase their steps.
2. **Feature Development:** It highlights the need for features that encourage more walking, such as step-based goals, progress tracking towards 10,000 steps, friendly competitions among users, or integration with walking routes.
3. **Marketing & Education:** Marketing materials can emphasize how Bellabeat devices can help users gradually increase their daily activity and reach health goals like 10,000 steps, educating them on the benefits of doing so.

Negative

If a large proportion of users consistently fail to reach commonly accepted activity goals like 10,000 steps *while using the device*, it could indicate that:
1. The device/app is not effectively motivating behavioral change.
2. Users might become demotivated if they see their consistently low step counts without improvement.
3. Users might question the value of the device if it's not helping them become more active.
This could lead to user dissatisfaction and potential churn if the device is perceived as not being helpful in achieving fitness goals. Bellabeat needs to ensure their features are effective in encouraging users to move more based on these insights.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2. Distribution of Total Minutes Asleep
plt.figure(figsize=(10, 6))
plt.hist(sleepDay['TotalMinutesAsleep'], bins=20, color='skyblue', edgecolor='black')
plt.title('2. Distribution of Total Minutes Asleep')
plt.xlabel('Total Minutes Asleep')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is ideal for visualizing the distribution of a single numerical variable. It shows the frequency of different sleep durations, allowing us to see the common sleep patterns among users.


##### 2. What is/are the insight(s) found from the chart?

The histogram shows that the majority of users report sleeping between approximately 300 and 500 minutes per night (5 to 8.3 hours). There's a peak around the 400-450 minute (6.7 - 7.5 hours) range, suggesting a common sleep duration. There are fewer instances of very short or very long sleep durations.


##### 3. Will the gained insights help creating a positive business impact?


Yes, understanding the typical sleep duration of users is valuable. Bellabeat can use this insight to:
1. Tailor marketing messages about the importance of getting enough sleep, highlighting how their devices can track and improve sleep within the observed range.
2. Develop features or content within the app that cater to this typical sleep duration, offering tips or goals related to achieving 7-8 hours of sleep.
3. Identify users with significantly low or high sleep durations and potentially offer targeted advice or support through the app.


Negative :

While this specific chart doesn't directly indicate negative growth, it reveals that some users are reporting very short or very long sleep durations. If a significant portion of users consistently have poor sleep (very short duration), it could imply that either:
1. The device isn't effectively helping them improve their sleep, potentially leading to dissatisfaction.
2. These users might have underlying health issues that the app/device doesn't adequately address, limiting the perceived value.
Bellabeat should investigate these outlier groups to understand why they are reporting these sleep patterns and if the device is meeting their needs. If not, it could lead to user churn.

#### Chart - 3

In [None]:
# Chart - 3 Distribution of Heart Rates
plt.figure(figsize=(12, 6))
sns.histplot(data=heartrateSeconds, x='Value', kde=True, bins=50)
plt.title('Distribution of Heart Rates')
plt.xlabel('Heart Rate (bpm)')
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a Kernel Density Estimate (KDE) is used to visualize the distribution of heart rate readings. This allows us to see the frequency of different heart rate values and understand the overall pattern of heart rates recorded by the devices. The KDE provides a smooth estimate of the probability density function.


##### 2. What is/are the insight(s) found from the chart?

The histogram shows a clear distribution of heart rate values.
Insights:
1. **Peak Frequency:** The distribution has a prominent peak, indicating a range of heart rates that are most frequently recorded. This peak likely corresponds to resting or light activity heart rates.
2. **Spread of Data:** The data is spread across a range, showing lower heart rates (likely during sleep or deep rest) and higher heart rates (during more intense activity).
3. **Potential Multiple Modes:** There might be subtle peaks or a wider spread indicating different modes of activity (e.g., a cluster around resting HR and another around moderate HR).


# Will the gained insights help creating a positive business impact?


Yes, understanding the distribution of heart rates is valuable for several reasons:
1. **Feature Development:** It helps identify the typical range of heart rates users experience, which is crucial for developing features like heart rate zones, workout intensity tracking, and recovery monitoring.
2. **Algorithm Refinement:** Knowing the distribution helps in refining algorithms for calorie estimation, sleep stage detection (often based on HR variability), and activity classification.
3. **User Education:** Bellabeat can use this information to educate users about what constitutes a healthy resting heart rate, target heart rate zones for exercise, and the importance of heart rate variability for overall health, enhancing the value proposition of the device.
4. **Identifying Anomalies:** Understanding the typical distribution helps in identifying unusually low or high heart rates that might warrant alerting the user to consult a doctor, adding a health monitoring value.

# Are there any insights that lead to negative growth? Justify with specific reason.
The distribution itself does not inherently lead to negative growth. However, issues related to the heart rate data collection or interpretation could cause negative growth:
1. **Data Quality Issues:** If the histogram showed an abnormal distribution (e.g., a large number of zero values or extremely high/low values clustered incorrectly), it would indicate poor sensor data quality, leading to user dissatisfaction and lack of trust in the device.
2. **Misinterpretation of Data:** If the app's interpretation of the heart rate data (e.g., incorrect heart rate zones, inaccurate workout summaries) is flawed, users might feel the device is not providing meaningful insights, leading to disengagement and potential churn.
3. **Privacy Concerns:** While not directly from the chart's data itself, handling sensitive heart rate data raises privacy concerns. Any perceived mishandling or security breach of this personal data could severely damage trust and lead to negative growth.

#### Chart - 4

In [None]:
# Chart - 4 Distribution of Daily Calories Burned
plt.figure(figsize=(10, 6))
sns.histplot(data=dailyCalories, x='Calories', bins=30, kde=True, color='orange')
plt.title('Distribution of Daily Calories Burned')
plt.xlabel('Calories Burned')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a Kernel Density Estimate (KDE) is used to visualize the distribution of daily calories burned by users. It helps understand the frequency of different calorie expenditure levels, providing insights into the typical daily energy burn of the user base. The KDE provides a smoothed representation of the distribution.

##### 2. What is/are the insight(s) found from the chart?

1. **Peak Distribution:** The histogram shows a peak in the distribution, indicating that a large number of users burn a similar amount of calories daily.
 2. **Range of Calories:** It illustrates the range of daily calories burned across the user base, from lower values (likely less active days) to higher values (more active days).
 3. **Potential Skewness:** The distribution might show some skewness, which could indicate that more users fall into lower calorie-burning categories than higher ones, aligning with the step count distribution seen earlier.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help creating a positive business impact?
Yes, this insight is valuable for Bellabeat:
1. **Personalized Goal Setting:** Understanding the typical range of calories burned helps in setting realistic and personalized daily calorie goals within the app.
2. **Feature Development:** This data is essential for developing features related to weight management, fitness goal tracking (e.g., "burn X calories today"), and integration with nutrition tracking.
3. **Marketing Messages:** Bellabeat can use this information to create marketing messages that highlight how their devices help users track and manage their calorie expenditure for fitness or weight goals.
4. **Identifying Activity Levels:** The distribution can help categorize users based on their activity-related calorie burn, allowing for more targeted engagement strategies.

Are there any insights that lead to negative growth? Justify with specific reason.
Similar to other distribution charts, this chart itself doesn't directly cause negative growth. However, issues related to calorie tracking could lead to negative growth:
1. **Accuracy Concerns:** If users perceive the calorie burn tracking to be inaccurate (e.g., significant discrepancies with other devices or perceived effort), they may lose trust in the Bellabeat device and its data, leading to dissatisfaction and churn.
2. **Lack of Actionable Insights:** If the app just shows a number for calories burned without providing actionable insights, tips, or context (e.g., how it relates to their activity or diet), users might find the feature less valuable and disengage.
3. **Comparison Issues:** If users compare their Bellabeat calorie data with friends using different devices and find large, unexplained differences, it could lead to doubts about the accuracy and potentially negative word-of-mouth.
Bellabeat must ensure the accuracy of their calorie tracking algorithm and provide context and actionable insights based on this data to maintain user trust and engagement.

#### Chart - 5

In [None]:

# 5. Distribution of Sedentary Minutes Per Day
plt.figure(figsize=(10, 6))
sns.histplot(dailyActivity['SedentaryMinutes'], bins=50, kde=True, color='purple')
plt.title('5. Distribution of Sedentary Minutes Per Day')
plt.xlabel('Sedentary Minutes')
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a Kernel Density Estimate (KDE) is used to visualize the distribution of daily calories burned by users. It helps understand the frequency of different calorie expenditure levels, providing insights into the typical daily energy burn of the user base. The KDE provides a smoothed representation of the distribution.

##### 2. What is/are the insight(s) found from the chart?

1. **Highly Skewed Distribution:** The distribution is extremely right-skewed, with a very high peak near 0 sedentary minutes. This is likely due to the way "sedentary minutes" is recorded – it might represent periods of *complete* inactivity recorded by the device sensors, and many users might not have prolonged periods of absolute stillness throughout the day *while wearing the device*.
2. **Concentration near Zero:** A large number of data points are clustered very close to 0 sedentary minutes.
3. **Long Tail:** There's a long tail extending to very high sedentary minute values, indicating that some users record significant periods of inactivity. However, the frequency drops sharply after the initial peak.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help creating a positive business impact?
Yes, understanding sedentary time is crucial for health and Bellabeat can leverage this:
1. **Highlighting Inactivity:** While the distribution is skewed, the long tail indicates that some users *are* recording significant sedentary time. Bellabeat can use this data to highlight the importance of reducing prolonged sitting and encourage users to take breaks.
2. **Feature Development:** This insight can drive the development of features like "move reminders," alerts after periods of inactivity, or challenges focused on reducing sedentary time.
3. **Educational Content:** Bellabeat can create content educating users about the health risks associated with prolonged sitting and how even short periods of light activity can make a difference.
4. **Identifying User Segments:** Users with very high sedentary minutes represent a segment that could benefit significantly from specific interventions or feature suggestions.

Are there any insights that lead to negative growth? Justify with specific reason.
The highly skewed distribution with a large cluster at or near zero might indicate potential issues:
1. **Data Accuracy/Interpretation:** If the device or app is not accurately capturing or defining "sedentary minutes," users might see values near zero even if they have periods of inactivity. This could lead to a false sense of being active or lack of trust in the data if it doesn't match their perceived activity level. If the data is perceived as inaccurate, users might question the value of the device.
2. **Lack of Actionability for Most Users:** If most users consistently see very low sedentary minutes recorded by the device, they might not find the sedentary minutes feature useful or motivating, as there's no perceived need to reduce it. This could lead to disengagement with this specific feature.

Bellabeat needs to ensure the "Sedentary Minutes" metric is accurately captured and interpreted in a way that is meaningful and actionable for users, especially those who do have high sedentary periods or need reminders to move.

#### Chart - 6

In [None]:

# 6. Distribution of VeryActive Minutes Per Day
plt.figure(figsize=(10, 6))
sns.histplot(dailyActivity['VeryActiveMinutes'], bins=50, kde=True, color='lightblue')
plt.title('6. Distribution of VeryActive Minutes Per Day')
plt.xlabel('Very Active Minutes')
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a Kernel Density Estimate (KDE) is used to visualize the distribution of daily calories burned by users. It helps understand the frequency of different calorie expenditure levels, providing insights into the typical daily energy burn of the user base. The KDE provides a smoothed representation of the distribution.

##### 2. What is/are the insight(s) found from the chart?

1. **Highly Skewed Distribution Towards Zero:** The most prominent feature is a very high frequency of days with zero or very few very active minutes. This indicates that a significant portion of users record little to no high-intensity activity on many days.
2. **Sharp Decline:** The frequency drops off very rapidly as the number of very active minutes increases.
3. **Few High-Intensity Users:** There's a long but very low tail, suggesting that only a small number of users consistently engage in substantial amounts of very active time (e.g., more than 30-60 minutes).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help creating a positive business impact?
Yes, understanding the distribution of very active minutes is crucial for targeted marketing and feature development:
1. **Encouraging High-Intensity Activity:** Given the low engagement with very active minutes for many users, Bellabeat can create features, challenges, or content specifically aimed at encouraging users to incorporate more vigorous exercise into their routines.
2. **Targeted Marketing:** Marketing can highlight the benefits of high-intensity workouts for women's health and show how Bellabeat devices can track and motivate users to achieve these levels.
3. **Personalized Recommendations:** The app can identify users with low very active minutes and provide personalized recommendations or workout suggestions to gradually increase their intensity.

Are there any insights that lead to negative growth? Justify with specific reason.
The insight that most users are not engaging in high levels of activity could lead to negative growth if:
1. **Users Get Demotivated:** If users consistently see low numbers for "Very Active Minutes" despite feeling like they are exercising, they might become demotivated or doubt the device's ability to accurately track their effort.
2. **Value Proposition Disconnect:** If Bellabeat heavily markets the ability to track intense workouts, but the majority of users aren't doing them (or the device isn't showing they are), there's a disconnect in the value proposition for many users.
3. **Comparison Issues:** Users might compare their low "Very Active Minutes" with fitness benchmarks or friends and feel like they are failing, potentially leading them to stop using the device.

Bellabeat needs to address this by ensuring accurate tracking of varying intensity levels and providing motivational tools and achievable goals that guide users towards increasing their very active minutes, rather than just highlighting low numbers without support.

#### Chart - 7

In [None]:
# 7. Distribution of Total Distance Per Day
plt.figure(figsize=(10, 6))
sns.histplot(dailyActivity['TotalDistance'], bins=30, kde=True, color='green')
plt.title('7. Distribution of Total Distance Per Day')
plt.xlabel('Total Distance (miles)')
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a Kernel Density Estimate (KDE) is used to visualize the distribution of total distance traveled per day. This helps understand the typical distances users cover, which is directly related to their overall mobility and activity levels

##### 2. What is/are the insight(s) found from the chart?

1. **Concentration at Lower Distances:** Similar to step counts, the distribution shows a significant peak at lower distances, indicating that many users travel relatively short distances daily.
2. **Relationship with Steps:** The shape of this distribution is expected to be similar to the distribution of total steps, as distance is often directly calculated or estimated from steps.
3. **Presence of Higher Distances:** There is a tail extending to higher distances, representing days when users were more active and covered more ground.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.
Will the gained insights help creating a positive business impact?
Yes, understanding the distribution of distance is beneficial:
1. **Activity Benchmarking:** It provides a baseline for the typical distances covered by users, helping Bellabeat set realistic distance goals or challenges.
2. **Feature Relevance:** The distance metric is relevant for users interested in running, walking, or cycling. Highlighting distance tracking can appeal to these segments.
3. **Progress Tracking:** Users can be motivated by seeing their daily distance and working towards increasing it over time. Bellabeat can visually represent this progress.
4. **Integration with Mapping:** Distance data is essential for features that involve mapping workouts or visualizing routes taken by users.

Are there any insights that lead to negative growth? Justify with specific reason.
The distribution itself doesn't cause negative growth, but issues with the distance tracking can:
1. **Inaccurate Tracking:** If the reported distance is significantly inaccurate (e.g., due to GPS issues or incorrect step-to-distance conversion), users will lose trust in the device's data. Inaccuracy can lead to frustration and users abandoning the device.
2. **Lack of Context:** Just showing a distance number might not be enough. If the app doesn't provide context (e.g., pace, relationship to steps, comparison to past performance), the data might not be perceived as valuable, leading to disengagement.
3. **Comparison with Other Devices:** Users might compare the distance reported by their Bellabeat device with other trackers or phone apps. Significant discrepancies can erode confidence in the Bellabeat device's reliability.

Bellabeat must prioritize the accuracy of distance tracking and provide features that make this data meaningful and actionable for users to maintain satisfaction and prevent negative growth.

#### Chart - 8

In [None]:
# Chart - 8 avg calories burnt during active time daily monday tuesday vs avg calories burnt during veryactive minutes (Categorical)

# Merge daily activity and calories data
merged_daily = pd.merge(dailyActivity, dailyCalories, on=['Id', 'ActivityDay'])

# Calculate average calories burned for each activity level
avg_calories_by_activity = merged_daily.groupby('steps_category')['Calories_x'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.barplot(data=avg_calories_by_activity, x='steps_category', y='Calories_x', palette='viridis', order=['Low', 'Moderate', 'Active', 'Very Active'])
plt.title('8. Average Daily Calories Burned by Step Category')
plt.xlabel('Step Category')
plt.ylabel('Average Daily Calories Burned')
plt.grid(axis='y', alpha=0.75)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is used to compare the average daily calories burned across different step categories ('Low', 'Moderate', 'Active', 'Very Active'). It clearly shows how average calorie expenditure varies with increasing activity levels based on step counts.


##### 2. What is/are the insight(s) found from the chart?

1. There is a clear positive correlation between step category and average calories burned. Users in higher step categories ('Active', 'Very Active') burn significantly more calories on average than those in lower categories ('Low', 'Moderate').
2. The 'Very Active' category shows the highest average calorie burn, as expected.
3. The difference in average calories between adjacent categories (e.g., Low vs. Moderate, Moderate vs. Active) demonstrates the impact of increasing daily steps on energy expenditure.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help creating a positive business impact?
Yes, this insight is highly valuable for Bellabeat:
1. Encouraging Higher Activity: It visually reinforces the link between steps and calorie burn, which can be used to motivate users to increase their daily step counts to achieve fitness goals like weight management.
2. Goal Setting: Bellabeat can use these averages as benchmarks to help users set realistic calorie burn goals based on their activity level or encourage them to move to a higher step category.
3. Marketing and Education: This data can be used in marketing campaigns and in-app content to educate users about the calorie-burning benefits of increased physical activity, demonstrating the value of the device.
4. Feature Personalization: The app can provide personalized insights or challenges based on a user's current step category and their potential to increase calorie burn by moving to a higher one.

Are there any insights that lead to negative growth? Justify with specific reason.
This chart itself shows a positive relationship consistent with expected physiological outcomes, so it doesn't inherently lead to negative growth. However, potential issues could arise if:
1. Inaccurate Data: If the calorie calculations are perceived as inaccurate by users (e.g., they don't match their perceived effort or results), they may lose faith in the device's data, leading to dissatisfaction and potential churn. For example, if a user in the 'Very Active' category sees an unexpectedly low calorie count, they might doubt the device's reliability.
2. Misinterpretation by Users: Users might focus solely on calorie burn without understanding the context of total daily expenditure, BMR, or the importance of diet. The app needs to provide proper context to prevent frustration if calorie burn alone isn't leading to desired results.
Bellabeat must ensure the accuracy and transparency of its calorie calculation algorithms and provide comprehensive tools and education to support users' fitness journeys.

#### Chart - 9

In [None]:

# 9. Very Active Distance vs. Total Distance
plt.figure(figsize=(10, 6))
sns.scatterplot(data=dailyActivity, x='VeryActiveDistance', y='TotalDistance', alpha=0.5)
plt.title('9. Very Active Distance vs. Total Distance')
plt.xlabel('Very Active Distance')
plt.ylabel('Total Distance')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is used to visualize the relationship between 'VeryActiveDistance' and 'TotalDistance'. This helps determine if there is a correlation between the distance covered during high-intensity activities and the overall daily distance.


##### 2. What is/are the insight(s) found from the chart?

1. There is a strong positive correlation between Very Active Distance and Total Distance. As Very Active Distance increases, Total Distance generally increases as well.
2. Many data points cluster along a line, suggesting that for users who engage in very active movement, a significant portion of their total daily distance comes from these high-intensity activities.
3. There are many data points where Very Active Distance is zero but Total Distance is greater than zero, indicating that users cover distance through other activity levels (light, fairly active).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help creating a positive business impact?
Yes, these insights can lead to positive business impact:
1. **Highlighting Impact of Intensity:** This chart visually demonstrates how engaging in very active movement contributes significantly to overall daily distance. Bellabeat can use this to encourage users to incorporate higher-intensity activities to boost their total distance and activity levels.
2. **Feature Development:** It supports the development of features that track, analyze, and motivate users based on their performance in different intensity zones, particularly focusing on very active distance goals or challenges.
3. **User Education:** Bellabeat can educate users on how activities like running or fast walking contribute disproportionately to their total daily movement compared to light activity.
4. **Segmentation:** Users with high very active distance are likely more engaged and fitness-focused. Bellabeat can tailor marketing and features to this segment, while also creating pathways for less active users to increase their very active distance.

Are there any insights that lead to negative growth? Justify with specific reason.
This chart primarily shows a positive relationship, which is expected. However, potential negative impacts could arise if:
1. **Inaccurate Tracking:** If the device inaccurately differentiates between very active distance and other types of distance, users may see a scatter plot that doesn't match their actual activity, leading to distrust in the data. For example, if someone runs a significant distance but it's not recorded as "VeryActiveDistance," they might lose faith in the device's accuracy.
2. **Overemphasis on High Intensity:** If Bellabeat's messaging *only* focuses on "Very Active" distance and neglects the importance of total distance covered through light or moderate activity, it might alienate users who prefer less intense forms of exercise, potentially leading them to feel their efforts aren't valued or tracked properly.

Bellabeat should ensure accurate intensity tracking and provide a balanced view of all activity levels and their contribution to overall health and fitness goals.

#### Chart - 10

In [None]:
# Chart - 10: Very Active Distance vs. Total Distance - Scatter Plot

plt.figure(figsize=(10, 6))
sns.scatterplot(data=dailyActivity, x='VeryActiveDistance', y='TotalDistance', alpha=0.6, s=50) # s controls marker size
plt.title('10. Very Active Distance vs. Total Distance')
plt.xlabel('Very Active Distance')
plt.ylabel('Total Distance')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is used to visualize the relationship between 'VeryActiveDistance' and 'TotalDistance'. This helps determine if there is a correlation between the distance covered during high-intensity activities and the overall daily distance.


##### 2. What is/are the insight(s) found from the chart?

Users are participating in less very active distance comparing to total distance. Many data points cluster along a line, suggesting that for users who engage in very active movement, a significant portion of their total daily distance comes from these high-intensity activities.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help creating a positive business impact?
Yes, these insights can lead to positive business impact:
1. **Highlighting Impact of Intensity:** This chart visually demonstrates how engaging in very active movement contributes significantly to overall daily distance. Bellabeat can use this to encourage users to incorporate higher-intensity activities to boost their total distance and activity levels.
2. **Feature Development:** It supports the development of features that track, analyze, and motivate users based on their performance in different intensity zones, particularly focusing on very active distance goals or challenges.
3. **User Education:** Bellabeat can educate users on how activities like running or fast walking contribute disproportionately to their total daily movement compared to light activity.
4. **Segmentation:** Users with high very active distance are likely more engaged and fitness-focused. Bellabeat can tailor marketing and features to this segment, while also creating pathways for less active users to increase their very active distance.

Are there any insights that lead to negative growth? Justify with specific reason.
This chart primarily shows a positive relationship, which is expected. However, potential negative impacts could arise if:
1. **Inaccurate Tracking:** If the device inaccurately differentiates between very active distance and other types of distance, users may see a scatter plot that doesn't match their actual activity, leading to distrust in the data. For example, if someone runs a significant distance but it's not recorded as "VeryActiveDistance," they might lose faith in the device's accuracy.
2. **Overemphasis on High Intensity:** If Bellabeat's messaging *only* focuses on "Very Active" distance and neglects the importance of total distance covered through light or moderate activity, it might alienate users who prefer less intense forms of exercise, potentially leading them to feel their efforts aren't valued or tracked properly.

Bellabeat should ensure accurate intensity tracking and provide a balanced view of all activity levels and their contribution to overall health and fitness goals.

#### Chart - 11

In [None]:

# Merge heartrate data with daily activity data to get step category
heartrate_with_activity = pd.merge(heartrateSeconds, dailyActivity[['Id', 'ActivityDay', 'steps_category']],
                                 left_on=['Id', heartrateSeconds['Time'].dt.date],
                                 right_on=['Id', dailyActivity['ActivityDay'].dt.date],
                                 how='left')

# Drop the redundant date column from the merge key
heartrate_with_activity = heartrate_with_activity.drop(columns=['key_1', 'key_1'])

# Calculate average heartrate for each step category
avg_heartrate_by_steps_category = heartrate_with_activity.groupby('steps_category')['Value'].mean().reset_index()

# Chart - 11 Average Heartrate by Step Category
plt.figure(figsize=(10, 6))
sns.barplot(data=avg_heartrate_by_steps_category.dropna(), x='steps_category', y='Value', palette='plasma', order=['Low', 'Moderate', 'Active', 'Very Active'])
plt.title('11. Average Heartrate by Step Category')
plt.xlabel('Step Category')
plt.ylabel('Average Heartrate (bpm)')
plt.grid(axis='y', alpha=0.75)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is used to compare the average heart rate across different step categories. It helps visualize how average heart rate varies with increasing daily activity levels (categorized by steps). This can provide insights into the physiological response related to different levels of daily movement.


##### 2. What is/are the insight(s) found from the chart?

1. There appears to be a slight trend of increasing average heart rate as the step category increases from 'Low' to 'Very Active'. This is expected as higher activity levels generally correlate with higher average heart rates throughout the day.
2. The difference in average heart rate between categories might be subtle or more pronounced depending on the data and how the average heart rate is calculated (e.g., daily average includes sedentary periods).
3. Users in higher step categories likely spend more time in elevated heart rate zones, which contributes to a higher daily average.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help creating a positive business impact?
Yes, understanding the relationship between activity level and heart rate is beneficial:
1. **Physiological Insight:** It helps users understand how their daily activity impacts their cardiovascular system, reinforcing the health benefits of moving more.
2. **Feature Enhancement:** Bellabeat can use this to provide more nuanced feedback to users, showing them not just their steps but also how those steps influence their average heart rate and overall cardiovascular health.
3. **Motivation:** Visualizing that more steps correlate with a higher average heart rate can motivate users to increase their activity, knowing it benefits their heart health.
4. **Contextual Data:** Providing average heart rate data alongside activity data gives users a more complete picture of their physiological state throughout the day.

Are there any insights that lead to negative growth? Justify with specific reason.
This chart primarily shows a positive relationship, which is expected. However, potential negative impacts could arise if:
1. **Data Consistency/Accuracy:** If the average heart rate data appears inconsistent with the activity level (e.g., 'Very Active' users showing unusually low average heart rates, or 'Low' activity users showing very high averages without a clear reason), it could indicate data collection issues (device not worn consistently, sensor errors) or flawed data processing. Inaccurate or confusing data can erode user trust.
2. **Lack of Context:** If the average heart rate is presented without sufficient context (e.g., user's age, individual baseline HR), users might misinterpret the data or become unnecessarily concerned about their readings. Providing raw numbers without interpretation can be detrimental.
3. **Comparison Issues:** Users might compare their average heart rate to others or online benchmarks. If their average seems significantly off compared to their activity level, they might doubt the device or feel discouraged.

Bellabeat needs to ensure the accuracy of heart rate tracking, provide contextual interpretation of the data, and integrate it seamlessly with activity metrics to build user trust and provide actionable insights.

#### Chart - 12

In [None]:
hourly_hr_zone_distribution = heartrateSeconds.groupby(['Time_Period', 'hr_zone']).size().reset_index(name='count')

# Calculate the total count for each hour to normalize the data
hourly_hr_zone_total = heartrateSeconds.groupby('Time_Period').size().reset_index(name='total_count')

# Merge the two dataframes
hourly_hr_zone_distribution = pd.merge(hourly_hr_zone_distribution, hourly_hr_zone_total, on='Time_Period')

# Calculate the percentage distribution for each hour
hourly_hr_zone_distribution['percentage'] = (hourly_hr_zone_distribution['count'] / hourly_hr_zone_distribution['total_count']) * 100

# Define the order of heart rate zones for consistent plotting
hr_zone_order = ['Resting', 'Light', 'Moderate', 'Vigorous']
hourly_hr_zone_distribution['hr_zone'] = pd.Categorical(hourly_hr_zone_distribution['hr_zone'], categories=hr_zone_order, ordered=True)

# Filter out 'Resting' zone to show non-resting distribution
non_resting_hourly_hr_zone_distribution = hourly_hr_zone_distribution[hourly_hr_zone_distribution['hr_zone'] != 'Resting'].copy()

# Chart - 12: Hourly Distribution of Non-Resting Heart Rate Zones (Stacked Bar Chart)
plt.figure(figsize=(14, 8))
sns.barplot(data=non_resting_hourly_hr_zone_distribution, x='Time_Period', y='percentage', hue='hr_zone', palette='viridis', dodge=False) # dodge=False for stacking
plt.title('12. Hourly Distribution of Non-Resting Heart Rate Zones')
plt.xlabel('Hour of Day')
plt.ylabel('Percentage of Heart Rate Readings')
plt.xticks(range(24)) # Ensure all hours are shown on the x-axis
plt.grid(axis='y', alpha=0.75)
plt.legend(title='HR Zone')
plt.show()

##### 1. Why did you pick the specific chart?

A stacked bar chart is used to visualize the distribution of heart rate readings across different intensity zones (excluding 'Resting') for each hour of the day. Stacking the bars for 'Light', 'Moderate', and 'Vigorous' zones allows us to see the *proportion* of non-resting activity within each hour and how the mix of intensity changes throughout the day. It provides a clear view of when users are most likely to be in elevated heart rate zones.


##### 2. What is/are the insight(s) found from the chart?

1. **Peak Activity Hours:** The stacked bars representing non-resting zones are taller during specific hours, indicating periods when users are most active or exercising. There are likely peaks in the morning (e.g., 8 AM - 10 AM) and in the afternoon/evening (e.g., 4 PM - 7 PM), corresponding to typical commuting or workout times.
2. **Dominance of Light Activity:** Within the non-resting zones, the 'Light' activity zone likely constitutes the largest proportion of readings across most hours, suggesting that even during active periods, a significant amount of time is spent in lower intensity.
3. **Timing of Higher Intensity:** The 'Moderate' and 'Vigorous' zones are likely more prominent during the peak activity hours mentioned above, confirming these are the times users are engaging in more intense exercise.
4. **Low Activity During Sleep/Rest:** Conversely, hours typically associated with sleep (late night/early morning) show very low or no readings in these non-resting zones, as expected.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.
Will the gained insights help creating a positive business impact?
Yes, this insight is highly impactful for Bellabeat:
1. **Targeted Engagement & Reminders:** Identifying peak and low activity hours allows Bellabeat to send targeted notifications, reminders, or challenges at times users are most likely to be active or when they *should* be more active (e.g., encouraging afternoon walks).
2. **Feature Optimization:** Features related to workout tracking, heart rate zone analysis, and activity goals can be optimized based on when users are most likely to use them.
3. **Content Scheduling:** Marketing content, tips, or guided workouts can be scheduled to align with peak activity times, maximizing user engagement.
4. **Understanding User Behavior:** This deep dive into hourly heart rate distribution helps understand the daily rhythm of user activity, which can inform product design and marketing strategies.

Are there any insights that lead to negative growth? Justify with specific reason.
The chart itself shows typical daily activity patterns and is not inherently negative. However, potential negative impacts could arise if:
1. **Lack of Higher Intensity:** If the 'Moderate' and 'Vigorous' zones consistently represent a very small percentage across all hours, it reiterates the insight from Chart 6 (Very Active Minutes distribution) that many users are not engaging in high-intensity activity. If Bellabeat's marketing emphasizes intense workouts but the data shows users aren't doing them (or the device isn't tracking them accurately), it creates a disconnect and could lead to user dissatisfaction and churn.
2. **Inaccurate Zone Classification:** If users feel they are doing an intense workout, but the chart shows their readings are primarily in the 'Light' or 'Resting' zone during that time, they will lose trust in the device's ability to accurately classify their activity intensity. This inaccuracy is a major factor leading to negative growth.
3. **Missing Data:** If certain hours show significantly less data collection (empty bars or very low counts), it might indicate users are not wearing their device consistently throughout the day, diminishing the value of 24/7 tracking.

Bellabeat must focus on accurate heart rate zone classification across different activities and provide clear, actionable feedback to users based on this data to avoid frustration and build trust.

#### Chart - 13

In [None]:
# Good sleep mins vs their active mins

# Merge the dataframes
sleep_activity_merged = pd.merge(sleepDay, dailyActivity,
                                 left_on=['Id', 'SleepDay'],
                                 right_on=['Id', 'ActivityDay'],
                                 how='inner') # Use inner merge to only include days with both sleep and activity data

# Calculate average active minutes per sleep quality category
avg_active_minutes_by_sleep_quality = sleep_activity_merged.groupby('sleep_quality')['TotalActiveMinutes'].mean().reset_index()

# Chart - 13: Average Active Minutes by Sleep Quality
plt.figure(figsize=(10, 6))
sns.barplot(data=avg_active_minutes_by_sleep_quality.dropna(), x='sleep_quality', y='TotalActiveMinutes', palette='cividis', order=['Poor', 'Good', 'Excellent'])
plt.title('13. Average Active Minutes by Sleep Quality')
plt.xlabel('Sleep Quality')
plt.ylabel('Average Total Active Minutes')
plt.grid(axis='y', alpha=0.75)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Select relevant numerical columns from dailyActivity for correlation
correlation_data = dailyActivity[['TotalSteps', 'TotalDistance', 'VeryActiveMinutes',
                                   'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes',
                                   'Calories']]

# Calculate average daily heart rate from heartrateSeconds
avg_daily_heartrate = heartrateSeconds.groupby('Id')['Value'].mean().reset_index()
avg_daily_heartrate.rename(columns={'Value': 'AverageDailyHeartrate'}, inplace=True)

# Merge average daily heart rate to dailyActivity
correlation_data_merged = pd.merge(dailyActivity, avg_daily_heartrate, on='Id', how='left')

# Select the final columns for the correlation heatmap
correlation_cols = correlation_data_merged[['TotalSteps', 'TotalDistance', 'VeryActiveMinutes',
                                             'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes',
                                             'Calories', 'AverageDailyHeartrate']]

# Calculate the correlation matrix
correlation_matrix = correlation_cols.corr()

# Chart - 14 Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('14. Correlation Heatmap of Daily Activity Metrics, Calories, and Average Heartrate')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is an excellent choice for visualizing the correlation matrix between multiple numerical variables. The color intensity represents the strength of the correlation, and the numerical annotations show the exact correlation coefficient (ranging from -1 to +1). This chart provides a clear, concise overview of how different metrics relate to each other.


##### 2. What is/are the insight(s) found from the chart?

1.  **Strong Positive Correlations:** There are strong positive correlations between:
    *   `TotalSteps` and `TotalDistance` (as expected, steps contribute directly to distance).
    *   `TotalSteps` and `Calories`.
    *   `TotalDistance` and `Calories`.
    *   `VeryActiveMinutes` and `TotalSteps`/`TotalDistance`/`Calories`.
    *   `FairlyActiveMinutes` and `TotalSteps`/`TotalDistance`/`Calories`.
    *   `LightlyActiveMinutes` and `TotalSteps`/`TotalDistance`/`Calories` (though potentially weaker than VeryActive and FairlyActive).
2.  **Negative Correlation:** `SedentaryMinutes` show a negative correlation with most activity metrics (`TotalSteps`, `TotalDistance`, `ActiveMinutes`, `Calories`), indicating that more sedentary time generally means less activity and fewer calories burned.
3.  **Heartrate Correlation:** `AverageDailyHeartrate` shows moderate positive correlations with activity metrics (`TotalSteps`, `TotalDistance`, Active Minutes, `Calories`), suggesting that higher daily activity levels tend to result in a higher average heart rate over the day. The correlation might not be extremely high because the average heart rate includes sedentary periods, but it still indicates a relationship.
4.  **Inter-Activity Correlations:** The different "ActiveMinutes" categories (`Very`, `Fairly`, `Lightly`) also show some positive correlation among themselves, although the strongest correlations are typically with the overall metrics like Steps, Distance, and Calories.


#### Chart - 15 - Pair Plot

In [None]:
# Select a subset of numerical columns for the pair plot to avoid overcrowding
pair_plot_cols = dailyActivity[['TotalSteps', 'Calories', 'TotalActiveMinutes', 'SedentaryMinutes']]

# Chart - 15 Pair Plot
plt.figure(figsize=(12, 10)) # Adjust figure size for better visibility
sns.pairplot(pair_plot_cols, diag_kind='kde') # Use kde for diagonal distribution plots
plt.suptitle('15. Pair Plot of Selected Daily Activity Metrics', y=1.02) # Add a title to the entire plot
plt.show()

In [None]:
# Cleaniing the updated columns
dailyActivity.dropna(subset=['steps_category'], inplace=True)
dailyActivity.dropna(subset=['calories_per_step'], inplace=True)
sleepDay.dropna(subset=['sleep_quality'], inplace=True)



In [None]:
# # Updated csv files if you want
# for name, df in columns.items():
#     df.to_csv(f'{name}.csv', index=False)
#     files.download(f'{name}.csv')

##### 1. Why did you pick the specific chart?

A pair plot is chosen to visualize the pairwise relationships between multiple numerical variables. It displays scatter plots for every combination of two variables and histograms for each individual variable along the diagonal. This allows for a quick visual assessment of correlations, distributions, and potential patterns among key metrics like Total Steps, Calories Burned, Average Sleep Duration, and Average Heart Rate.


##### 2. What is/are the insight(s) found from the chart?

1. **Steps vs. Calories:** The scatter plot shows a clear positive linear relationship. More steps generally lead to more calories burned, which is a fundamental fitness principle. The histogram for Total Steps shows a right-skewed distribution, and for Calories, a more spread-out distribution.
2. **Steps vs. Avg Sleep:** The scatter plot might show little to no strong linear correlation, suggesting that the number of steps taken on a given day doesn't have a simple linear relationship with average sleep duration over the tracking period. The histograms show the distribution of these two metrics.
3. **Steps vs. Avg Heartrate:** The scatter plot might show a weak to moderate positive correlation, consistent with the heatmap. More steps *might* slightly increase the average daily heart rate, though other factors like intensity and sedentary time play a role.
4. **Calories vs. Avg Sleep:** Similar to steps vs. sleep, there might not be a strong linear correlation between daily calories burned and average sleep duration.
5. **Calories vs. Avg Heartrate:** A moderate positive correlation is expected, as higher activity (burning more calories) tends to increase average heart rate.
6. **Avg Sleep vs. Avg Heartrate:** The scatter plot might show a weak or no clear linear relationship. However, it's worth noting that *resting* heart rate (not average daily) often improves with better sleep and fitness. This chart shows the average over the whole day.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To increase user activity and app engagement, Bellabeat should:

Most users show moderate daily activity.
Higher steps directly contribute to higher calorie burn.
Users spend more time in light or sedentary activities.

**Recommendations:**

Launch step-based challenges to improve daily engagement.
Encourage users to track daily activity with reminders.
Build targeted messaging for low-active users with personalized goals.

Users with more active minutes tend to burn more calories.
Sleep duration correlates negatively with sedentary time.
Peak step counts and heart rate often align during morning and evening hours.

Introduce personalized activity goals
Provide timely nudges or reminders to move
Add gamification features like streaks or badges
Leverage user insights to develop tailored wellness plans.

Suggest users to sleep and fall under good sleep category and increase active minutes.

Improve calories burn for healthy life.

Suggest people whose weight is more and tell them to fall under active minutes and reduce time in bed and show them their calories burnt and other people with same weight people burnt calories .

# **Conclusion**

The analysis of Bellabeat user data revealed patterns in physical activity, calorie consumption, and sedentary behavior. Consistent active engagement leads to higher calorie burn and possibly improved health outcomes. However, most users show sedentary trends which indicate a potential area for improvement. People calories are less burnt when they walk, they need to fall under veryactive minutes. Sleep time should increase, it will increase active minutes propotionally.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***