# INTRODUCTION
Welcome to the Bellabeat data analysis case study! In this case study, you will perform many real-world tasks of a junior data
analyst. You will imagine you are working for Bellabeat, a high-tech manufacturer of health-focused products for women, and
meet different characters and team members. In order to answer the key business questions, you will follow the steps of the
data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables —
including guiding questions and key tasks — will help you stay on the right path.

# CASE STUDY: How Can a Wellness Technology Company Play It Smart?



## About the company
Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products.
Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around
the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with
knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned
itself as a tech-driven wellness company for women.

## Scenario
You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused
products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the
global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart
device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of
Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The
insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat
executive team along with your high-level recommendations for Bellabeat’s marketing strategy.




# ASK

Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart
devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions
will guide your analysis:
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?


## Deliverables
1. A clear summary of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of your analysis
5. Supporting visualizations and key findings
6. Your top high-level content recommendations based on your analysis

# PREPARE

**About Data**

This Kaggle data set contains personal fitness tracker from 30 fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It also includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

The dataset was collected from April 12, 2016 to May 12, 2016. There are 18 .CSV files in total and generated in both wide and long format.

**Data limitations**

The data has some limitations which could Undermine the results of the analysis Such limitations to take into consideration are:
1. Missing demographics
2. Small simple size
3. Short time period of Data collection

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pandas.api.types import CategoricalDtype
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
daily_activity = pd.read_csv('/kaggle/input/fitbit-tracker-data/dailyActivity_merged.csv')
sleep_day = pd.read_csv('/kaggle/input/fitbit-tracker-data/sleepDay_merged.csv')
weight_log = pd.read_csv('/kaggle/input/fitbit-tracker-data/weightLogInfo_merged.csv')
hourly_intensities = pd.read_csv('/kaggle/input/fitbit-tracker-data/hourlyIntensities_merged.csv')
hourly_steps = pd.read_csv('/kaggle/input/fitbit-tracker-data/hourlySteps_merged.csv')
heart_rate = pd.read_csv('/kaggle/input/fitbit-tracker-data/heartrate_seconds_merged.csv')

## Exploring how data is organized

In [None]:
daily_activity.head(5)

In [None]:
sleep_day.head(5)

In [None]:
weight_log.head(5)

In [None]:
hourly_intensities.head()

In [None]:
hourly_steps.head()

In [None]:
heart_rate.head()

In [None]:
daily_activity.dtypes

In [None]:
sleep_day.dtypes

In [None]:
weight_log.dtypes

In [None]:
hourly_intensities.dtypes

In [None]:
hourly_steps.dtypes

In [None]:
heart_rate.dtypes

## How many unique participants are there in each dataframe?

In [None]:
daily_activity.nunique().Id

In [None]:
sleep_day.nunique().Id

In [None]:
weight_log.nunique().Id

In [None]:
hourly_intensities.nunique().Id

In [None]:
hourly_steps.nunique().Id

In [None]:
heart_rate.nunique().Id

# PROCESS

In [None]:
daily_activity.shape

In [None]:
sleep_day.shape

In [None]:
hourly_intensities.shape

In [None]:
hourly_steps.shape

## Statistics of data

In [None]:
daily_activity.describe()

In [None]:
sleep_day.describe()

In [None]:
hourly_intensities.describe()

In [None]:
hourly_steps.describe()

## Are there any null values?

In [None]:
daily_activity.isna().sum()

In [None]:
sleep_day.isna().sum()

In [None]:
hourly_intensities.isna().sum()

In [None]:
hourly_steps.isna().sum()

There are no null values found in the data.

## Checking for Duplicates

In [None]:
daily_activity.duplicated().sum()

In [None]:
sleep_day.duplicated().sum()

In [None]:
sleep_day.shape

In [None]:
hourly_intensities.duplicated().sum()

In [None]:
hourly_steps.duplicated().sum()

3 Duplicated rows found in sleep_day, the next step would be is to remove them.

In [None]:
sleep_day = sleep_day.drop_duplicates().copy()
sleep_day

In [None]:
sleep_day.duplicated().sum()

## Data Transformation

Renaming Date & Time Columns to maintain consistency.

In [None]:
daily_activity = daily_activity.rename(columns={'ActivityDate': 'Date'})

sleep_day = sleep_day.rename(columns={'SleepDay': 'Date'})

hourly_steps = hourly_steps.rename(columns={'ActivityHour' : 'Time'})

hourly_intensities = hourly_intensities.rename(columns={'ActivityHour' : 'Time'})

In [None]:
daily_activity.dtypes

Similar to daily_activity we have noticed that the data type of the date columns are of the type object. Thus, we must change the data types to datetime to proceed further.

In [None]:
daily_activity['Date'] = pd.to_datetime(daily_activity['Date'], format="%m/%d/%Y")


In [None]:
sleep_day['Date'] = pd.to_datetime(sleep_day['Date'], format="%m/%d/%Y %I:%M:%S %p")

In [None]:
hourly_steps['Time'] = pd.to_datetime(hourly_steps['Time'], format="%m/%d/%Y %I:%M:%S %p")

In [None]:
hourly_intensities['Time'] = pd.to_datetime(hourly_intensities['Time'], format="%m/%d/%Y %I:%M:%S %p")

In [None]:
daily_activity.head()

In [None]:
sleep_day.head()

In [None]:
daily_activity['DayOfWeek'] = daily_activity['Date'].dt.day_name()

hourly_steps['DayOfWeek'] = hourly_steps['Time'].dt.day_name()


## Merging

I want to merge daily activity (daily_activity) and daily sleep (sleep_day) data so that it contains the total daily data I want to analyse. I also want to merge hourly step data (hourly_steps) and hourly intensity data (hourly_intensity) so the total hourly data is in one dataset.

In [None]:
total_daily = pd.merge(daily_activity, sleep_day, on = 'Date')
total_daily.head()

In [None]:
total_daily = total_daily.rename(columns = {'Id_x':'Activity_Id'})
total_daily = total_daily.rename(columns = {'Id_y':'Sleep_Id'})
total_daily.head(4)

In [None]:
total_hourly = pd.merge(hourly_steps, hourly_intensities, on = ['Time','Id'])
total_hourly['Date'] = total_hourly['Time'].dt.date
total_hourly['Time'] =total_hourly['Time'].dt.time
total_hourly.head(5)

# ANALYSE

How many steps do our users take daily? How active are they? Does the total number of steps have any correlation to the amount of burned calories?

In [None]:
print('The min date is:',min(daily_activity['Date']))
print('The max date is:',max(daily_activity['Date']))
print('The number of unique dates are:',daily_activity['Date'].nunique())

In [None]:
daily_activity.agg(
    {'TotalSteps': ['mean', 'min', 'max'],
     'Calories': ['mean', 'min', 'max'],  
    })

The average of total daily steps by the users is 7,638 steps. According to a study conducted in 2011 by BMC/BioMed Central, taking 10,000 steps a day is a reasonable target for healthy adults, helping reduce certain health conditions, such as high blood pressure and heart disease. In order to compare daily steps to an activity level, the following categories can be considered:

Inactive - Less than 5,000 steps/day 

Average - Between 7,500 and 9,999 steps/day

Very Active - More than 12,500 steps/day

Given the information above, we can conclude that our users fall into the Average category. With this in mind, it becomes clear that the users should improve their amount of daily steps for optimal results regarging their health and well-being.

In [None]:
total_daily.plot.scatter(x='TotalSteps', y='Calories', color='purple', alpha=0.5, figsize=(10,5))
plt.title('Total Steps VS Calories')
plt.show()

As we can see on the scatter plot above, there is a positive relationship between the two variables, which indicates that the greater the number of steps taken, the more calories user burns. As mentioned above, keeping active is crucial for maintaining good health, and the number of steps the individual takes daily has a significant impact on that.

# SHARE

All this information now helps us to draw visualizations and infer results from the said visualizations.

In [None]:
category = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

category_type = CategoricalDtype(categories=category, ordered=True)

total_daily['DayOfWeek'] = total_daily['DayOfWeek'].astype(category_type)

weekday = total_daily.groupby('DayOfWeek', observed=False).mean().reindex(category)

weekday.filter(['TotalSteps'])


## Most active time of the day

We will now determine the average steps and average intensity of each time point of the day.

In [None]:
hourly_activity = total_hourly[['Id', 'Time', 'StepTotal', 'TotalIntensity']]
avg_hourly_activity = (hourly_activity
                       .groupby('Time', as_index=False)
                       .agg(total_steps_mean =('StepTotal','mean'),
                           total_intensity_mean = ('TotalIntensity','mean')))

In [None]:
plt.figure(figsize=(10,6))  
barplot = sns.barplot(data=avg_hourly_activity, x='Time', y='total_steps_mean', 
                      hue='total_intensity_mean', palette="Purples", dodge=False)
barplot.set_xticklabels(barplot.get_xticklabels(), rotation=90)

plt.xlabel('Time of the Day', fontsize=10)
plt.ylabel('Average Steps of the Day', fontsize=10)
plt.title('Average Activity of the Day', fontsize=12)

plt.legend(title='Activity level', title_fontsize=10, loc='upper right', 
           bbox_to_anchor=(1.25, 1), fontsize=8, frameon=True)
plt.tight_layout() 
plt.show()

We can see that 17:00 to 19:00 are the most active time of the day, in terms of both intensity and steps. Around 550 to 600 step per 30 mins during that time.

## Average steps & calories burned

Next, we analyse how users spend their time when they are awake, that is, how active are the users during their active time, including sedentary time.

In [None]:
total_daily['total_activity'] = (total_daily['VeryActiveMinutes'] + 
                               total_daily['FairlyActiveMinutes'] + 
                               total_daily['LightlyActiveMinutes'] + 
                               total_daily['SedentaryMinutes'])

total_daily['pct_very_active'] = (total_daily['VeryActiveMinutes'] / total_daily['total_activity']) * 100
total_daily['pct_fairly_active'] = (total_daily['FairlyActiveMinutes'] / total_daily['total_activity']) * 100
total_daily['pct_lightly_active'] = (total_daily['LightlyActiveMinutes'] / total_daily['total_activity']) * 100
total_daily['pct_sedentary'] = (total_daily['SedentaryMinutes'] / total_daily['total_activity']) * 100


avg_daily_activity = total_daily.groupby('DayOfWeek', observed=False).agg(
    pct_very_active_mean=('pct_very_active', 'mean'),
    pct_fairly_active_mean=('pct_fairly_active', 'mean'),
    pct_lightly_active_mean=('pct_lightly_active', 'mean'),
    pct_sedentary_mean=('pct_sedentary', 'mean'),
    total_step_mean=('TotalSteps', 'mean'),
    total_dist_mean=('TotalDistance', 'mean'),
    very_active_mean=('VeryActiveMinutes', 'mean'),
    lightly_active_mean=('LightlyActiveMinutes', 'mean'),
    activity_mean=('total_activity', 'mean'),
    calories_mean=('Calories', 'mean')).reset_index()

activity_long = avg_daily_activity.melt(
    id_vars=['DayOfWeek'],
    value_vars=['pct_very_active_mean', 'pct_fairly_active_mean', 'pct_lightly_active_mean', 'pct_sedentary_mean'],
    var_name='activity_level',
    value_name='percent_activity')

activity_labels = {'pct_very_active_mean': 'Very Active',
    'pct_fairly_active_mean': 'Fairly Active',
    'pct_lightly_active_mean': 'Lightly Active',
    'pct_sedentary_mean': 'Sedentary'}

activity_long['activity_level'] = activity_long['activity_level'].map(activity_labels)


Next, we want to see if users reach their daily steps and calories burned.

According to CDC, average American takes 3,000 to 4,000 steps per day and they recommend 10,000 per day for general health. And thus, many pedometers track to see if users reach 10,000 steps per day.

Medical News Today: [How many steps should people take per day?](http://https://www.medicalnewstoday.com/articles/how-many-steps-should-you-take-a-day)

Mayo Clinics: [10,000 steps a day: Too low? Too high?](http://https://www.mayoclinic.org/healthy-lifestyle/fitness/in-depth/10000-steps/art-20317391)

Studies consider 10,000 steps per day active and is a good baseline. Let's look at how many steps the users have and if they reach 10,000 steps per day.

Reference: [How Many Steps Should You Aim for Each Day?](http://https://www.verywellfit.com/how-many-steps-per-day-are-enough-3432827)

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

norm = plt.Normalize(vmin=avg_daily_activity['very_active_mean'].min(), 
                      vmax=avg_daily_activity['very_active_mean'].max())

bars = ax.bar(avg_daily_activity['DayOfWeek'], avg_daily_activity['total_step_mean'], 
              color=plt.cm.Blues(norm(avg_daily_activity['very_active_mean'])))

sm = plt.cm.ScalarMappable(cmap='Blues', norm=norm)
sm.set_array([]) 

cbar = fig.colorbar(sm, ax=ax)
cbar.set_label('Very Active Minutes')

ax.set_xticks(range(len(avg_daily_activity['DayOfWeek'])))  
ax.set_xticklabels(avg_daily_activity['DayOfWeek'], rotation=90)  

ax.set_xlabel('Day of the Week')
ax.set_ylabel('Average Total Steps')
ax.set_title('Average Total Steps vs. Very Active Minutes')

plt.show()

The fitbit users have higher daily steps than genernal Americans but still lower than the recommended 10,000 steps and less than 30 minutes per day. They may need to increase additional 30 min of activities to reach the goal.

## Percentage of activity in minutes

In [None]:
minutes_categories = total_daily[['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes']].mean()
minutes_categories.plot.pie(ylabel='Category', title='Average of Minutes Spent in Each Activity Category',autopct='%1.1f%%', fontsize='11', startangle=0, figsize=(10,8))
plt.show()

This pie chart shows that the users are in a sedentary state of activity most of the time, a sixth of the time doing light activity and only 2% of the time being active doing proper excercise.

# ACT

From the above data, we know that:

1. Users likely workout between 5 - 7 pm each day
2. During the weekdays, 80% of the time, users are sedentary.
3. Averge steps are around 7000 to 82000 steps daily, which are less than the recommended 10,000 steps; active time is also less than the recommended 30 minutes.
4. Calories burned are around 2000 to 2500 each day.

# RECOMMENDATIONS

Based on users life styles, we can try incorporating following features to help build Bellabeat's marketing strategy:

1. Based on hourly step data, the average maximum steps taken in 30 minutes have been analyzed. On average, users take around 7,200 to 8,000 steps per day. This indicates that they may need an additional 30 to 60 minutes of activity per day to reach the recommended 10,000 steps. To help users improve their daily activity gradually, we can incorporate a feature that allows them to set personalized step goals and receive reminders to stay on track.

2. Calories calculator: we can add a feature for users to record their daily calorie consumption and a calories calculator so that they know if net calories is 0 and if they have to burn more calories than they consume when they want to lose weight.

3. Weekly and Monthly Achievement Reports: To keep the users motivated, the Bellabeat app could provide customized weekly and monthly reports regarding the total number of steps, burned calories, sleeping habits, weight loss, and total time spent on the different activity levels. The app could send congratulatory messages to those who keep up with good habits, as well as motivational tips for improvement depending on the user's overall performance.

4. A reward-based point system where users earn points for hitting healthy activity and nutrition goals. These points can be redeemed for vouchers, discounts, or gift cards, encouraging consistent participation. Users can earn points by achieving key health milestones, such as meeting step goals, maintaining a balanced calorie intake, tracking workouts, and ensuring adequate sleep.

5. Introducing a meal plan suggestion feature would provide users with a holistic approach to health and wellness. This feature would offer users tailored meal plans, helping them maintain a balanced diet that aligns with their fitness objectives. By leveraging user data, such as daily step count, intensity levels, and calories burned, the app could generate customized meal suggestions that promote healthier eating habits. 