# Case Study 02: How can a wellness company play it smart?


Author: Vinícius Alves

Date: 03/31/2023

Version: 1.0

## About the Company: Bellabeat

[Bellabeat](https://bellabeat.com/) is a high-tech company that manufactures health-focused smart products. The company is focused on women and develop gadgets to collect data on activity, sleep, stress, and reproductive health.  That has allowed Bellabeat to empower women with the knowledge about their own health and habits.

### Products
- **Bellabeat app:** The Bellabeat app provides users with health data related to their activity, sleep, stress,
menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and
make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

- **Leaf**: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects
to the Bellabeat app to track activity, sleep, and stress.

- **Time**: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user
activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your
daily wellness.

- **Spring**: This is a water bottle that tracks daily water intake using smart technology to ensure that you are
appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your
hydration levels.

- **Bellabeat membership**: Bellabeat also oﬀers a subscription-based membership program for users.
Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and 
beauty, and mindfulness based on their lifestyle and goals.

# The analysis 

## Ask phase

To help Bellabeat improve a product, a study was commissioned on user behavior using another company's healthy data collection gadgets.

The questions raised in this analysis were:

1. What are some trends in smart device usage? 
2. How could these trends apply to Bellabeat customers?
3. How could these trends help inﬂuence Bellabeat marketing strategy? 

## Prepare phase:

### The data used

The data used in this analysis is from [Fitbit's gadgets](https://www.kaggle.com/datasets/arashnic/fitbit) available on Kaggle.

This dataset was collected by a distributed survey via Amazon Mechanical Turk between March 12, 2016 and May 12, 2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. The data does not have any personal information of the users.

This data is open-source and have the CC0: Public Domain License, which means anyone can copy, modify, distribute, work with it, even for commercial purposes, without asking permission.

### How was the data distributed?

The dataset is composed of 18 .csv files. 15 in long format, 3 in wide format.

### Bias

About bias in the data, I think hat some points need to be discussed:


#### About the Sampe size:

The data has 33 people data in the majority of the shared datasets, 
but 33 people is just a piece of the entire population that uses 
smart gadgets. On some archives we just have 24 or 8 people 
information. Because of that we cannot say that our data is not biased.


#### About the missing values:

17 of the 18 archives have no missing information, but in the 
'weightLogInfo_merged.csv' we only have Fat information of two 
people. Thus, that information is not supposed to be used.


#### About confounding variables:

Most of the datasets have readible understandable variables, but, the 
'minuteSleep_merged.csv' has a variable called 'value' that its not 
clear what is. Maybe it is the minute slept, which in this case, the 
value 1 make sense, but in some points it has a value of 2 or 3.

After checking [Fitabase Data Dictionary ](https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf), these values can be translate in: 1 = asleep, 2 = restless, 3 = awake.


#### About the selection and measurement bias:

As this dataset was constructed with the consentiment of a few people 
that used these gadgets, it is already biased, because they knew that 
their data was being collected. So, we already know that we have the 
bias of type of people that used some gadgets and agreed to share the data, 
and about the change in the behavior in the period that the data was collected?

In [20]:
# Libraries
import pandas as pd
import numpy as np
import statistics as st

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import os


The data source archives used in this study are listed below with some data analysis

In [21]:
# Getting the archives
import fnmatch

extensions = ['*.csv']

archives = []

folder_path = "Data_Coursera_CaseStudy02/"


# Walk through the folder and its subdirectories to find CSV files
for root, dirs, files in os.walk(folder_path):
    for csv_extension in extensions:
        for filename in fnmatch.filter(files, csv_extension):
            csv_path = os.path.join(root, filename)
            archives.append(csv_path)

archives

['Data_Coursera_CaseStudy02/dailyActivity_merged.csv',
 'Data_Coursera_CaseStudy02/dailyCalories_merged.csv',
 'Data_Coursera_CaseStudy02/dailyIntensities_merged.csv',
 'Data_Coursera_CaseStudy02/dailySteps_merged.csv',
 'Data_Coursera_CaseStudy02/heartrate_seconds_merged.csv',
 'Data_Coursera_CaseStudy02/hourlyCalories_merged.csv',
 'Data_Coursera_CaseStudy02/hourlyIntensities_merged.csv',
 'Data_Coursera_CaseStudy02/hourlySteps_merged.csv',
 'Data_Coursera_CaseStudy02/minuteCaloriesNarrow_merged.csv',
 'Data_Coursera_CaseStudy02/minuteCaloriesWide_merged.csv',
 'Data_Coursera_CaseStudy02/minuteIntensitiesNarrow_merged.csv',
 'Data_Coursera_CaseStudy02/minuteIntensitiesWide_merged.csv',
 'Data_Coursera_CaseStudy02/minuteMETsNarrow_merged.csv',
 'Data_Coursera_CaseStudy02/minuteSleep_merged.csv',
 'Data_Coursera_CaseStudy02/minuteStepsNarrow_merged.csv',
 'Data_Coursera_CaseStudy02/minuteStepsWide_merged.csv',
 'Data_Coursera_CaseStudy02/sleepDay_merged.csv',
 'Data_Coursera_CaseStudy0

In [22]:
# Checking the data for bias:


# 1 - Checking the sample size

# For every dataset:
# Check the number of ID (people)
# Check the total line count
# Check the missig values

for archive in archives:
  new_df = pd.read_csv(archive)

  print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
  print(f'Ids: {len(new_df.Id.unique())}')
  print(f'Total line count: {len(new_df)}')
  print(f'Missing values: {new_df.isnull().any(axis=1).sum()}')
  print("-------------------------------------------")





archive:  dailyActivity_merged.csv
Ids: 33
Total line count: 863
Missing values: 0
-------------------------------------------
archive:  dailyCalories_merged.csv
Ids: 33
Total line count: 940
Missing values: 0
-------------------------------------------
archive:  dailyIntensities_merged.csv
Ids: 33
Total line count: 940
Missing values: 0
-------------------------------------------
archive:  dailySteps_merged.csv
Ids: 33
Total line count: 863
Missing values: 0
-------------------------------------------
archive:  heartrate_seconds_merged.csv
Ids: 14
Total line count: 2483658
Missing values: 0
-------------------------------------------
archive:  hourlyCalories_merged.csv
Ids: 33
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  hourlyIntensities_merged.csv
Ids: 33
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  hourlySteps_merged.csv
Ids: 33
Total line count: 22099
Missing values: 0
-----------

Data:
https://www.kaggle.com/datasets/arashnic/fitbit?resource=download

## Process phase:


As you already can see, I will use a Jupyter Notebook to perform this analysis.

First of all, I downloaded the data and put them into a folder called 'Data_Coursera_CaseStudy02', so I will read every file from there.


To ensure that the data is clean, I will perform this steps in every archive:
1. Checked for duplicates;
2. Checked for rows with no data;
3. Checked for inconsistent data;


Is there any duplicated value in the files?

In [23]:
for archive in archives:
  new_df = pd.read_csv(archive)

  print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
  print(f'Duplicated lines: {new_df.duplicated().sum()}')
  print("-------------------------------------------")



archive:  dailyActivity_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  dailyCalories_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  dailyIntensities_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  dailySteps_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  heartrate_seconds_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  hourlyCalories_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  hourlyIntensities_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  hourlySteps_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  minuteCaloriesNarrow_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  minuteCaloriesWide_merged.csv
Duplicated lines: 0
-----------------------------------

Duplicated rows were founded in two of the datasets:

-------------------------------------------

archive:  minuteSleep_merged.csv

Duplicated lines: 543

-------------------------------------------

archive:  sleepDay_merged.csv

Duplicated lines: 3

-------------------------------------------

Removing duplicates

In [24]:
# Removing duplicates
for archive in archives:
  new_df = pd.read_csv(archive)
  if(new_df.duplicated().sum() > 0 ):
    print(f"Removing the duplicates from: {archive.replace('Data_Coursera_CaseStudy02/',' ')}")
    new_df.drop_duplicates(keep='first',inplace=True)
    # Saving the Dataframe without the duplicated values
    new_df.to_csv(archive)

# Removing the duplicates from:  minuteSleep_merged.csv
# Removing the duplicates from:  sleepDay_merged.csv

### Is there any missing value in the files?

In [25]:
# Searching for rows or column with no data:


for archive in archives:
  new_df = pd.read_csv(archive)

  print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
  print(f'Total line count: {len(new_df)}')
  print(f'Missing values: {new_df.isnull().any(axis=1).sum()}')
  print("-------------------------------------------") 


archive:  dailyActivity_merged.csv
Total line count: 863
Missing values: 0
-------------------------------------------
archive:  dailyCalories_merged.csv
Total line count: 940
Missing values: 0
-------------------------------------------
archive:  dailyIntensities_merged.csv
Total line count: 940
Missing values: 0
-------------------------------------------
archive:  dailySteps_merged.csv
Total line count: 863
Missing values: 0
-------------------------------------------
archive:  heartrate_seconds_merged.csv
Total line count: 2483658
Missing values: 0
-------------------------------------------
archive:  hourlyCalories_merged.csv
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  hourlyIntensities_merged.csv
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  hourlySteps_merged.csv
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  minuteCaloriesNarrow_

Missing values detected in:

-------------------------------------------
archive:  weightLogInfo_merged.csv

Total line count: 67

Missing values: 65

-------------------------------------------



But this feels strange.

Have checked the data, and that is a column with only 2 values.

Need to remove that column, because we will not use that data.

Removing the column 'Fat' from 'weightLogInfo_merged.csv' :

### Evaluating for inconsistent data:

In [26]:
for archive in archives:
  new_df = pd.read_csv(archive)

  print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
  print(f'Description:')
  print(new_df.describe())
  print("-------------------------------------------") 

archive:  dailyActivity_merged.csv
Description:
       Unnamed: 0            Id    TotalSteps  TotalDistance  TrackerDistance  \
count  863.000000  8.630000e+02    863.000000     863.000000       863.000000   
mean   470.404403  4.857542e+09   8319.392816       5.979513         5.963882   
std    270.542606  2.418405e+09   4744.967224       3.721044         3.703191   
min      0.000000  1.503960e+09      4.000000       0.000000         0.000000   
25%    240.500000  2.320127e+09   4923.000000       3.370000         3.370000   
50%    471.000000  4.445115e+09   8053.000000       5.590000         5.590000   
75%    708.500000  6.962181e+09  11092.500000       7.900000         7.880000   
max    939.000000  8.877689e+09  36019.000000      28.030001        28.030001   

       LoggedActivitiesDistance  VeryActiveDistance  ModeratelyActiveDistance  \
count                863.000000          863.000000                863.000000   
mean                   0.117822            1.636756         

### Evaluating inconsistent data

---
from the archive:  dailyCalories_merged.csv

We can se that there is a minimum of 0 calories that someoned burned in a day. Ok, it's impossible. These data needs to be erased from the dataset.

---

From the archive: dailyIntensities_merged.csv

The sum of the columns: SedentaryMinutes + LightlyActiveMinutes + FairlyActiveMinutes + VeryActiveMinutes must be 1440 (the total minutes of a day).

In some days, the total sum isn't equal to 1440, maybe this is because some battery discharge of the gadgets, or the data was altered.

Evaluating the number of rows that doesn't sum 1440, we keep with 462 (49.1% of the data). So I will continue with this data

--- 

From the archive:  dailyActivity_merged.csv

All the days with 0 steps will be excluded. Probably on those days the volunteers did not use the gadget. There are 77 rows with 0 TotalSteps data.



---

From the archive:  dailySteps_merged.csv

The same situation as the archive above. Deleted the data with 0 steps.


---

From the archive:  hourlySteps_merged.csv

Here, the 0 steps value has significant, but some days with a total of 0 values no. Days that sum a total of 0 steps will be deleted, since they are of no use to us (Maybe these days make difference in the analysis phase, so I will keep with them and come back later if I didn't find anything).

---

From the archive:  minuteMETsNarrow_merged.csv

In some reasearches and "some" help of ChatGPT (that gave me the wrong information), there are no activity that we, human, execute that costs less than 0.95. 

From the Compendium of Physical Activities:

https://sites.google.com/site/compendiumofphysicalactivities/Activity-Categories/inactivity?authuser=0

Sleep has a value of 0.95 METs. So, every 0 METs value in the dataset will be deleted. 

To anyone that want to know what MET is, MET is the acronym to Metabolic Equivalent of Task. It is a unit that measures how much energy an activity consumes compared to being at rest.

I get this information from: https://www.omnicalculator.com/sports/met-minutes-per-week#met-definition

---

From the archive:  weightLogInfo_merged.csv

We must remove the column Fat, since has only 2 values.




In [27]:
# Analyzing dailyIntensities_merged.csv
dataset = 'Data_Coursera_CaseStudy02/dailyIntensities_merged.csv'
df_analysis = pd.read_csv(dataset)
df_analysis['Sum'] = df_analysis['SedentaryMinutes'] + df_analysis['LightlyActiveMinutes'] + df_analysis['FairlyActiveMinutes'] + df_analysis['VeryActiveMinutes']
len(df_analysis[df_analysis['Sum']!=1440])

462

In [28]:
# Analyzing dailyActivity_merged.csv
dataset = 'Data_Coursera_CaseStudy02/dailyActivity_merged.csv'
df_analysis = pd.read_csv(dataset)
df_analysis
# df_analysis['TotalSteps'] = df_analysis['SedentaryMinutes'] + df_analysis['LightlyActiveMinutes'] + df_analysis['FairlyActiveMinutes'] + df_analysis['VeryActiveMinutes']
df_analysis = df_analysis[df_analysis['TotalSteps']!=0]

df_analysis.to_csv( 'Data_Coursera_CaseStudy02/dailyActivity_merged.csv')

In [29]:
# Analyzing dailySteps_merged.csv
dataset = 'Data_Coursera_CaseStudy02/dailySteps_merged.csv'
df_analysis = pd.read_csv(dataset)
df_analysis

df_analysis = df_analysis[df_analysis['StepTotal']!=0]

df_analysis.to_csv( 'Data_Coursera_CaseStudy02/dailySteps_merged.csv')


In [30]:
# Analyzing minuteMETsNarrow_merged.csv
dataset = 'Data_Coursera_CaseStudy02/minuteMETsNarrow_merged.csv'
df_analysis = pd.read_csv(dataset)
# df_analysis
# df_analysis['TotalSteps'] = df_analysis['SedentaryMinutes'] + df_analysis['LightlyActiveMinutes'] + df_analysis['FairlyActiveMinutes'] + df_analysis['VeryActiveMinutes']
df_analysis =df_analysis[df_analysis['METs']!=0]

df_analysis.to_csv( 'Data_Coursera_CaseStudy02/minuteMETsNarrow_merged.csv')



In [31]:
# Removing the column 'Fat' from 'weightLogInfo_merged.csv' :
dataset = 'Data_Coursera_CaseStudy02/weightLogInfo_merged.csv'
df_drop_column = pd.read_csv(dataset)
df_drop_column.drop(columns=['Fat'],inplace=True)
df_drop_column.to_csv(dataset)

KeyError: "['Fat'] not found in axis"

## Analyze and Share phase:


OK. Now we need to develop our hypothesis on what could this data show to us.




### Average Number of steps

In [None]:
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
print(f"Average number of steps taken by day: {np.average(dailySteps['StepTotal'])} steps")

Average number of steps taken by day: 8319.39281575898 steps


Doing a research, to be considered active, a person should do 10.000 steps in a day. Of course in our lives we have to balancing active life with our jobs, but seeing this averave from all the users, we can ensure that **most of them are not active people**.

### Average Calories burned by day

In [None]:
caloriesDay = pd.read_csv("Data_Coursera_CaseStudy02/dailyCalories_merged.csv", index_col=[0])
print(f"Average Calories by day: {np.average(caloriesDay['Calories'])} calories")

Average Calories by day: 2303.609574468085 calories


This value of burned calories in a day is in line with what research shows that both men and adult women must burn daily to maintain their weight

### Average Time in Activities

In [None]:
dailyActivity = pd.read_csv("Data_Coursera_CaseStudy02/dailyActivity_merged.csv", index_col=[0])
print(f"Average Time by day in lightly activities: {np.average(dailyActivity['LightlyActiveMinutes']):.2f} minutes")
print(f"Average Time by day in fairly activities: {np.average(dailyActivity['FairlyActiveMinutes']):.2f} minutes")
print(f"Average Time by day in very active activities: {np.average(dailyActivity['VeryActiveMinutes']):.2f} minutes")



Average Time by day in lightly activities: 210.02 minutes
Average Time by day in fairly activities: 14.78 minutes
Average Time by day in very active activities: 23.02 minutes


In a research with the help of the Bing AI:

According to the World Health Organization (WHO), adults should spend at least 180 minutes in a variety of types of physical activities at any intensity, of which at least 60 minutes is moderate- to vigorous-intensity physical activity, spread throughout the day; more is better. The Centers for Disease Control and Prevention (CDC) recommends that adults need 150 minutes of moderate-intensity physical activity and 2 days of muscle strengthening activity each week.

So, we can see that the average is about 210 minuites in lightly activities, that can maybe be related with work. So, we can assure that most of this users need to take more time to do more fairly and active activities.

### Steps through the day

In [None]:
# Bar chart: steps through the day

hourlySteps = pd.read_csv("Data_Coursera_CaseStudy02/hourlySteps_merged.csv", index_col=[0])
hourlySteps['ActivityHour'] = pd.to_datetime(hourlySteps['ActivityHour'])

hourlySteps['Day'] = hourlySteps['ActivityHour'].dt.date
hourlySteps['Time'] =  hourlySteps['ActivityHour'].dt.time


stepsMean_By_hour = []

for hour in (hourlySteps['Time'].unique()): 
  stepsMean_By_hour.append(np.average(hourlySteps[hourlySteps['Time']==hour]['StepTotal']))



# df['datetime'] = pd.to_datetime(df['datetime'])

data = {'hours': hourlySteps['Time'].unique(),
        'Average steps': stepsMean_By_hour}

fig = px.bar(data, x='hours', y='Average steps', title = 'Average steps by hour')

fig.update_layout(xaxis=dict(
                    title = 'Hours',
                    ), 
                  yaxis=dict(
                    title='Average steps',
                    side='left'                 
                    )
)

fig.show()

Here we can see something that would be expected, most of the steps are in the period of the work times and after it, maybe time to go away or go to the gym?

### Classification of users

Users with an average steps more than 10.000 were classified as active.

In [None]:
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])

dailySteps = dailySteps.groupby(by=['Id']).mean()

status_list = []

for id in dailySteps.index:
  if(dailySteps.loc[id]['StepTotal']>=10000):
    status_list.append('Active')
  else:
    status_list.append('Not Active')

dailySteps['Status'] = status_list

data = {'Status': ['Active', 'Not Active'],
        'number of status': [ len(dailySteps[dailySteps['Status']=='Active']), len(dailySteps[dailySteps['Status']=='Not Active'])]}


fig = px.pie(data, values='number of status', names='Status', title='Percentage of active users based on 10.000 steps daily')
fig.show()

# print(f"Average number of steps taken by day: {np.average(dailySteps['StepTotal'])} steps")

As evidenced above in the average daily steps calculation. When we separate the calculation by user, we can see that almost 80% of them do not do 10.000 steps by day.

### Intensities by hour

In [None]:
# Bar chart: intensities through the day

hourlyIntensities = pd.read_csv("Data_Coursera_CaseStudy02/hourlyIntensities_merged.csv", index_col=[0])
hourlyIntensities['ActivityHour'] = pd.to_datetime(hourlyIntensities['ActivityHour'])

hourlyIntensities['Day'] = hourlyIntensities['ActivityHour'].dt.date
hourlyIntensities['Time'] =  hourlyIntensities['ActivityHour'].dt.time


stepsMean_By_hour = []

for hour in (hourlyIntensities['Time'].unique()): 
  stepsMean_By_hour.append(np.average(hourlyIntensities[hourlyIntensities['Time']==hour]['TotalIntensity']))



# df['datetime'] = pd.to_datetime(df['datetime'])

data = {'hours': hourlyIntensities['Time'].unique(),
        'Average Intensities': stepsMean_By_hour}

fig = px.bar(data, x='hours', y='Average Intensities', title = 'Average Intensities by hour')

fig.update_layout(xaxis=dict(
                    title = 'Hours',
                    ), 
                  yaxis=dict(
                    title='Average Intensities',
                    side='left'                 
                    )
)

fig.show()




This follows the steps by hour. We can do the same analysis as the previous.

### Calories through the day

In [None]:
# Bar chart: Average time by type of activity

hourlyCalories = pd.read_csv("Data_Coursera_CaseStudy02/hourlyCalories_merged.csv", index_col=[0])
hourlyCalories['ActivityHour'] = pd.to_datetime(hourlyCalories['ActivityHour'])

hourlyCalories['Day'] = hourlyCalories['ActivityHour'].dt.date
hourlyCalories['Time'] =  hourlyCalories['ActivityHour'].dt.time


caloriesAVG_By_hour = []

for hour in (hourlyCalories['Time'].unique()): 
  caloriesAVG_By_hour.append(np.average(hourlyCalories[hourlyCalories['Time']==hour]['Calories']))



# df['datetime'] = pd.to_datetime(df['datetime'])

data = {'hours': hourlyCalories['Time'].unique(),
        'Average calories': caloriesAVG_By_hour}

fig = px.bar(data, x='hours', y='Average calories', title = 'Average calories by hour')

fig.update_layout(xaxis=dict(
                    title = 'Hours',
                    ), 
                  yaxis=dict(
                    title='Average calories',
                    side='left'                 
                    )
)

fig.show()

Also, we can do the same analysis as the previous. The most calories are burned in the work hours.

### Time Sleep vs Daily Steps

In [None]:
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
sleepDay = pd.read_csv("Data_Coursera_CaseStudy02/sleepDay_merged.csv", index_col=[0])

# Remove the time data from sleepDay Dataframe
sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')

# Renaming the columns
sleepDay.rename(columns={'SleepDay': 'Day'}, inplace=True)
dailySteps.rename(columns={'ActivityDay': 'Day'}, inplace=True)






In [None]:
# New dataframe 
df_analysis_sleep_steps = pd.merge(dailySteps, sleepDay, on = ['Id', 'Day'])
# df_analysis_sleep_steps

In [None]:
# Plot
fig = px.scatter(x=df_analysis_sleep_steps['TotalMinutesAsleep'], y=df_analysis_sleep_steps['StepTotal'], title="Minutes Asleep vs Steps total")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Minutes Asleep',
                      ), 
                    yaxis=dict(
                      title='Total Steps in a day',
                      side='left',
))


fig.show()

In this plot we can see that most of the users concentrate between ~440 minutes of sleep and ~10k steps per day.

It doesn't appear to exist a relation in the two data. 

### Sleep Time vs Time in Bed

In [32]:
sleepDay = pd.read_csv("Data_Coursera_CaseStudy02/sleepDay_merged.csv", index_col=[0])

# Remove the time data from sleepDay Dataframe
sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')
sleepDay

fig = px.scatter(x=sleepDay['TotalMinutesAsleep'], y=sleepDay['TotalTimeInBed'], title="Minutes Asleep vs Total time in Bed")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Minutes Asleep',
                      ), 
                    yaxis=dict(
                      title='Total Time in Bed',
                      side='left',
))


fig.show()

An exactly linear relation. But kind of obvius.

### Sleep Records vs Daily Steps

In [33]:
# Plot
fig = px.scatter(x=df_analysis_sleep_steps['TotalSleepRecords'], y=df_analysis_sleep_steps['StepTotal'], title="Minutes Asleep vs Steps total")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Total Sleep Records',
                      ), 
                    yaxis=dict(
                      title='Total Steps in a day',
                      side='left',
))


fig.show()

Also, nothing that we can confirm. 

If we had more information of people with 3 sleeping records we could affirm that those who doesn't appear to sleep well tends to take less steps per day.

### Calories vs Daily Steps

In [35]:
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
caloriesDay = pd.read_csv("Data_Coursera_CaseStudy02/dailyCalories_merged.csv", index_col=[0])

# Remove the time data from sleepDay Dataframe
# sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')

# Renaming the columns
caloriesDay.rename(columns={'ActivityDay': 'Day'}, inplace=True)
dailySteps.rename(columns={'ActivityDay': 'Day'}, inplace=True)


In [36]:
# New dataframe 
df_analysis_calories_steps = pd.merge(dailySteps, caloriesDay, on = ['Id', 'Day'])
# df_analysis_sleep_steps

In [37]:
# Plot
fig = px.scatter(x=df_analysis_calories_steps['Calories'], y=df_analysis_calories_steps['StepTotal'], title="Calories vs Steps total")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Calories spent in a Day',
                      ), 
                    yaxis=dict(
                      title='Total Steps in a day',
                      side='left',
))


fig.show()

A well correlation. The more steps you do in your day, the more calories are spent.

We have some outliers, but that can be:
1. An error of the device;
2. The outlier is counting calories just from the steps (and the calories of every other data point is a sum of steps and activities);

### Time Activity vs Sleep Time

In [38]:
dailyActivity = pd.read_csv("Data_Coursera_CaseStudy02/dailyActivity_merged.csv", index_col=[0])
sleepDay = pd.read_csv("Data_Coursera_CaseStudy02/sleepDay_merged.csv", index_col=[0])

# Remove the time data from sleepDay Dataframe
sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')

# Renaming the columns
dailyActivity.rename(columns={'ActivityDate': 'Day'}, inplace=True)
sleepDay.rename(columns={'SleepDay': 'Day'}, inplace=True)

In [None]:
# Merging the Dataframes
df_analysis_calories_steps = pd.merge(dailyActivity, sleepDay, on = ['Id', 'Day'])
df_analysis_calories_steps

In [40]:

# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)

fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['LightlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Lightly Activities'),
    row=1, col=1,
)


fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['FairlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Fairly Active'),
    row=1, col=2
)


fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['VeryActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Very Active'),
    row=1, col=3
)

fig.update_layout(height=600, width=1200, title_text="Activity Time vs Sleep Time ", hovermode="x unified")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Time spent in lightly activities',
                      ), 
                    yaxis=dict(
                      title='Time Slept',
                      side='left',
))

fig.update_xaxes(row=1, col=2, title = 'Time spent in fairly activities')
fig.update_xaxes(row=1, col=3, title = 'Time spent in very active activities')


# fig.update_traces(connectgaps=False)

fig.show()


Here we can see a preference to lightly activities instead of fairly or very active activities.

Again, most of those people sleep ~440 minutes a day (7 hours and 20 minutes).

Bute, we also have many points below 300 minutes (5 hours), and that is not healty.

In [41]:
# Density plots from the plot above: 

# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)

fig.add_trace(
    go.Histogram2dContour(x=df_analysis_calories_steps['LightlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], colorscale = 'Blues',name='Lightly Activities'),
    row=1, col=1,
)


fig.add_trace(
    go.Histogram2dContour(x=df_analysis_calories_steps['FairlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], colorscale = 'Blues',name='Fairly Active'),
    row=1, col=2
)


fig.add_trace(
    go.Histogram2dContour(x=df_analysis_calories_steps['VeryActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], colorscale = 'Blues',name='Very Active'),
    row=1, col=3
)

fig.update_layout(height=800, width=1200, title_text="Activity Time vs Sleep Time ", hovermode="x unified")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Time spent in lightly activities',
                      range = [0,450]
                      ), 
                    yaxis=dict(
                      title='Time Slept',
                      side='left',
))

fig.update_xaxes(row=1, col=2, title = 'Time spent in fairly activities', range = [-5.5,55])
fig.update_xaxes(row=1, col=3, title = 'Time spent in very active activities', range = [-10,85])


# fig.update_traces(connectgaps=False)


Same as above, but with a density plot. We can see that the preference in lightly activities exists.

In [42]:
# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)

fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['LightActiveDistance'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Lightly Activities Distance'),
    row=1, col=1,
)


fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['ModeratelyActiveDistance'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Moderately Active Distance'),
    row=1, col=2
)


fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['VeryActiveDistance'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Very Active Distance'),
    row=1, col=3
)

fig.update_layout(height=600, width=1200, title_text="Distance in activities vs Sleep Time ", hovermode="x unified")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Distance in lightly activities',
                      ), 
                    yaxis=dict(
                      title='Time Slept',
                      side='left',
))

fig.update_xaxes(row=1, col=2, title = 'Distance in moderately activities')
fig.update_xaxes(row=1, col=3, title = 'Distance in very active activities')


# fig.update_traces(connectgaps=False)

fig.show()

Same as Time in activities vs Sleep Time.

### Calories vs time in activities

In [43]:
dailyActivity = pd.read_csv("Data_Coursera_CaseStudy02/dailyActivity_merged.csv", index_col=[0])
caloriesDay = pd.read_csv("Data_Coursera_CaseStudy02/dailyCalories_merged.csv", index_col=[0])

# Remove the time data from sleepDay Dataframe
# sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')

# Renaming the columns
caloriesDay.rename(columns={'ActivityDay': 'Day'}, inplace=True)
# Renaming the columns
dailyActivity.rename(columns={'ActivityDate': 'Day'}, inplace=True)


In [None]:
# Merging the Dataframes
df_analysis_calories_activities = pd.merge(dailyActivity, caloriesDay, on = ['Id', 'Day'])
df_analysis_calories_activities

In [45]:
# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)

fig.add_trace(
    go.Scatter(x=dailyActivity['LightlyActiveMinutes'], y=dailyActivity['Calories'], mode='markers',name='Lightly Activities'),
    row=1, col=1,
)


fig.add_trace(
    go.Scatter(x=dailyActivity['FairlyActiveMinutes'], y=dailyActivity['Calories'], mode='markers',name='Fairly Active'),
    row=1, col=2
)


fig.add_trace(
    go.Scatter(x=dailyActivity['VeryActiveMinutes'], y=dailyActivity['Calories'], mode='markers',name='Very Active'),
    row=1, col=3
)

fig.update_layout(height=600, width=1200, title_text="Activity Time vs Calories", hovermode="x unified")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Time spent in lightly activities',
                      ), 
                    yaxis=dict(
                      title='Calories',
                      side='left',
))

fig.update_xaxes(row=1, col=2, title = 'Time spent in fairly activities')
fig.update_xaxes(row=1, col=3, title = 'Time spent in very active activities')


# fig.update_traces(connectgaps=False)

fig.show()

### Average time by type of activity

In [46]:
# Bar chart: Average time by type of activity

data = {'labels': ['Ligthly', 'Fairly', 'Very active'],
        'Mean time in activities': [np.average(dailyActivity['LightlyActiveMinutes']), np.average(dailyActivity['FairlyActiveMinutes']), np.average(dailyActivity['VeryActiveMinutes'])]}

fig = px.bar(data, x='labels', y='Mean time in activities', title = 'Average time by type of activity')

fig.update_layout(xaxis=dict(
                    title = 'Type of Activity',
                    ), 
                  yaxis=dict(
                    title='Mean time (minutes)',
                    side='left'                 
                    )
)

fig.show()

### AVG Steps vs Body Mass Index

In [47]:
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
weightLogInfo = pd.read_csv("Data_Coursera_CaseStudy02/weightLogInfo_merged.csv", index_col=[0])



dailySteps = dailySteps.groupby(by=['Id']).mean()
dailySteps.reset_index(inplace=True)

df_analysis_steps_BMI = pd.merge(dailySteps, weightLogInfo, on = ['Id'])
df_analysis_steps_BMI


fig = px.scatter(x=df_analysis_steps_BMI['BMI'], y=df_analysis_steps_BMI['StepTotal'], title="BMI vs Average Steps total")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'BMI (Body Mass Index)',
                      ), 
                    yaxis=dict(
                      title='Total Steps in a day',
                      side='left',
))


fig.show()

# data = {'Status': ['Active', 'Not Active'],
#         'number of status': [ len(dailySteps[dailySteps['Status']=='Active']), len(dailySteps[dailySteps['Status']=='Not Active'])]}


# fig = px.pie(data, values='number of status', names='Status', title='Percentage of active users based on 10.000 steps daily')
# fig.show()

# print(f"Average number of steps taken by day: {np.average(dailySteps['StepTotal'])} steps")

Ok. We do not have much data, but we can presume that people with BMI higher than 40 really has difficulty to do more exercises.

## Act phase

It is important to note that the data analyzed in this study only represents a partial sample of Bellabeat clients, and therefore cannot be generalized to the entire population. The sample size of 7~33 people is not large enough to assume that the behaviors observed are representative of the broader population. As such, any conclusions or insights drawn from this data should be considered with caution and should not be applied to the entire Bellabeat client base without further research and analysis. It is important to gather a more significant sample size to accurately reflect the population's behavior and make informed decisions based on the data collected.

### Insigths from the data:

- Most os these users can't be considered active, with less than 10.000 steps per day;
  - The users want to know data about their health, and for that, use the gadgets. But, even so, they can't manages to change their habits to become more active.

- Some users do not sleep well;
  - We could see a lot of data points below 300 minutes a day.

- Most of the users do not have a habit to do exercises;
  - The average time spent with lightly activities may be related to the job, and the average with the others activities are too low to be considered to be healthy.

- The most active time is from 17h to 20h;
  - This time could be the time of leaving work and going to the gym.

- People with high BMI tends to make less exercise;


### Recomendations

Based on the trends and hypothesis that has been found, I recommend create new features to the ibtegration of **Bellabeat Leaf** with **Bellabeat App and/ or Bellabeat membership**.

- New sleeping traking system:
  - To help users improve their sleep quality, the Bellabeat app could implement a notification system that sends reminders to users who have experienced poor sleep quality. These notifications could provide helpful tips and suggestions related to sleep hygiene, such as establishing a regular bedtime routine or creating a relaxing sleep environment. Additionally, the app could offer guided meditations or breathing exercises to help users relax before bed and improve the quality of their sleep.

- Game mode of steps daily task and activity time (with rewards):
  - Setting daily step goals can be an effective way to motivate Bellabeat users to walk more. By establishing a daily target, users can track their progress and see how close they are to achieving their goal. This can create a sense of accomplishment and provide a tangible incentive to stay active throughout the day. The Bellabeat app could provide users with personalized step targets based on their current activity levels, and offer rewards or incentives for achieving milestones or consistently meeting their goals.

- Social media share options on app:
  - Setting options to share the acomplishment of tasks in the actual days should have an excelent impact and incentive people to stay active. An environment of competition and companionship can be created once groups could be formed, and the users can start to help each other.
  


In conclusion, I believe that more research and analysis is needed to better understand the specific needs and behaviors of Bellabeat users in order to continue to improve and tailor the product to meet their needs.



