# Case Study: How Can a Wellness Technology Company Play It Smart?

Bellabeat is a company that offers services and products focused on women’s wellness and health. They collect users’ data through wearable smart devices. Using available data from Fitbit, the current analysis will offer insights to guide Bellabeat in its marketing strategy. 

The following questions will direct this process to deliver effective guidelines for Bellabeat’s marketing team. 
* What are some trends in Fitbit smart device usage?
* How could these trends apply to Bellabeat customers?
* How could these trends help influence Bellabeat's marketing strategy?

The stakeholders of this analysis are Urška Sršen (Bellabeat’s cofounder and current Chief People Officer), Sandro Mur (Bellabeat’s cofounder and current Chief Executive Officer), and Bellabeat's marketing analytics team.


> #### 1. [Data Gathering](#data_gathering)
> #### 2. [Data Assessment](#data_assess)
> #### 3. [Data Cleaning](#data_clean)
> #### 4. [Analyzing and Visualizing Data](#data_analyze)
> #### 5. [Conclusion](#conclusion)

In [None]:
# Notebook developed on Kaggle where the R environment comes with the analytics packages I used already installed
# In the case you need to install the package run the following command:
# install.packages('tidyverse')
# install.packages('visdat')

# Load packages
library(tidyverse) 
library(visdat)
library(reshape2)

<a id="data_gathering"></a>
___
## 1. Data Gathering

The dataset Fitbit Fitness Tracker Data used in this analysis is on [Kaggle](https://www.kaggle.com/datasets/arashnic/fitbit). It was made available by Mobius and comprises the data of around 30 Fitbit users. We will be working with daily activity, heart rate, and sleep data. 

In [None]:
# Load the data set into dataframes
daily_activity <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
sleep_day <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
hourly_cal <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
hourly_intensity <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
hourly_step <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
heart_rate <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")

<a id="data_assess"></a>
___
## 2. Data Assessment

I will assess each data frame in order to point out the issues to be solved in the cleaning stage.

#### 2.1 Data Frame: daily_activity

The first one contains data on daily activities.

In [None]:
# Assess daily_activity data frame
head(daily_activity)

# Number of rows and columns
rows <- nrow(daily_activity)
cols <- ncol(daily_activity)
cat("Rows: ", rows, "\n")
cat("Columns: ", cols, "\n")

# Number of dintinct Id's
distinct_ids <- n_distinct(daily_activity$Id)
cat("Number of distinct user id's: ", distinct_ids)

In [None]:
# Summary statistics
daily_activity %>%  
  select(TotalSteps,
         VeryActiveMinutes,
         FairlyActiveMinutes,
         LightlyActiveMinutes,
         SedentaryMinutes,
         Calories) %>%
  summary()

In [None]:
# Check for the number of zero-valued observations in the active minutes categories
daily_activity %>%  
  filter(VeryActiveMinutes == 0,
         FairlyActiveMinutes == 0,
         LightlyActiveMinutes == 0) %>%
  nrow()

#### 2.2 Data Frame: sleep_day

This data frame contains data on the users' daily amount of sleep.

In [None]:
# Assess sleep_day data frame
head(sleep_day)

# Number of rows and columns
rows <- nrow(sleep_day)
cols <- ncol(sleep_day)
cat("Rows: ", rows, "\n")
cat("Columns: ", cols, "\n")

# Number of dintinct Id's
distinct_ids <- n_distinct(sleep_day$Id)
cat("Number of distinct user id's: ", distinct_ids)

In [None]:
# Summary statistics
sleep_day %>%  
  select(TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()

#### 2.3 Data Frame: hourly_cal

This data frame describes the users' amount of calories burned per hour.

In [None]:
# Assess hourly_cal data frame
head(hourly_cal)

# Number of rows and columns
rows <- nrow(hourly_cal)
cols <- ncol(hourly_cal)
cat("Rows: ", rows, "\n")
cat("Columns: ", cols, "\n")

# Number of dintinct Id's
distinct_ids <- n_distinct(hourly_cal$Id)
cat("Number of distinct user id's: ", distinct_ids)

In [None]:
# Summary statistics
hourly_cal %>%  
  select(Calories) %>%
  summary()

#### 2.4 Data Frame: hourly_intensity

This data frame contains data on the intensity of users' activities on an hourly basis.

In [None]:
# Assess hourly_intensity data frame
head(hourly_intensity)

# number of rows and columns
rows <- nrow(hourly_intensity)
cols <- ncol(hourly_intensity)
cat("Rows: ", rows, "\n")
cat("Columns: ", cols, "\n")

# Number of dintinct Id's
distinct_ids <- n_distinct(hourly_intensity$Id)
cat("Number of distinct user id's: ", distinct_ids)

In [None]:
# Summary statistics
hourly_intensity %>%  
  select(TotalIntensity,
         AverageIntensity) %>%
  summary()

The column `AverageIntensity` presents the `TotalIntensity` divided by 60, i.e. it computes the activity intensity per minute.

#### 2.5 Data Frame: hourly_step

This data frame contains data on the users' amount of steps per hour.

In [None]:
# Assess hourly_step data frame
head(hourly_step)

# Number of rows and columns
rows <- nrow(hourly_step)
cols <- ncol(hourly_step)
cat("Rows: ", rows, "\n")
cat("Columns: ", cols, "\n")

# Number of dintinct Id's
distinct_ids <- n_distinct(hourly_step$Id)
cat("Number of distinct user id's: ", distinct_ids)

In [None]:
# Summary statistics
hourly_step %>%  
  select(StepTotal) %>%
  summary()

#### 2.6 Data Frame: heart_rate

This data frame describes data on the users' heart rate.

In [None]:
# Assess heart_rate data frame
head(heart_rate)

# Number of rows and columns
rows <- nrow(heart_rate)
cols <- ncol(heart_rate)
cat("Rows: ", rows, "\n")
cat("Columns: ", cols, "\n")

# Number of dintinct Id's
distinct_ids <- n_distinct(heart_rate$Id)
cat("Number of distinct user id's: ", distinct_ids)

In [None]:
# Summary statistics
heart_rate %>%  
  select(Value) %>%
  summary()

#### 2.7 Data Issues 

> **Issue 1:** The `daily_activity` data frame contains 83 observations with zero values in most of their features, offering little insight into the current analysis.
>
> **Issue 2:** in all dataframes, the columns containing date and time are in the `character` data type.
>
> **Issue 3:** the observations in the `heart_rate` data frame were registered every few seconds. It is not suitable to compare it in this format with the other data frames that present time data on a daily or an hourly basis.
>
> **Issue 4:** data spread out in multiple data frames when they could be combined in an hourly basis and a daily basis.


<a id="data_clean"></a>
___
## 3. Data Cleaning

Here I will document the cleaning process with define-code-test framework. At first, I determine how I can solve the issue. Then, I apply the coding solution and test to verify the results.

### Issue 1
The `daily_activity` data frame contains 83 observations with zero values in most of their features, offering little insight into the current analysis.

**Define:** drop these rows filtering for non-zero values.

**Code:**

In [None]:
daily_activity <- 
  daily_activity %>%  
    filter(VeryActiveMinutes != 0 |
           FairlyActiveMinutes != 0 |
           LightlyActiveMinutes != 0)

**Test:**

In [None]:
# Check for the number of zero-valued observations in the active minutes categories
daily_activity %>%  
  filter(VeryActiveMinutes == 0,
         FairlyActiveMinutes == 0,
         LightlyActiveMinutes == 0) %>%
  nrow()

### Issue 2
In all data frames, the columns containing date and time are in the `character` data type.

**Define:** convert these columns from character data type to date and time data types using `mdy()` and `mdy_hms()` functions from `lubridate` package.

**Code:**

In [None]:
# Convert from character to date data type column 'ActivityDate' from 'daily_activity' data frame
daily_activity$ActivityDate <- mdy(daily_activity$ActivityDate)

# Rename 'ActivityDate' column to 'date'
names(daily_activity)[names(daily_activity) == 'ActivityDate'] <- 'date'

In [None]:
# Split the column 'SleepDay' from 'sleep_day' data frame and get just the date
sleep_day$SleepDay <- map(strsplit(sleep_day$SleepDay, split=" "), 1)

# Convert from character to date data type 'SleepDay' column and assign to 'date' column
sleep_day$date <- mdy(sleep_day$SleepDay)

# Drop SleepDay column
sleep_day$SleepDay <- NULL

In [None]:
# Convert from character to datetime data type 'ActivityHour' column from 'hourly_cal' data frame
hourly_cal$ActivityHour <- mdy_hms(hourly_cal$ActivityHour)

# Convert from character to datetime data type 'ActivityHour' column from 'hourly_intensity' data frame
hourly_intensity$ActivityHour <- mdy_hms(hourly_intensity$ActivityHour)

# Convert from character to datetime data type 'ActivityHour' column from 'hourly_step' data frame
hourly_step$ActivityHour <- mdy_hms(hourly_step$ActivityHour)

# Convert from character to datetime data type 'Time' column from 'heart_rate' data frame
heart_rate$Time <- mdy_hms(heart_rate$Time)

**Test:**

In [None]:
# Check the data type for the column 'date' of the 'daily_activity' data frame
glimpse(daily_activity$date)

# Check the data type for the column 'date' of the 'sleep_day' data frame
glimpse(sleep_day$date)

# Check the data type for the column 'ActivityHour' of the 'hourly_cal' data frame
glimpse(hourly_cal$ActivityHour)

# Check the data type for the column 'ActivityHour' of the 'hourly_intensity' data frame
glimpse(hourly_intensity$ActivityHour)

# Check the data type for the column 'ActivityHour' of the 'hourly_step' data frame
glimpse(hourly_step$ActivityHour)

# Check the data type for the column 'Time' of the 'heart_rate' data frame
glimpse(heart_rate$Time)

### Issue 3
The observations in the `heart_rate` data frame were registered every few seconds. It is not suitable to compare it in this format with the other data frames that present time data on a daily or an hourly basis.

**Define:** group the observartions by hour and get the mean heart rate of each hour.

**Code:**

In [None]:
# Create a data frame with the users' mean heart rate per hour
hourly_hrate <-
  heart_rate %>%
  group_by(Id, time=floor_date(Time, '1 hour')) %>%
  summarize(mean_hrate=round(mean(Value)))

**Test:**

In [None]:
# the 'mean_rate' column contains the average heart rate per hour
head(hourly_hrate)

### Issue 4
Data spread out in multiple data frames when they could be combined in an hourly basis and a daily basis.

**Define:** Combine daily_activity and sleep_day data frames. Make another combination containing the hourly_cal, hourly_intensity, hourly_step and heart_rate data frames.

**Code:**

In [None]:
# Combine 'daily_activity' and 'sleep_day' data frames
daily_combined <- left_join(daily_activity, sleep_day, by=c("Id"="Id","date"="date"), multiple = "all")

# Drop duplicates
daily_combined <- unique(daily_combined)

In [None]:
# Merge 'hourly_cal', 'hourly_intensity' and 'hourly_step' data frames
hourly_combined <- list(hourly_cal, hourly_intensity, hourly_step) 
hourly_combined <- hourly_combined %>% reduce(inner_join, by= c("Id","ActivityHour"))

# Merge the resulting data frame 'hourly_combined' with the 'hourly_hrate'
hourly_combined <- left_join(hourly_combined, hourly_hrate, by=c("Id"="Id","ActivityHour"="time"))

**Test:**

In [None]:
head(daily_combined)

In [None]:
head(hourly_combined)

<a id="data_analyze"></a>
___
## 4. Analyzing and Visualizing Data

As you may notice in the chart below, **the `daily_combined` data frame presents missing values** in the columns related to the sleep data. That happened because there is a gap between the number of distinct users in the data frames used to make this combined data. While the `daily_activity` data frame registers 33 distinct Id's, the `sleep_day` have 24 distinct Id's. 

**A similar condition affects the `hourly_combined` data frame.** While `hourly_cal`, `hourly_intensity` and `hourly_step` data frames present the same number of users' Id (33), the `hourly_hrate` has data of only 14 users. This results in an even higher number of missing values when merging these data. 

These facts will be carefully considered when analyzing the relationship among the variables in the data frames. In other words, I decided not to drop this missing values because part of my analysis will focus on comparisons other than sleep data.

In [None]:
# Missing values in black
vis_miss(daily_combined)

# Missing values in black
vis_miss(hourly_combined)

However, in both of the combined data frames I will drop columns that are composed mostly by 0 values or won't be useful in my analysis. 

In [None]:
# Drop columns in the 'daily_combined' data frame
daily_combined[,c('TrackerDistance', 'LoggedActivitiesDistance',
                  'VeryActiveDistance', 'ModeratelyActiveDistance',
                  'LightActiveDistance', 'SedentaryActiveDistance',
                  'TotalSleepRecords')] <- list(NULL)

# Check if columns were dropped
glimpse(daily_combined)

In [None]:
# Drop columns in the 'hourly_combined' data frame
hourly_combined$AverageIntensity <- NULL

# Check if columns were dropped
glimpse(hourly_combined)

### 4.1 Exploring 'daily_combined' data frame

Here, I'm going to explore the daily data relationships. The first two questions compare the sleep data with sedentary minutes and steps taken per day. The third works on the relationship between calories burned and the different levels of activity. 

#### 4.1.1 How the number of steps taken correlate with the amount of sleep? 

In [None]:
# Suppress plot warnings
options(warn=-1)

# Correlation between 'TotalMinutesAsleep' and 'TotalSteps'
cor.test(daily_combined$TotalSteps, daily_combined$TotalMinutesAsleep, method = "pearson")

# Change plot size
options(repr.plot.width =12, repr.plot.height =8)

# Scatterplot comparing number of steps and amount of sleep
ggplot(data=daily_combined, aes(x = TotalSteps, y = TotalMinutesAsleep)) +
  geom_point() + ylim(200, 600) + xlim(0, 20000) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12),
        axis.title = element_text(size=14),
        plot.title = element_text(size=20, face="bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(title="Minutes Asleep by Total Steps",
       subtitle = "Relationship between steps taken and amount of sleep per day",
       x="Daily Steps Taken", y="Minutes Asleep")

> From the plot above and the result of the Pearson's Correlation test, we found an almost **negligible negative correlation** between the number of steps you take a day and the amount of time you sleep.

<a id="sedentary-sleep"></a>
#### 4.1.2 Does a higher number of sedentary minutes translate into a lower amount of sleep? 

In [None]:
# Correlation between 'TotalMinutesAsleep' and 'SedentaryMinutes'
cor.test(daily_combined$SedentaryMinutes, daily_combined$TotalMinutesAsleep, method = "pearson")

# Scatterplot comparing sedentary minutes and amount of sleep
ggplot(data=daily_combined, aes(x = SedentaryMinutes, y = TotalMinutesAsleep)) +
  geom_point() + xlim(0, 1300) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12),
        axis.title = element_text(size=14),
        plot.title = element_text(size=20, face="bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(title="Minutes Asleep by Sedentary Minutes",
       subtitle = "Comparing minutes spent on sedentary activity and amount of sleep per day",
        x="Sedentary Minutes",
        y="Minutes Asleep") +
  geom_smooth(method = "lm")

In [None]:
# Create column with time taken to sleep subtracting time spend Asleep from time in bed
daily_combined$timeTakenToSleep <- daily_combined$TotalTimeInBed - daily_combined$TotalMinutesAsleep

# Correlation between 'timeTakenToSleep' and 'SedentaryMinutes'
cor.test(daily_combined$SedentaryMinutes, daily_combined$timeTakenToSleep, method = "pearson")

# Scatterplot comparing sedentary minutes and amount of time taken to sleep
ggplot(data=daily_combined, aes(x = SedentaryMinutes, y = timeTakenToSleep)) +
  geom_point() + xlim(0, 1300) + ylim(0, 60) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12),
        axis.title = element_text(size=14),
        plot.title = element_text(size=20, face="bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(title="Time Taken to Sleep by Sedentary Minutes",
       subtitle = "Comparing minutes spent on sedentary activity and time taken to sleep",
        x="Sedentary Minutes",
        y="Time taken to get asleep (minutes)") +
  geom_smooth(method = "lm")

> Taking the plots above and Pearson's Correlation Tests into account, we may notice that the time spent in sedentary activity has a **moderate negative correlation with the amount of sleep** a user gets a day. On the other hand, we see a **very weak negative correlation** between the number of sedentary minutes and **the time taken to sleep**.

#### 4.1.3 How active need a user be to burn more calories? Does a higher proportion of very active minutes translate into more calories burned? Or light active minutes are enough to achieve the same results?

In [None]:
# Correlation between 'LightlyActiveMinutes' and 'Calories'
cor.test(daily_combined$LightlyActiveMinutes, daily_combined$Calories, method = "pearson")

# Correlation between 'FairlyActiveMinutes' and 'Calories'
cor.test(daily_combined$FairlyActiveMinutes, daily_combined$Calories, method = "pearson")

# Correlation between 'VeryActiveMinutes' and 'Calories'
cor.test(daily_combined$VeryActiveMinutes, daily_combined$Calories, method = "pearson")

# Reshape the data into long format
daily_long <- melt(daily_combined, id.vars = "Calories",
                  measure.vars = c("LightlyActiveMinutes", "FairlyActiveMinutes", "VeryActiveMinutes"))

# Create the plot
ggplot(daily_long, aes(value, Calories)) +
  geom_point(alpha = 0.4) +
  facet_grid(variable ~ ., scales = "free_x") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12),
        axis.title = element_text(size=14),
        plot.title = element_text(size = 20, face = "bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(title = "Calories Burned by Activity Level",
       subtitle = "Relationship between time spent in each activity level and calories burned daily",
       x = "Activity Level (minutes)", y = "Calories Burned") +
  geom_smooth(method = "lm")

<a id="question-activity-level"></a>
> Looking at the test results above, we may notice that all three categories of active minutes positively correlate with the number of calories burned. However, **only the very active minutes seem to have a significant and strong relationship.** The other two present a very weak correlation.
>
> To better understand how the intensity of the active minutes might affect the `Calories` variable, I decided to take the proportion of each level of active minutes. In other words, I want to know how a user can maximize calories burn, for instance, they might better spend their time in more intense activities instead of light ones. First, I will compute the total active minutes, and then, I will divide each category by this total. I'm leaving out the minutes spent in sedentary activity and time in bed, for they usually comprise more than half of a day, which makes active minutes look insignificant to the overall daily activities.

In [None]:
# Get the total active minutes into a column
daily_combined$ActiveMinutes <- daily_combined$LightlyActiveMinutes +
                                daily_combined$FairlyActiveMinutes +
                                daily_combined$VeryActiveMinutes

# Get the proportion in % of Very Active Minutes to the total of Active Minutes
daily_combined$PropVeryActive <- round(((daily_combined$VeryActiveMinutes / 
                                         daily_combined$ActiveMinutes) * 100), digits = 2)

# Get the proportion in % of Fairly Active Minutes to the total of Active Minutes
daily_combined$PropFairlyActive <- round(((daily_combined$FairlyActiveMinutes / 
                                           daily_combined$ActiveMinutes) * 100), digits = 2)

# Get the proportion in % of Lightly Active Minutes to the total of Active Minutes
daily_combined$PropLightlyActive <- round(((daily_combined$LightlyActiveMinutes / 
                                            daily_combined$ActiveMinutes) * 100), digits = 2)

In [None]:
# Correlation between 'PropLightlyActive' and 'Calories'
cor.test(daily_combined$PropLightlyActive, daily_combined$Calories, method = "pearson")

# Correlation between 'PropFairlyActive' and 'Calories'
cor.test(daily_combined$PropFairlyActive, daily_combined$Calories, method = "pearson")

# Correlation between 'PropVeryActive' and 'Calories'
cor.test(daily_combined$PropVeryActive, daily_combined$Calories, method = "pearson")

# Reshape the data into long format
daily_long <- melt(daily_combined, id.vars = "Calories",
                  measure.vars = c("PropLightlyActive", "PropFairlyActive", "PropVeryActive"))

# Create the plot
ggplot(daily_long, aes(value, Calories)) +
  geom_point(alpha = 0.4) +
  facet_grid(variable ~ ., scales = "free_x") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12),
        axis.title = element_text(size=14),
        plot.title = element_text(size = 20, face = "bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(title = "Calories Burned by Activity Level",
       subtitle = "Relationship between the daily proportion of each activity level and calories burned",
       x = "Proportion of Activity Level (%)", y = "Calories Burned") +
  geom_smooth(method = "lm")

> Activities with a light level of intensity make up most of the daily amount of active minutes. Using the correlation test results and the scatterplots, we may conclude that fairly active minutes have an almost negligible relationship with the number of calories burned daily. 
>
> In addition, a higher proportion of time spent in light activity don't seem to translate into more calories burned, it's the opposite, with a moderate negative relationship. 
>
> On the other hand, **the only significant positive correlation is the amount of very active minutes. The higher the proportion of this activity level in a day, the higher the calories burned.**

### 4.2 Exploring 'hourly_combined' data frame

Taking a more specific approach, we are going to analyse the data on an hourly basis. This might reveal additional insights into users' behavior along a day.

#### 4.2.1 Considering an hourly basis, does a higher activity intensity result in more calories burned? What about more steps taken?

In [None]:
# Correlation between 'Calories' and 'TotalIntensity'
cor.test(hourly_combined$TotalIntensity, hourly_combined$Calories, method = "pearson")

# Scatterplot comparing Calories burned and the Intensity of user's activity
ggplot(data=hourly_combined, aes(x = TotalIntensity, y = Calories)) +
  geom_point(alpha = 0.3) + ylim(0, 600) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12),
        axis.title = element_text(size=14),
        plot.title = element_text(size = 20, face = "bold")) +
  labs(title="Calories Burned per Hour by Activity Intensity",
       x="Intensity",
       y="Calories") +
  geom_smooth(method = "lm")

> The correlation test and the scatterplot indicate that **the intensity level of the activity has a very significant positive relationship with the number of calories burned per hour.** 

In [None]:
# Correlation between 'Calories' and 'StepTotal'
cor.test(hourly_combined$StepTotal, hourly_combined$Calories, method = "pearson")

# Scatterplot comparing Calories burned and the number of steps taken per hour
ggplot(data=hourly_combined, aes(x = StepTotal, y = Calories)) +
  geom_point(alpha = 0.8) + xlim(0, 7500) + ylim(0, 600) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title = element_text(size=18),
        plot.title = element_text(size=20, face="bold")) +
  labs(title="Calories Burned per Hour by Number of Steps Taken",
        x="Steps",
        y="Calories") +
  geom_smooth(method = "lm")

> **When it comes to the number of steps taken per hour, we find again a strong positive correlation with the calories burned**, even if it's not as significant as the intensity. 

#### 4.2.2 What is the correlation between calories burned and mean heart rate?

In [None]:
# Correlation between 'Calories' and 'mean_hrate'
cor.test(hourly_combined$mean_hrate, hourly_combined$Calories, method = "pearson")

# Scatterplot comparing Calories burned and the average heart rate per hour
ggplot(data=hourly_combined, aes(x = mean_hrate, y = Calories, col = TotalIntensity)) +
  geom_point(alpha = 0.7) + ylim(0, 600) + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title = element_text(size=18),
        plot.title = element_text(size=20, face="bold"),
        legend.title=element_text(size=14),
        legend.text=element_text(size=14)) +
  labs(title="Calories Burned by Mean Heart Rate and Intensity Level",
        x="Mean Heart Rate",
        y="Calories", 
        color = 'Activity Intensity Level') +
  geom_smooth(method = "lm")

> Although there is a considerable positive relationship between the calories burned and the mean heart rate, the intensity of the activity seems to play a more important role than this last variable. If we follow the bottom part of the scatterplot, we notice that even with an increase in the heart rate, the number of calories burned and the activity intensity do not significantly change. In fact, the increase in the number of calories burned follows an increase in the activity level. As we have seen before, intensity is strongly correlated to calories. 

<a id="active-time-peak"></a>
#### 4.2.3 Along the day, when do fitbit users become more active and burn more calories?

In [None]:
# Create time column
hourly_combined$time <- format(hourly_combined$ActivityHour, format = "%H")

# Change plot size
options(repr.plot.width =14, repr.plot.height =8)

# Box plot with frequency of steps taken per hour
hourly_combined %>% 
  ggplot(aes(x = time, y = StepTotal)) +
  geom_boxplot(outlier.alpha = 0.2) + ylim(0, 2000)  +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title = element_text(size=18),
        plot.title = element_text(size=20, face="bold")) +
  labs(title="Steps Taken per Hour",
        x="Time",
        y="Steps")

# Box plot with frequency of Calories burned per hour
hourly_combined %>% 
  ggplot(aes(x = time, y = Calories)) +
  geom_boxplot(outlier.alpha = 0.2) + ylim(30, 300) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title = element_text(size=18),
        plot.title = element_text(size=20, face="bold")) +
  labs(title="Calories Burned per Hour",
        x="Time",
        y="Calories")

> Both groups of box plots reveal that **user activity peaks around 18 (6 p.m.)** with the highest average number of calories burned. One possible reason is that this is the time when people usually finish their work shifts and do exercise.
>
> Besides, it is interesting to point out that **around 13 (1 p.m.), user activity peaks again but with a slightly smaller intensity.** And, although we would need more data to understand this trend, that might follow from the fact that some people prefer to do exercise at lunchtime. 

<a id="conclusion"></a>
___
## 5. Conclusion

Here I present the main insights of this analysis with recommendations to address each of them. 

* In the analysis of the daily data, the [charts](#question-activity-level) reveal that increasing the time spent in more intense activities (for example, more steps taken in less time) leads to more calories burned without the need to spend a long time in light activities. In other words, **users can burn more calories in less time if they do more intense exercises.** 
  * <span style='color:gray'>**Recommendation 1:** using app notifications, we can engage them in saving up a short time of their journeys to do 10 to 30 very active minutes. There is no need for endless workout routines.</span>


* **Users usually [get more active](#active-time-peak) at around 18 (6 p.m.).** A possible reason might be people ending their work shifts and going to the gym or for a run. 
  * <span style='color:gray'>**Recommendation 2:** with that in mind, we could send notifications around 17 (5 p.m.) to remind and motivate them of the few minutes needed to burn an approximate number of calories, their daily goal.</span>
  * <span style='color:gray'>**Recommendation 3:** Another strategy to engage users is sending notifications or giving them more points for becoming more active at alternative times during the day, e.g., running in the morning.</span>
  

* **When it comes to [sleeping routine](#sedentary-sleep), our data suggest that the more time you spend in sedentary activity, the less sleep you get.** At this point, it’s important to state that not all users have sleeping data. In fact, minutes spent asleep are often within sedentary minutes. Having access to more data about sleeping habits could offer more insights into what we found here.
  * <span style='color:gray'>**Recommendation 4:** besides additional data on Sleeping habits, we could provide users with a way of registering how they spend the sedentary minutes. These details would support tools we would employ to motivate users to get more active.</span>


* Finally, we need to **keep track of users’ engagement with a healthy lifestyle. We want to know what drives them to pursue their goals and how the initial motivation evolves over the weeks.** That way, we can trigger mechanisms to keep them motivated whenever there is a decrease on a daily basis in the number of steps taken or in the level of activity.
  * <span style='color:gray'>**Recommendation 5:** to deliver this, we can offer to users a journal where they would register their daily or weekly activities altogether with a point system.</span>
