# Asking the right questions

### Business Task: *How can we use trends in smart device usage to produce actionable insights that guide Bellabeat's marketing efforts?*

#### About BellaBeat 
Bellabeat is a wearable smart device company co-founded by Urška Sršen and Sando Mur. Their aim is to create fashionable, fitness smart products that integrate seamlessly into womens' lifestyles. Currently their products collect data on sleep, stress, activity, and reproductive health. This analysis is tailored towards marketing Bellabeat's wearable smart device 'Leaf' and its accompanying app. 

#### Analysis Objectives:

1. Finding Patterns
    + What features encourage users to frequently/consistently use their smart device?
    + Based on these features, which customer segments should Bellabeat's marketing team target?
2. Discover Connections
    + How does daily steps, sleep, activity, intensity, and calories burned correlate to one another?
3. Make Educated Predictions 
    + How can Bellabeat use information about key features and correlated factors to inform their marketing strategy?
    
#### Accessing and Sourcing Reliable Data 
I used [FitBit Fitness Tracker Data](https://www.kaggle.com/arashnic/fitbit) under the CC0: Public Domain license (dataset made available by [Möbius](https://www.kaggle.com/arashnic) ). This dataset includes 30 individuals' data over 31 days. 

#### Installing and loading necessary packages and libraries

I installed necessary packages for manipulating, processing, analyzing, and visualizing FitBit usage data.

In [None]:
#core R packages for visualization and manipulation
library('tidyverse')
#summary statistics
library('skimr')
#examine and clean data
library('janitor')
#formatting date/month/year
library('lubridate')
#clean and check data quickly

#### Loading CSV files 
I decided to explore these data sets and renamed them for consistency: 
  * daily_activity <- dailyActivity_merged.csv
  * daily_calories <- dailyCalories_merged.csv
  * daily_intensities <- dailyIntensities_merged.csv
  * daily_sleep <- sleepDay_merged.csv

In [None]:
daily_activity <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
hourly_intensities <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
daily_intensities <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
daily_sleep <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

# Exploring Raw Data 
Previewing daily_activity data and identifying column names in daily_activity.

In [None]:
head(daily_activity)
colnames(daily_activity)

Previewing daily_sleep data and identifying columns names in daily_sleep.

In [None]:
head(daily_sleep)
colnames(daily_sleep)

#### Data Frame Summary Statistics 
Checking to see how many distinct participants are in each data frame.

In [None]:
n_distinct(daily_activity$Id)
n_distinct(daily_intensities$Id)
n_distinct(daily_sleep$Id)

Compiling key summary statistics for the daily_activity data frame.

In [None]:
daily_activity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes,
         LightlyActiveMinutes,
         FairlyActiveMinutes,
         VeryActiveMinutes,
         Calories) %>%
  summary()

Compiling key summary statistics for the daily_sleep data frame.

In [None]:
daily_sleep %>%
      select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()

# Preparing Data Frames for Analysis 
#### Defining Variables 
Using Fitbit's definition of very active, fairly active, and lightly active minutes, I created
**active_mins:** Based on [Fitbit's guide to Active Minutes](https://help.fitbit.com/articles/en_US/Help_article/1379.htm), I devised a weighted sum of lightly, fairly, and very active minutes to holistically represent 'active' minutes in a day. 

In [None]:
summary_activity <- daily_activity %>% 
  group_by() %>% #computing summary by grouping by Id 
  summarize(avg_sedentary_mins = mean(SedentaryMinutes), #compute mean activity levels
           avg_lightly_active_mins = mean(LightlyActiveMinutes),
           avg_fairly_active_mins = mean(FairlyActiveMinutes),
           avg_very_active_mins = mean(VeryActiveMinutes),
           active_mins = ((LightlyActiveMinutes*0.5)+
                           (FairlyActiveMinutes)+(VeryActiveMinutes*1.75))
            ) 

In [None]:
summary_activity$activity_level = case_when( #categorizing users based on activity levels
  summary_activity$active_mins >= 206.06 ~ 'Very Active',
  summary_activity$active_mins >= 147.01 ~ 'Fairly Active',
  summary_activity$active_mins >= 82.57 ~ 'Somewhat Active',
  summary_activity$active_mins < 82.57 ~ 'Sedentary', 
)

#### Defining Variables 
Here I calculated the sum, average, and number of sleep data entries each participant recorded. I then calculated sleep efficiency- the ratio of time spent in bed versus time asleep- using the sum and mean.  

**avg_time_in_bed** = the sum of time spent in bed asleep and awake; a indication how easy or difficult it is for an individual to fall asleep 

In [None]:
library(dplyr, warn.conflicts = FALSE)
# Suppress summarise info
options(dplyr.summarise.inform = FALSE)
summary_sleep <- daily_sleep %>%  
  group_by(Id) %>% 
  summarize(sum_sleep_mins = sum(TotalMinutesAsleep), 
            avg_sleep_mins = mean(TotalMinutesAsleep), 
            avg_time_in_bed = mean(TotalTimeInBed),
            number_sleep_entries = length(TotalMinutesAsleep),
            sleep_efficiency = (TotalMinutesAsleep/TotalTimeInBed)*100 
            ) 

Based on [medical research](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4302758/), I established a broad set of criteria to classify different recovery levels based on sleep efficiency.  

**Sleep efficiency** impacts how well-rested an individual feels; high sleep efficiency leads to deeper, higher quality sleep while low sleep efficiency is associated with tiredness, restlessness, and sleep disorders.  

Based on different sleep efficiency, I categorized individuals into different **recovery levels**. Research suggests that individuals with a sleep efficiency close to 100% are likely sleep deprived. Hence I categorized this as: **'Possible Sleep Deficit '**.

In [None]:
summary_sleep$recovery_level = case_when(
  summary_sleep$sleep_efficiency >= 95 ~ 'Possible Sleep Deficit',
  summary_sleep$sleep_efficiency >= 85 ~ 'Optimal',
  summary_sleep$sleep_efficiency < 85 ~ 'Possible Sleep Disorder',
)

Based on [sleep guidelines](https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html) by the CDC, I defined another set of criteria to classify how well rested an individual was based on their amount of sleep.

In [None]:
summary_sleep$rested_level = case_when(
  summary_sleep$avg_sleep_mins >= 450 ~ 'Well Rested',
  summary_sleep$avg_sleep_mins >= 390 ~ 'Moderately Rested',
  summary_sleep$avg_sleep_mins >= 330 ~ 'Poorly Rested',
  summary_sleep$avg_sleep_mins < 330 ~ 'Very Poorly Rested'
)

Merged sleep and activity dataframes to prepare for visualization.

In [None]:
sleep_activity_merged <- merge(summary_sleep, summary_activity, all = 'true') # merging sleep and activity data frames to prepare for analysis 

Getting rid of duplicates in sleep_activity_merged.

In [None]:
sleep_activity_merged <- sleep_activity_merged[!duplicated(sleep_activity_merged), ]

# Analyzing and Visualizing Data 

In [None]:
ggplot(data = summary_sleep) +
  geom_smooth(mapping = aes(x = sleep_efficiency, y = avg_sleep_mins)) +
  labs(title="Sleep Efficiency vs. Average Sleep (mins)") + 
  xlab("Sleep Efficiency (%)") + ylab("Average Sleep (mins)")

We can see that generally, the more sleep individuals get the higher their sleep efficiency. There is, however, a dip for individuals with sleep efficiency higher than ~97%. Let's explore this dip in more detail. 

In [None]:
ggplot(data = summary_sleep) +
  geom_smooth(mapping = aes(x = avg_sleep_mins, y = avg_time_in_bed)) +
  geom_point(mapping = aes(x = avg_sleep_mins, y = avg_time_in_bed)) + 
  labs(title = "Time Asleep vs Time in Bed") +
  xlab("Average Time in Bed (mins)") + xlab("Average Sleep (mins)")

Based on our visual we can see there is some correlation between time asleep and time spent in bed. 

>Marketing Insight: Based on this correlation, Bellabeat could set reminders to encourage users to spend more time in bed. However, there are outliers for individuals spending long durations in bed (trying to fall asleep). This indicates some individuals struggle to sleep even with sufficient time in bed- lets explore this idea further. 

*Next Steps:* To explore these outliers, I separated the Time Asleep vs Time in Bed based on recovery levels. This is because research suggests that individuals lacking sleep fall asleep quicker and experience longer durations of deep sleep. 

In [None]:
ggplot(data = summary_sleep) +
  geom_jitter(mapping = aes(x = avg_sleep_mins, y = avg_time_in_bed)) + #jitter used due to high density of points 
  facet_wrap(~recovery_level) + #find relationships by different recovery groups 
    labs(title = "Time Asleep vs. Time in Bed- Based on Recovery") + 
    xlab("Average Sleep Duration (mins)") + ylab("Time in Bed (mins)")

Interestingly we see that:  

  * There seems to be a linear relationship between Time Asleep and Time in Bed for individuals with *Optimal* or *Possible Sleep Deficits*
  * There is more variation between Time Asleep and Time in Bed for indivdiuals with abnormally low sleep efficiency (classified as *Possible Sleep Disorder*).  

>Marketing Insights: Use these findings to track sleep efficiency and recovery in relation to sleep disorders. Just as how other fitness trackers found broader application to their fitness tracking metric admist the pandemic [(using respiratory rate to predict COVID-19 symptoms)](https://www.whoop.com/thelocker/case-studies-respiratory-rate-covid-19/), Bellabeat can act on this data to shift branding towards combatting health issues. Bellabeat should play an **active role** in their consumers' health rather than being a bystander. 

*Next Steps:* I explored further into sleep behavior by contrasting rest levels (based on sleep duration) and recovery levels (based on sleep efficiency).

In [None]:
ggplot(data = summary_sleep) +
  geom_bar(mapping = aes(x = rested_level, fill = recovery_level)) +
  theme(axis.text.x = element_text(angle = 90)) +
    labs(title = "Rest vs. Recovery Levels") + 
      xlab("Sleep Sufficiency") + ylab("Sleep Entries (#)")

I noticed that:  

  * Poorly and Very Poorly Rested individuals are most frequently classified as having *Possible Sleep Disorders* and *Low Recovery*   
  * Moderately and Well Rested individuals are have optimal and above optimal recovery levels; this indicates that although they are well rested, they may be in a sleep deficit. There is also low occurrence of *Possible Sleep Disorders*. 

>Marketing Insights: Make Bellabeat app analytics and marketing more health oriented by educating users. For example, if a user is experiences sustained periods of *Poor Rest* and *Low Recovery* the app could notify the user and explain possible issues and recommend the user to visit their doctor. 

*Next Steps:* I separated Rest vs. Recovery levels based on activity level to explore whether activity levels impacted sleep quality and duration. 

In [None]:
ggplot(data = sleep_activity_merged) +
  geom_bar(mapping = aes(x = rested_level, fill = recovery_level)) +
  labs(title="Rest vs. Recovery Levels Based on Activity Level") +
  theme(axis.text.x = element_text(angle = 90)) +
  facet_wrap(~activity_level) + xlab("Sleep Sufficiency") + ylab("Sleep Entries (#)")

This visualization surprised me because I predicted that sleep quality and sufficiency would increase with activity levels, instead:  

  * Among *Very* and *Fairly Active* individuals who were *Very Poorly Rested/Poorly Rested*, a relatively high proportion of this group experienced low sleep efficiency (categorized as Possible Sleep Deficiency)   
  * Across all activity levels, *Moderate* to *Well Rested* individuals mostly experienced high sleep efficiency

>Marketing Insights: Data suggests the more active you are, the more rest you need (makes sense!) Active, poorly rested individuals are at higher risk of sleep disorders because  their bodies need longer high quality sleep to recovery; overtraining and poor rest will lead to insomnia. Using this information Bellabeat can take a more scientific/data-driven approach to daily performance. 

*Next Steps:* Finding out how time in bed is related to recovery levels to determine optimal sleep duration. 

In [None]:
ggplot(data = summary_sleep) +
  geom_histogram(mapping = aes(x = avg_time_in_bed, fill = recovery_level)) +
  theme(axis.text.x = element_text(angle = 50)) + 
  labs(title = "Time in Bed vs. Recovery Levels") + 
  xlab("Time in Bed (Mins)") + ylab("Sleep Entries (#)")

Within the FitBit dataset:  

  * Users who spent **450 to 520 minutes (7.5 to 8.67 hours~)** in bed experienced optimal recovery most frequently  
  * A relatively high proportion of users who spent excessive (10 hours>) or insufficient (2 hours<) time in bed experienced very poor or abnormally high recovery levels, indicating insufficient sleep or possible sleep disorders.  


# Insights and Recommendations
#### **Final conclusions based on my analysis**  
The [FitBit Fitness Tracker Data](https://www.kaggle.com/arashnic/fitbit) includes a wide range of users (in terms of activity levels and lifestyles). Based on Bellabeat's website they are targeting women who are: active, young, and fashionable. My main takeaway is how can Bellabeat make the Leaf more inclusive and specific to a broader range of lifestyles. To appeal to a wider audience, Bellabeat should help inactive/unhealthy individuals work towards a healthy lifestyle. Below I outlined a few recommendations to help Bellabeat refine their branding.  


#### **How can I apply my findings to Bellabeat's marketing efforts?**
1. Play an active/proactive role in consumer health  
>Use reminders to spend more time in bed based on sleep duration and quality. Highlight the broader application of health metrics relevant to pandemic times. Promote material on how the Leaf can help track sleep quality (and other metrics) to determine health. Provide an overview of incremental improvements to work towards (or maintain) a healthy lifestyle. 

2. Apply a data-driven approach to optimizing health 
>Place more focus on user data analytics and continual feedback. Continuous interactions can help increase engagement between users with different activity levels. Bellabeat could promote this approach through social media channels and blogs. Lastly, educate users about the importance of lesser known metrics (like sleep efficiency) through blogs, short reels (IG reels, Tiktok), and short informative notifications. 

3. "Gameify" Bellabeat's smart app
>As a smart fitness device user, I was surprised by the inconsistency of these datasets. I rarely miss tracking my day because of the social element- comparing my daily statistics with my friends. Fitbit's initial success relied on adding social and competitive elements to their app UI. Bellabeat could also analyze user data to create comparative statistics customised to each user. I also really like [Whoop's use of communities](https://support.whoop.com/hc/en-us/articles/360043767753-Joining-a-Public-WHOOP-Team) to motivate users with similar interests.  


#### **Additional data to expand on findings**
  * Larger data set (more individuals across a longer period of time) to increase statistical power and bring more reliability/credibility to analysis
  * Detailed demographic data for FitBit users to account for for cultural, geographical, and socio-economic factors 
  * Existing data by Bellabeat on 'reproductive health' statistics that they track. This could reveal how activity, sleep, and reproductive health are related and how Bellabeat can guide their female users.  
  
    
##### If you're here congrats! You've stuck with me throughout this brain-racking but ultra-rewarding process. I'm still a beginner in my data analystics journey so any comments/feedback would be much appreciated!  

##### Some newbie issues I ran into: 
  * When transferring code from R into Kaggle I encountered some errors that weren't an issue in R. However in Kaggle it prevented me from properly executing sections of my code. Specifically, I had trouble with merging my dataframes by "Id" and struggled with why I couldn't group.by then summarize (even though it had worked in R). I would really appreciate it if anyone knows why this happened and can share with me! 