In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Bellabeat Case Study

## Ask phase
Business task: To identify trends in the wearable fitness technology market, apply these trends to bellabeat products, and provide strategic marketing recommendations based on these comparisons.

Key Stakeholder: Urska Srsen

Secondary Stakeholders: Sando Mur, Marketing analytics team

## Prepare phase 

Dataset explanation: Thirty users consented to the collection and submission of FitBit tracker data between the dates of March 12 2016 and May 12 2016. Data was generated by users to a distrbuted survey via Amazon Mechanical Turk.

Dataset location: https://www.kaggle.com/datasets/arashnic/fitbit?datasetId=1041311&sortBy=voteCount

Source: https://zenodo.org/record/53894#.X9oeh3Uzaao

License: CC0 Public Domain

Privacy: Thirty users consented to the collecting of their smart wearables data.

Bias: Some values are self-reported, so inaccuracy may exist related to this fact.

## Process Phase
Zipped folder of datasets uploaded to RStudio cloud project directory

tidyverse installed and loaded

CSV files converted to data frames for visualization and analysis

data frames inspected to ensure proper conversion

In [None]:
install.packages("tidyverse")
library(tidyverse)

In [None]:
DAct <- read_csv("project/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

View(DAct)
head(DAct)
colnames(DAct)
str(DAct)

In [None]:
dsleep <- read_csv("project/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
View(dsleep)

After viewing data frames to examine structure and content, packages are installed to clean the data.

In [None]:
install.packages("here")
install.packages("janitor")
install.packages("skimr")
library(here)
library(janitor)
library(skimr)

Column names were standardized using clean_names() function

In [None]:
DAct <- clean_names(DAct)
dsleep <- clean_names(dsleep)

Data was formatted for columns activity_date, sleep_day and id to allow for more accurate analysis.

Successful reformat verified using View()

Successful conversion of datatype verified using str()

In [None]:
DAct <- mutate(DAct, activity_date = as.Date(activity_date, format= "%m/%d/%Y"), id = as.character(id))
View(DAct)
str(DAct)

In [None]:
dsleep <- mutate(dsleep, sleep_day = as.Date(sleep_day, format = "%m/%d/%Y"))
dsleep <- mutate(dsleep, id = as.character(id))
View(dsleep)
str(dsleep)

## Analyze Phase

Summary statistics performed to determine number of unique users and activity days

In [None]:
n_distinct(DAct$id)
n_distinct(DAct$activity_date)
n_distinct(dsleep$id)
n_distinct(dsleep$sleep_day)

> n_distinct(DAct$id)
[1] 33
> n_distinct(DAct$activity_date)
[1] 31
> n_distinct(dsleep$id)
[1] 24
> n_distinct(dsleep$sleep_day)
[1] 31

Summary statistics continued to find mean, median, min, max and quartiles of each data frame

In [None]:
DAct %>%
  select(total_steps,
         total_distance,
         calories,
         lightly_active_minutes,
         fairly_active_minutes,
         very_active_minutes,
         sedentary_minutes) %>%
  summary()

total_steps    total_distance      calories    lightly_active_minutes fairly_active_minutes
 Min.   :    0   Min.   : 0.000   Min.   :   0   Min.   :  0.0          Min.   :  0.00       
 1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:1828   1st Qu.:127.0          1st Qu.:  0.00       
 Median : 7406   Median : 5.245   Median :2134   Median :199.0          Median :  6.00       
 Mean   : 7638   Mean   : 5.490   Mean   :2304   Mean   :192.8          Mean   : 13.56       
 3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:2793   3rd Qu.:264.0          3rd Qu.: 19.00       
 Max.   :36019   Max.   :28.030   Max.   :4900   Max.   :518.0          Max.   :143.00       
very_active_minutes sedentary_minutes
 Min.   :  0.00      Min.   :   0.0   
 1st Qu.:  0.00      1st Qu.: 729.8   
 Median :  4.00      Median :1057.5   
 Mean   : 21.16      Mean   : 991.2   
 3rd Qu.: 32.00      3rd Qu.:1229.5   
 Max.   :210.00      Max.   :1440.0   


dsleep %>%
  select(total_minutes_asleep,
         total_time_in_bed) %>%
  summary()

total_minutes_asleep total_time_in_bed
 Min.   : 58.0        Min.   : 61.0    
 1st Qu.:361.0        1st Qu.:403.0    
 Median :433.0        Median :463.0    
 Mean   :419.5        Mean   :458.6    
 3rd Qu.:490.0        3rd Qu.:526.0    
 Max.   :796.0        Max.   :961.0    

Relationships visualized through plots to determine correlation of variables.

Variables examined include total time in bed and time asleep, which confirms hypothesis of linear correlation.

In [None]:
ggplot(dsleep, aes(x = total_minutes_asleep, y = total_time_in_bed)) + 
  geom_point()

total steps and total distance would be expected to have a linear correlation, which in this instance they do.

In [None]:
ggplot(data = DAct) + 
  geom_point(mapping = aes(x = total_steps, y = total_distance))

these two plots showed an interesting pattern. Two subsets of users appeared within the variable of sedentary minutes, indicating some common dividing factor between the two.

In [None]:
ggplot(data = DAct) +
  geom_point(mapping = aes(x = total_steps, y = sedentary_minutes))

In [None]:
ggplot(DAct, aes(x = very_active_minutes, y = sedentary_minutes)) + geom_point()

combining sleep and daily activity data frames creates a more comprehensive frame, however unique number of ids is decreased to 24.

In [None]:
combined_data <- merge(DAct, dsleep, by = "id")

n_distinct(combined_data$id)

This merge created a table that had 12441 observations, far more than the 413 of dsleep data frame. 

For each id and date combination, the table merged all sleep observations for that id.
    
So, if an id had 25 sleep observations, all 25 were associated with that one id/date combo. 
    
It repeated this process for all id/date combos, returning an exorbitant amount of observations.


Another merge was necessary, this time on two columns; id and date.

In [None]:
combined_data1 <- merge(DAct, dsleep, by.x=c('id', 'activity_date'), by.y=c('id', 'sleep_day'))

This command created a data frame with the correct amount of 413 observations based on the dsleep df.

Using combined_data1, the variables total_steps and total_minutes_asleep were visualized to examine for a trend.

The data show that as total minutes asleep increased, total steps decreased.

This effect was not immediately noticeable using a scatter plot but was more clearly visualized using smooth and quantile graphs.

In [None]:
ggplot(combined_data1, aes(x = total_steps, y = total_minutes_asleep)) + geom_smooth()

In [None]:
ggplot(data = combined_data1, aes(x = total_steps, y = total_minutes_asleep)) + 
  geom_point(alpha = 0.3) + geom_quantile()

### Analysis Conclusions

Summary Results

The median wearable tech user logs 7406 steps, covers 5.25 miles, burns 2134 calories, and 95% of their acitivty is light, with 6 and 4 minutes being fairly and very active, respectively. The median user spends 433 minutes asleep. The median and mean were similar for the minutes asleep and lightly active minutes variables. Larger differences were seen for total steps, total distance, calories, sedentary, fairly active and very active minutes variables. These differences were skewed upward significantly likely due to outliers, hence why the median was used instead of mean.

There is a positive correlation between total steps and total distance, and between time in bed and time asleep.

There is a negative correlation between sedentary minutes and lightly active minutes, and between total minutes asleep and total steps.

A mostly negative correlation is seen between sedentary minutes and total steps up to about 11,000 steps where the relationship then becomes positive. An explanation may be that sedentary minutes decreases due to time spent walking up to a point where the walking begins to significantly fatigue the person, eliminating other forms of exercise and movement and increasing sedentary minutes.  




## Share Phase

Plots were created for presentation to stakeholders, illustrating the most interesting or surprising trends found.



In [None]:
ggplot(combined_data1, aes(x = lightly_active_minutes, y = sedentary_minutes)) +
  geom_point(alpha = 0.3) +
  geom_smooth()

In [None]:
ggplot(combined_data1, aes(x = very_active_minutes, y = sedentary_minutes)) + 
  geom_point(alpha = 0.3) +
  geom_smooth()

In [None]:
ggplot(combined_data1, aes(x = total_steps, y = total_minutes_asleep)) +
  geom_point(alpha = 0.3) +
  geom_smooth()

Correlation coefficients calculated for each plot to more accurately interpret the relationships.

In [None]:
cor(combined_data1$total_steps, combined_data1$total_distance)
cor(combined_data1$total_minutes_asleep, combined_data1$total_steps)
cor(combined_data1$sedentary_minutes, combined_data1$very_active_minutes)
cor(combined_data1$sedentary_minutes, combined_data1$lightly_active_minutes)

Google slide presentation created to present visualizations, data, conclusions and recommendations.

### Coding addendum
After having the case study reviewed by a data professional, it was suggessted to use the ggpairs() function from GGally, an extension to GGplot2, to create a correllelogram to visualize all correlations of relevant variables. This worked beautifully once unneccesary columns had been removed which created a dataframe of desired variables to visualize relationships.
Using this function when beginning the search for trends would have been immensely helpful, and would have saved signficant time and effort when searching to find correlations of note.

In [1]:
install.packages("GGally")
library(GGally)

correlellogram1 <- combined_data1[ -c(1,2,5:10,16)]

ggpairs(correlellogram1)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“unable to access index for repository http://cran.rstudio.com/src/contrib:
  cannot open URL 'http://cran.rstudio.com/src/contrib/PACKAGES'”
“package ‘GGally’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages”
Loading required package: ggplot2

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2



ERROR: Error in eval(expr, envir, enclos): object 'combined_data1' not found


## Act Phase

### Conclusions

From a dataset of 30 users of different wearable technology products, the median user:

1. spends the majority of tracked active minutes performing light activity, presumably walking.

2. participates in 20 minutes of combined time in fairly active and very active zones per tracked time.


Trends observed include:

1. a negative correlation between sleep and steps

2. no correlation between time spent in vigorous activity and sedentary time

3. slight negative correlation between time spent in light activity and sedentary time


This dataset is applicable to Bellabeat products, as the dataset represents a mix of users and products. This provides a profile of the general industry rather than a specific subset of industry.


### Recommendations

Marketing strategy should be targeted to the customer who is already active and spends most of their active time performing light activity.
    



### Next Steps

Develop primary marketing strategy toward the population segement that matches the median user. 

Develop secondary strategies towards those who are not a median user: thsoe who are sedentary and wanting to be active and those who track very high amounts of activity already.

Obtain more user data for wearable technology, and break down the data by product and price to obtain more specific detail on the type of user of each product.