# Bellabeat Case Study (Google Data Analytics Capstone)

![Screenshot 2024-05-08 at 22.26.06 copy 2.jpg](attachment:dcd095df-a23d-452a-b561-616f5da7f287.jpg)

## Table of Contents

* [1. Ask-Phase](#1.)
  * [1.1 Objective/Purpose of Analysis](#1.1)
  * [1.2 Stakeholders](#1.2)
<br></br>
* [2. Prepare-Phase](#2.)
  * [2.1 Data Collection](#2.1)
  * [2.2 Data Integrity](#2.2)
     * [2.2.1 Data Source](#2.2.1)
     * [2.2.2 Data Formats](#2.2.2)
     * [2.2.3 Data Credibility](#2.2.3)
     * [2.2.4 Data Security](#2.2.4)
  * [2.3 Conclusion](#2.3)
<br></br> 
* [3. Process-Phase](#3.)
  * [3.1 First Month (March to April)](#3.1)
     * [3.1.1 Load packages](#3.1.1)
     * [3.1.2 Load in all data](#3.1.2)
     * [3.1.3 Preview the data](#3.1.3)
     * [3.1.4 Check timezones of data and computer and adjust if necessary](#3.1.4)
     * [3.1.5 Make column naming more consistent](#3.1.5)
     * [3.1.6 Change data types](#3.1.6)
     * [3.1.7 Check unique 'Id' values for each dataset](#3.1.7)
     * [3.1.8 Check time frame for each dataset](#3.1.8)
     * [3.1.9 Check for NULLs or missing values in each dataset](#3.1.9)
     * [3.1.10 Check for duplicates in each dataset](#3.1.10)
     * [3.1.11 Check for 0s in each dataframe](#3.1.11)
     * [3.1.12 Drop/add columns and replace values where necessary](#3.1.12)
     * [3.1.13 Create new tables](#3.1.13)
     * [3.1.14 Limit decimal places in each table](#3.1.14)
     * [3.1.15 Double check cleaned and newly created dataframes](#3.1.15)
  * [3.2 Second Month (April to May)](#3.2)
     * [3.2.1 Load packages](#3.2.1)
     * [3.2.2 Load in all data](#3.2.2)
     * [3.2.3 Preview the data](#3.2.3)
     * [3.2.4 Check timezones of data and computer and adjust if necessary](#3.2.4)
     * [3.2.5 Make column naming more consistent](#3.2.5)
     * [3.2.6 Change data types](#3.2.6)
     * [3.2.7 Check unique 'Id' values for each dataset](#3.2.7)
     * [3.2.8 Check time frame for each dataset](#3.2.8)
     * [3.2.9 Check for NULLs or missing values in each dataset](#3.2.9)
     * [3.2.10 Check for duplicates in each dataset](#3.2.10)
     * [3.2.11 Check for 0s in each dataframe](#3.2.11)
     * [3.2.12 Drop/add columns and replace values where necessary](#3.2.12)
     * [3.2.13 Create new tables](#3.2.13)
     * [3.2.14 Limit decimal places in each table](#3.2.14)
     * [3.2.15 Double check cleaned and newly created dataframes](#3.2.15)
  * [3.3 Data Binding](#3.3)
     * [3.3.1 Bind respective dataframes from both months together](#3.3.1)
     * [3.3.2 Preview bound dataframes](#3.3.2)
     * [3.3.3 Check each dataframe for problematic values](#3.3.3)
<br></br>
* [4. Analyse-Phase](#4.)
  * [4.1 Analysis: Usage](#4.1)
     * [4.1.1 Load packages](#4.1.1)
     * [4.1.2 Change timezone to UTC](#4.1.2)
     * [4.1.3 Create dataframe for feature usage](#4.1.3)
     * [4.1.4 Create dataframes for days tracked per feature](#4.1.4)
     * [4.1.5 Preview new dataframes](#4.1.5)
     * [4.1.6 Create plots for feature usage](#4.1.6)
     * [4.1.7 Examine user engagement rate](#4.1.7)
     * [4.1.8 Compare total number of trackings for each feature](#4.1.8)
     * [4.1.9 Examine the percentage distribution](#4.1.9)
     * [4.1.10 Examine how much time each feature was used over the entire period](#4.1.10)
     * [4.1.11 Examine relation between tracking activity and time](#4.1.11)
     * [4.1.12 Examine how long the device was worn on average](#4.1.12)
  * [4.2 Analysis: Performance](#4.2)
     * [4.2.1 Load packages](#4.2.1)
     * [4.2.2 Change timezone to UTC](#4.2.2)
     * [4.2.3 Examine daily activity and weight](#4.2.3)
     * [4.2.4 Examine calories](#4.2.4)
     * [4.2.5 Examine intensities](#4.2.5)
     * [4.2.6 Examine METs](#4.2.6)
     * [4.2.7 Examine sleep](#4.2.7)
     * [4.2.8 Examine steps](#4.2.8)
     * [4.2.9 Examine heart rates](#4.2.9)
     * [4.2.10 Examine caloric relations for intensity, METs and steps](#4.2.10)
     * [4.2.11 Examine sleep relations](#4.2.11)
     * [4.2.12 Examine heart rate relations](#4.2.12)
<br></br>
* [5. Share-Phase](#5.)
  * [5.1 Summary of Device Usage Trends](#5.1)
  * [5.2 Analysis and Targets](#5.2)
  * [5.2.1 Dashboards](#5.2.1)
  * [5.3 Conclusion](#5.3)
  * [5.3.1 Recommendations](#5.3.1)
  * [5.3.2 New Marketing Strategy](#5.3.2)
<br></br>
* [6. Act-Phase](#6.)

<div style = "background-color: #ADE8F4">
    
## **1. Ask-Phase** 
</div>

#### **1.1 Objective/Purpose of Analysis** 
    
* facilitate business growth through conceptualising new marketing and business strategies based on usage   data from non-Bellabeat smart devices
* identify trends in how smart devices are being used and how this information could be       applied to Bellabeat's own products in order to enhance user experience
<br></br>

#### **1.2 Stakeholders** 
    
* Co-Founder/Chief Creative Officer: Urška Sršen   
* Co-Founder/Executive Team: Sando Mur

<div style = "background-color: #ADE8F4">

##  **2. Prepare-Phase** 
</div>

#### **2.1 Data Collection** 

* Data generation needed: no
* Data collection: external source (third party) recommended by chief creative officer
* Data collection method: see 2.2.1 <br></br>

#### **2.2 Data Integrity** 
#### **2.2.1 Data Source**  
* Datasets to be examined: [Fitbit Fitness Tracker Data](http://www.kaggle.com/datasets/arashnic/fitbit/discussion/313589)

  * Source: Kaggle
  * Provided by: Möbius (a healthcare data scientist) 
  * Originally published on: Zenodo
  * Ownership: Robert Furberg, Julia Brinton, Michael Keating, Alexa Ortiz
  * Affiliations: RTI International
  * DOI: 10.5281/zenodo.53894
  * Accessibility: open source (under correct citation, see DOI)
  * Cited: yes
  * Data collection method: survey distributed via Amazon Mechanical Turk
    * time frame: 3/12/2016-5/12/2016 
    * participants: 30
    * consent: given (to original creator)
    * contents: tracking of physical activity, heart rate, sleep <br></br>
    
#### **2.2.2 Data Formats** 

* mainly quantitative (continuous) data
* formats: both wide and long data formats
* structured: yes 
* file formats: .csv
* data storage: stored in 2 folders (downloaded to computer hard drive): <br></br>
  * <u>3/12-4/11 (first month)</u>
    * *Activity (day)*
    * *Calories (hour, minute)*
    * *Intensities (hour, minute)*
    * *METs (minute)*
    * *Sleep (minute)*
    * *Steps (hour, minute)*
    * *Weight Log Info (day)*
    * *Heart Rate (second)* 
      [11 .csv-files] <br></br>
  * <u>4/12-5/12 (second month)</u>
    * *Activity (day)*
    * *Calories (day, hour, minute_narrow, minute_wide)*
    * *Intensities (day, hour, minute_narrow, minute_wide)*
    * *METs (minute)*
    * *Sleep (day, minute)*
    * *Steps (day, hour, minute_narrow, minute_wide)*
    * *Weight Log Info*
    * *Heart Rate (second)* 
      [18 .csv-files] <br></br>
      
#### **2.2.3 Data Credibility** 

To determine the credibility and quality of the data, the 'ROCCC'-analysis will be utilised.

1.) **R**eliability: Low 
                    
* The number of people who participated in the survey only amounts to 30, which is often viewed as the minimum sample size for surveying. In 2016, the year in which the survey was conducted, Fitbit had around 23.6 million active users. This means that at a desirable confidence level of 90-95% a margin of error of 15-18% is to be expected. While it is still in an acceptable range, a larger sample size is strongly advised.     

* The time frame of the survey spans around two months, from the middle of March to the middle of May 2016. The results may be enough to give a first impression on user activity and fitness, however, a longer time period would be much beneficial, as it is doubtful that such a short period of time would accurately reflect the users' fitness habits.

* With no further information on the participants' backgrounds and how the survey was conducted sampling bias might be an issue, generally, however, Amazon's Mechanical Turk seems to be an "efficient [and] reliable" way of data collection (see [National Institutes of Health](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5880761/)). 
                                 
2.) **O**riginality: Medium 
* Third party provider that shared the data from the original source under citation.
* Survey conducted through Amazon's Mechanical Turk.

3.) **C**omprehensiveness: Medium 
* Most of the critical fitness data such as calories, intensities, METs, steps, sleep, heart rates and weight are provided.
* There is no further information on the users, such as age or gender. Since Bellabeat's products are targeted towards women, naturally, data on females would be preferable.
* The data solely focuses on Fitbit and does not include other brands.

4.) **C**urrency: Low
* The survey was conducted in 2016 and is, at the time of this analysis, 8 years old. During this time both technology and the general awareness towards health and fitness have shifted. The survey might thus be outdated. 

5.) **C**itation: Medium
* The original creators are cited with credible affiliations.
* No apparent scientific citation. <br></br>

#### **2.2.4 Data Security** 
* used under original licensing 
* use will be restricted to analytics team
* stored on hard drive (+ open source availability)
* user information is anonymous <br></br>

#### **2.3 Conclusion** 

The data provided comes with several limitations, which could undermine the accuracy and effectiveness of the analysis. The main problematic areas are the extremely low number of participants, the limited time span, the currency of the survey and the lack of information on the users (especially age and gender). Adding further and more diverse data to augment the data analysis process is therefore strongly recommended. 

<div style = "background-color: #ADE8F4">

##  **3. Process-Phase** 
</div>

Due to the size of the data, the tool of choice for this analysis will be R. This phase will be split into three major steps: 1.) the data cleaning of the first month, 2.) the data cleaning of the second month and 3.) the data binding of both months.  

#### **3.1 First Month (March to April)** 
#### **3.1.1 Load packages** 

In [None]:
#### Load packages ####
#=====================#

library("tidyverse")

#### **3.1.2 Load in all data** 

In [None]:
#### Load in all data ####
#========================#

Activity_d_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/dailyActivity_merged.csv")
Calories_h_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlyCalories_merged.csv")
Intensities_h_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlyIntensities_merged.csv")
Steps_h_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlySteps_merged.csv")
Calories_m_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/minuteCaloriesNarrow_merged.csv")
Intensities_m_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/minuteIntensitiesNarrow_merged.csv")
MET_m_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/minuteMETsNarrow_merged.csv")
Sleep_m_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/minuteSleep_merged.csv")
Steps_m_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/minuteStepsNarrow_merged.csv")
Weight_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/weightLogInfo_merged.csv")
HeartRate_s_MarApr <- read.csv("fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/heartrate_seconds_merged.csv")

#'*Variable names consist of category + time measurement + period indicator*

#### **3.1.3 Preview the data** 

In [None]:
#### Preview the data ####
#========================#

glimpse(Activity_d_MarApr)
glimpse(Calories_h_MarApr)
glimpse(Intensities_h_MarApr)
glimpse(Steps_h_MarApr)
glimpse(Calories_m_MarApr)
glimpse(Intensities_m_MarApr)
glimpse(MET_m_MarApr)
glimpse(Sleep_m_MarApr)
glimpse(Steps_m_MarApr)
glimpse(Weight_MarApr)
glimpse(HeartRate_s_MarApr)

#### **3.1.4 Check timezones of data and computer and adjust if necessary** 

In [None]:
#### Check timezones of data and computer and adjust if necessary ####
#====================================================================#

tz(Activity_d_MarApr$ActivityDate)
tz(Calories_h_MarApr$ActivityHour)
tz(Intensities_h_MarApr$ActivityHour)
tz(Steps_h_MarApr$ActivityHour)
tz(Calories_m_MarApr$ActivityMinute)
tz(Intensities_m_MarApr$ActivityMinute)
tz(MET_m_MarApr$ActivityMinute)
tz(Sleep_m_MarApr$date)
tz(Steps_m_MarApr$ActivityMinute)
tz(Weight_MarApr$Date)
tz(HeartRate_s_MarApr$Time)

Sys.time()
Sys.setenv(TZ = "UTC")

#### **3.1.5 Make column naming more consistent** 

Upon reviewing the datasets several issues have been detected that are to be fixed. These include:  
* case
* inconsistent column naming
* vague column names 

In [None]:
#### Make column naming more consistent ####
#==========================================#

# Change column names to uppercase where necessary #

Sleep_m_MarApr <- rename_with(Sleep_m_MarApr, str_to_title)
Sleep_m_MarApr <- rename(Sleep_m_MarApr, "LogId" = Logid)

# Rename date column from each dataset #

Activity_d_MarApr <- rename(Activity_d_MarApr, "ActivityTime" = ActivityDate)
Calories_h_MarApr <- rename(Calories_h_MarApr, "ActivityTime" = ActivityHour)
Intensities_h_MarApr <-  rename(Intensities_h_MarApr, "ActivityTime" = ActivityHour)
Steps_h_MarApr <- rename(Steps_h_MarApr, "ActivityTime" = ActivityHour)
Calories_m_MarApr <- rename(Calories_m_MarApr, "ActivityTime" = ActivityMinute)
Intensities_m_MarApr <- rename(Intensities_m_MarApr, "ActivityTime" = ActivityMinute)
MET_m_MarApr <- rename(MET_m_MarApr, "ActivityTime" = ActivityMinute)
Sleep_m_MarApr <- rename(Sleep_m_MarApr, "ActivityTime" = Date)
Steps_m_MarApr <- rename(Steps_m_MarApr, "ActivityTime" = ActivityMinute)
Weight_MarApr <- rename(Weight_MarApr, "ActivityTime" = Date)
HeartRate_s_MarApr <- rename(HeartRate_s_MarApr, "ActivityTime" = Time)

# Specify vague column names where necessary #

Steps_h_MarApr <- rename(Steps_h_MarApr, "TotalSteps" = StepTotal)

Sleep_m_MarApr <- rename(Sleep_m_MarApr, "SleepStage" = Value)

HeartRate_s_MarApr <- rename(HeartRate_s_MarApr, "HeartRate" =  Value)

#### **3.1.6 Change data types** 

The data types of some columns (mainly 'Id' and 'Activity Time') are impractical and are therefore to be altered. 

In [None]:
#### Change data types ####
#=========================#

Activity_d_MarApr$Id <- as.character(Activity_d_MarApr$Id)
Activity_d_MarApr$ActivityTime <- mdy(Activity_d_MarApr$ActivityTime)

Calories_h_MarApr$Id <- as.character(Calories_h_MarApr$Id)
Calories_h_MarApr$ActivityTime <- mdy_hms(Calories_h_MarApr$ActivityTime)

Intensities_h_MarApr$Id <- as.character(Intensities_h_MarApr$Id)
Intensities_h_MarApr$ActivityTime <- mdy_hms(Intensities_h_MarApr$ActivityTime)

Steps_h_MarApr$Id <- as.character(Steps_h_MarApr$Id)
Steps_h_MarApr$ActivityTime <- mdy_hms(Steps_h_MarApr$ActivityTime)

Calories_m_MarApr$Id <- as.character(Calories_m_MarApr$Id)
Calories_m_MarApr$ActivityTime <- mdy_hms(Calories_m_MarApr$ActivityTime)

Intensities_m_MarApr$Id <- as.character(Intensities_m_MarApr$Id)
Intensities_m_MarApr$ActivityTime <- mdy_hms(Intensities_m_MarApr$ActivityTime)

MET_m_MarApr$Id <- as.character(MET_m_MarApr$Id)
MET_m_MarApr$ActivityTime <- mdy_hms(MET_m_MarApr$ActivityTime)

Sleep_m_MarApr$Id <- as.character(Sleep_m_MarApr$Id)
Sleep_m_MarApr$ActivityTime <- mdy_hms(Sleep_m_MarApr$ActivityTime)
Sleep_m_MarApr$LogId <- as.character(Sleep_m_MarApr$LogId)

Steps_m_MarApr$Id <- as.character(Steps_m_MarApr$Id)
Steps_m_MarApr$ActivityTime <- mdy_hms(Steps_m_MarApr$ActivityTime)

Weight_MarApr$Id <- as.character(Weight_MarApr$Id)
Weight_MarApr$ActivityTime <- mdy_hms(Weight_MarApr$ActivityTime)
Weight_MarApr$LogId <- as.character(Weight_MarApr$LogId)

HeartRate_s_MarApr$Id <- as.character(HeartRate_s_MarApr$Id)
HeartRate_s_MarApr$ActivityTime <- mdy_hms(HeartRate_s_MarApr$ActivityTime)

#### **3.1.7 Check unique 'Id' values for each dataset** 

To ensure the datasets' comparability, the unique ID-values will be checked below.  

In [None]:
#### Check unique 'Id'-values for each dataset ####
#=================================================#

n_distinct(Activity_d_MarApr$Id)
n_distinct(Calories_h_MarApr$Id)
n_distinct(Intensities_h_MarApr$Id)
n_distinct(Steps_h_MarApr$Id)
n_distinct(Calories_m_MarApr$Id)
n_distinct(Intensities_m_MarApr$Id)
n_distinct(MET_m_MarApr$Id)
n_distinct(Sleep_m_MarApr$Id)
n_distinct(Steps_m_MarApr$Id)
n_distinct(Weight_MarApr$Id)
n_distinct(HeartRate_s_MarApr$Id)

In [None]:
# Compare IDs to make sure they are consistent throughout each dataset #

actd_id <- print(unique(Activity_d_MarApr$Id))
calh_id <- print(unique(Calories_h_MarApr$Id))
inth_id <- print(unique(Intensities_h_MarApr$Id))
steh_id <- print(unique(Steps_h_MarApr$Id))
calm_id <- print(unique(Calories_m_MarApr$Id))
intm_id <- print(unique(Intensities_m_MarApr$Id))
metm_id <- print(unique(MET_m_MarApr$Id))
slem_id <- print(unique(Sleep_m_MarApr$Id))
stem_id <- print(unique(Steps_m_MarApr$Id))
weig_id <- print(unique(Weight_MarApr$Id))
hearts_id <- print(unique(HeartRate_s_MarApr$Id))

match(actd_id, calh_id)
match(actd_id, inth_id)
match(actd_id, steh_id)
match(actd_id, calm_id)
match(actd_id, intm_id)
match(actd_id, metm_id)
match(actd_id, slem_id) 
match(actd_id, stem_id)
match(actd_id, weig_id) 
match(actd_id, hearts_id) 

actd_id[17] 

<div style = "background-color: #e6f2ff">


#### **<u>Findings</u>**

The check has revealed that there is data for a total of 35 participants. This proves inconsistent with the dataset description seen in 2.2.1, which mentioned only 30 participants. The actual number of users is hence slightly larger than initially expected. It was also verified that the same participants are featured throughout the various datasets. It is, however, to be noted that there is quite a bit of fluctuation in the number of users per dataset. Some features were seemingly tracked more than others. Most datasets contain data for 34 participants, with *Activity* (35), *Sleep* (23), *HeartR* (14) and *Weight* (11) being the exceptions. The only ID that is unique to the *Activity* dataset is '4388161847'.  

</div>

#### **3.1.8 Check time frame for each dataset** 

To ensure the data is focused around the right period of time, the time range of each dataset will be examined in the following.

In [None]:
#### Check time frame for each dataset ####
#=========================================#

summarise(Activity_d_MarApr, min(ActivityTime), max(ActivityTime))
summarise(Calories_h_MarApr, min(ActivityTime), max(ActivityTime))
summarise(Intensities_h_MarApr, min(ActivityTime), max(ActivityTime))
summarise(Steps_h_MarApr, min(ActivityTime), max(ActivityTime))
summarise(Calories_m_MarApr, min(ActivityTime), max(ActivityTime))
summarise(Intensities_m_MarApr, min(ActivityTime), max(ActivityTime))
summarise(MET_m_MarApr, min(ActivityTime), max(ActivityTime))
summarise(Sleep_m_MarApr, min(ActivityTime), max(ActivityTime))
summarise(Steps_m_MarApr, min(ActivityTime), max(ActivityTime)) 
summarise(Weight_MarApr, min(ActivityTime), max(ActivityTime)) 
summarise(HeartRate_s_MarApr, min(ActivityTime), max(ActivityTime))

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

The check has revealed several time-related issues with the datasets. The main anomalies found centre around two dates that should not be included in the data:
* '2016-03-11'
* '2016-04-12'

The first date most probably stems from participants going to bed before 12 AM and thus triggering the tracker. To strictly limit the data solely to the period from '2016-03-12' to '2016-05-12' (as indicated in the dataset description; see 2.2.1) the data entries for this date will be removed. 

The second date collides with the datasets from the second month (starting at '2016-04-12') and are therefore also removed.

</div>

In [None]:
# Check anomalies #

Sleep_m_MarApr %>% 
  filter(grepl("2016-03-11", ActivityTime)) 

Activity_d_MarApr %>% 
  filter(grepl("2016-04-12", ActivityTime)) 

Calories_h_MarApr %>% 
  filter(grepl("2016-04-12", ActivityTime))

Intensities_h_MarApr %>% 
  filter(grepl("2016-04-12", ActivityTime))

Steps_h_MarApr %>% 
  filter(grepl("2016-04-12", ActivityTime))

Calories_m_MarApr %>% 
  filter(grepl("2016-04-12", ActivityTime))

Intensities_m_MarApr %>%
  filter(grepl("2016-04-12", ActivityTime))

Sleep_m_MarApr %>% 
  filter(grepl("2016-04-12", ActivityTime))

Steps_m_MarApr %>% 
  filter(grepl("2016-04-12", ActivityTime))

MET_m_MarApr %>% 
  filter(grepl("2016-04-12", ActivityTime))

Weight_MarApr %>% 
  filter(grepl("2016-04-12", ActivityTime))

HeartRate_s_MarApr %>% 
  filter(grepl("2016-04-12", ActivityTime))

In [None]:
# Exclude corresponding rows #

Sleep_m_MarApr <- 
  Sleep_m_MarApr[!grepl("2016-03-11", Sleep_m_MarApr$ActivityTime),]

Activity_d_MarApr <- 
  Activity_d_MarApr[!grepl("2016-04-12", Activity_d_MarApr$ActivityTime),]

Calories_h_MarApr <- 
  Calories_h_MarApr[!grepl("2016-04-12", Calories_h_MarApr$ActivityTime),]

Intensities_h_MarApr <- 
  Intensities_h_MarApr[!grepl("2016-04-12", Intensities_h_MarApr$ActivityTime),]

Steps_h_MarApr <- 
  Steps_h_MarApr[!grepl("2016-04-12", Steps_h_MarApr$ActivityTime),]

Calories_m_MarApr <- 
  Calories_m_MarApr[!grepl("2016-04-12", Calories_m_MarApr$ActivityTime),]

Intensities_m_MarApr <- 
  Intensities_m_MarApr[!grepl("2016-04-12", Intensities_m_MarApr$ActivityTime),]

MET_m_MarApr <- 
  MET_m_MarApr[!grepl("2016-04-12", MET_m_MarApr$ActivityTime),]

Sleep_m_MarApr <- 
  Sleep_m_MarApr[!grepl("2016-04-12", Sleep_m_MarApr$ActivityTime),]

Steps_m_MarApr <- 
  Steps_m_MarApr[!grepl("2016-04-12", Steps_m_MarApr$ActivityTime),]

Weight_MarApr <- 
  Weight_MarApr[!grepl("2016-04-12", Weight_MarApr$ActivityTime),]

HeartRate_s_MarApr <- 
  HeartRate_s_MarApr[!grepl("2016-04-12", HeartRate_s_MarApr$ActivityTime),]

#### **3.1.9 Check for NULLs or missing values in each dataset** 

In [None]:
#### Check for NULLs or missing values in each dataset ####
#=========================================================#

Activity_d_MarApr %>% 
  subset(!complete.cases(Activity_d_MarApr))

Calories_h_MarApr %>% 
  subset(!complete.cases(Calories_h_MarApr))

Intensities_h_MarApr %>% 
  subset(!complete.cases(Intensities_h_MarApr))

Steps_h_MarApr %>% 
  subset(!complete.cases(Steps_h_MarApr))

Calories_m_MarApr %>% 
  subset(!complete.cases(Calories_m_MarApr))

Intensities_m_MarApr %>% 
  subset(!complete.cases(Intensities_m_MarApr))

MET_m_MarApr %>% 
  subset(!complete.cases(MET_m_MarApr))

Sleep_m_MarApr %>% 
  subset(!complete.cases(Sleep_m_MarApr))

Steps_m_MarApr %>% 
  subset(!complete.cases(Steps_m_MarApr))

Weight_MarApr %>% 
  subset(!complete.cases(Weight_MarApr))  

HeartRate_s_MarApr %>% 
  subset(!complete.cases(HeartRate_s_MarApr))

# Check missing values #

count(Weight_MarApr, !complete.cases(Weight_MarApr)) 

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

The *Weight* dataset contains 29 NULLs, all for the column 'Fat'. These will be kept for analysis. 

</div>

#### **3.1.10 Check for duplicates in each dataset** 

In [None]:
#### Check for duplicates in each dataset ####
#============================================#

sum(duplicated(Activity_d_MarApr)) 
sum(duplicated(Calories_h_MarApr)) 
sum(duplicated(Intensities_h_MarApr)) 
sum(duplicated(Steps_h_MarApr)) 
sum(duplicated(Calories_m_MarApr)) 
sum(duplicated(Intensities_m_MarApr)) 
sum(duplicated(MET_m_MarApr)) 
sum(duplicated(Sleep_m_MarApr)) 
sum(duplicated(Steps_m_MarApr)) 
sum(duplicated(Weight_MarApr)) 
sum(duplicated(HeartRate_s_MarApr)) 

# Check and remove duplicates #

Sleep_m_MarApr %>% 
  filter(duplicated(Sleep_m_MarApr)) 

Sleep_m_MarApr <- Sleep_m_MarApr[!duplicated(Sleep_m_MarApr),]

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

*Sleep_m_MarApr* contained 525 duplicates, which have been removed.

</div>

#### **3.1.11 Check for 0s in each dataframe** 

In [None]:
#### Check for 0s in each dataframe ####
#======================================#

filter_all(Activity_d_MarApr, any_vars(. ==0)) 
filter_all(Calories_h_MarApr, any_vars(. ==0))
filter_all(Intensities_h_MarApr, any_vars(. ==0))
filter_all(Steps_h_MarApr, any_vars(. ==0))
filter_all(Calories_m_MarApr, any_vars(. ==0))
filter_all(Intensities_m_MarApr, any_vars(. ==0))
filter_all(MET_m_MarApr, any_vars(. ==0))
filter_all(Sleep_m_MarApr, any_vars(. ==0))
filter_all(Steps_m_MarApr, any_vars(. ==0))
filter_all(Weight_MarApr, any_vars(. ==0))
filter_all(HeartRate_s_MarApr, any_vars(. ==0))


<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

While *Activity_d_MarApr*, *Intensities_h_MarApr*, *Steps_h_MarApr*, *Calories_m_MarApr*, *Intensities_m_MarApr*, *MET_m_MarApr*, *Steps_m_MarApr* all contain zero values, which will be kept in mind and be dealt with later where necessary.
</div>

#### **3.1.12 Drop/add columns and replace values where necessary** 

To make the data easier to work with the following operations will be performed:
* columns that are not needed for analysis will be dropped (columns/values affected by this will be adjusted)
* the 'Intensity' column in *Intensities_m_MarApr* will be duplicated and its values will be translated to strings for better understanding
* the 'METs' column in *MET_m_MarApr* will be divided by 10 to get accurate values (see [Fitabase Data Dicitionary](https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf))
* simultaneously all MET values below the minimum of 0.9 (sleeping state) will be filtered
* the 'SleepStage' column in *Sleep_m_MarApr* will be duplicated and its values will be translated to strings for better understanding


In [None]:
#### Drop/add columns and replace values where necessary ####
#===========================================================#

Intensities_m_MarApr <- mutate(Intensities_m_MarApr, "Intensity_chr" = Intensity)

Intensities_m_MarApr$Intensity_chr[Intensities_m_MarApr$Intensity_chr==0] <- "Sedentary"
Intensities_m_MarApr$Intensity_chr[Intensities_m_MarApr$Intensity_chr==1] <- "Light"
Intensities_m_MarApr$Intensity_chr[Intensities_m_MarApr$Intensity_chr==2] <- "Moderate"
Intensities_m_MarApr$Intensity_chr[Intensities_m_MarApr$Intensity_chr==3] <- "VeryActive"

#-----------------------------------------------------------------------------#

MET_m_MarApr <- 
  mutate(MET_m_MarApr, METs/10) %>% 
  select(-METs) %>% 
  rename("METs" = `METs/10`)

MET_m_MarApr <- 
  MET_m_MarApr %>% 
  filter(METs >= 0.9) # filter because lowest value possible (= sleep)

#-----------------------------------------------------------------------------#

Sleep_m_MarApr <- select(Sleep_m_MarApr, -LogId)
Sleep_m_MarApr <- mutate(Sleep_m_MarApr, "SleepStage_chr" = SleepStage)

second(Sleep_m_MarApr$ActivityTime) <- 0 # to make minute dfs more consistent

Sleep_m_MarApr$SleepStage_chr[Sleep_m_MarApr$SleepStage_chr==1] <- "asleep"
Sleep_m_MarApr$SleepStage_chr[Sleep_m_MarApr$SleepStage_chr==2] <- "restless"
Sleep_m_MarApr$SleepStage_chr[Sleep_m_MarApr$SleepStage_chr==3] <- "awake"

#-----------------------------------------------------------------------------#

Weight_MarApr <- 
  Weight_MarApr %>%
  separate(ActivityTime, into = c("ActivityTime", "Time"), sep = 10) %>% 
  select(-Time, -WeightPounds, -LogId)  

Weight_MarApr$ActivityTime <- ymd(Weight_MarApr$ActivityTime)

#### **3.1.13 Create new tables** 

New dataframes that will prove useful for analysis and compatibility with the second month will be created.

In [None]:
#### Create new tables ####
#==========================#

# Daily dataframes #

Calories_d_MarApr <- 
  Calories_m_MarApr %>% 
  group_by(Id, "ActivityTime" = as.Date(ActivityTime)) %>% 
  summarise("Calories" = round(sum(Calories)))

#-----------------------------------------------------------------------------#

Intensities_d_MarApr <- 
  Intensities_m_MarApr %>% 
  group_by(Id, "ActivityTime" = as.Date(ActivityTime), Intensity_chr) %>% 
  count() %>% 
  pivot_wider(names_from = Intensity_chr , values_from = n) %>% 
  rename("SedentaryMinutes" = Sedentary, "LightlyActiveMinutes" = Light, 
         "FairlyActiveMinutes" = Moderate, "VeryActiveMinutes" = VeryActive) 

Intensities_d_MarApr[is.na(Intensities_d_MarApr)] <- 0

Intensities_d_MarApr <- Intensities_d_MarApr[,c(1,2,5,3,4,6)]

#-----------------------------------------------------------------------------#

MET_d_MarApr <-  
  MET_m_MarApr %>% 
  group_by(Id, "ActivityTime" = as.Date(ActivityTime)) %>%
  summarise("METs_d" = round(sum(METs))) %>% 
  mutate("Mean_METs_d" = round(METs_d/(60*24), digits = 2)) 

#-----------------------------------------------------------------------------#

Sleep_d_MarApr <- 
  Sleep_m_MarApr %>% 
  group_by(Id, "ActivityTime" = as.Date(ActivityTime)) %>% 
  select(-SleepStage)

Sleep_d_MarApr_asleep <- 
  Sleep_d_MarApr %>% 
  filter(SleepStage_chr=="asleep") %>% 
  group_by(Id, ActivityTime) %>% 
  count(SleepStage_chr)

Sleep_d_MarApr_inBed <- 
  Sleep_d_MarApr %>% 
  group_by(Id, ActivityTime) %>% 
  count(SleepStage_chr) %>% 
  summarise(sum(n)) %>% 
  print()

Sleep_d_MarApr <- 
  left_join(Sleep_d_MarApr_asleep, Sleep_d_MarApr_inBed, 
            by = c("Id", "ActivityTime")) %>% 
  rename("TotalMinutesAsleep" = n, "TotalTimeInBed" = `sum(n)`) %>% 
  select(-SleepStage_chr)

#-----------------------------------------------------------------------------#

Steps_d_MarApr <- 
  Steps_m_MarApr %>% 
  group_by(Id, "ActivityTime" = as.Date(ActivityTime)) %>% 
  summarise("TotalSteps" = sum(Steps))

# Create a table for Heart Rate per minute #

HeartRate_m_MarApr <- HeartRate_s_MarApr

second(HeartRate_m_MarApr$ActivityTime) <- 0

HeartRate_m_MarApr <-
  HeartRate_m_MarApr %>% 
  group_by(Id, ActivityTime) %>% 
  summarise("Mean_HeartR" = round(mean(HeartRate), digits = 2), 
            "Max_HeartR" = max(HeartRate), 
            "Min_HeartR" = min(HeartRate)) 

In [None]:
## Check and preview new dataframes ##

glimpse(Calories_d_MarApr)
glimpse(Intensities_d_MarApr)
glimpse(MET_d_MarApr) 
glimpse(Sleep_d_MarApr)
glimpse(Steps_d_MarApr)
glimpse(HeartRate_m_MarApr)

#### **3.1.14 Limit decimal places in each table** 

In [None]:
#### Limit decimal places in each table ####
#==========================================#

Activity_d_MarApr$LoggedActivitiesDistance <- 
  round(Activity_d_MarApr$LoggedActivitiesDistance, digits = 2)

Intensities_h_MarApr$AverageIntensity <- 
  round(Intensities_h_MarApr$AverageIntensity, digits = 2)

Calories_m_MarApr$Calories <- 
  round(Calories_m_MarApr$Calories, digits = 2)

#### **3.1.15 Double check cleaned and newly created dataframes** 

In [None]:
#### Double check cleaned and newly created dataframes ####
#=========================================================#

glimpse(Activity_d_MarApr)
glimpse(Calories_d_MarApr)
glimpse(Intensities_d_MarApr)
glimpse(MET_d_MarApr) 
glimpse(Sleep_d_MarApr)
glimpse(Steps_d_MarApr)

glimpse(Calories_h_MarApr)
glimpse(Intensities_h_MarApr)
glimpse(Steps_h_MarApr)

glimpse(Calories_m_MarApr)
glimpse(Intensities_m_MarApr)
glimpse(MET_m_MarApr)
glimpse(Sleep_m_MarApr)
glimpse(Steps_m_MarApr)

glimpse(Weight_MarApr)
glimpse(HeartRate_s_MarApr)
glimpse(HeartRate_m_MarApr)

#### **3.2 Second Month (April to May)** 

In the following the same data cleaning process from the previous month will be applied to the second month.

*<u>Note:</u> Some redundant steps like loading packages or adjusting the time zone will be repeated throughout the analysis process. This is to show what exactly is needed </br>
          for each part of the analysis.* 

#### **3.2.1 Load packages** 

In [None]:
#### Load packages ####
#=====================#

library("tidyverse")

#### **3.2.2 Load in all data** 

Some datasets are provided as both wide and long formats. The wide formats will be excluded from the analysis process.

In [None]:
#### Load in all data ####
#========================#

Activity_d_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
Calories_d_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
Intensities_d_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
Sleep_d_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
Steps_d_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
Calories_h_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
Intensities_h_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
Steps_h_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
Calories_m_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/minuteCaloriesNarrow_merged.csv")
Intensities_m_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/minuteIntensitiesNarrow_merged.csv")
MET_m_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/minuteMETsNarrow_merged.csv")
Sleep_m_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")
Steps_m_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/minuteStepsNarrow_merged.csv")
Weight_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
HeartRate_s_AprMay <- read.csv("fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")


#### **3.2.3 Preview the data** 

In [None]:
#### Preview the data ####
#========================#

glimpse(Activity_d_AprMay)
glimpse(Calories_d_AprMay)
glimpse(Intensities_d_AprMay)
glimpse(Sleep_d_AprMay)
glimpse(Steps_d_AprMay)
glimpse(Calories_h_AprMay)
glimpse(Intensities_h_AprMay)
glimpse(Steps_h_AprMay)
glimpse(Calories_m_AprMay)
glimpse(Intensities_m_AprMay)
glimpse(MET_m_AprMay)
glimpse(Sleep_m_AprMay)
glimpse(Steps_m_AprMay)
glimpse(Weight_AprMay)
glimpse(HeartRate_s_AprMay)

#### **3.2.4 Check timezones of data and computer and adjust if necessary** 

In [None]:
#### Check timezones of data and computer and adjust if necessary ####
#====================================================================#

tz(Activity_d_AprMay$ActivityDate)
tz(Calories_d_AprMay$ActivityDay)
tz(Intensities_d_AprMay$ActivityDay)
tz(Sleep_d_AprMay$SleepDay)
tz(Steps_d_AprMay$ActivityDay)
tz(Calories_h_AprMay$ActivityHour)
tz(Intensities_h_AprMay$ActivityHour)
tz(Steps_h_AprMay$ActivityHour)
tz(Calories_m_AprMay$ActivityMinute)
tz(Intensities_m_AprMay$ActivityMinute)
tz(MET_m_AprMay$ActivityMinute)
tz(Sleep_m_AprMay$date)
tz(Steps_m_AprMay$ActivityMinute)
tz(Weight_AprMay$Date)
tz(HeartRate_s_AprMay$Time)

Sys.time()
Sys.setenv(TZ = "UTC")

#### **3.2.5 Make column naming more consistent** 

Upon reviewing the datasets several issues have been detected that are to be fixed. These include:  
* case
* inconsistent column naming
* vague column names 

In [None]:
#### Make column naming more consistent ####
#==========================================#

# Change column names to uppercase where necessary #

Sleep_m_AprMay <- rename_with(Sleep_m_AprMay, str_to_title)

Sleep_m_AprMay <- rename(Sleep_m_AprMay, "LogId" = Logid)

# Rename date column from each dataset #

Activity_d_AprMay <- rename(Activity_d_AprMay, "ActivityTime" = ActivityDate)
Calories_d_AprMay <- rename(Calories_d_AprMay, "ActivityTime" = ActivityDay)
Intensities_d_AprMay <- rename(Intensities_d_AprMay, "ActivityTime" = ActivityDay)
Sleep_d_AprMay <- rename(Sleep_d_AprMay, "ActivityTime" = SleepDay)
Steps_d_AprMay <- rename(Steps_d_AprMay, "ActivityTime" = ActivityDay)
Calories_h_AprMay <- rename(Calories_h_AprMay, "ActivityTime" = ActivityHour)
Intensities_h_AprMay <- rename(Intensities_h_AprMay, "ActivityTime" = ActivityHour)
Steps_h_AprMay <- rename(Steps_h_AprMay, "ActivityTime" = ActivityHour)
Calories_m_AprMay <- rename(Calories_m_AprMay, "ActivityTime" = ActivityMinute)
Intensities_m_AprMay <- rename(Intensities_m_AprMay, "ActivityTime" = ActivityMinute)
MET_m_AprMay <- rename(MET_m_AprMay, "ActivityTime" = ActivityMinute)
Sleep_m_AprMay <- rename(Sleep_m_AprMay, "ActivityTime" = Date)
Steps_m_AprMay <- rename(Steps_m_AprMay, "ActivityTime" = ActivityMinute)
Weight_AprMay <- rename(Weight_AprMay, "ActivityTime" = Date)
HeartRate_s_AprMay <- rename(HeartRate_s_AprMay, "ActivityTime" = Time)

# Specify vague column names where necessary #

Steps_d_AprMay <- rename(Steps_d_AprMay, "TotalSteps" = StepTotal)

Steps_h_AprMay <- rename(Steps_h_AprMay, "TotalSteps" = StepTotal) 

Sleep_m_AprMay <- rename(Sleep_m_AprMay, "SleepStage" = Value) 

HeartRate_s_AprMay <- rename(HeartRate_s_AprMay, "HeartRate" = Value) 

#### **3.2.6 Change data types** 

The data types of some columns (mainly 'Id' and 'Activity Time') are impractical and are therefore to be altered. 

In [None]:
#### Change data types ####
#=========================#

Activity_d_AprMay$Id <- as.character(Activity_d_AprMay$Id)
Activity_d_AprMay$ActivityTime <- mdy(Activity_d_AprMay$ActivityTime)

Calories_d_AprMay$Id <- as.character(Calories_d_AprMay$Id)
Calories_d_AprMay$ActivityTime <- mdy(Calories_d_AprMay$ActivityTime)

Intensities_d_AprMay$Id <- as.character(Intensities_d_AprMay$Id)
Intensities_d_AprMay$ActivityTime <- mdy(Intensities_d_AprMay$ActivityTime)

Sleep_d_AprMay$Id <- as.character(Sleep_d_AprMay$Id)
Sleep_d_AprMay$ActivityTime <- mdy_hms(Sleep_d_AprMay$ActivityTime)

Steps_d_AprMay$Id <- as.character(Steps_d_AprMay$Id)
Steps_d_AprMay$ActivityTime <- mdy(Steps_d_AprMay$ActivityTime)

Calories_h_AprMay$Id <- as.character(Calories_h_AprMay$Id)
Calories_h_AprMay$ActivityTime <- mdy_hms(Calories_h_AprMay$ActivityTime)

Intensities_h_AprMay$Id <- as.character(Intensities_h_AprMay$Id)
Intensities_h_AprMay$ActivityTime <- mdy_hms(Intensities_h_AprMay$ActivityTime)

Steps_h_AprMay$Id <- as.character(Steps_h_AprMay$Id)
Steps_h_AprMay$ActivityTime <- mdy_hms(Steps_h_AprMay$ActivityTime)

Calories_m_AprMay$Id <- as.character(Calories_m_AprMay$Id)
Calories_m_AprMay$ActivityTime <- mdy_hms(Calories_m_AprMay$ActivityTime)

Intensities_m_AprMay$Id <- as.character(Intensities_m_AprMay$Id)
Intensities_m_AprMay$ActivityTime <- mdy_hms(Intensities_m_AprMay$ActivityTime)

MET_m_AprMay$Id <- as.character(MET_m_AprMay$Id)
MET_m_AprMay$ActivityTime <- mdy_hms(MET_m_AprMay$ActivityTime)

Sleep_m_AprMay$Id <- as.character(Sleep_m_AprMay$Id)
Sleep_m_AprMay$ActivityTime <- mdy_hms(Sleep_m_AprMay$ActivityTime)

Steps_m_AprMay$Id <- as.character(Steps_m_AprMay$Id)
Steps_m_AprMay$ActivityTime <- mdy_hms(Steps_m_AprMay$ActivityTime)

Weight_AprMay$Id <- as.character(Weight_AprMay$Id)
Weight_AprMay$ActivityTime <- mdy_hms(Weight_AprMay$ActivityTime)

HeartRate_s_AprMay$Id <- as.character(HeartRate_s_AprMay$Id)
HeartRate_s_AprMay$ActivityTime <- mdy_hms(HeartRate_s_AprMay$ActivityTime)

#### **3.2.7 Check unique 'Id'-values for each dataset** 

To ensure the datasets' comparability, the unique ID-values will be checked below.  

In [None]:
#### Check unique 'Id'-values for each dataset ####
#=================================================#

n_distinct(Activity_d_AprMay$Id)
n_distinct(Calories_d_AprMay$Id)
n_distinct(Intensities_d_AprMay$Id)
n_distinct(Sleep_d_AprMay$Id)
n_distinct(Steps_d_AprMay$Id)
n_distinct(Calories_h_AprMay$Id)
n_distinct(Intensities_h_AprMay$Id)
n_distinct(Steps_h_AprMay$Id)
n_distinct(Calories_m_AprMay$Id)
n_distinct(Intensities_m_AprMay$Id)
n_distinct(MET_m_AprMay$Id)
n_distinct(Sleep_m_AprMay$Id)
n_distinct(Steps_m_AprMay$Id)
n_distinct(Weight_AprMay$Id)
n_distinct(HeartRate_s_AprMay$Id)


In [None]:
# Compare IDs to make sure they are consistent throughout each dataset #

actd_id_AprMay <- print(unique(Activity_d_AprMay$Id))
cald_id_AprMay <- print(unique(Calories_d_AprMay$Id))
intd_id_AprMay <- print(unique(Intensities_d_AprMay$Id))
sled_id_AprMay <- print(unique(Sleep_d_AprMay$Id))
sted_id_AprMay <- print(unique(Steps_d_AprMay$Id))
calh_id_AprMay <- print(unique(Calories_h_AprMay$Id))
inth_id_AprMay <- print(unique(Intensities_h_AprMay$Id))
steh_id_AprMay <- print(unique(Steps_h_AprMay$Id))
calm_id_AprMay <- print(unique(Calories_m_AprMay$Id))
intm_id_AprMay <- print(unique(Intensities_m_AprMay$Id))
metm_id_AprMay <- print(unique(MET_m_AprMay$Id))
slem_id_AprMay <- print(unique(Sleep_m_AprMay$Id))
stem_id_AprMay <- print(unique(Steps_m_AprMay$Id))
weig_id_AprMay <- print(unique(Weight_AprMay$Id))
hearts_id_AprMay <- print(unique(HeartRate_s_AprMay$Id))

match(actd_id_AprMay, cald_id_AprMay)
match(actd_id_AprMay, intd_id_AprMay)
match(actd_id_AprMay, sled_id_AprMay) 
match(actd_id_AprMay, sted_id_AprMay)
match(actd_id_AprMay, calh_id_AprMay)
match(actd_id_AprMay, inth_id_AprMay)
match(actd_id_AprMay, steh_id_AprMay)
match(actd_id_AprMay, calm_id_AprMay)
match(actd_id_AprMay, intm_id_AprMay)
match(actd_id_AprMay, metm_id_AprMay)
match(actd_id_AprMay, slem_id_AprMay) 
match(actd_id_AprMay, stem_id_AprMay)
match(actd_id_AprMay, weig_id_AprMay) 
match(actd_id_AprMay, hearts_id_AprMay) 


<div style = "background-color: #e6f2ff">


#### **<u>Findings</u>**

The check has revealed that there is data for a total of 33 participants. Again, this proves inconsistent with the dataset description (cf. 2.2.1), as there are 3 participants more than expected. It also collides with the first month, which involved a total of 35 users, meaning two users only tracked (or submitted) data for the first half of the period in question. As before, the same participants are repeated throughout the different datasets, minus the fluctuations caused by tracking behaviour (see 3.1.7 for reference). Most datasets contain data for 33 participants, except *Sleep* (24), *HeartR* (14) and *Weight* (8). 

</div>

#### **3.2.8 Check time frame for each dataset** 

To ensure the data is focused around the right period of time, the time range of each dataset will be examined in the following.

In [None]:
#### Check time frame for each dataset ####
#=========================================#

summarise(Activity_d_AprMay, min(ActivityTime), max(ActivityTime))
summarise(Calories_d_AprMay, min(ActivityTime), max(ActivityTime)) 
summarise(Intensities_d_AprMay, min(ActivityTime), max(ActivityTime)) 
summarise(Sleep_d_AprMay, min(ActivityTime), max(ActivityTime)) 
summarise(Steps_d_AprMay, min(ActivityTime), max(ActivityTime)) 
summarise(Calories_h_AprMay, min(ActivityTime), max(ActivityTime))
summarise(Intensities_h_AprMay, min(ActivityTime), max(ActivityTime))
summarise(Steps_h_AprMay, min(ActivityTime), max(ActivityTime))
summarise(Calories_m_AprMay, min(ActivityTime), max(ActivityTime))
summarise(Intensities_m_AprMay, min(ActivityTime), max(ActivityTime))
summarise(MET_m_AprMay, min(ActivityTime), max(ActivityTime))
summarise(Sleep_m_AprMay, min(ActivityTime), max(ActivityTime)) 
summarise(Steps_m_AprMay, min(ActivityTime), max(ActivityTime)) 
summarise(Weight_AprMay, min(ActivityTime), max(ActivityTime)) 
summarise(HeartRate_s_AprMay, min(ActivityTime), max(ActivityTime)) 


<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

Upon checking the time ranges of each dataset, one date was found that should not be included according to data description:
* '2016-04-11'

Since this point in time is already covered in the first month (again, presumably resulting from users that went to bed before 12 AM and thus triggering the tracker), the corresponding data entries will be removed. 

</div>

In [None]:
# Check and remove anomalies #

Sleep_m_AprMay %>% 
  filter(grepl("2016-04-11", ActivityTime)) 

Sleep_m_AprMay <- 
  Sleep_m_AprMay[!grepl("2016-04-11", Sleep_m_AprMay$ActivityTime),]

#### **3.2.9 Check for NULLs or missing values in each dataset** 

In [None]:
#### Check for NULLs or missing values in each dataset ####
#=========================================================#

Activity_d_AprMay %>% 
  subset(!complete.cases(Activity_d_AprMay))

Calories_d_AprMay %>% 
  subset(!complete.cases(Calories_d_AprMay))

Intensities_d_AprMay %>% 
  subset(!complete.cases(Intensities_d_AprMay))

Sleep_d_AprMay %>% 
  subset(!complete.cases(Sleep_d_AprMay))

Steps_d_AprMay %>% 
  subset(!complete.cases(Steps_d_AprMay))

Calories_h_AprMay %>% 
  subset(!complete.cases(Calories_h_AprMay))

Intensities_h_AprMay %>% 
  subset(!complete.cases(Intensities_h_AprMay))

Steps_h_AprMay %>% 
  subset(!complete.cases(Steps_h_AprMay))

Calories_m_AprMay %>% 
  subset(!complete.cases(Calories_m_AprMay))

Intensities_m_AprMay %>% 
  subset(!complete.cases(Intensities_m_AprMay))

MET_m_AprMay %>% 
  subset(!complete.cases(MET_m_AprMay))

Sleep_m_AprMay %>% 
  subset(!complete.cases(Sleep_m_AprMay))

Steps_m_AprMay %>% 
  subset(!complete.cases(Steps_m_AprMay))

Weight_AprMay %>% 
  subset(!complete.cases(Weight_AprMay)) 

HeartRate_s_AprMay %>% 
  subset(!complete.cases(HeartRate_s_AprMay))

# Check missing values #

count(Weight_AprMay, !complete.cases(Weight_AprMay)) 


<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

The *Weight* dataset contains 65 rows with NULLs, all for the column 'Fat'. As previously, these will be kept for analysis. 

</div>

#### **3.2.10 Check for duplicates in each dataset** 

In [None]:
#### Check for duplicates in each dataset ####
#============================================#

sum(duplicated(Activity_d_AprMay)) 
sum(duplicated(Calories_d_AprMay))
sum(duplicated(Intensities_d_AprMay))
sum(duplicated(Sleep_d_AprMay))  
sum(duplicated(Steps_d_AprMay))
sum(duplicated(Calories_h_AprMay)) 
sum(duplicated(Intensities_h_AprMay)) 
sum(duplicated(Steps_h_AprMay)) 
sum(duplicated(Calories_m_AprMay)) 
sum(duplicated(Intensities_m_AprMay)) 
sum(duplicated(MET_m_AprMay)) 
sum(duplicated(Sleep_m_AprMay)) 
sum(duplicated(Steps_m_AprMay)) 
sum(duplicated(Weight_AprMay)) 
sum(duplicated(HeartRate_s_AprMay)) 

# Check and remove duplicates #

Sleep_d_AprMay %>% 
  filter(duplicated(Sleep_d_AprMay)) 

Sleep_d_AprMay <- Sleep_d_AprMay[!duplicated(Sleep_d_AprMay),]

#-----------------------------------------------------------------------------#

Sleep_m_AprMay %>% 
  filter(duplicated(Sleep_m_AprMay)) 

Sleep_m_AprMay <- Sleep_m_AprMay[!duplicated(Sleep_m_AprMay),]

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

*Sleep_d_AprMay* contained 3, *Sleep_m_AprMay* contained 543 duplicates, which have been removed.

</div>

#### **3.2.11 Check for 0s in each dataframe** 

In [None]:
#### Check for 0s in each dataframe ####
#======================================#

filter_all(Activity_d_AprMay, any_vars(. ==0)) 
filter_all(Calories_d_AprMay, any_vars(. ==0))
filter_all(Intensities_d_AprMay, any_vars(. ==0))
filter_all(Sleep_d_AprMay, any_vars(. ==0))
filter_all(Steps_d_AprMay, any_vars(. ==0))
filter_all(Calories_h_AprMay, any_vars(. ==0))
filter_all(Intensities_h_AprMay, any_vars(. ==0))
filter_all(Steps_h_AprMay, any_vars(. ==0))
filter_all(Calories_m_AprMay, any_vars(. ==0))
filter_all(Intensities_m_AprMay, any_vars(. ==0))
filter_all(MET_m_AprMay, any_vars(. ==0))
filter_all(Sleep_m_AprMay, any_vars(. ==0))
filter_all(Steps_m_AprMay, any_vars(. ==0))
filter_all(Weight_AprMay, any_vars(. ==0))
filter_all(HeartRate_s_AprMay, any_vars(. ==0))


<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

*Activity_d_AprMay*, *Calories_d_AprMay*, *Intensities_d_AprMay*, *Steps_d_AprMay*, *Intensities_h_AprMay*, *Steps_h_AprMay*, *Calories_m_AprMay*, *Intensities_m_AprMay*, *MET_m_AprMay*, *Steps_m_AprMay* all contain zero values, which will be kept in mind and be dealt with later where necessary.
</div>

#### **3.2.12 Drop/add columns and replace values where necessary** 

To make the data easier to work with the following operations will be performed:
* columns that are not needed for analysis will be dropped (columns/values affected by this will be adjusted)
* the 'Intensity' column in *Intensities_m_AprMay* will be duplicated and its values will be translated to strings for better understanding
* the 'METs' column in *MET_m_AprMay* will be divided by 10 to get accurate values (see [Fitabase Data Dicitionary](https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf))
* simultaneously all MET values below the minimum of 0.9 (sleeping state) will be filtered
* the 'SleepStage' column in *Sleep_m_AprMay* will be duplicated and its values will be translated to strings for better understanding

In [None]:
#### Drop/add columns and replace values where necessary #### 
#===========================================================#

Intensities_d_AprMay <- 
  select(Intensities_d_AprMay, -SedentaryActiveDistance, -LightActiveDistance,
         -ModeratelyActiveDistance, -VeryActiveDistance)

#-----------------------------------------------------------------------------#

Sleep_d_AprMay <- select(Sleep_d_AprMay, -TotalSleepRecords) 

#-----------------------------------------------------------------------------#

Intensities_m_AprMay <- mutate(Intensities_m_AprMay, "Intensity_chr" = Intensity)

Intensities_m_AprMay$Intensity_chr[Intensities_m_AprMay$Intensity_chr==0] <- "Sedentary"
Intensities_m_AprMay$Intensity_chr[Intensities_m_AprMay$Intensity_chr==1] <- "Light"
Intensities_m_AprMay$Intensity_chr[Intensities_m_AprMay$Intensity_chr==2] <- "Moderate"
Intensities_m_AprMay$Intensity_chr[Intensities_m_AprMay$Intensity_chr==3] <- "VeryActive"

#-----------------------------------------------------------------------------#

MET_m_AprMay <- 
  mutate(MET_m_AprMay, METs/10) %>% 
  select(-METs) %>% 
  rename("METs" = `METs/10`)

MET_m_AprMay <- 
  MET_m_AprMay %>% 
  filter(METs >= 0.9) # filter because lowest value possible (= sleep)

#-----------------------------------------------------------------------------#

Sleep_m_AprMay <- select(Sleep_m_AprMay, -LogId)
Sleep_m_AprMay <- mutate(Sleep_m_AprMay, "SleepStage_chr" = SleepStage)

second(Sleep_m_AprMay$ActivityTime) <- 0 # to make minute dfs more consistent

Sleep_m_AprMay$SleepStage_chr[Sleep_m_AprMay$SleepStage_chr==1] <- "asleep"
Sleep_m_AprMay$SleepStage_chr[Sleep_m_AprMay$SleepStage_chr==2] <- "restless"
Sleep_m_AprMay$SleepStage_chr[Sleep_m_AprMay$SleepStage_chr==3] <- "awake"

#-----------------------------------------------------------------------------#

Weight_AprMay <- 
  Weight_AprMay %>% 
  separate(ActivityTime, into = c("ActivityTime", "Time"), sep = 10) %>% 
  select(-WeightPounds, -LogId, -Time)

Weight_AprMay$ActivityTime <- ymd(Weight_AprMay$ActivityTime)

#### **3.2.13 Create new tables** 

New dataframes that will prove useful for analysis and compatibility with the first month will be created.

In [None]:
#### Create new tables #### 
#=========================#

# Daily dataframe for MET #

MET_d_AprMay <-  
  MET_m_AprMay %>% 
  group_by(Id, "ActivityTime" = as.Date(ActivityTime)) %>%
  summarise("METs_d" = round(sum(METs))) %>% 
  mutate("Mean_METs_d" = round(METs_d/(60*24), digits = 2)) 

# Create a table for Heart Rate per minute #

HeartRate_m_AprMay <- HeartRate_s_AprMay

second(HeartRate_m_AprMay$ActivityTime) <- 0

HeartRate_m_AprMay <-
  HeartRate_m_AprMay %>% 
  group_by(Id, ActivityTime) %>% 
  summarise("Mean_HeartR" = round(mean(HeartRate), digits = 2), 
            "Max_HeartR" = max(HeartRate), 
            "Min_HeartR" = min(HeartRate)) 

In [None]:
## Check and preview new dataframe ##

glimpse(MET_d_AprMay)
glimpse(HeartRate_m_AprMay)

#### **3.2.14 Limit decimal places in each table** 

In [None]:
#### Limit decimal places in each table ####
#==========================================#

Activity_d_AprMay$LoggedActivitiesDistance <- 
  round(Activity_d_AprMay$LoggedActivitiesDistance, digits = 2)

Intensities_h_AprMay$AverageIntensity <- 
  round(Intensities_h_AprMay$AverageIntensity, digits = 2)

Calories_m_AprMay$Calories <- 
  round(Calories_m_AprMay$Calories, digits = 2)

#### **3.2.15 Double check cleaned and newly created dataframes** 

In [None]:
#### Double check cleaned and newly created dataframes ####
#=========================================================#

glimpse(Activity_d_AprMay)
glimpse(Calories_d_AprMay)
glimpse(Intensities_d_AprMay)
glimpse(MET_d_AprMay)
glimpse(Sleep_d_AprMay)
glimpse(Steps_d_AprMay)

glimpse(Calories_h_AprMay)
glimpse(Intensities_h_AprMay)
glimpse(Steps_h_AprMay)

glimpse(Calories_m_AprMay)
glimpse(Intensities_m_AprMay)
glimpse(MET_m_AprMay)
glimpse(Sleep_m_AprMay)
glimpse(Steps_m_AprMay)

glimpse(Weight_AprMay)
glimpse(HeartRate_s_AprMay)
glimpse(HeartRate_m_AprMay)

#### **3.3 Data Binding**

Now that both months have been cleaned, the only thing left to do is binding them together, which will be done in the upcoming section. To give the final dataframes more structure they will also be arranged by 'Id' and 'ActivityTime'.

#### **3.3.1 Bind respective dataframes from both months together**

In [None]:
#### Bind respective dataframes from both months together ####
#============================================================#

# Daily Dataframes #

Activity_d <- 
  bind_rows(Activity_d_MarApr, Activity_d_AprMay) %>% 
  arrange(Id, ActivityTime)

Calories_d <- 
  bind_rows(Calories_d_MarApr, Calories_d_AprMay) %>% 
  arrange(Id, ActivityTime)

Intensities_d <- 
  bind_rows(Intensities_d_MarApr, Intensities_d_AprMay) %>% 
  arrange(Id, ActivityTime)

MET_d <- 
  bind_rows(MET_d_MarApr, MET_d_AprMay) %>% 
  arrange(Id, ActivityTime)

Sleep_d <- 
  bind_rows(Sleep_d_MarApr, Sleep_d_AprMay) %>% 
  arrange(Id, ActivityTime)

Steps_d <- 
  bind_rows(Steps_d_MarApr, Steps_d_AprMay) %>% 
  arrange(Id, ActivityTime)

Weight <- 
  bind_rows(Weight_MarApr, Weight_AprMay) %>% 
  arrange(Id, ActivityTime)

# Hourly Dataframes #

Calories_h <- 
  bind_rows(Calories_h_MarApr, Calories_h_AprMay) %>% 
  arrange(Id, ActivityTime)

Intensities_h <- 
  bind_rows(Intensities_h_MarApr, Intensities_h_AprMay) %>% 
  arrange(Id, ActivityTime)

Steps_h <- 
  bind_rows(Steps_h_MarApr, Steps_h_AprMay) %>% 
  arrange(Id, ActivityTime)

# Minute Dataframes #

Calories_m <- 
  bind_rows(Calories_m_MarApr, Calories_m_AprMay) %>% 
  arrange(Id, ActivityTime)

Intensities_m <- 
  bind_rows(Intensities_m_MarApr, Intensities_m_AprMay) %>% 
  arrange(Id, ActivityTime)

MET_m <- 
  bind_rows(MET_m_MarApr, MET_m_AprMay) %>% 
  arrange(Id, ActivityTime)

Sleep_m <- 
  bind_rows(Sleep_m_MarApr, Sleep_m_AprMay) %>% 
  arrange(Id, ActivityTime)

Steps_m <- 
  bind_rows(Steps_m_MarApr, Steps_m_AprMay) %>% 
  arrange(Id, ActivityTime)

HeartRate_m <- 
  bind_rows(HeartRate_m_MarApr, HeartRate_m_AprMay) %>% 
  arrange(Id, ActivityTime)

#### **3.3.2 Preview bound dataframes**

In [None]:
#### Preview bound dataframes ####
#================================#

glimpse(Activity_d)
glimpse(Calories_d)
glimpse(Intensities_d)
glimpse(MET_d)
glimpse(Sleep_d)
glimpse(Steps_d)
glimpse(Weight)

glimpse(Calories_h)
glimpse(Intensities_h)
glimpse(Steps_h)

glimpse(Calories_m)
glimpse(Intensities_m)
glimpse(MET_m)
glimpse(Sleep_m)
glimpse(Steps_m)
glimpse(HeartRate_m)

#### **3.3.3 Check each dataframe for problematic values**

Before moving onto the analysis, it seems advisable to have one final glance at the data itself. A simple function that outputs both the maximum and minimum values for each column (in each dataset) will be used to check for any implausible or illogical values.  

In [None]:
#### Check each dataframe for problematic values ####
#===================================================#

f_max_min <- function(x){c(max(x), min(x))}

apply(Activity_d, 2, f_max_min) 
apply(Calories_d, 2, f_max_min) 
apply(Intensities_d, 2, f_max_min) 
apply(MET_d, 2, f_max_min) 
apply(Sleep_d, 2, f_max_min) 
apply(Steps_d, 2, f_max_min) 
apply(Weight, 2, f_max_min) 

apply(Calories_h, 2, f_max_min) 
apply(Intensities_h, 2, f_max_min) 
apply(Steps_h, 2, f_max_min) 

apply(Calories_m, 2, f_max_min) 
apply(Intensities_m, 2, f_max_min) 
apply(MET_m, 2, f_max_min) 
apply(Sleep_m, 2, f_max_min) 
apply(Steps_m, 2, f_max_min) 
apply(HeartRate_m, 2, f_max_min) 


<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

Upon checking several problematic values were discovered, which need filtering. These include:
* TotalSteps (0 steps per day seem practically impossible)
* Calories (0 calories per day are impossible)
* SedentaryMinutes (1440 minutes of sedentary time, i.e. sleeping or sitting, seems impossible and rather to be the default value when the device was active but not worn)
* TotalIntensity (in conjuncture with sedentary minutes a combined total intensity of 0 seems impossible) 
* Mean_METs_d (an average MET value below 0.9 - the amount of energy burned during sleep - is impossible)

All days with problematic values will be filtered. It is important to note that these days also need to be filtered from the smaller (hour and minute) dataframes. This will be done via grouping the dataframes by day and then filtering their sum values. 

Another problematic case is the 'TotalTimeinBed' (or 'TotalMinutesAsleep') column from the *Sleep_d* dataframe, as 6 minutes of total sleeping time appears to be extremely low. However, since it is impossible to filter sleeping time without bias (where is the drawing line between how much sleeping time seems reasonable or not) and the 6 minutes are theoretically possible (if unlikely), they will remain unaltered.

</div>

In [None]:
# Check and remove problematic cases where necessary #

## Daily dataframes ##

Activity_d %>% 
  filter(TotalSteps==0)

Activity_d <- filter(Activity_d, TotalSteps!=0)

Activity_d %>% 
  filter(SedentaryMinutes==1440)

Activity_d <- filter(Activity_d, SedentaryMinutes!=1440)

Activity_d %>% 
  filter(Calories==0)

Activity_d %>%    
  mutate("SumMinutes" = VeryActiveMinutes+FairlyActiveMinutes+
           LightlyActiveMinutes+SedentaryMinutes) %>% 
  filter(SumMinutes!=1440)

#-----------------------------------------------------------------------------#

Calories_d %>% 
  filter(Calories==0)

Calories_d <- filter(Calories_d, Calories!=0)

#-----------------------------------------------------------------------------#

Intensities_d %>% 
  filter(SedentaryMinutes==1440)

Intensities_d <- filter(Intensities_d, SedentaryMinutes!=1440)

Intensities_d %>%    
  mutate("SumMinutes" = VeryActiveMinutes+FairlyActiveMinutes+
           LightlyActiveMinutes+SedentaryMinutes) %>% 
  filter(SumMinutes!=1440)

#-----------------------------------------------------------------------------#

MET_d %>% 
  filter(Mean_METs_d < 0.9)

MET_d <-
  MET_d %>% 
  filter(Mean_METs_d >= 0.9)

#-----------------------------------------------------------------------------#

Steps_d %>% 
  filter(TotalSteps==0)

Steps_d <- filter(Steps_d, TotalSteps!=0)

## Hourly dataframes ##

Intensities_h %>% 
  group_by(Id, "Day" =  as.Date(ActivityTime)) %>% 
  filter(sum(TotalIntensity)==0)

Intensities_h <- 
  Intensities_h %>%
  group_by(Id, "Day" = as.Date(ActivityTime)) %>% 
  filter(sum(TotalIntensity)!=0) %>% 
  group_by(Id, ActivityTime) %>% 
  select(-Day)

#-----------------------------------------------------------------------------#

Steps_h %>% 
  group_by(Id, "Day" = as.Date(ActivityTime)) %>% 
  filter(sum(TotalSteps)==0)

Steps_h <- 
  Steps_h %>%
  group_by(Id, "Day" = as.Date(ActivityTime)) %>% 
  filter(sum(TotalSteps)!=0) %>%
  group_by(Id, ActivityTime) %>% 
  select(-Day)

## Minute dataframes ##

Calories_m %>% 
  filter(Calories==0)

Calories_m <- filter(Calories_m, Calories!=0)

#-----------------------------------------------------------------------------#

Intensities_m %>% 
  group_by(Id, "Day" =  as.Date(ActivityTime)) %>% 
  filter(sum(Intensity)==0)

Intensities_m <- 
  Intensities_m %>%
  group_by(Id, "Day" = as.Date(ActivityTime)) %>% 
  filter(sum(Intensity)!=0) %>% 
  group_by(Id, ActivityTime) %>% 
  select(-Day)

#-----------------------------------------------------------------------------#

MET_m %>% 
  group_by(Id, "ActivityTime" = as.Date(ActivityTime)) %>%
  mutate(MeanMET = sum(METs)/(60*24)) %>% 
  filter(MeanMET < 0.9)

MET_m <-
  MET_m %>% 
  group_by(Id, "ActivityTime" = as.Date(ActivityTime)) %>%
  mutate(MeanMET = sum(METs)/(60*24)) %>% 
  filter(MeanMET >= 0.9)

#-----------------------------------------------------------------------------#

Steps_m %>% 
  group_by(Id, "Day" = as.Date(ActivityTime)) %>% 
  filter(sum(Steps)==0)

Steps_m <- 
  Steps_m %>%
  group_by(Id, "Day" = as.Date(ActivityTime)) %>% 
  filter(sum(Steps)!=0) %>% 
  group_by(Id, ActivityTime) %>% 
  select(-Day)

<div style = "background-color: #ADE8F4">

## **4. Analyse-Phase**
</div>
The analysis will focus on two main aspects:
* 1.) The device usage (i.e. how often/how long certain features were tracked and how they compare)
* 2.) The performance (i.e. the physical activity and health states of the participants)

The chapter is split accordingly.

#### **4.1 Analysis: Usage**
#### **4.1.1 Load packages** 

In [None]:
#### Load packages ####
#=====================#

library("tidyverse")
library("plotrix")
library("viridis")
library("MESS")

#### **4.1.2 Change timezone to UTC** 

In [None]:
#### Change timezone to UTC ####
#==============================#

Sys.setenv(TZ = "UTC")

#### **4.1.3 Create dataframe for feature usage** 

To check how many times each user tracked a certain feature (e.g. Calories, Steps, Heart Rate), a dataframe consisting of only the ID, activity time and a feature class will be created for each feature. These will then be bound together and used for further operating. For this purpose, one final dataframe for daily (mean) heart rates will be created that has not been introduced previously.  

In [None]:
#### Create dataframe for feature usage ####
#==========================================#

# Create daily dataframe for HeartRate #

HeartRate_d <-  
  HeartRate_m %>% 
  group_by(Id, "ActivityTime" = as.Date(ActivityTime)) %>%
  summarise("Mean_HeartRate_d" = round(mean(Mean_HeartR), digits = 2),
            "Max_HeartRate_d" = max(Max_HeartR),
            "Min_HeartRate_d" = min(Max_HeartR)) 

# Create and prepare dataframes for binding #

Usage_Distance <-
  Activity_d %>% 
  filter(TrackerDistance!=0) %>% 
  select(1,2)  %>% 
  mutate("Feature" = "Distance", .before = ActivityTime) 

Usage_LogActivity <-
  Activity_d %>% 
  filter(LoggedActivitiesDistance!=0) %>% 
  select(1,2)  %>% 
  mutate("Feature" = "LogActivity", .before = ActivityTime) 

Usage_Calories <-
  Calories_d %>% 
  select(1,2) %>% 
  mutate("Feature" = "Calories", .before = ActivityTime)

Usage_Intensities <-  
  Intensities_d %>%
  select(1,2) %>% 
  mutate("Feature" = "Intensities", .before = ActivityTime)

Usage_MET <-
  MET_d %>% 
  select(1,2) %>% 
  mutate("Feature" = "MET", .before = ActivityTime)

Usage_Sleep <-
  Sleep_d %>% 
  select(1,2) %>% 
  mutate("Feature" = "Sleep", .before = ActivityTime)

Usage_Steps <-
  Steps_d %>% 
  select(1,2) %>% 
  mutate("Feature" = "Steps", .before = ActivityTime)

Usage_Weight <-
  Weight %>% 
  select(1,2) %>% 
  mutate("Feature" = "Weight", .before = ActivityTime)

Usage_Fat <-
  Weight %>% 
  filter(!is.na(Fat)) %>% 
  select(1,2) %>% 
  mutate("Feature" = "Fat", .before = ActivityTime)

Usage_BMI <- 
  Weight %>% 
  select(1,2) %>% 
  mutate("Feature" = "BMI", .before = ActivityTime)

Usage_HeartRate <- 
  HeartRate_d %>% 
  select(1,2) %>% 
  mutate("Feature" = "HeartRate", .before = ActivityTime)

# Bind dataframes together #

Usage_All <-
  bind_rows(Usage_Distance, Usage_LogActivity, Usage_Calories, Usage_Intensities,
            Usage_MET, Usage_Sleep, Usage_Steps, Usage_Weight, Usage_Fat, 
            Usage_BMI, Usage_HeartRate)

#### **4.1.4 Create dataframes for days tracked per feature** 

With the new bound dataframe from the previous step, one can now count the number of days each feature was tracked per user. The results will be realised as both long and wide data for different purposes. The data will also be organised which will come in handy for plotting.

In [None]:
#### Create dataframes for days tracked per feature ####
#======================================================#

# Long data #

DaysTracked_All_long <-
  Usage_All %>% 
  group_by(Id, Feature) %>% 
  count(ActivityTime) %>% 
  summarise("DaysTracked" = sum(n)) 

DaysTracked_All_long$Feature <-
  factor(DaysTracked_All_long$Feature, 
         levels = c("Fat", "LogActivity", "BMI", "Weight", "HeartRate", "Sleep",
                    "Distance", "Steps", "Intensities", "MET", "Calories"))

# Wide data #

DaysTracked_All <- 
  DaysTracked_All_long %>% 
  pivot_wider(names_from = Feature, values_from = DaysTracked)

DaysTracked_All[is.na(DaysTracked_All)] <- 0

DaysTracked_All <- DaysTracked_All[,c(1,4,12,3,6,7,8,9,10,5,2,11)]

#### **4.1.5 Preview new dataframes** 

In [None]:
#### Preview new dataframes ####
#==============================#

glimpse(Usage_All)
glimpse(DaysTracked_All_long)
glimpse(DaysTracked_All)

#### **4.1.6 Create plots for feature usage** 

In [None]:
#### Create plots for feature usage ####
#======================================#

# Create single plot that is applied to each feature #

DaysTracked_plots <- 
  lapply(
    names(DaysTracked_All)[-13], \(var)
    
    ggplot(DaysTracked_All, aes(x = Id,
                                y = .data[[var]],
                                fill = .data[[var]])) + 
      geom_col(width = 0.8) +
      ylim(0, 65) +
      theme_minimal() +
      labs(title = "Days Tracked per User") +
      theme(plot.title = element_text(size = 20, face = "bold"),
            legend.title = element_text(size = 14),
            legend.text = element_text(size = 12),
            axis.title = element_text(size = 15),
            axis.text.x = element_text(angle = 90, size = 12),
            axis.text.y = element_text(size = 12)) +

      geom_hline(yintercept = mean(DaysTracked_All[[var]]),
                 colour = "plum2",
                 linewidth = 1) +
      
      annotate("text",                                                           # add mean value to trend line
               x = 37,                                                      
               y = mean(DaysTracked_All[[var]])+2,
               label =  round(mean(DaysTracked_All[[var]]), digits = 1),
               colour = "plum2", 
               fontface = "bold", 
               size = 5.5) +
      
      coord_cartesian(clip = "off") +
      
      labs(tag = sprintf("nUsers: %i",                                           # add tag for number of users
                         DaysTracked_All %>%                                     # who used feature
                           select(Id, .data[[var]]) %>%
                           filter(.data[[var]] != 0) %>%
                           nrow()
      )
      ) + 
      
      theme(plot.tag.position = "topright",                                      # adjust parameters of tag
            plot.tag = element_text(colour = "plum2", 
                                    size = 18))
  )

In [None]:
# View plot for each feature #

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

print(DaysTracked_plots[[2]])  # Distance
print(DaysTracked_plots[[3]])  # LoggedActivity
print(DaysTracked_plots[[4]])  # Calories
print(DaysTracked_plots[[5]])  # Intensities
print(DaysTracked_plots[[6]])  # MET
print(DaysTracked_plots[[7]])  # Sleep
print(DaysTracked_plots[[8]])  # Steps
print(DaysTracked_plots[[9]])  # Weight
print(DaysTracked_plots[[10]]) # Fat
print(DaysTracked_plots[[11]]) # BMI
print(DaysTracked_plots[[12]]) # HeartRate

In [None]:
# Combine and compare results #

## Set colour palette for "features" ##

clr_features <-
  c("red", "coral2", "orange", "yellow", "green3","aquamarine2", "turquoise3",
    "dodgerblue1", "mediumpurple", "hotpink", "plum1")

## Stacked bar graph ##

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

ggplot(DaysTracked_All_long, aes(Id, DaysTracked, fill = Feature)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = clr_features) +
  labs(title = "Total Days Tracked per User") +
  theme_classic() +
  theme(plot.title = element_text(size = 20, face = "bold"),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12),
        axis.title = element_text(size = 15),        
        axis.text.x = element_text(angle = 90, size = 12), 
        axis.text.y = element_text(size = 12))

## Circular stacked bar graph ##

options(repr.plot.width = 12, repr.plot.height = 10) # adjust plot size for jupyter notebook

ggplot(DaysTracked_All_long, aes(x = Id, y = DaysTracked, fill = Feature)) +
  scale_fill_manual(values = clr_features) +
  geom_bar(stat = "identity", size = 5) +
  ylim(-50, 650) +
  theme_minimal() +
  coord_radial(start = 0) +
  guides(theta = guide_axis_theta(angle = 90)) +
  labs(title = "Total Days Tracked per User") +
  theme(
    plot.title = element_text(hjust = 0.5, vjust = 1, size = 20, face = "bold"),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),      
    axis.title = element_blank(),
    axis.text.x = element_text(size = 11),
    axis.text.y = element_blank(),
    legend.position = "right"
  ) +
  annotate("text", x = rep(11, 3), y = c(200, 400, 600), 
           label = c("200", "400", "600"), colour = "grey", size = 5)

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

<u>Days each feature was tracked on average</u>

* Distance (34.8)
* LogActivity (1.5)
* Calories (55.3)
* Intensities (48.4)
* MET (54.5)
* Sleep (24.6)
* Steps (48.4)
* Weight (2.8)
* Fat (0.1)
* BMI (2.8)
* HeartRate (13.4)

<u>How many users used a feature</u>

* Distance (34)
* LogActivity (7)
* Calories (35)
* Intensities (35)
* MET (34)
* Sleep (25)
* Steps (35)
* Weight (13)
* Fat (3)
* BMI (13)
* HeartRate (15)

<u>Other</u>
* most users tracked their activity regurlarly
* most users fall into a range of 275-375 of total (daily) trackings (the theoretical maximum being 682 days for 11 features and 2 months)
* the tracking activity of '2891001357' and '6391747486' is relatively low; these are the two users that only tracked activity for the first month (see 3.1.7 or 3.2.7 respectively)
* the most used features are (see also 4.1.8):
  * Calories
  * METs
  * Intensities
  * Steps 


</div>

#### **4.1.7 Examine user engagement rate** 

In [None]:
#### Examine user engagement rate ####
#====================================#

# Calculate percentages of user engagement for each feature #

nUsers <- colSums(DaysTracked_All!=0)

Distance_perc <- as.numeric(round((100/35)*nUsers[2], digits = 2)) 
LogActivity_perc <- as.numeric(round((100/35)*nUsers[3], digits = 2))
Calories_perc <- as.numeric(round((100/35)*nUsers[4], digits = 2))
Intensities_perc <- as.numeric(round((100/35)*nUsers[5], digits = 2))
MET_perc <- as.numeric(round((100/35)*nUsers[6], digits = 2))
Sleep_perc <- as.numeric(round((100/35)*nUsers[7], digits = 2))
Steps_perc <- as.numeric(round((100/35)*nUsers[8], digits = 2))
Weight_perc <- as.numeric(round((100/35)*nUsers[9], digits = 2))
Fat_perc <- as.numeric(round((100/35)*nUsers[10], digits = 2))
BMI_perc <- as.numeric(round((100/35)*nUsers[11], digits = 2))
HeartRate_perc <- as.numeric(round((100/35)*nUsers[12], digits = 2))

# Create a dataframe consisting of features' names and their corresponding #
# engagement percentages #

UserEngagement_Features <- 
  c("Distance", "LoggedActivity", "Calories", "Intensities", "MET",
    "Sleep", "Steps", "Weight", "Fat", "BMI", "HeartRate") 

UserEngagement_Percentages <- 
  c(Distance_perc, LogActivity_perc, Calories_perc, Intensities_perc,
    MET_perc, Sleep_perc, Steps_perc, Weight_perc, Fat_perc, BMI_perc,
    HeartRate_perc)

UserEngagement <- data.frame(UserEngagement_Features, UserEngagement_Percentages)

UserEngagement <- arrange(UserEngagement, UserEngagement_Percentages)

In [None]:
# Create plot for user engagement per feature #

## Check and alter margin values ##

par("mar") 
default_par <- par(mar = c(5.1, 4.1, 4.1, 2.1)) # save as default

par(mar = c(0,0,0,0))

## Create circular plot ##

options(repr.plot.width = 10, repr.plot.height = 10) # adjust plot size for jupyter notebook

Plot_UserEngagement <- 
  function(x, labels, 
           colors = c("red", "coral2", "yellow", "orange", "green3","aquamarine2",
                      "turquoise3", "hotpink", "plum1", "mediumpurple", 
                      "dodgerblue1"), 
           cex.lab = 0.65) {
    require(plotrix)
    plot(0, 
         xlim = c(-1.75, 1.75), 
         ylim = c(-1.75, 1.75), 
         type = "n", 
         axes = F, 
         xlab = NA, 
         ylab = NA)
    
    radii <- seq(1.025, 0.22, length.out = length(x))
    draw.circle(0, 0, radii, border = "lightgrey")
    angles <- (1/4 - x)*2*pi
    draw.arc(0, 0, radii, angles, pi/2, 
             col = colors, 
             lwd = 175/length(x), 
             lend = 2, 
             n = 100)
    ymult <- (par("usr")[4]-par("usr")[3])/
      (par("usr")[2]-par("usr")[1])*par("pin")[1]/par("pin")[2]
    text(x = -0.02, 
         y = radii*ymult, 
         labels = paste(labels," - ", x*100, "%", sep = ""), 
         pos = 2, 
         cex = 1)
  }

Plot_UserEngagement(UserEngagement$UserEngagement_Percentages/100, 
                    UserEngagement$UserEngagement_Features) +
  title(main = "User Engagement Ratio per Feature", line = -7.5, cex.main = 2) +
  text(0, 0.05, "User", cex = 0.9, col = "grey") +
  text(0, -0.05, "Engagement", cex = 0.9, col = "grey")

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

In conjunction with the findings from 4.1.6 the respective percentage values for user engagement per feature (i.e. x% of the users used feature y) are:

* Distance (97.14%)
* LogActivity (20.00%)
* Calories (100.00%)
* Intensities (100.00%)
* MET (97.14%)
* Sleep (71.43%)
* Steps (100.00%)
* Weight (37.14%)
* Fat (8.57%)
* BMI (37.14)
* HeartRate (42.86%)

<u>Summary</u>

As a result, it can be noted that all users tracked their calories, intensities and steps. Almost all users also tracked their running distance and MET values. Nearly three quarters of the users tracked their sleeping activity. Heart Rates, Weight and BMI were at least tracked by around two-fifths of the users, while activity logs were only used by one fifth of the users. The feature that was used by the least users is fat (8.57).

</div>

#### **4.1.8 Compare total number of trackings for each feature** 

In [None]:
#### Compare total number of trackings for each feature ####
#==========================================================#

# Prepare dataframe #

Total_nTrackings <- 
  DaysTracked_All_long %>% 
  group_by(Feature) %>% 
  summarise("nTrackings" = sum(DaysTracked)) %>% 
  arrange(nTrackings)

# Create plot #

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

Total_nTrackings %>% 
  ggplot() +
  geom_col(mapping = aes(x = Feature, y = nTrackings, fill = Feature)) +
  scale_fill_manual(values = clr_features) +
  ylim(0, (35*62)) +
  labs(title = "Total Number of Trackings for Each Feature") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 15),
        axis.text = element_text(size = 12),
        legend.position = "none") +
  geom_text(aes(x = Feature, 
                y = nTrackings+100, 
                label = nTrackings),
                size = 5)

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

The most tracked feature is 'Calories', closely followed by 'MET', while 'Intensities' and 'Steps' are not too far behind. All other features experience a drop in usage in comparison. This is probably because these features tend to be tracked automatically and do not need user activation, which would explain the lower overall trackings for the other features like 'HeartRate'. For 'Sleep' the data suggests that the device was not worn during sleeping sessions on a regular basis, which prevented the sleep activity from being documented. This is even more obvious for the other features that are only ever used (or tracked) on rare occasions. These seem to require manual logging, which most users did not engage in.   

</div>

#### **4.1.9 Examine the percentage distribution** 

In [None]:
#### Examine the percentage distribution ####
#===========================================#

# Create dataframe with percentages where smaller fractions are grouped together # 

Summarised_nTrackings <-
  Total_nTrackings %>% 
  group_by(Feature = fct_lump_min(Feature, 100, nTrackings)) %>% 
  summarise("nTrackings" = sum(nTrackings)) 

Summarised_nTrackings <-
  Summarised_nTrackings %>% 
  mutate("Percentage" = ((100/sum(Summarised_nTrackings$nTrackings))
                         *Summarised_nTrackings$nTrackings))

Summarised_nTrackings$Percentage <- 
  round(Summarised_nTrackings$Percentage, digits = 2)

Summarised_nTrackings <- arrange(Summarised_nTrackings, -nTrackings)

# Create own dataframe for the smaller fractions #

## Look up necessary values ##

Total_nTrackings %>%                              
  mutate("Percentage" = ((100/sum(Total_nTrackings$nTrackings))
                         *Total_nTrackings$nTrackings)) 

## Create dataframe and add dummy values for visualisation ##

Other_nTrackings1 <- c("Fat", "LogActivity", "BMI", "Weight", "")
Other_nTrackings2 <- c("0.04 %", "0.53 %", "0.98 %", "0.98 %", "")
Other_nTrackings3 <- c(4, 53, 98, 98, 900)

Other_nTrackings <- 
  data.frame("Feature" = Other_nTrackings1, 
             "Percentage" = Other_nTrackings2, 
             "Count" = Other_nTrackings3)

## Set colour palettes for pie charts ##

Clr_nTrackingsPie_all <-
  c("plum1", "hotpink", "mediumpurple", "dodgerblue1", "turquoise3", 
    "aquamarine2", "green3", "bisque1")

Clr_nTrackingsPie_other <- 
  c("red", "coral2", "orange", "yellow", "white")

In [None]:
## Create pie charts showing the total tracking distribution ##

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

{
  # For side-by-side view of both charts
  
  par(mfrow=c(1,2), mar = c(0,0,0,0)) 
  
  # Pie chart for total tracking distribution
  
  pie(Summarised_nTrackings$nTrackings, 
      labels = paste0(Summarised_nTrackings$Percentage, " %"), 
      col = Clr_nTrackingsPie_all, 
      border = "white", 
      radius = 1, 
      cex = 1.5) 
  
  legend("bottomleft", 
         legend = Summarised_nTrackings$Feature, 
         ncol = 2, 
         bty = "n", 
         fill = Clr_nTrackingsPie_all, 
         border = "black", 
         cex = 1.5)
  
  title("Total Trackings Distribution", cex.main = 1.75, line = -3.5)
  
  # Pie chart for slice 'Other'
  
  pie(Other_nTrackings$Count,                  
      labels = Other_nTrackings$Percentage, 
      col = Clr_nTrackingsPie_other, 
      border = "white", 
      radius = 1,
      cex = 1.5)
  
  legend("bottomleft", 
         legend = Other_nTrackings$Feature, 
         ncol = 2, 
         bty = "n", 
         fill = Clr_nTrackingsPie_other, 
         border = c(rep("black", 4), "white"), 
         cex = 1.5)
  
  title("Composition of group 'Other'", cex.main = 1.75, line = -3.5)
  }

dev.off()

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

In conjunction with 4.1.8 the corresponding percentages for the usage ratio are:

* Distance (12.13%)
* LogActivity (0.53%)
* Calories (19.31%)
* Intensities (16.89%)
* MET (19.02%)
* Sleep (8.57%)
* Steps (16.88%)
* Weight (0.98%)
* Fat (0.04%)
* BMI (0.98)
* HeartRate (4.68%)

</div>

#### **4.1.10 Examine how much time each feature was used over the entire period** 

In [None]:
#### Examine how much time each feature was used over the entire period ####
#==========================================================================#

# Create and bind dataframes #

Usage_Time_Calories <-
  Calories_m %>% 
  mutate("Feature" = "Calories") %>% 
  select(-Calories)

Usage_Time_Intensities <-
  Intensities_m %>% 
  mutate("Feature" = "Intensities", .after = ActivityTime) %>% 
  select(1:3)

Usage_Time_MET <-
  MET_m %>% 
  mutate("Feature" = "MET", .after = ActivityTime) %>% 
  select(1:3)

Usage_Time_Sleep <-
  Sleep_m %>% 
  mutate("Feature" = "Sleep", .after = ActivityTime) %>% 
  select(1:3)

Usage_Time_Steps <-
  Steps_m %>% 
  mutate("Feature" = "Steps") %>% 
  select(-Steps)

Usage_Time_HeartRate <-
  HeartRate_m %>% 
  mutate("Feature" = "HeartRate", .after = ActivityTime) %>% 
  select(1:3)

#-----------------------------------------------------------------------------#

Usage_Time_All <-
  bind_rows(
    Usage_Time_Calories,
    Usage_Time_Intensities,
    Usage_Time_MET,
    Usage_Time_Sleep,
    Usage_Time_Steps,
    Usage_Time_HeartRate
  )

# Prepare dataframe for plotting #

Usage_Time_All <-
  Usage_Time_All %>% 
  group_by(Feature) %>% 
  count(ActivityTime, name = "Minutes_Used") %>% 
  summarise("Minutes_Used" = sum(Minutes_Used)) %>% 
  mutate("Total_Minutes_Observed" = 62*24*60*35) %>% 
  mutate("UsagePercentage" = round((100/Total_Minutes_Observed)*Minutes_Used, 
                                   digits = 2))

Usage_Time_All$Feature <- factor(Usage_Time_All$Feature, 
                                 levels = c("HeartRate", "Sleep", "MET", 
                                            "Calories", "Intensities", "Steps"))

Usage_Time_All <- arrange(Usage_Time_All, Feature)

In [None]:
# Create plot #

options(repr.plot.width = 10, repr.plot.height = 10) # adjust plot size for jupyter notebook

par(mar = c(0,0,0,0))

Plot_MinutesUsed <- 
  function(x, labels, 
           colors = c("green3", "aquamarine2", "hotpink", "plum1", 
                      "mediumpurple","dodgerblue1"), 
           cex.lab=0.65) {
    require(plotrix)
    plot(0, 
         xlim = c(-1.75, 1.75), 
         ylim = c(-1.75, 1.75), 
         type = "n", 
         axes = F, 
         xlab = NA, 
         ylab = NA)
    
    radii <- seq(0.75, 0.22, length.out = length(x))
    draw.circle(0, 0, radii, border = "lightgrey")
    angles <- (1/4 - x)*2*pi
    draw.arc(0, 0, radii, angles, pi/2, 
             col = colors, 
             lwd = 100/length(x), 
             lend = 2, 
             n = 100)
    ymult <- (par("usr")[4]-par("usr")[3])/
      (par("usr")[2]-par("usr")[1])*par("pin")[1]/par("pin")[2]
    text(x = -0.02, 
         y = radii*ymult, 
         labels = paste(
           labels," - ", x*100, 
           "% -  -  -  -  -  -  -  -  -  -  -  -  -  -  -", sep = ""), 
         pos = 2, 
         cex = 1.25)
  }

Plot_MinutesUsed(Usage_Time_All$UsagePercentage/100, 
                 Usage_Time_All$Feature) +
  text(0, 0.05, "Time", cex = 1.2, col = "grey") +
  text(0, -0.05, "Used", cex = 1.2, col = "grey")

title(main = "Amount of Time Each Feature was Used", cex.main = 2, line = -7.5)

legend("bottomright", 
       ncol = 3, 
       cex= 1.1, 
       box.lty = 3, 
       text.width = c(0.25,0.25,0.4),
       legend = c(
         "", c("", substitute(paste(italic("HeartRate"))),      # Col 1
               substitute(paste(italic("Sleep"))), 
               substitute(paste(italic("MET"))),
               substitute(paste(italic("Calories"))),
               substitute(paste(italic("Intensities"))),
               substitute(paste(italic("Steps")))),
         substitute(paste(bold("Minutes"))),                    # Col 2
         substitute(paste(bold("Used"))), 
         Usage_Time_All$Minutes_Used[1:6],
         substitute(paste(bold("Minutes"))),                    # Col 3
         substitute(paste(bold("Observed"))), 
         Usage_Time_All$Total_Minutes_Observed[1:6]
       )
)

dev.off()

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* The feature **'HeartRate'** was used **15.1 %** (a total of **471810** minutes) of the time during the observed period of two months
* The feature **'Sleep'** was used minutes or **12.21 %** (a total of **381614** minutes)  of the time during the observed period of two months
* The feature **'MET'** was used minutes or **87.74 %** (a total of **2741738** minutes) of the time during the observed period of two months
* The feature **'Calories'** was used minutes or **88.33 %** (a total of **2760108** minutes) of the time during the observed period of two months
* The feature **'Intensities'** was used minutes or **77.07 %** (a total of **2408340** minutes) of the time during the observed period of two months
* The feature **'Steps'** was used minutes or **76.93 %** (a total of **2404020** minutes) of the time during the observed period of two months

</div>

#### **4.1.11 Examine relation between tracking activity and time** 

In [None]:
#### Examine relation between tracking activity and time ####
#===========================================================#

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

# Tracking Activity over the entire period of time #

Usage_All %>% 
  group_by(Id, ActivityTime) %>% 
  count(ActivityTime, name = "ActivityCount") %>% 
  ggplot() +
  geom_tile(mapping = aes(x = Id, y = ActivityTime, fill = ActivityCount)) +
  labs(title = "Tracking Activity Over the Whole Period") +
  scale_fill_viridis(option = "magma") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12),
        axis.title = element_text(size = 15),
        axis.text.x = element_text(angle = 90, size = 12),
        axis.text.y = element_text(size = 12))

# Activity per weekday #

Usage_All %>% 
  mutate("Weekday" = weekdays(ActivityTime), .after = ActivityTime) %>% 
  group_by(Id, Weekday) %>% 
  count(Weekday, name = "ActivityCount") %>% 
  ggplot() +
  geom_tile(mapping = aes(x = Weekday, y = Id, fill = ActivityCount)) +
  scale_fill_viridis(option = "viridis") +
  scale_x_discrete(limits = c("Monday", "Tuesday", "Wednesday", "Thursday", 
                              "Friday", "Saturday", "Sunday")) +
  labs(title = "Tracking Activity per Weekday") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 20,face = "bold"),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12),
        axis.title = element_blank(),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12))

# Comparison of activity levels between weekdays #

Usage_All %>% 
  mutate("Weekday" = weekdays(ActivityTime), .after = ActivityTime) %>% 
  group_by(Weekday) %>% 
  count(name = "ActivityCount")  %>% 
  ggplot(aes(Weekday, ActivityCount)) +
  geom_bar(stat = "identity", fill = "steelblue2", width = 0.8, colour = "black") +
  geom_text(aes(label = signif(ActivityCount)), nudge_y = 50, size = 5) +
  scale_x_discrete(limits = c("Monday", "Tuesday", "Wednesday", "Thursday", 
                              "Friday", "Saturday", "Sunday")) +
  labs(title = "Total Tracking Activity per Weekday") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 15),
        axis.text = element_text(size = 12))

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* The tracking activity does not show any major patterns. Some users used the device mostly the same way throughout the entire period, for others the tracking activity is rather volatile. The only thing noticeable here is that there is an increase (or stabilsation) of tracking activity after the first two weeks. This might infer that the participants had to get used to the device or tracking their activity first.

* There are no apparent patterns for tracking activity based on the weekday. The tracking activity is more or less split evenly across every weekday. The only day on which the activity was tracked (or the device used) less is friday.

</div>

#### **4.1.12 Examine how long the device was worn on average**

To determine how much the device was worn per day, the sum of minutes spent in the different intensity levels will be compared. The *Intensities_d* dataframe is to be used for this, since it offers more data entries than the *Activity_d* dataframe.

In [None]:
#### Examine how long the device was worn on average ####
#=======================================================#

# Prepare dataframes #

## Dataframe for time grouped ##

Grouped_TrackedMinutes <-
  Intensities_d %>% 
  mutate("TotalTrackingMinutes" = 
           SedentaryMinutes+LightlyActiveMinutes+
           FairlyActiveMinutes+VeryActiveMinutes, .before = SedentaryMinutes) %>% 
  select(1:3)

Grouped_TrackedMinutes <-
  setNames(
    data.frame(
      table(
        cut(
          Grouped_TrackedMinutes$TotalTrackingMinutes,
          breaks = c(-Inf, 300, 600, 900, 1439, 1440),
          labels = c("< 5h", "5-10h", "10-15h", "15 < 24h", "24h")
        )
      )
    ) , c("Range", "Frequency")
  ) %>% 
  mutate("Percentage" = round_percent(Frequency, 2)) 

Grouped_TrackedMinutes <- arrange(Grouped_TrackedMinutes, -Percentage) 

Grouped_TrackedMinutes <- 
  Grouped_TrackedMinutes %>% 
  mutate("Cumulative_Percentage" = cumsum(Percentage), 
         .after = Percentage) 

Grouped_TrackedMinutes$Range <- 
  factor(Grouped_TrackedMinutes$Range, levels = Grouped_TrackedMinutes$Range)

print(Grouped_TrackedMinutes)

## Dataframe for worn all day or not ##

TrackedMinutes_all_day <-
  with(Grouped_TrackedMinutes, sum(Percentage[Range!="24h"]))

TrackedMinutes_all_day <-
  data.frame("Range" = c("Not_All_Day", "All_Day"),
             "Frequency" = 
               c(
                 with(Grouped_TrackedMinutes, sum(Frequency[Range!="24h"])),
                 Grouped_TrackedMinutes$Frequency[1]
               ),
             "Percentage" = 
               c(
                 with(Grouped_TrackedMinutes, sum(Percentage[Range!="24h"])),
                 Grouped_TrackedMinutes$Percentage[1]
               )
  )

print(TrackedMinutes_all_day)

In [None]:
# Create plots #

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

ggplot(Grouped_TrackedMinutes, aes(x = Range)) +
  geom_bar(aes(y = Percentage),
           fill = "lightskyblue1", 
           stat = "identity", 
           colour = "black", linewidth = 0.8) +
  geom_path(aes(y = Cumulative_Percentage, group = 1), 
            colour = "plum1", 
            lty = 3, size = 1.5) +
  geom_point(aes(y = Cumulative_Percentage), 
             colour = "hotpink", 
             pch = 16, size = 3) +
  theme_classic() +
  labs(title = "Relative Usage of Device") +
  xlab("Hours Used") +
  ylab("Percentage") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
        axis.title = element_text(size = 15),
        axis.text.x = element_text(vjust = 0.6, size = 12),
        axis.text.y = element_text(size = 12)) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  geom_text(aes(y = Percentage+10, 
                label = paste0(Percentage, " %")), size = 5)

#-----------------------------------------------------------------------------#

options(repr.plot.width = 10, repr.plot.height = 10) # adjust plot size for jupyter notebook

par(default_par)
pie(TrackedMinutes_all_day$Frequency,
    labels = paste(c("Not Used the Whole Day -", "Used the Whole Day -"), 
                   paste0(TrackedMinutes_all_day$Percentage, "%")),
    col = c("lightskyblue4", "lightskyblue1"),
    border = "white",
    radius = 0.5,
    cex = 1.35)
title(main = "Relative Usage of Device", cex.main = 1.75, line = -6)                     

dev.off()                     

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* The device was worn the entire day 71.96% of the time
* If not worn the whole day the device was worn between 15 to < 24 hours 21.9% of the time
* If not worn the whole day the device was worn between 10 to 15 hours 5.02% of the time
* 0.59% of the time the device was only worn between 5 to 10 hours 
* 0.53% of the time the device was worn less than 5 hours 
  
</div>

#### **4.2 Analysis: Performance**
#### **4.2.1 Load packages** 

In [None]:
#### Load packages ####
#=====================#

library("tidyverse")
library("plotrix")
library("RColorBrewer")
library("treemapify")
library("pals")
library("plotly")
library("gridExtra")
library("grid")

#### **4.2.2 Change timezone to UTC** 

In [None]:
#### Change timezone to UTC ####
#==============================#

Sys.setenv(TZ = "UTC")

#### **4.2.3 Examine daily activity and weight** 

In [None]:
#### Examine Daily Activity and Weight ####
#=========================================#

summary(Activity_d)
summary(Weight)

#### **4.2.4 Examine calories** 

In [None]:
#### Examine Calories ####
#========================#

# Look at statistics #

summary(Calories_d)
summary(Calories_h)
summary(Calories_m)

# Average Calories burned per day #

## Create dataframe to group Calories into ranges ##

Grouped_Calories_d <-
  setNames(
    data.frame(
      table(
        cut(
          Calories_d$Calories, 
          breaks = c(-Inf, 1200, 1500, 2000, 2500, 3000, Inf),
          labels = c("< 1200", "1200-1500", "1500 - 2000", "2000 - 2500", 
                     "2500 - 3000", "> 3000")
        )
      )
    ), c("Range", "Frequency")
  )

## Add percentages ##

Grouped_Calories_d <-
  Grouped_Calories_d %>% 
  mutate("Percentage" = round_percent(Frequency, 2))

## Preview new dataframe ##

print(Grouped_Calories_d)

In [None]:
## Create plot ##

options(repr.plot.width = 10, repr.plot.height = 10) # adjust plot size for jupyter notebook

{
  plotrix::pie3D(Grouped_Calories_d$Frequency, 
                 border = "white", 
                 theta = 1,
                 radius = 0.8,
                 col = hcl.colors(length(Grouped_Calories_d$Range), "PuBu"), 
                 explode = 0.1,
                 labels = paste0(Grouped_Calories_d$Percentage, " %"), 
                 mar = c(10,10,10,10)
  ) 
  
  title(main = "Average Calories Burned per Day", cex.main = 2.2, line = -2)
  
  legend("bottomright",
         inset = c(-0.16, -0.2),          
         legend = paste0(Grouped_Calories_d$Range, " Cal"), 
         fill = hcl.colors(length(Grouped_Calories_d$Range), "PuBu"), 
         xpd = TRUE, 
         cex = 1.2,
         text.width = 0.6,
         box.lty = 3,
         y.intersp = 1
  )
}

dev.off()

In [None]:
# Calories burnt throughout the week #

## Prepare dataframe ##

Calories_over_Time <-
  Calories_h %>% 
  mutate("Weekday" = weekdays(ActivityTime), .after = ActivityTime)

Calories_over_Time$ActivityTime <- 
  format(Calories_over_Time$ActivityTime, format = "%H:%M:%S")

Calories_over_Time$Weekday <- 
  factor(Calories_over_Time$Weekday, 
         levels = c("Sunday", "Saturday", "Friday", "Thursday", "Wednesday", 
                    "Tuesday", "Monday"))

In [None]:
## Create plot ##

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook
                     
ggplot(Calories_over_Time, aes(x = ActivityTime, y = Weekday, fill = Calories)) +
  geom_tile(lwd = .1, colour="grey40") +
  scale_fill_viridis(option = "inferno") +
  labs(title = "Calories Throughout the Week") +
  theme(plot.title = element_text(size = 20, face = "bold"),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12),
        axis.title = element_text(size = 15),
        axis.text.x = element_text(angle = 90, size = 12),
        axis.text.y = element_text(size = 12)) 

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* On average users burned 2283 calories per day
* The maximum amount of calories burned per single day is 4900<br></br>
* 15.18% of the time the users burned more than 3000 calories per day
* 17.91% of the time the users burned between 2500-3000 calories per day
* 27.16% of the time the users burned between 2000-2500 calories per day
* 27.05% of the time the users burned between 1500-2000 calories per day
* 11.10% of the time the users burned between 1200-1500 calories per day
* 1.60% of the time the users burned less than 1200 calories per day<br></br>
* In general, most calories were burned on mondays and saturdays
* During the week most calories were burned between 14-17 o'clock
* On weekends most calories were burned between 12-13 o'clock

</div>

#### **4.2.5 Examine intensities** 

In [None]:
#### Examine Intensities ####
#===========================#

# Look at statistics #

summary(Intensities_d)
summary(Intensities_h)
summary(Intensities_m)

# Average Sedentary Minutes #

## Create dataframe to group SedentaryMinutes into ranges ##

Grouped_Intensities_d <-
  Intensities_d %>% 
  mutate("SedentaryActiveMinutes" = SedentaryMinutes-(8*60), 
         .after = SedentaryMinutes) %>% 
  select(1, 2, 4)

Grouped_Intensities_d <-
  setNames(
    data.frame(
      table(
        cut(
          Grouped_Intensities_d$SedentaryActiveMinutes, 
          breaks = c(-Inf, 240, 480, 660, Inf),
          labels = c("< 240", "240-480", "480-660", "> 660")
        )
      )
    ), c("Range", "Frequency")
  )  

## Add percentages ##

Grouped_Intensities_d <-
  Grouped_Intensities_d %>% 
  mutate("Percentage" = round_percent(Frequency, 2))

In [None]:
# Average proportion of intensity levels #

## Prepare dataframe ##

Comparison_Intensities <-
  Intensities_d %>% 
  pivot_longer(cols = c(3:6), names_to = "ActivityType", values_to = "Minutes") %>% 
  group_by(ActivityType) %>% 
  summarise("Minutes" = sum(Minutes)) %>% 
  mutate("Percentage" = round_percent(Minutes, 2),  
         "ActivityLevel" = "ActivityLevel")

Comparison_Intensities <- Comparison_Intensities[c(3,2,1,4),]

In [None]:
## Create plot ##

{
  pie(Comparison_Intensities$Minutes,
      labels = paste0(Comparison_Intensities$Percentage, " %"),
      col = hcl.colors(length(Grouped_Intensities_d$Range), "PuBu"),
      border = "white", 
      radius = 0.6, cex = 1.35, 
      mar=c(1,1,1,1))
  
  title("Average Ratio of Intensity Levels", line = -0.5, cex.main = 2)
    
  legend("bottom",
         legend = Comparison_Intensities$ActivityType,
         fill = hcl.colors(length(Grouped_Intensities_d$Range), "PuBu"),
         ncol = 4, 
         cex = 1.10, text.width = 0.4)
}

dev.off()

In [None]:
## Create plot ##

options(repr.plot.width = 10, repr.plot.height = 10) # adjust plot size for jupyter notebook

{
  plotrix::pie3D(Grouped_Intensities_d$Frequency, 
                 border = "white", 
                 theta = 1, 
                 radius = 0.8,
                 col = hcl.colors(length(Grouped_Intensities_d$Range), "PuBu"), 
                 explode = 0.1,
                 labels = paste0(Grouped_Intensities_d$Percentage, " %"), 
                 mar = c(10,10,10,10)
  ) 
  
  title(main = "Average Sedentary Minutes per Day", cex.main = 2.2, line = -2)
  
  legend("bottomright", 
         inset = c(-0.125, -0.22), 
         legend = paste0(Grouped_Intensities_d$Range, " Sedentary Minutes"), 
         fill = hcl.colors(length(Grouped_Intensities_d$Range), "PuBu"), 
         xpd = TRUE, 
         cex = 1.2,
         text.width = 1,
         box.lty = 3,
         y.intersp = 1  )
  }

dev.off()

In [None]:
# Intensities throughout the week #

## Prepare dataframe ##

Intensities_over_Time <-
  Intensities_h %>% 
  mutate("Weekday" = weekdays(ActivityTime), .after = ActivityTime)

Intensities_over_Time$ActivityTime <- 
  format(Intensities_over_Time$ActivityTime, format = "%H:%M:%S")

Intensities_over_Time$Weekday <- 
  factor(Intensities_over_Time$Weekday, 
         levels = c("Sunday", "Saturday", "Friday", "Thursday", "Wednesday", 
                    "Tuesday", "Monday"))

In [None]:
## Create plot ##

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

ggplot(Intensities_over_Time, aes(x = ActivityTime, 
                                  y = Weekday, 
                                  fill = TotalIntensity)) +
  geom_tile(lwd = .1, colour="grey40") +
  scale_fill_viridis(option = "inferno") +
  labs(title = "Intensity Levels Throughout the Week") +
  theme(plot.title = element_text(size = 20, face = "bold"),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12),
        axis.title = element_text(size = 15),
        axis.text.x = element_text(angle = 90, size = 12),
        axis.text.y = element_text(size = 12)) 

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* On average users spent around **1063 minutes** in a **sedentary** state (sleeping or sitting), **212 minutes** in a **lightly active** state, **15 minutes** in a **fairly active** state and **23 minutes** in a **very active** state per day <br></br>
* On average **80.97%** of the time **sedentary**, **16.15%** of the time **lightly active**, **1.13%** of the time **fairly active** and **1.75%** of the time **very active** <br></br>
* The average intensity level is **13.09** per hour or **0.22** per minute (calculated from assigning the values 0-3 to each state; the maximum possible values are 3x60=180 per hour or 3 per minute) <br></br>
* On average
  * **47.76%** of the time users have **more than 660** sedentary minutes per day
  * **26.74%** of the time users have **between 480 and 660** sedentary minutes per day
  * **12.34%** of the time users have **between 240 and 480** sedentary minutes per day
  * **13.16%** of the time users have **less than 240** sedentary minutes per day<br></br>
* On average users were the most active between **14-17 o'clock** during the **week** and **12-13 o'clock** on the **weekend** (this coincides with the previous findings in 4.2.4 for calories)
* In general most sedentary minutes are spent in the morning (apart from sleep)

</div>

#### **4.2.6 Examine METs** 

In [None]:
#### Examine METs ####
#====================#

# Look at statistics #

summary(MET_d)
summary(MET_m)

# METs over time #

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

ggplot(MET_d, aes(x = ActivityTime, y = Id, fill = METs_d)) +
  geom_tile() +
  scale_fill_viridis(option = "inferno") +
  theme_minimal() +
  labs(title = "METs over Time") +
  theme(plot.title = element_text(hjust = .5, size = 20, face = "bold"),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12),
        axis.title = element_text(size = 15),
        axis.text = element_text(size = 12)) 

In [None]:
# Average METs per day #
 
## Create dataframe and determine mean values for in-chart reference ##

MET_d_Averages <-
  MET_d %>% 
  group_by(Id) %>% 
  summarise("AVG_METs_d" = mean(METs_d), "AVG_METs_m" = mean(Mean_METs_d)) 

round(mean(MET_d_Averages$AVG_METs_d), digits = 2)
round(mean(MET_d_Averages$AVG_METs_m), digits = 2)

In [None]:
## Create plot ##

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

ggplot(MET_d_Averages, aes(x = Id, y = AVG_METs_d)) + 
  geom_col(fill = "turquoise3", colour="black", width = 1) +
  theme_minimal() +
  labs(title = "Average METs per day") +
  ylab("METs") +
  theme(axis.text.x = element_text(angle = 90),
        plot.margin = unit(c(.5,2,.5,.5), "cm"),
        plot.title = element_text(hjust = 0.5, size = 20),
        axis.title = element_text(size = 15),
        axis.text = element_text(size = 12)) +
  scale_y_continuous(breaks = c(500, 1000, 1500, 2000, 2500)) +
  geom_hline(yintercept = round(mean(MET_d_Averages$AVG_METs_d), digits = 2), 
             linewidth = 1, linetype = 2, 
             colour = "darkorange2") +

  annotate("rect", 
           xmin = -.5, xmax = 35.5, 
           ymin = 450, ymax = 900, 
           alpha = .5, 
           fill = "orange", colour = "darkorange3") +

  annotate("text", 
           x = 38.1, 
           y = round(mean(MET_d_Averages$AVG_METs_d), digits = 2)+100,
           label = paste(round(mean(MET_d_Averages$AVG_METs_d), 
                               digits = 2), "/ d"), 
           colour = "darkorange2",
           size = 6) +

  annotate("text", 
           x = 38.1, 
           y = round(mean(MET_d_Averages$AVG_METs_d), digits = 2)-100,
           label = paste(round(mean(MET_d_Averages$AVG_METs_m), 
                               digits = 2), "/ min"), 
           colour = "darkorange2",
           size = 6) +

  annotate("text", 
           x = 38.1, 
           y = 675, 
           label = "450 - 900", 
           colour = "darkorange2",
           size = 6) +

  coord_cartesian(clip = "off")

In [None]:
# Average distribution of MET levels #

## Prepare dataframe to examine average MET proportions ##

Grouped_MET_Levels <-
  MET_m %>% 
  group_by(METs) %>% 
  count(name = "Total_Minutes") %>% 
  summarise("Total_Minutes" = sum(Total_Minutes)) %>% 
  separate(METs, 2, into=c("MET_Level", "Decimal")) %>% 
  select(-Decimal) 

Grouped_MET_Levels$MET_Level <- gsub("\\.", "", Grouped_MET_Levels$MET_Level)    # to remove the dot

Grouped_MET_Levels <-
  Grouped_MET_Levels %>% 
  group_by(MET_Level) %>% 
  summarise("Total_Minutes" = sum(Total_Minutes)) %>% 
  arrange(as.numeric(MET_Level))

Grouped_MET_Levels$MET_Level <-
  factor(Grouped_MET_Levels$MET_Level, 
         levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",
                    "12", "13", "14", "15", "18"))

## Create dataframe for percentages ##

Grouped_MET_Levels_relperc <-
  Grouped_MET_Levels %>% 
  mutate("Percentage" = round_percent(Total_Minutes, 2))

In [None]:
## Create plot ##

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

ggplot(Grouped_MET_Levels, 
       mapping = aes(area = Total_Minutes, 
                     fill = MET_Level, 
                     label = MET_Level)) +
  geom_treemap() +
  scale_fill_manual(values = as.vector(stepped3(18))) +
  labs(title = "Average MET-Levels",
       tag = "Rest: 0.16%")+
  theme(plot.title = element_text(hjust = 0.5, size = 25),
        plot.margin = unit(c(0.5, 3.2, 0.5, 0.5), "cm"),
        plot.tag.position = c(1.0725, 0.88),           
        plot.tag = element_text(colour = "grey", size = 16.5),
        legend.position = c(1.07, 0.32),
        legend.title = element_text(size = 15), 
        legend.text = element_text(size = 15)) +
  geom_treemap_text(label = paste0(Grouped_MET_Levels_relperc$Percentage, "%"),
                    colour = c(rep("white", 3), rep("black", 13)), size = 25) +
  guides(fill = guide_legend(ncol = 2)) 

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* The average METs (energy output) per day are **2065.1** or **1.43** per minute (0.9 being a sleeping state, 22-25 being the maximum values for top-athletic-level-training) <br></br>
* There are no apparent patterns for when how much energy was output; the MET levels are rather volatile for most users (a few users maintained a high MET level throughout the entire period, while the rest has more mixed MET levels)  <br></br>
* Every user meets the recommended minimum METs of 450-900 per week on an average day <br></br>
* On average users spent
  * **84.65%** of the time at a level 1 MET-level (sleeping/sitting; this coincides with the high amounts of sedentary time found in 4.2.5)
  * **6.46%** of the time at a level 2 MET-level (light exercises like gardening, cleaning, slow strolling)
  * **4.75%** of the time at a level 3 MET-level (moderate exercises like walking, light resistance training)
  * **1.5%** of the time at a level 4 MET-level (moderate exercises like leisure cycling)
  * **2.64%** of the time at an MET-level above 5 (moderate cycling to pro sports) 

</div>

#### **4.2.7 Examine sleep** 

In [None]:
#### Examine Sleep ####
#=====================#

# Look at statistics #

summary(Sleep_d)
summary(Sleep_m)

# Average sleep per day #

## Create dataframes to group sleep into ranges ##

Grouped_Sleep_d_Asleep <-
  setNames(
    data.frame(
      table(
        cut(
          Sleep_d$TotalMinutesAsleep, 
          breaks = c(-Inf, (7*60), (9*60), Inf),
          labels = c("< 7", "7-9", "> 9")
        )
      )
    ), c("Range", "Frequency")
  ) %>% 
  mutate("Percentage" = round_percent(Frequency, 2), "Sleep" = "Time Asleep")  

Grouped_Sleep_d_Total <-
  setNames(
    data.frame(
      table(
        cut(
          Sleep_d$TotalTimeInBed, 
          breaks = c(-Inf, (7*60), (9*60), Inf),
          labels = c("< 7", "7-9", "> 9")
        )
      )
    ), c("Range", "Frequency")
  ) %>% 
  mutate("Percentage" = round_percent(Frequency, 2), "Sleep" = "Time in Bed")  

## Bind dataframes ##

Grouped_Sleep_d <- bind_rows(Grouped_Sleep_d_Asleep, Grouped_Sleep_d_Total)

In [None]:
## Create plot ##

options(repr.plot.width = 9, repr.plot.height = 7) # adjust plot size for jupyter notebook

ggplot(Grouped_Sleep_d, aes(x = Sleep, y = Percentage, fill = Range)) +
  geom_bar(stat = "identity", width = 0.6, colour = "black") +
  theme_minimal(base_size = 15) +
  scale_fill_brewer(labels = paste0(Grouped_Sleep_d$Range, " h"), 
                    palette = "PuBu", 
                    direction = -1) +
  labs(title = "Average Sleep per Day") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
        legend.title = element_text(size = 17),
        legend.text = element_text(size = 15),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 16),
        axis.text = element_text(size = 14)) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  geom_text(aes(label = paste0(Percentage, " %")), 
            colour = "black",
            position = position_stack(vjust = .5), 
            size = 5) +
  geom_segment(aes(x = 1.3, y = Percentage[3]/2 , 
                   xend = 1.7, yend = Percentage[6]/2), 
               linetype=2, colour = "grey") +
  
  geom_segment(aes(x = 1.3, y = (Percentage[2]/2)+Percentage[3], 
                   xend = 1.7, yend = (Percentage[5]/2)+Percentage[6]), 
               linetype=2, colour = "grey") +
  
  geom_segment(aes(x = 1.3, y = (Percentage[1]/2)+Percentage[2]+Percentage[3], 
                   xend = 1.7, yend = (Percentage[4]/2)+Percentage[5]+Percentage[6]), 
               linetype=2, colour = "grey") 

In [None]:
# Average sleeping times #

## Create dataframes ##
 
Sleep_Time <- Sleep_m  

minute(Sleep_Time$ActivityTime) <- 0

Sleep_Time$ActivityTime <- format(Sleep_Time$ActivityTime, format = "%H:%M:%S")

Sleep_Time_Individual <-
  Sleep_Time %>% 
  group_by(Id, ActivityTime) %>% 
  count(SleepStage_chr, name = "SleepSUM") 

Sleep_Time_All <-
  Sleep_Time %>% 
  group_by(ActivityTime) %>% 
  count(SleepStage_chr, name = "SleepSUM")  

In [None]:
## Create plots ##

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

Sleep_Time_Individual %>% 
  filter(SleepStage_chr=="asleep") %>% 
  ggplot(aes(x = ActivityTime, y = Id, fill = SleepSUM)) +
  geom_tile(colour = "white", linetype = 1, show.legend = FALSE) +
  scale_fill_viridis(option = "viridis") +
  labs(title = "Average Individual Sleeping Times") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, colour = "grey70", face = "bold"),
        axis.title = element_blank(),
        axis.text = element_text(size = 12, colour = "grey70"),
        axis.text.x = element_text(angle = 90),
        plot.background = element_rect(fill = "black"),
        panel.grid = element_blank(), 
        panel.background = element_rect(fill = "black"))

Sleep_Time_All %>% 
  filter(SleepStage_chr=="asleep") %>% 
  ggplot(aes(x = ActivityTime, y = SleepStage_chr, fill = SleepSUM)) +
  geom_tile(colour = "white", lwd= .3, linetype = 1, show.legend = FALSE) +
  scale_fill_viridis(option = "viridis") +
  theme_minimal() +
  labs(title = "Average Sleeping Time") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, colour = "grey70", face = "bold"),
        axis.title = element_blank(),
        axis.text = element_text(size = 12, colour = "grey70"),
        axis.text.x = element_text(angle = 90),
        plot.background = element_rect(fill = "black"),
        panel.grid = element_blank(), 
        panel.background = element_rect(fill = "black"))

In [None]:
# Sleep Phase Ratio # 

## Create dataframes ##

Sleep_Time_Individual_perc <- 
  Sleep_Time_Individual %>% 
  group_by(SleepStage_chr) %>% 
  summarise("SleepSUM" = sum(SleepSUM))

Sleep_Time_Individual_perc <- 
  round_percent(Sleep_Time_Individual_perc$SleepSUM,2)

In [None]:
## Create plot ##  

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

Sleep_Time_Individual %>% 
  ggplot(aes(x = Id, y = SleepSUM, fill = SleepStage_chr)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_fill_brewer(palette = "Pastel1", direction = -1) +
  theme_minimal() +
  labs(title = "Sleep Phase Ratio", fill = "Sleep Stage") +
  ylab("Total Sleep") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
        legend.title = element_text(size = 15),
        legend.text = element_text(size = 13), 
        axis.title = element_text(size = 16),
        axis.text.x = element_text(angle = 90, size = 13),
        axis.text.y = element_text(size = 13)) +
  geom_hline(yintercept = Sleep_Time_Individual_perc/100, 
             colour = c("darkgreen", "darkblue", "darkred"),
             linewidth = 1) +
  annotate("text", 
           x = 27, y = Sleep_Time_Individual_perc/100+0.03, 
           label = paste0(c(Sleep_Time_Individual_perc[1], 
                            Sleep_Time_Individual_perc[2],
                            Sleep_Time_Individual_perc[3]), "%"), 
           colour = c("#85CF75", "#699CC8", "#F65649"),
           size = 6) +
  coord_cartesian(clip = "off")

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* On average users **slept** around **6.8h** (409 minutes) per day and **spent** a total of **7.41 hours** (445 minutes) **in bed** <br></br>
* 2 users spent around one fourth of the sleeping time in a restless state <br></br>
* On average
  * **46.39%** of the time users slept for less than **7 hours**
  * **42.56%** of the time users slept between **7-9 hours**
  * **11.05%** of the time users slept for more than **9 hours** <br></br>
* On average
  * **35.00%** of the time users spent less than **7 hours** in bed
  * **43.84%** of the time users spent between **7-9 hours** in bed
  * **21.16%** of the time users spent more than **9 hours** in bed <br></br>  
* The core sleeping time of users usually falls between **23-24** and **5-6 o'clock** <br></br>
* Around **92%** of the sleeping time was spent **asleep**, **7% awake** and **1% restless** 

</div>

#### **4.2.8 Examine steps** 

In [None]:
#### Examine Steps ####
#=====================#

# Look at statistics #

summary(Steps_d)
summary(Steps_h)
summary(Steps_m)

# Average Steps taken per day #

## Create dataframe to group Steps into ranges ##

Grouped_Steps_d <-
  setNames(
    data.frame(
      table(
        cut(
          Steps_d$TotalSteps, 
          breaks = c(-Inf, 1000, 2500, 5000, 10000, Inf),
          labels = c("< 1000", "1000-2500", "2500 - 5000", "5000 - 10000", 
                     "> 10000")
        )
      )
    ), c("Range", "Frequency")
  ) %>% 
  mutate("Percentage" = round_percent(Frequency, 2), "Steps" = "Steps")          # dummy value

Grouped_Steps_d <- Grouped_Steps_d[c(5,4,3,2,1),]

In [None]:
## Create plot ##

options(repr.plot.width = 9, repr.plot.height = 7) # adjust plot size for jupyter notebook

Grouped_Steps_d %>% 
  ggplot(mapping = aes(x = Steps, y = Percentage, fill = Range)) +
  geom_bar(stat = "identity", width = 0.5, colour = "black") +
  theme_minimal(base_size = 15) +
  scale_fill_brewer(labels = paste0(c("< 1000", "1000-2500", "2500 - 5000", 
                                      "5000 - 10000", "> 10000"), " Steps"), 
                    palette = "PuBu", 
                    direction = -1) +
  labs(title = "Average Steps taken per Day") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),        
        legend.title = element_text(size = 17),
        legend.text = element_text(size = 15),
        axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 16),
        axis.text.y = element_text(size = 14)) +       
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  geom_text(aes(label = paste0(Percentage, " %")), 
            colour = c(rep("black", 4), "white"),
            position = position_stack(vjust = .5), 
            size = 5) 

In [None]:
# Steps throughout the week #

## Prepare dataframe ##

Steps_over_Time <-
  Steps_h %>% 
  mutate("Weekday" = weekdays(ActivityTime), .after = ActivityTime)

Steps_over_Time$ActivityTime <- 
  format(Steps_over_Time$ActivityTime, format = "%H:%M:%S")

Steps_over_Time$Weekday <- 
  factor(Steps_over_Time$Weekday, 
         levels = c("Sunday", "Saturday", "Friday", "Thursday", "Wednesday", 
                    "Tuesday", "Monday"))

In [None]:
## Create plot ##

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

ggplot(Steps_over_Time, aes(x = ActivityTime, y = Weekday, fill = TotalSteps)) +
  geom_tile(lwd = .1, colour="grey40") +
  scale_fill_viridis(option = "inferno") +
  labs(title = "Steps Throughout the Week") +
  theme(plot.title = element_text(size = 20, face = "bold"),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12),
        axis.title = element_text(size = 15),
        axis.text.x = element_text(angle = 90, size = 12),
        axis.text.y = element_text(size = 12)) 

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* On average users accumulated around **8292 steps a day** (which translates to roughly 348 steps per hour or 5.8 steps per minute) <br>
* On average
  * **35.68%** of the time users took more than **10000 steps** a day
  * **36.27%** of the time users took between **5000-10000 steps** a day
  * **17.07%** of the time users took between **2500-5000 steps** a day
  * **7.56%** of the time users took between **1000-2500 steps** a day
  * **3.42%** of the time users took less than **1000 steps** a day <br></br>


* Most steps were taken during the week between **14-17 o'clock**; on the weekend users generally got in fewer steps

</div>

#### **4.2.9 Examine heart rates** 

In [None]:
#### Examine Heart Rate ####
#==========================#

# Look at statistics # 

summary(HeartRate_d)
summary(HeartRate_m)

# Average, maximum and minimum heart rates #

## Create dataframes ##

Max_Min_HR_wide <-
  HeartRate_d %>% 
  group_by(Id) %>% 
  summarise("Max_HR" = max(Max_HeartRate_d), "Min_HR" = min(Min_HeartRate_d))

Max_Min_HR_long <-
  bind_rows(
    
    HeartRate_d %>% 
      group_by(Id) %>% 
      summarise("Max_Min_HR" = max(Max_HeartRate_d)),
    
    HeartRate_d %>% 
      group_by(Id) %>% 
      summarise("Max_Min_HR" = min(Min_HeartRate_d))
    
  )

In [None]:
## Create plot ##

p_HeartRate_interactive <-
  ggplot() +
  theme_classic() +
  theme(plot.title = element_text(hjust = .5, size = 15),
        axis.text.x = element_text(angle = 90)) +
  geom_line(data = Max_Min_HR_long, 
            aes(x = Id, y = Max_Min_HR), 
            colour = "red", 
            linewidth = 1.5, 
            alpha = .25) +
  geom_line(data = HeartRate_d, 
            aes(x = Id, y = Mean_HeartRate_d), 
            colour = "red", 
            linewidth = 2.5) +
  geom_point(data = Max_Min_HR_wide, 
             aes(x = Id, y = Max_HR), 
             colour = "red", 
             size = 2.5) +
  geom_point(data = Max_Min_HR_wide, 
             aes(x = Id, y = Min_HR), 
             colour = "red", 
             size = 2.5) +
  labs(title = "Heart Rates by Users") +
  ylab("Heart Rate") 

ggplotly(p_HeartRate_interactive)

In [None]:
# Heart rates during sedentary time #

## Create dataframe ##

Intensities_Sedentary <- 
  Intensities_m %>% 
  filter(Intensity_chr=="Sedentary")

Sedentary_HeartRates <- inner_join(Intensities_Sedentary, HeartRate_m, by = c("Id", "ActivityTime"))

Sedentary_HeartRates <-
  Sedentary_HeartRates %>% 
  group_by(Id) %>% 
  summarise("MeanHR" = mean(Mean_HeartR), 
            "MinHR" = min(Mean_HeartR), 
            "MaxHR" = max(Mean_HeartR))

In [None]:
## Create plot ##

Plot_Sedentary_HeartRates <-
  ggplot(Sedentary_HeartRates) +
  geom_point(aes(x = Id, y = MeanHR), fill = "darkseagreen2", size = 3.5, shape = 23) +
  geom_point(aes(x = Id, y = MinHR), fill = "skyblue1", size = 3.5, shape = 25) +
  geom_point(aes(x = Id, y = MaxHR), fill = "salmon", size = 3.5, shape = 24) +
  labs(title = "Minimum, Mean and Maximum Heart Rates During Sedentary Time") +
  ylab("Heart Rate") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 11, face = "bold"),
        axis.text.x = element_text(angle = 90))

ggplotly(Plot_Sedentary_HeartRates)

In [None]:
# Average daily heart rates over the entire period by user #

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

HeartRate_d %>% 
  group_by(Id, ActivityTime) %>% 
  summarise("Mean_HR" = mean(Mean_HeartRate_d)) %>% 
  ggplot(aes(x = ActivityTime, y = Mean_HR)) +
  geom_line(colour = "red") +
  facet_wrap(~Id) +
  theme_minimal() +
  theme(axis.title = element_blank())

In [None]:
# Time spent at different heart rates #

## Create dataframe ##

Grouped_HeartRate_m <-
  setNames(
    data.frame(
      table(
        cut(
          HeartRate_m$Mean_HeartR, 
          breaks = c(-Inf, 50, 80, 100, 150, 180, Inf),
          labels = c("< 50", "50-80", "80-100", "100-150", "150-180", "> 180")
        )
      )
    ), c("Range", "Minutes")
  ) %>% 
  mutate("Percentage" = round_percent(Minutes, 2), "HeartRate" = "HeartRate")    # dummy value

In [None]:
## Create plot ##

options(repr.plot.width = 10.5, repr.plot.height = 7) # adjust plot size for jupyter notebook

ggplot(Grouped_HeartRate_m, aes(x = Minutes, y = Range, fill = Range)) +
  geom_col(colour = c("#F0027F", "#386CB0", "#FFFF99",
                      "#FDC086", "#BEAED4", "#7FC97F")) +
  labs(title = "Total Time Spent at Different Heart Rates") +
  xlab("Total Minutes") +
  ylab("Heart Rate") +
  scale_x_continuous(breaks = c(50000, 100000, 150000, 200000, 250000, 300000), 
                     labels = c("50000", "100000", "150000", "200000", "250000", 
                                "300000")) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
        legend.title = element_text(size = 15),
        legend.text = element_text(size = 13),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 13)) +
  scale_fill_brewer(palette = "Accent", direction = -1) +
  geom_text(label = paste(Grouped_HeartRate_m$Percentage, "%"), 
            nudge_x = 25000,
           size = 5) +
  coord_cartesian(clip = "off")

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* In general most users have ordinary mean heart rates <br></br>
* During sedentary time most users have peak heart rates of over 115 BPM, 33% of the users have peak heart rates of around 155-180 BPM <br></br> 
* For some users the minimum heart rate falls between 40-60 (for two users even below 40) BPM<br>
  → the according mean heart rates look fine, but a check up is still advisable just in case <br></br>
* For one user the mean heart rate is slightly higher than the normal 60-100 BPM, which should be checked just in case <br></br>
* More than half of the users reach **peak heart rates** between **180 and 200 BPM** <br></br>
* There do not seem to be any abnormalities for heart rates over time (though some users tracked their heart rates inconsistently) <br></br>
* On average users spent
  * **2.08%** of the time at a heart rate of under **50** BPM
  * **67.42%** of the time at a heart rate of **50-80** BPM
  * **22.98%** of the time at a heart rate of **80-100** BPM
  * **7.19%** of the time at a heart rate of **100-150** BPM
  * **0.32%** of the time at a heart rate of **150-180** BPM
  * **0.01%** of the time at a heart rate of over **180** BPM

</div>

#### **4.2.10 Examine caloric relations for intensity, METs and steps** 

In [None]:
#### Examine caloric relations for Intensity, METs and Steps #### 
#===============================================================#

# Join and preview dataframe #

Calories_Intensities_MET_Steps_d <- 
  inner_join(Calories_d, Intensities_d, 
             by = c("Id", "ActivityTime"))

Calories_Intensities_MET_Steps_d <- 
  inner_join(Calories_Intensities_MET_Steps_d, Steps_d, 
             by = c("Id", "ActivityTime"))

Calories_Intensities_MET_Steps_d <- 
  inner_join(Calories_Intensities_MET_Steps_d, MET_d, 
             by = c("Id", "ActivityTime"))

Calories_Intensities_MET_Steps_d <- 
  Calories_Intensities_MET_Steps_d[,c(1,2,8,7,6,5,4,9,10,3)]

glimpse(Calories_Intensities_MET_Steps_d)

In [None]:
# Create plots #

## Intensities - Calories ##

### Prepare plots ###

Plot_Intensities_Calories1 <-
  ggplot(Calories_Intensities_MET_Steps_d, 
         aes(x = SedentaryMinutes, y = Calories)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  theme(axis.title = element_text(size = 14),
        axis.text = element_text(size = 12))

Plot_Intensities_Calories2 <-
  ggplot(Calories_Intensities_MET_Steps_d, 
         aes(x = LightlyActiveMinutes, y = Calories)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  theme(axis.title = element_text(size = 14),
        axis.text = element_text(size = 12))

Plot_Intensities_Calories3 <-
  ggplot(Calories_Intensities_MET_Steps_d, 
         aes(x = FairlyActiveMinutes, y = Calories)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  theme(axis.title = element_text(size = 14),
        axis.text = element_text(size = 12))

Plot_Intensities_Calories4 <-
  ggplot(Calories_Intensities_MET_Steps_d, 
         aes(x = VeryActiveMinutes, y = Calories)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  theme(axis.title = element_text(size = 14),
        axis.text = element_text(size = 12))

### Arrange and display plots ###

options(repr.plot.width = 10, repr.plot.height = 10) # adjust plot size for jupyter notebook

grid.arrange(Plot_Intensities_Calories1,Plot_Intensities_Calories2,
             Plot_Intensities_Calories3,Plot_Intensities_Calories4, 
             top = textGrob("Correlation between Intensity Levels and Calories", 
                            gp = gpar(fontface = "bold", cex = 2)))

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* Generally more active time unsurprisingly results in more calories being burned, however, the relation is not linear or exponential <br>
→ this suggest that it is more about the quality of time and how effectively it is spent (i.e. if you have lots of sedentary time but spent the few minutes of active time
    at an increased intensity, you can burn as many calories as someone who has less sedentary time but does not spent the active time that intensively)


</div>

In [None]:
## TotalSteps - Calories / METs - Calories ##

### Prepare plots ###

Plot_Steps_Calories <-
  ggplot(Calories_Intensities_MET_Steps_d, 
         aes(x = TotalSteps, y = Calories)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  labs(title = "Correlation between Calories and Steps") +
  xlab("Total Steps") +
  theme(plot.title = element_text(size = 16.5, face = "bold"),
        axis.title = element_text(size = 14.5),
        axis.text = element_text(size = 13.5))

Plot_METs_Calories <-
  ggplot(Calories_Intensities_MET_Steps_d, 
         aes(x = Mean_METs_d, y = Calories)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  labs(title = "Correlation between Calories and METs") +
  xlab("Mean METs / d") +
  theme(plot.title = element_text(size = 16.5, face = "bold"),
        axis.title = element_text(size = 14.5),
        axis.text = element_text(size = 13.5))

### Arrange and display plots ###

options(repr.plot.width = 10, repr.plot.height = 8.) # adjust plot size for jupyter notebook

grid.arrange(Plot_Steps_Calories, Plot_METs_Calories, nrow = 1)

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

No surprises here: more steps and higher MET-levels result in more calories being burned.

</div>

#### **4.2.11 Examine sleep relations** 

In [None]:
#### Examine Sleep relations ####
#===============================#

# Join and preview dataframe #

Calories_Intensities_MET_Sleep_Steps_d <- 
  inner_join(Calories_Intensities_MET_Steps_d, Sleep_d, 
             by = c("Id", "ActivityTime"))

glimpse(Calories_Intensities_MET_Sleep_Steps_d)

In [None]:
# Create plots #

## Sleep - Calories / Sleep - METs / Sleep - Steps ##

### Prepare plots ###

Plot_Calories_Sleep <-
  ggplot(Calories_Intensities_MET_Sleep_Steps_d, 
         aes(x = Calories, y = TotalMinutesAsleep)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  labs(title = "Calories and Minutes Asleep") +
  theme(plot.title = element_text(size = 16, face = "bold"),
        axis.title = element_text(size = 14.5),
        axis.text = element_text(size = 13.5))

Plot_METs_Sleep <-
  ggplot(Calories_Intensities_MET_Sleep_Steps_d, 
         aes(x = METs_d, y = TotalMinutesAsleep)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  labs(title = "METs and Minutes Asleep") +
  xlab("METs / d") +
  theme(plot.title = element_text(size = 16, face = "bold"),
        axis.title = element_text(size = 14.5),
        axis.text = element_text(size = 13.5))

Plot_Steps_Sleep <-
  ggplot(Calories_Intensities_MET_Sleep_Steps_d, 
         aes(x = TotalSteps, y = TotalMinutesAsleep)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  labs(title = "Steps and Minutes Asleep") +
  xlab("Total Steps") +
  theme(plot.title = element_text(size = 16, face = "bold"),
        axis.title = element_text(size = 14.5),
        axis.text = element_text(size = 13.5))

### Arrange and display plots ###

options(repr.plot.width = 11, repr.plot.height = 9) # adjust plot size for jupyter notebook

grid.arrange(Plot_Calories_Sleep, Plot_METs_Sleep, Plot_Steps_Sleep, nrow = 1)

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* More sleep does not mean that more calories are burned
* Higher METs per day only partially result in better sleep; a range of 1500-2500 METs seems to have the opposite effect
* More steps per day generally seem to cause a little worse sleep

  → this may be due to higher exhaustion and adrenaline production, causing the body to require a phase of recovery 

* Avoid overtraining for better sleep   
</div>

In [None]:
## Sleep - Intensity ##

### Prepare plots ###

Plot_SedentaryMinutes_Sleep <-
  ggplot(Calories_Intensities_MET_Sleep_Steps_d, 
         aes(x = SedentaryMinutes, y = TotalMinutesAsleep)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  labs(title = "Sedentary Minutes and Minutes Asleep") +
  theme(plot.title = element_text(size = 16, face = "bold"),
        axis.title = element_text(size = 14.5),
        axis.text = element_text(size = 13.5))

Plot_VeryActiveMinutes_Sleep <-
  ggplot(Calories_Intensities_MET_Sleep_Steps_d, 
         aes(x = VeryActiveMinutes, y = TotalMinutesAsleep)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  labs(title = "Very Active Minutes and Minutes Asleep") +
  theme(plot.title = element_text(size = 16, face = "bold"),
        axis.title = element_text(size = 14.5),
        axis.text = element_text(size = 13.5))

### Arrange and display plots ###

options(repr.plot.width = 10, repr.plot.height = 8.5) # adjust plot size for jupyter notebook

grid.arrange(Plot_SedentaryMinutes_Sleep, 
             Plot_VeryActiveMinutes_Sleep, nrow = 1)

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* More sedentary generally seem to worsen sleep
* Active minutes (when not overtraining) generally seem to better sleep

</div>

#### **4.2.12 Examine heart rate relations** 

In [None]:
#### Examine Heart Rate Relations ####
#====================================#

# Join and preview dataframe #

Calories_Intensities_MET_Sleep_Steps_HeartRate_d <-
  inner_join(Calories_Intensities_MET_Sleep_Steps_d, HeartRate_d, 
             by = c("Id", "ActivityTime"))

glimpse(Calories_Intensities_MET_Sleep_Steps_HeartRate_d)

## Heart Rate - Calories / Heart Rate - Sleep ##
 
### Prepare plots ###

Plot_Calories_HeartRate <-
  ggplot(Calories_Intensities_MET_Sleep_Steps_HeartRate_d, 
         aes(x = Calories, y = Mean_HeartRate_d)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  labs(title = "Correlation between Heart Rate and Calories") +
  ylab("Heart Rate") +
  theme(plot.title = element_text(size = 16, face = "bold"),
        axis.title = element_text(size = 14.5),
        axis.text = element_text(size = 13.5))

Plot_Sleep_HeartRate <-
  ggplot(Calories_Intensities_MET_Sleep_Steps_HeartRate_d, 
         aes(x = TotalMinutesAsleep, y = Mean_HeartRate_d)) +
  geom_point(colour = "plum2") +
  geom_smooth() +
  theme_minimal() +
  labs(title = "Correlation between Heart Rate and Sleep") +
  ylab("Heart Rate") +
  theme(plot.title = element_text(size = 16, face = "bold"),
        axis.title = element_text(size = 14.5),
        axis.text = element_text(size = 13.5))

### Arrange and display plots ###

options(repr.plot.width = 11, repr.plot.height = 8.5) # adjust plot size for jupyter notebook

grid.arrange(Plot_Calories_HeartRate, Plot_Sleep_HeartRate, nrow = 1)

<div style = "background-color: #e6f2ff">

#### **<u>Findings</u>**

* A mean heart rate from 70-80 BPM per day seem to be the best range for burning calories
* Higher mean heart rates seem related to worse sleep

</div>

<div style = "background-color: #ADE8F4">

## **5. Share-Phase** 
</div>

#### **5.1 Summary of Device Usage Trends**

<u>User Engagement</u>
* User engagement for *Calories*, *Intensities* and *Steps* is 100%
* User engagement for *Distance* and *METs* is 97.14%
* User engagement for *Sleep* is 71.43%
* User engagement for all other features is below 50%
  * *Heart Rate* is only tracked by 42.86%
  * Critically low user engagement is observable for *LogActivity*, *Weight*, *Fat*, *BMI*
 
<u>Usage ratio</u>
* *Calories* and *METs* were tracked around 87-88% of the time
* *Intensities* and *Steps* were tracked around 77% of the time
* *Sleep* and *Heart Rate* were tracked 12-15% of the time
* Every other feature was used less than 5% of the time

<u>Device Usage</u>
* There are no apparent patterns for when activity was tracked (Friday seems to be the only day with slightly less average tracking activity)
* On average the device was worn for the entire day around 72% of the time
  * Around 22% of the time the device was worn for 15 to under 24 hours
  * Around 1% of the time the device was worn for less than 10 hours a day
 
<u>Calories</u>
* Average calories per day: 2283
* Around 40% of the time less 2000 calories or less, 15% of the time more than 3000 calories
* Most calories burned on mondays and saturday
* Most calories burned between 14-17 and 12-13 o'clock (weekend)

<u>Intensities</u>
* Average sedentary minutes per day: 1063
* On average around 81% of the time is spent in a sedentary state, fairly and very active activity makes up only around 3% of the total time
* Only around 13% of the time users have less than 240 sedentary minutes a day
* Most sedentary minutes are spent in the morning (9-10 AM), the most active time is between 14-17 and 12-13 o'clock (weekend) 

<u>METs</u>
* Average METs per day: 2065 (1.43/m)
* 100% of the time at least 450-900 METs per week
* Level-1-activity makes up around 85% of the time, activity above level 5 makes up only 2.64% of the time

<u>Sleep</u>
* Average sleeping time per day: 6.8 hours (7.41 hours in bed)
* Time spent asleep makes up 92% of the sleep time, 8% are spent awake or restless
* Around 46% of the time users sleep less than 7 hours
* Core sleeping time is 23/24-5/6 o'clock
  
<u>Steps</u>
* Average steps per day: 8292
* Around 36% of the time more than 10000 steps a day, around 28% of the time less than 5000 steps
* Most steps taken between 14-17 o'clock, on the weekend generally fewer steps 

<u>Heart Rate</u>
* More than half of the users reach peak heart rates of 180-200 BPM
* During sedentary time
  * the majority of users has a peak heart rate of over 115 BPM
  * 33% of the users have peak heart rates of around 155-180 BPM
* Most users have minimum heart rates drop below 60 BPM, half the users have it drop below 50 BPM
* Around 67% of the time is spent at a heart rate of 50-80 BPM, 7% at 100-150 BPM and 0.4% at over 150 BPM

<u>Interrelationships</u>
* Generally more active time unsurprisingly results in more calories being burned, however, the relation is not linear or exponential 
* Overtraining can cause worse sleep
* More sedentary time worsens sleep
* Higher mean heart rates (longer phases at higher heart rates) seem related to worse sleep


#### **5.2 Analysis and Targets**

<div style = "background-color: #e6f2ff">

* Usage of the *Sleep*, *Heart Rate*, *LogActivity*, *Weight*, *Fat* and *BMI* features have room for improvement <br>
  → possible reasons for the low engagement rate could be that <br>
      a.) for sleep the device needs to be worn over night and something keeps customers from wanting to do that <br>
          -is the device uncomfortable when worn over extended periods? <br>
          -do people charge their batteries during the night? <br>
          -are people aware of the importance of sleep and the benefits that come from tracking it? <br>
          -do people have any fears that keep them from wearing the device 24/7? (e.g. damaging the device, malfunctions, EMF radiation)<br>       b.) the features require manual tracking/activation </br>
</div>
<div style = "background-color: #e2fee2">

**<u>Targets</u>**
* Boost user engagement for less-used features
* Increase comfort and wearability
* Reduce manual tracking
</div> 

<div style = "background-color: #e6f2ff">

* The caloric average is 2283 per day
* In general a calorie intake of 2000-2500 calories per day is recommended and calories should be burned accordingly (see [NHS](https://www.nhs.uk/live-well/healthy-weight/managing-your-weight/understanding-calories/#:~:text=An%20ideal%20daily%20intake%20of,women%20and%202%2C500%20for%20men.))
* 40% of the time less than 2000 calories are burned, 15% of the time more than 3000 calories are burned
* Too much of a (constant) calorie deficit can 
  * cause muscle instead of fat loss
  * can deprive the body of important macronutrients
  * can result in eating disorders
  * can cause fatigue, dizziness, depression, inattention <br>
(cf. [this article](https://www.everydayhealth.com/weight/can-more-calories-equal-more-weight-loss.aspx))
* Too much of a (constant) calorie surplus can cause obesity, which can lead to
  * heart diseases
  * high blood pressure
  * diabetes
  * cancer  
(cf. [this article](https://www.ncbi.nlm.nih.gov/books/NBK235013/)) <br></br>
* The optimal time for burning calories seems to be late afternoon to early evening, which is partly achieved by most (see [this article](https://goodstretch.uk/time-of-the-day-when-you-burn-the-most-and-least-calories-for-weight-loss/#:~:text=We%20burn%20most%20calories%20in,regardless%20of%20what%20we%20do.)) </br>
</div>
<div style = "background-color: #e2fee2">
    
**<u>Targets</u>**
* Guide users and provide feedback on their caloric needs
</div>

<div style = "background-color: #e6f2ff">

* For most users the sedentary time per day is extremely high (1063 minutes on average)
* High sedentary times can cause severe health problems such as
  * an increased mortality risk
  * cancer, diabetes, cardiovascular diseases, depression  
  * degeneration of cognitive health
  * worse sleep <br>
([see WHO](https://www.who.int/news-room/fact-sheets/detail/physical-activity#:~:text=periods%20of%20time.-,Sedentary%20screen%20time%20should%20be%20no,1%20hour%3B%20less%20is%20better.))
* People with more than 240 minutes of sedentary time per day have a medium to high risk of developing health problems ([see here](https://www.medicalnewstoday.com/articles/sitting-down-all-day))
* It is generally thought that a high sedentary time can have as bad effects as smoking or obesity
* in conjunction MET-levels are ok but could be better; the minimum of 450-900 METs per week is met however
</div>
<div style = "background-color: #e2fee2">
    
**<u>Targets</u>**
* Reduce sedentary time and increase average MET-level 
</div>

<div style = "background-color: #e6f2ff">

* The average sleeping time of 6.8 hours per day is lower than the recommended 7-9 hours
* Adults with less than 7 hours of sleep a night have more health issues
  * mental health problems
  * loss of productivity
  * greater likelihood of death
  * insufficient body regeneration <br>
 (cf. [this article](https://www.nhlbi.nih.gov/health/sleep-deprivation))     
* Irregular sleeping times can worsen the quality of sleep
</div>
<div style = "background-color: #e2fee2">

**<u>Targets</u>**
* Increase average sleeping time
* Increase quality of sleep
* Improve mental health 
</div>

<div style = "background-color: #e6f2ff">

* The average step count per day is 8292
* A general aim for steps is 10000 a day, however, newer research suggests 7500 steps are enough to significantly decrease the risk of death (see [Harvard Health Publishing](https://www.health.harvard.edu/staying-healthy/how-many-steps-should-i-take-each-day))
</div>
<div style = "background-color: #e2fee2">

**<u>Targets</u>**
* Encourage users to get in 7500 steps or more per day
</div>

<div style = "background-color: #e6f2ff">

* During sedentary time the majority of users reaches peak heart rates between 115 up to 180 BPM and minimum heart rates of 60 BPM or less
* Normal resting heart rates range from 55-100 BPM 
* Heart rates below that can be a sign of bradycardia, heart rates above that can be a sign of tachycardia <br>
→ both can cause fatigue and frequent fainting or even lead to an increased risk of strokes, heart failure and cardiac arrest (cf. [American Heart Association](https://www.heart.org/en/health-topics/arrhythmia/about-arrhythmia/tachycardia--fast-heart-rate)) 
* More than half of the users reach peak heart rates between 180 and 200 BPM
* Overtraining can worsen sleep
* Optimal heart rates for training that improve performance and reduce exhaustion and overtraining is 50-80% of the maximum heart rate (often calculated by: 220 - age)[cf. [Harvard Health Publishing](https://www.health.harvard.edu/heart-health/what-your-heart-rate-is-telling-you)]
</div>
<div style = "background-color: #e2fee2">
    
**<u>Targets</u>**
* Improve cardiovascular health
* Avoid overtraining 
</div>

#### **5.2.1 Dashboards**

![Bellabeat DB 1.png](attachment:94e79217-c341-4812-bc09-cb5a074421ee.png)

<div style = "background-color: #F2F3F4">
   
</div>

![Bellabeat DB 2.png](attachment:dd8a44fa-9ebe-4107-80a5-9f615766a23d.png)

#### **5.3 Conclusion**
#### **5.3.1 Recommendations**

#### **Gamification**

One way to boost user engagement and encourage more activity might be the implementation of gaming elements, such as experience points and rewards. Each feature could come with a ranking system that would need individual levelling. Charts like radar graphs could be used for visualising current levels or statistics.  

In a further step, these statistics could then be put into a personalised trading card of the user consisting of an avatar, an overall level, strength values for each feature, general attributes that describe user behaviour and so on. 

All of this could be combined with other gamification elements. For example, as you earn more experience points and gain more levels you could be rewarded in-app currency with which you could unlock customisables (e.g. new and funny titles, new visuals or themes for your trading card/avatar). There could also be a streak system for days (experience multipliers based on how many days the device is used in a row) to boost daily usage.  

Adding to this, another way to foster user engagement might be to enable certain amounts of in-app currency to be transferable to real life, by granting slight discounts on Bellabeat's products or other benefits.

#### **Contests and Community Interaction**

Tying in with gamification elements there could be optional contests or challenges to try to beat ghost data from other users. Or weekly/daily challenges for the user to complete.

There could be some sort of community feature, with which users could motivate each other. Possibly some sort of 'like'-feature to send out anonymously to a random device as a sign of support (this should be a passive feature to avoid spamming).

#### **Alerts and Reminders**

Alerts and reminders should be issued to provide users with feedback when they are running risk of not meeting their caloric needs or when they accumulate too much sedentary time. There should also be reminders and encouragement for taking short breaks or for not sitting too much. These would need to be dosed properly to avoid a bad user experience.

Users should also be educated and reminded about the negative effects unhealthy lifestyles can have. A life clock could be implemented counting and showing how much life expectancy changes according to the user's behaviour. Occasional fun facts could be issued to the default screen, showing the positive and negative effects of different lifestyles.

During training there could be visual or audible feedback for training in the optimal heart rate range in order to prevent overtraining.

There should be reminders to prepare for bed in a timely manner and options to reduce blue light from the device. 

#### **Introduction of new features**

Device compatibility could be improved to reduce manual tracking, for example by connecting the device to a weight that automatically uploads weight, BMI and fat levels to the device and keeps them updated every time.

Personal fitness programmes could be included with the device, based on their individual needs and offer them a roadmap to follow.

Meditation programmes could be included with the device that users could follow to offer support for their mental health. This could be especially useful before or after bed time to improve sleep quality and to start/end a day in a relaxed state.

Battery life should be extended so overnight charging becomes less of an issue.

Offer more variety in watch bands to improve comfort and wearability (but also style as the main target group for Bellabeat are women).


<div style = "background-color: #f5edf5">

#### **5.3.2 New Marketing Strategy**


The new marketing strategy should concentrate on the following aspects:
* Forging a unique community (the 'Bellabeat family') with new features such as community support, challenges and contests
* Customisable and personalised user experience
* Improved features such as extended battery life, new variety and improved wearability that allow for new levels of comfort
* The 'we'-feeling with which Bellabeat supports its customers and their individual training needs and goals (e.g. through alerts, different programmes and support)

</div>

<div style = "background-color: #ADE8F4">
    
## **6. Act-Phase** 
</div>
Once a decision on the recommendations and new marketing strategy has been made, actions should be taken accordingly. Further surveys (focusing especially on qualitative data) might be conducted to generate more insightful data. After a set period of time (usually a year) another analysis should be conducted to evaluate the results.