In [165]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Project Overwiev

## Bellabeat

is the high-tech wellness company. Sando Mur and Urška Sršen created smart-devices to monitor biometric and lifestyle data that help women better understand how their bodies work and make healthier choices. Bellabeat devices not only support healthy lifestyle but also are beautiful jewellery designed by artist and co-founder of Bellabeat - Urška

# Ask

### Key business task:

Gain more insight into behavior of non-Bellabeat smart-devices users to find new opportunities to develop device or improve marketing strategy.

To address key business task my analyze will answer questions below:

1.  What are some trends in smart device usage?
2.  How could these trends apply to Bellabeat customers?
3.  How could these trends help influence Bellabeat marketing strategy?

# Prepare

### Used data sources:

[FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit) (author Mobius) from Kaggle data set. Data license is [CC0: Public Domain](https://wiki.creativecommons.org/wiki/CC0_use_for_data), which means there is no copyright. Data can be copied, modified and distributed.

Installing and loading packages

In [166]:
install.packages('tidyverse')
install.packages('lubridate')
install.packages('janitor')
install.packages('repr')

In [167]:
library(tidyverse)
library(lubridate)
library(janitor)
library(repr)

Importing datasets

In [168]:
activity <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
calories <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
sleep <- read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')

# Process (Data Cleaning)

### Tools used:

R for data cleaning, Tableau for visualization

### Data inspecting

View first 10 rows of each dataset to get familiar with structure of table, variable names convention, data types set in columns and looking for NA values.

***Activity*** **dataset**

In [169]:
head(activity,10)
tail(activity,10)
str(activity)

any(is.na.data.frame(activity))

Formatting datatype in column ActivityDate (from *char* to *Date*)

In [170]:
activity$ActivityDate <- as.Date(activity$ActivityDate, format = "%m/%d/%Y")

***Calories*** **dataset**

In [171]:
head(calories, 10)
tail(calories, 10)
str(calories)
any(is.na(calories))

Formatting datatype in column ActivityDay (from *char* to *Date*)

In [172]:
calories$ActivityDay <- as.Date(calories$ActivityDay, format = "%m/%d/%Y")

***Sleep*** **dataset**

In [173]:
head(sleep)
tail(sleep)
str(sleep)
any(is.na(sleep))

Formatting SleepDay type (from *char* to *Date*)

In [174]:
sleep$SleepDay <- as.Date(sleep$SleepDay, format = "%m/%d/%Y")

Counting unique users in each dataset

In [175]:
n_distinct(activity$Id)
n_distinct(calories$Id)
n_distinct(sleep$Id)

Searching for duplicates in each dataset

In [176]:
sum(duplicated(activity))
sum(duplicated(calories))
sum(duplicated(sleep))

Cleaning duplicates

In [177]:
activity <- activity %>% 
  distinct() 

calories <- calories %>% 
  distinct()

sleep <- sleep %>% 
  distinct()

Verification

In [178]:
sum(duplicated(activity))
sum(duplicated(calories))
sum(duplicated(sleep))

Column names cleaning to get consistency in variables and renaming column in activity data

In [179]:
activity <- clean_names(activity)
calories <- clean_names(calories)
sleep <- clean_names(sleep)

activity <- activity %>% 
  rename(activity_day = activity_date)

Summary of each dataset

In [180]:
activity %>% 
  select(total_steps, total_distance, sedentary_minutes, calories) %>% 
  summary()

calories %>% 
  select(calories) %>% 
  summary()

sleep %>% 
  select(total_sleep_records, total_minutes_asleep, total_time_in_bed) %>% 
  summary()

# Analyze

In [181]:
activity_calories_merged <- merge(activity, calories, by=c("id", "activity_day", "calories"))
activity_calories_merged$activity_minutes <-activity_calories_merged$very_active_minutes + activity_calories_merged$fairly_active_minutes + activity_calories_merged$lightly_active_minutes
head(activity_calories_merged)

In [187]:
options(repr.plot.width=10, repr.plot.height=10)

ggplot(data = activity_calories_merged, mapping = aes(x=total_distance, y=calories)) +
  geom_point()+
  geom_smooth(method = 'loess') +
  theme_light(base_size = 15) +
  labs(x="Total activity distance in km", y="Calories burned",title = "Distance and calories correlation")

In [183]:
options(repr.plot.width=10, repr.plot.height=10)

ggplot(data=activity_calories_merged, aes(x=total_steps))+
  geom_histogram(bins = 20, colour='darkblue', fill='blue', alpha=0.7)+
  geom_vline(aes(xintercept=mean(total_steps), colour='red'), size=0.75)+
  theme_light(base_size=15) +
  labs(colour = "Average amount of steps - 7406", x= "Total steps", title="Total steps recorded in activities")

In [184]:
activity_intensity <- activity %>%
    select(id, activity_day, very_active_distance, moderately_active_distance, light_active_distance)

activity_wide <- melt(as.data.frame(activity_intensity), id=c("id", "activity_day"))
activity_wide <- activity_wide %>% rename(distance_intensity=variable, km=value)

head(activity_wide)

In [185]:
options(repr.plot.width=10, repr.plot.height=10)

ggplot(data=activity_wide)+
    geom_violin(aes(x=distance_intensity, y=km, fill=distance_intensity), alpha=0.7)+
    theme_light(base_size=15)+
    guides(x =  guide_axis(angle = 25)) +
    ylim(0,10)+
    labs(x="Distance intensity", y="Kilometres of activity", title="Plot of distance intensity and kilometres of activity")+
    scale_fill_manual(values=c("very_active_distance"="red","moderately_active_distance"="orange","light_active_distance"="yellow"))