   # **CASE STUDY: BELLABEAT**


> # **INTRODUCTION**

We are living witnesses of how technology have changed our world in so many ways, in my personal case for example, I am part of one generation that could remember the world before internet access. Every day we see, how humans direct its efforts to improve their lives by using technology; transportation, communication, education are some aspects in which we include smart devices and technology to perform our daily activities more efficiently.
This is a case study presented for me to complete the Google Data Analytics certification. I will present an analysis about one way, we humans, have involved technology in our lives, specifically to improve our health and daily habits. In this case I will work with data provided by Bellabeat, of a device used by women to track their activity and health.
Bellabeat, a high-tech company that manufactures health-focused smart products for women, needs to analyze one of their smart device data to gain insight into how their customers, women, are using their product. The company wants to guide its marketing strategy based on the discoveries and the results obtained by this analysis. 

> # **TABLE OF CONTENTS**

[1.	SUMMARY](#1)

[2.	ASK](#2)

[3.	PREPARE](#3)

[4.	PROCESS](#4)

[5.	ANALYZE](#5)

[6.	SHARE - ACT](#6)


<a id="1"></a> <br>
> # **1.	SUMMARY**

Founded by Urška Sršen and Sando Mur in 2013, Bellabeat is a high-tech company that manufactures health-focused products specifically for women. Since it was founded, the company has grown and positioned itself in the tech-driven wellness market.
Their product is basically a tracker device called “leaf” that can be worn as a bracelet, a necklace, or a clip, and follows user data related to their activity and health, such as sleep, stress, menstrual cycle, and mindfulness habits. This device is connected to the Bellabeat app, which collects this data and shows it to the user and gives advice about their wellness habits and helps them make healthy decisions.
The company’s marketing analytics team believes that analyzing usage data from the smart devices will help them understand their customers, and gain insights in how they use the device, to guide future marketing strategies.


<a id="2"></a> <br>
> # **2.	ASK**

Urška Sršen wants to analyze smart device usage data in order to gain insight about how consumers use non-Bellabeat smart devices. Then, she wants to select one Bellabeat product to apply these insights to. These questions will guide the analysis:

* What are some trends in smart device usage? 
* How could these trends apply to Bellabeat customers? 
* How could these trends help influence Bellabeat marketing strategy? 


**2.1. Business task**

Guide the company’s future marketing strategies by identifying usage trends in smart devices that apply specifically to women.


**2.2. Stakeholders**

* Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer.
* Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team.
* Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.


<a id="3"></a> <br>
> # **3. PREPARE**

**3.1. Data source**

For this case study, the company’s CEO suggested working with a public dataset that explores smart device usage [FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit) (CC0: Public Domain, dataset made available through [Mobius](https://www.kaggle.com/arashnic)). This is a Kaggle dataset that after verifying, we can confirm that is an opensource and we can work with it.
The dataset features 18 .csv files that contain information about the usage of a personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.


**3.2. Dataset preview**

After downloading and storing on a folder created for this case study purpose only, I reviewed each file, filtered, and sorted to learn more about the data itself and its quality, and understand how it is organized.
I discovered that the study consisted in tracking the activity of 33 users during a period of 31 days between March 2016 – May 2016. It tracked information like, activity, intensity, calories burned, hearth rate, weight, and sleep on different time frames.


**3.3. Limitations**

There are some limitations I found in the dataset suggested by the CEO.

* It is out of date (Period between March and May 2016).
* Sample size could be too short and could not be representative for this kind of studies.
* We have no demographic information about the people in the study, we do not know if there are men or women. So, for this analysis would be better to select a bigger sample of women only.

Acknowledging these limitations, I did some research but unfortunately, I could not find other datasets, so I decided to continue with the analysis and get some insight on the general usage of this kind of devices.


<a id="4"></a> <br>
> # **4. PROCESS**

During the process phase, I started setting up my workspace in RStudio Cloud. Created a new project under the name BelleBeat_CaseStudy, then I installed the packages that I thought I was going to need and imported the dataset files. I decided to work only with daily datasets for effects of this study, figured that the daily_activity had the same information of other tables such as, calories, steps, and intensity of activity, so I decided to work with only that table, daily_sleep and weight_log.
Reviewed each table, I verified the amount of analysis subjects on each one of them to have an idea of future merging possibilities.

I did some cleaning and formatting to some columns and documented every step.


In [None]:
    # Packages installation and libraries loading 
    
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("skimr")
install.packages("dplyr")
install.packages("lubridate")
install.packages("here")
install.packages("janitor")

library(tidyverse)
library(ggplot2)
library(skimr)
library(dplyr)
library(lubridate)
library(here)
library(janitor)
library(tidyr)

In [None]:
  # Data sets import

daily_activity <- read.csv ("../input/bellabeat/daily_activity.csv" , header = TRUE, sep = ",")
daily_sleep <- read.csv ("../input/bellabeat/daily_sleep.csv" , header = TRUE, sep = ",")

In [None]:
  # Data frames preview and cleaning

n_distinct(daily_activity$Id)
n_distinct(daily_sleep$Id)

sum(duplicated(daily_activity))
sum(duplicated(daily_sleep))

clean_names(daily_activity)
daily_activity <- rename_with(daily_activity, tolower)

clean_names(daily_sleep)
daily_sleep <- rename_with(daily_sleep, tolower)

  
daily_activity <- daily_activity %>%
  rename(date = activitydate) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))

daily_sleep <- daily_sleep %>%
  rename(date = sleepday) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))


str(daily_activity)
str(daily_sleep)

<a id="5"></a> <br>
> # **5. ANALYZE**

* I started determining device usage per user by calculating the amount of time each user wore the device daily. Then classified them in four categories:

 * Top user: Time worn > 95%
 * Regular user: Time worn between 70% and 95%
 * Occasional user: Time worn < 70%
 
 

* Classification of users based on activity levels:
“According to the 10,000 Steps Project, people who take fewer than 5,000 steps a day have a sedentary lifestyle. Increasing your activity level to anywhere between 7,500 and 10,000 steps would place you into the moderate, or somewhat active, level. Only those individuals who take more than 12,500 steps each day are considered highly active”. (livestrong.com)

 * Sedentary is less than 5,000 steps per day 
 * Low active is 5,000 to 7,499 steps per day
 * Somewhat active is 7,500 to 9,999 steps per day
 * Active is more than 10,000 steps per day
 * Highly active is more than 12,500
 


In [None]:
  # Summarizing averages
    # Classification of Activity levels
    # User Classification

daily_activity ["total_minutes_active"] <- daily_activity$veryactiveminutes + daily_activity$fairlyactiveminutes + daily_activity$lightlyactiveminutes

daily_activity ["device_usage_min"] <- daily_activity$total_minutes_active + daily_activity$sedentaryminutes

daily_activity ["device_usage_percent"] <- (daily_activity$device_usage_min / 1440) * 100 
  

daily_activity_avg <- daily_activity %>%
  group_by(id) %>%
  summarise (daily_steps_avg = mean(totalsteps),
             daily_calories_avg = mean(calories),
             daily_sedentaryminutes_avg = mean(sedentaryminutes),
             daily_activeminutes_avg = mean(total_minutes_active),
             device_usage_min_avg = mean(device_usage_min),
             device_usage_perc_avg = mean(device_usage_percent)) %>% 
  
  mutate(activity_level = case_when(
    daily_steps_avg < 5000 ~ "sedentary",
    daily_steps_avg >= 5000 & daily_steps_avg < 7500 ~ "low_active",
    daily_steps_avg >= 7500 & daily_steps_avg < 10000 ~ "somewhat_active",
    daily_steps_avg >= 10000 & daily_steps_avg < 12500 ~ "active",
    daily_steps_avg >= 12500 ~ "highly_active"
    )) %>% 
  
  mutate(user_classification = case_when(
    device_usage_perc_avg >= 95 ~ "top_user",
    device_usage_perc_avg >= 70 & device_usage_perc_avg < 95 ~ "regular_user",
    device_usage_perc_avg < 70 ~ "ocassional_user"
  ))

daily_activity <- merge(daily_activity, daily_activity_avg, by=("id"))


  # Remove unnecessary columns

daily_activity <- subset(
  daily_activity, select = -c(
    totaldistance, trackerdistance, loggedactivitiesdistance,veryactivedistance,
    moderatelyactivedistance, lightactivedistance, sedentaryactivedistance,
    daily_calories_avg, daily_activeminutes_avg, daily_steps_avg))

  # Average of daily sleep/user

daily_sleep_avg <- daily_sleep %>% 
  group_by(id) %>% 
  summarise (daily_sleep_avg_min = mean(totalminutesasleep))

daily_sleep_avg["daily_sleep_avg_hr"] <- daily_sleep_avg$daily_sleep_avg_min / 60

daily_sleep_avg$daily_sleep_avg_min <- format(round(daily_sleep_avg$daily_sleep_avg_min, 2), nsmall = 2)
daily_sleep_avg$daily_sleep_avg_hr <- format(round(daily_sleep_avg$daily_sleep_avg_hr, 2), nsmall = 2)


  # Analyzing User classification

user_class_totals <- daily_activity_avg %>% 
  group_by(user_classification) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(user_classification) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))


user_class_totals %>% 
  ggplot (aes (x = "", y = total_percent, fill = user_classification)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  theme_minimal() +
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text (hjust = 0.5, size=14, face = "bold")) +
  scale_fill_brewer(palette="Blues") +
  geom_text (aes (label = labels),
            position = position_stack (vjust = 0.5)) +
  labs(title="User Classification")


# Analyzing User's activity levels

user_activity_level <- daily_activity_avg %>% 
  group_by(activity_level) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(activity_level) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

user_activity_level$activity_level <- factor(user_activity_level$activity_level,
                                             levels = c("highly_active", "active",
                                                        "somewhat_active", "low_active",
                                                        "sedentary"))

user_activity_level %>% 
  ggplot (aes (x = "", y = total_percent, fill = activity_level,)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  theme_minimal() +
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text (hjust = 0.5, size=14, face = "bold")) +
  scale_fill_brewer(palette="YlGnBu", direction = -1) +
  geom_text (aes (label = labels),
             position = position_stack (vjust = 0.5)) +
  labs(title="User's Activity Level")

•	I tried to find a relationship between the device usage and the average daily activity of each user, and here is what I found: 

In [None]:
  # Relation between Daily Activity and Device usage 

ggplot (data = daily_activity_avg) +
  aes (x = daily_activeminutes_avg, y = device_usage_min_avg, color = user_classification) +
  geom_point(size = 5) +
  theme_minimal() +
  theme(panel.border = element_blank(),
        plot.title = element_text (hjust = 0.5, size=14, face = "bold")) +
  labs (title = "Daily Activity VS Device Usage",
        x = "Daily Activity (min)", y = "Device Usage (min)")

•	Another scatter plot that I did, was the relationship between the daily steps per user and the calories burnt, and here is the result:


In [None]:
  # Relation between Daily Steps and Calories burnt

daily_activity_avg$activity_level <- factor(daily_activity_avg$activity_level,
                                             levels = c("highly_active", "active",
                                                        "somewhat_active", "low_active",
                                                        "sedentary"))

ggplot (data = daily_activity_avg) +
  aes (x = daily_steps_avg, y = daily_calories_avg, color = activity_level) +
  geom_point(size = 5) +
  scale_color_brewer(palette = "YlOrRd", direction = -1) +
  geom_jitter() +
  geom_smooth(color = "blue") +
  theme_minimal() +
  theme(panel.border = element_blank(),
        plot.title = element_text (hjust = 0.5, size=14, face = "bold")) +
  labs (title = "Daily Steps VS Calories Burnt",
        x = "Daily Daily Steps", y = "Calories Burnt")





•	And here is the minutes active and calories burnt plot:

In [None]:
  # Relation between Minutes Active and Calories burnt

ggplot (data = daily_activity_avg) +
  aes (x = daily_activeminutes_avg, y = daily_calories_avg, color = activity_level) +
  geom_point(size = 5) +
  scale_color_brewer(palette = "YlOrRd", direction = -1) +
  geom_jitter() +
  geom_smooth(color = "red") +
  theme_minimal() +
  theme(panel.border = element_blank(),
        plot.title = element_text (hjust = 0.5, size=14, face = "bold")) +
  labs (title = "Minutes Active VS Calories Burnt",
        x = "Minutes Active", y = "Calories Burnt")

<a id="6"></a> <br>
> # **6.	ANALIZE - ACT**

* Starting with the device usage, we could say that the sample was divided in three groups:

 * The majority, with 39.4% was people who used the device almost all the time (22.8 hr/day). Meaning they wore the device even when they were sleeping and only taking it out to take a bath for example, and some of them even used it 24 hours a day
 * Next comes regular users (Between 16.8 hr and 22.8 hr) with 36.4%, and this means they wore the device mostly while they were awake.
 * And then, with 24.2%, the occasional users, and we could say they were not as disciplined as the other users and forgot to use it some days and use it only during some hours.



* One of the company goals should be to encourage the users to wear the device every day and during more time.

 * Analyzing the activity levels, I noticed than the highly active, and active people are less than the other groups, with the majorities being in the moderate activity levels.
Here is an opportunity for the company. Developing marketing strategies that show how the inclusion of this kind of technologies, motivates people to have more active and healthier lives.

 * We can confirm that the step count feature, is one effective way to show the effectivity of the product helping track healthy habit, analyzing the steps vs calories plot we see how they tend to be related, and is very important to confirm the progress of the user and keep the motivation.

 * Another important trend that the company should exploit, is the usage of the device on different moments of the day, and on different occasions. Specially if the target market are women and is known that this group of people tend to be more guided by fashion trends than men. Whit this, the company could achieve more sales, by getting that one user buys more than one device, to wear them on different occasions and keep tracking their vitals.
