# **<center>Project Bellabeat: How Can A Wellness Technology Company Play It Smart?</center>**

<img align = "center" src = "http://appletechtalk.com/wp-content/uploads/2022/07/Bellabeat-logo.jpg">

## **<center>Dubem Ngini</center>**
### **<center>March 12, 2023</center>**


This report is based on a capstone project from the **[Google Data Analytics Project](https://www.coursera.org/learn/google-data-analytics-capstone/supplement/ZsmDD/case-study-2-how-can-a-wellness-company-play-it-smart)** on Coursera which is also posted on **[GitHub](https://github.com/deengini/bellabeat_capstone_project)**. 

The goal of this project is to provide insights to Bellabeat using the six stages of Data Analytics in the Google course: Ask, Prepare, Process, Analyze, Share, and Act. 

## **Step 1: ASK**
The goal of this stage is to identify the business task and consider the key stakeholders. 

### **1.1 Background**

**[Bellabeat](https://bellabeat.com/)** is a femtech health company founded by Urška Sršen and Sando Mur in 2013 which manufactures health-focused smart products. 

Co-founder and Chief creative Officer, Sršen springboarded the company by utilizing her art background to develop beautifully designed technology that informs and inspires women around the world to be more proactive about their health and habits.

As at 2016, Bellabeat had opened offices around the world and launched multiple products such as:

* **[Bellabeat app](https://play.google.com/store/apps/details?id=com.bellabeat.cacao&hl=en&gl=US&pli=1)** an app monitors health data and menstrual cycles by syncing with other Bellabeat smart products.

* **[Leaf](https://bellabeat.com/shop/)** a wellness tracker that can be worn as a bracelet, necklace or clip and is used to track sleep and stress. 

* **[Time](https://bellabeat.com/shop/)** a watch device that connects to the Bellabeat app and tracks activity, sleep and stress.

* **[Spring](https://bellabeat.com/shop/)** a water bottle that connects to the Bellabeat app and tracks daily water intake to make sure that customers stay hydrated during the day.

Bellabeat invests year-round in Google Search, Youtube ads, maintaining active [Facebook](https://web.facebook.com/bellabeat/?_rdc=1&_rdr) and [Instagram](https://www.instagram.com/bellabeat/?hl=en) pages, and consistently engages consumers on [Twitter](https://twitter.com/GetBellaBeat) in order to stay updated on current marketing trends. 

Here is an example of a promotional campaign by Bellabeat for **[International Women's Day](https://www.youtube.com/watch?v=D_LXAydFIw0&ab_channel=BellabeatWellness)**

However, Urška Sršen believes that Bellabeat can be doing more to extend their growth and she believes that the best way to do this is by analyzing customer from similar products (i.e. Fitbit). 

### **1.2 Business Task**

Analyze data from Fitbit fitness tracker to provide insights on how customers are using this app. Then taking the insights from this analysis provide recommendations on how to improve Bellabeat marketing strategy.

### **1.3 Key Stakeholders**

* Urška Sršen: Bellabeat’s cofounder and Chief Creative Oﬃcer

* Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team

* Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.


### **1.4 Deliverables**

The Bellabeat team expects the following deliverables: 

* A clear summary of the business task.

* A description of all data sources used.

* A comprehensive documentation of any cleaning or manipulation of data.

* A summary of your analysis.

* Supporting visualizations and key ﬁndings.

* A list of recommendations based on your analysis.

## **Step 2: PREPARE**

The goal of this stage is to present information about the data being used and acknowledge its limitations. 

### **2.1 Data Source & Format**

* The data used for this project is publicly available on [Kaggle: FitBit: Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit) and stored in 18 csv files as long format. 

* Personal fitness data was generated from a survey of 30 FitBit users via Amazon Mechanical Turk from 12 March 2016 to 12 May 2016. 

* The data included duration of physical activity (in minutes), daily activity, calories spent, steps and distance covered. 

### **2.2 Is Data ROCCC?**

Data credibility and bias is judged by the ROCCC criteria which stands for Reliable, Original, Comprehensive, Current, and Cited 

* Reliable: **LOW** because it has a small sample size of only 30 participants. 

* Original: **LOW** because it was collected by a third-party provider (Amazon Mechanical Turk). 

* Comprehensive: **MEDIUM** because most of the parameters of the data line up with the parameters of Bellabeat products.

* Current: **LOW** because the data is 7 years old and exercise habits and requirements may have changed since then.

* Cited: **LOW** because the data was collected from a third party and the original source is unknown. 


### **2.3 Limitations of the data**

* The sample size of the data is so small (30) that it cannot be used as an adequate representation of the fitness population. 

* The data is more than 7 years old and so might be irrelevant or outdated.

### **2.4 Data Selection**

The data selected for analysis is the daily_activity_merged.csv file

## **Step 3: PROCESS**

The goal of this stage is to make sure that the data is error-free, complete, relevant and ready for analysis. 

### **3.1 Tools**

R programming language will be used for data cleaning, transformation, and visualization. 


### **3.2 Preparing the environment**

The following R libraries were installed and loaded for easy use. 

In [None]:
# install and load packages 
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")

library(tidyverse)
library(skimr)
library(janitor)
library(dplyr)
library(lubridate)

### **3.3 Importing the dataset**

Reading in the daily activity dataset as a csv file.

In [None]:
# import daily activity data
activity_df <- read_csv("daily_activity_merged.csv")

### **3.4 Data cleaning and formatting**


#### 3.4.1 Preview the dataset

First, preview the first 10 rows of the dataset. 

In [None]:
# preview dataset 
head(activity_df)

#### 3.4.2 Check for missing values

Then, we can check for missing values 

In [None]:
# check for missing data 
missing_data <-is.na(activity_df)

# count the number of missing values in each column 
num_null <- apply(missing_data, 2, sum)
print(num_null)

From the results, we can see that there are no missing values.

#### 3.4.3 Create a column for total mins and then convert it to hours 

In [None]:
# Create columns for total time spent in minutes and then hours
activity_df2 <- activity_df %>%
  mutate(total_mins = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes + SedentaryMinutes) %>% 
  mutate(total_hours = total_mins/60)

#### 3.4.4 Change the datatype of the columns 

We can check the data type of the columns

In [None]:
# check datatype of columns 
str(activity_df2)

The column activity date is listed as character type, so we have to change it to date type. 

In [None]:
# change datatype of activity date column from str to date type 
activity_df2$ActivityDate <- as.Date(activity_df2$ActivityDate, format = "%m/%d/%Y")

In [None]:
# Check again for datatype of the columns 

str(activity_df2)

We can see that it works!

#### 3.4.5 Extract months and days from activity date into seperate columns

For further analysis, we may want to isolate days and months from our date column.  

In [None]:
# extract month from date column as activity months
activity_df3 <- activity_df2 %>%
  mutate(activity_days = weekdays(ActivityDate)) %>% 
  mutate(activity_months = month(ActivityDate))

Let's see the new columns 

In [None]:
# previewing new columns 
head(activity_df3)

We can notice that the month column is in numbers let's convert it to month names instead.

In [None]:
# convert month to month names
activity_df3$activity_months <- month.name[activity_df3$activity_months]

Let's look at the data now. 

In [None]:
# previewing new columns 
head(activity_df3)

Perfect, now the months all have names. 

#### 3.4.6 Rearrange the order of columns in the dataset

We can rearrange the columns in the dataset:

In [None]:
# rearrange the columns of the dataset
activity_df4 <- select(activity_df3, "Id", "ActivityDate", "activity_months", "activity_days", "TotalSteps", "TotalDistance", "TrackerDistance", "LoggedActivitiesDistance", "VeryActiveDistance",
                       "ModeratelyActiveDistance", "LightActiveDistance", "SedentaryActiveDistance", "VeryActiveMinutes", "FairlyActiveMinutes", 
                       "LightlyActiveMinutes", "SedentaryMinutes", "total_mins", "total_hours", "Calories")

#### 3.4.7 Rename the columns in the dataset

Notice that the names in the dataset do not follow typical R naming conventions so we can rename them 

In [None]:
# rename all the columns in the dataset
#check the names
names(activity_df4)

# rename with dplyr
activity_df5 <- rename(activity_df4, "id" = "Id", "date" = "ActivityDate", "months" = "activity_months", "week_days" = "activity_days", "total_steps" = "TotalSteps", 
                      "total_dist" = "TotalDistance", "track_dist" = "TrackerDistance", "logged_dist" = "LoggedActivitiesDistance", "very_active_dist" = "VeryActiveDistance",
                      "moderate_active_dist" = "ModeratelyActiveDistance", "light_active_dist" = "LightActiveDistance", 
                      "sedentary_active_dist" = "SedentaryActiveDistance", "very_active_mins" = "VeryActiveMinutes", 
                      "fairly_active_mins" = "FairlyActiveMinutes", "lightly_active_mins" = "LightlyActiveMinutes", 
                      "sedentary_mins" = "SedentaryMinutes", "total_mins" = "total_mins", 
                      "total_hours" = "total_hours", "calories" = "Calories")

In [None]:
# make sure sure the column names are all unique and consistent
activity_df6 <- clean_names(activity_df5)

In [None]:
# check the names of the columns again 
head(activity_df6)

#### 3.4.8 Validate number of participants 

The number of participants should be 30 and we can cross-check that by looking at the number of ids we have 

In [None]:
# check for number of unique ids 
length(unique(activity_df6$id))

The number of participants is 33!

### **3.5 Observations**

The data is now ready for analysis but before that, here are some quick observations about the data: 

* The data has 15 columns and 940 rows with no missing or null values. 

* There are 33 unique IDs instead of 30. While this does not change the number of participants, it means that our data may be skewed erroneously. 

* There is no info on age or gender so we might not be able to tell how different age groups might use FitBit products.

In [None]:
### **Export and Save Cleaned Data

After cleaning and validating the data, it is important to export it as a csv file 
write.csv(activity_df6, "cleaned_data.csv", row.names = FALSE)

## **Step 4: ANALYZE**

The goal of this stage is to identify trends and relationships in the data from summary statistics. 


### 4.1 **Perform Calculations**

In order to draw preliminary insights into the data, there are certain statistics that are needed: 

* Min and Max values

* Mean

* Median 

* 1st Quartile 

* 3rd Quartile

In [None]:
# find summary statistics of the data
activity_summary <- data.frame(summary(activity_df6))
activity_summary

**Interpreting these findings:**

**Steps:** The average total steps accumulated on the fitbit app was 7638. A study published in **[The Lancet](https://www.thelancet.com/action/showPdf?pii=S2468-2667%2821%2900302-9)** stated that individuals who walked an average of 6000 to 10,000 steps were less likely to face health issues, regardless of gender or age. In fact, their health issues were cut by half. However, it is difficult  to judge what steps were taken by only women since gender was not part of the stats collected by the surveyors. According to this study, age plays a bigger role than gender when it comes to steps. The threshold for >60 year olds was 3000 steps while for <60 year olds was 5000 steps. However, the data does not account for age so there is no way of knowing how an elderly demographic might interact with Bellabeat [1].

**Calories burnt:** According to our data, the average calories burnt was 2304, while 4900 was the maximum amount of calories. According to a **[Health Line article](https://www.healthline.com/health/fitness-exercise/how-many-calories-do-i-burn-a-day#_noHeaderPrefixedContent)** by Katey Davidson, women need burn about 1980 calories a day so the participants in the survey were definitely exceeding their goals.

**Outliers:** There were several outliers which showed that an individual consistently had over 20,000 steps a day which could either be due to a recording error, a very athletic individual or some with a demanding job according to a **[Health Line article](https://www.healthline.com/health/average-steps-per-day#occupation
)** by Adrienne Santos.

## **Step 5: SHARE**

The goal of this stage is to create visualizations and communicate findings to the stakeholders. 

### **5.1 Steps per Day**

The following graph compares the number of steps that are taken on each week day

In [None]:
# create a bar chart of total steps by day of week 
ggplot(activity_df, aes(x = week_days, y = total_steps, fill = week_days)) +
  geom_bar(stat = "identity") +

  # add chart title, subtitle and caption
  labs(x = "Days of the Week", y = "Total Steps", title = "Total Steps by Day of the Week",
       subtitle = "Sample of Fitbit Tracking Data from March 2016 to May 2016",
       caption = "Data source: Amazon Mechanical Turk") +

  #customize chart theme 
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12),
        plot.caption = element_text(hjust = 0.5, size = 8))

**Insights**

* The most steps were taken on Tuesdays which is contrary to assumptions that the most steps would be taken on traditional non-workdays like Saturdays and Sundays. Infact, Saturdays and Sundays represented the least amount of steps taken with Sundays being the least. 

* This indicates that most of the participants probably work demanding jobs that get busier during the middle of the weekd and then chose to rest on Sundays instead.  

### **5.2 Calories burned per steps**

The following graph compares calories burned to steps taken

In [None]:
# create a scatter plot chart of total steps by calories
ggplot(activity_df, aes(x = total_steps, y = calories)) +
  geom_point(color = "steel blue") +

  # add trendline
  geom_smooth(method = "lm", se = FALSE, color = "red") +

  # add chart title, subtitle and caption
  labs(x = "Total Steps", y = "Calories", title = "Total Steps vs. Calories Burned",
       subtitle = "Sample of Fitbit Tracking Data from March 2016 to May 2016",
       caption = "Data source: Amazon Mechanical Turk") +

  #customize chart theme 
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12),
        plot.caption = element_text(hjust = 0.5, size = 8)) +
  annotate("text", x = 5000, y = 200, label = "Note: Outlier data point at above 35,000 steps",
           size = 4, color = "black", hjust = 0)

**Insights**

* From the graph, we can see that there is a positive correlation between calories burnt and number of steps taken. This indicates that steps taken does have a positive effect on physical health.  

* As mentioned earlier, there were several outliers in this number of daily steps taken by the participants of the survey. However, the graph shows that the max number of steps at 30,000 does not correspond with an increase in calories burnt but rather a sharp drop. This might be due to a faulty device, human error from self-reporting. 

### **5.3 Calories per Distance**

The following graph compares the number of calories that are burnt per distance travelled by each participant. 

In [None]:
# create a scatter plot chart of total distance by calories
ggplot(activity_df, aes(x = total_dist, y = calories)) +
  geom_point(color = "steel blue") +

  # add trendline 
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  
  # add chart title, subtitle and caption
  labs(x = "Total Distance", y = "Calories", title = "Total Distance vs. Calories Burned",
       subtitle = "Sample of Fitbit Tracking Data from March 2016 to May 2016",
       caption = "Data source: Amazon Mechanical Turk") +
 
  #customize chart theme 
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12),
        plot.caption = element_text(hjust = 0.5, size = 8))

**Insights**

* Similarly to the graph in section 5.2, there is a positive correlation between distance travelled and calories burned. This indicates that the more active an individual is, the more likely they are to improve their physical health. 

* Again there seems to be a significant outlier that bucks the trend of the data. This data point is at a similar position to the outlier in section 5.2 and thus, it can be concluded that this is most likely an error from the same user. 

### **5.4 Types of Activities**

The following chart tries to ascertain the lifestyle/habits of the participants by comparing types of activity based on total time spent per activity. 

In [None]:
# create a pie chart of the percentage of mins spent per activity

# calculate total minutes for each category 
total_very_active <- sum(activity_df$very_active_mins)
total_fairly_active <- sum(activity_df$fairly_active_mins)
total_lightly_active <- sum(activity_df$lightly_active_mins)
total_sedentary <- sum(activity_df$sedentary_mins)
total_mins <- sum(activity_df$total_mins)

# calculate the percentage for each activity
perc_very_mins <- round((total_very_active / total_mins * 100), 1)
perc_fair_mins <- round((total_fairly_active / total_mins * 100), 1)
perc_light_mins <- round((total_lightly_active / total_mins * 100), 1)
perc_sed_mins <- round((total_sedentary / total_mins * 100), 1)

# Define the data
activity_levels <- c("Very Active", "Fairly Active", "Lightly Active", "Sedentary")
percentages <- c(perc_very_mins, perc_fair_mins, perc_light_mins, perc_sed_mins)

# Create a data frame contain the activity level and percentages of time spent
percent_df <- data.frame(activity_levels, percentages)

# Create a pie chart using ggplot2
ggplot(percent_df, aes(x = "", y = percentages, fill = activity_levels)) +
  geom_bar(stat="identity", width=1, color="black") +
  coord_polar("y", start=0) +
  
  # add percentage labels using geom_text
  geom_text(aes(label = paste0(percentages)),
            position = position_stack(vjust = 0.5)) +

  # add chart title, subtitle and caption            
  labs(title = "Percentage of Time spent per Activity",
       subtitle = "Sample of Fitbit Tracking Data from March 2016 to May 2016",
       caption = "Data source: Amazon Mechanical Turk") +
 
   #customize chart theme 
  theme_void() +
  theme(legend.position = "right", 
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12),
        plot.caption = element_text(hjust = 0.5, size = 12)) +
    
  #customize fill colors  
  scale_fill_manual(values = c("yellow", "orange", "red", "green"))

Since the data points are not really visible on the chart, let's pull up a table of the percentage. 

In [None]:
# Create a data frame
percent_df <- data.frame(activity_levels, percentages)

**Insights**

* Sedentary activities were the most type of activities with over 81.3%. This implies that most of the users were carrying out passive activities such as sitting or riding a car or bus.

* Unfortunately, most participants were not really active which indicates the need for a change in exercise habits or lifestyles.

## **Step 6: ACT**

The goal of this final stage is to summarize our insights and provide recommendations to the stakeholders based on our conclusions. 

 ### **6.1 Trends**

* Users were most active during the middle of week and were less active on the weekends - perhaps because they are busiest at work during those times and rest on the weekends.

* Users were significantly more prone to engaging in sedentary activities overall, which defeats the purpose of having a fitness product in the first place. 

* Regardless of a preference for sedentary activities, users were still meeting the average goal for steps in a day. 

### **6.2 Relevance to Bellabeat**

Both FitBit and Bellabeat are involved in the development and manufacturing of fitness products for women regardless of age or race. While Bellabeat is more focused towards women's health, both companies encourage their customers to pursue a healthier lifestyle and to make healthier decisions. 

### **6.3 Recommendations for Bellabeat**

* The Bellabeat marketing team should create promotional and educational material that promotes the importance of hydration to women in demanding jobs as well. For example, they can promote the Spring water bottle as must-have accessory for women who might not remember to take adequate water breaks while at work.

* The marketing team should also create promotional content that encourages women to be more active on weekends even if it is for only a short period of time. 

* Finally, the marketing team should create an advertising campaign that highlights the benefits of being an active woman. This campaign can focus on the Leaf and Time products as a focus point to women who may need that extra push to change their usual habits for the better.  