# Google Data Analytics Professional Case Study: Bellabeat
## An Analysis of Wellness Technology Usage

## Scenario:

### About the company:

Urška Sršen and Sando Mur founded [Bellabeat](https://bellabeat.com), a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with
knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and displays ads on the Google Display Network to support campaigns around key marketing dates. 

Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.


### Stakeholders: 

1. **Urška Sršen**: Bellabeat co-founder and Chief Creative Officer
2. **Sando Mur**: Mathematician and Bellabeat co-founder; key member of the Bellabeat executive team
3. **Bellabeat marketing analytics team**: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.

### [Bellabeat Products](https://bellabeat.com/catalog/):

1. **Bellabeat app**: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
2. **Leaf**: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
3. **Time**: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
4. **Spring**: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.


### Business Tasks Given By Stakeholders:

1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

### Problem statement:

Bellabeat wants clear trends spotted in their customer usage data to not only influence their marketing strategy, but also help build healthy habits in their customer’s daily activity.

## About the Data

### Credibility:

The data used in this study is third-party information gathered via a voluntary survey, stored and distributed on Kaggle.com. This data is completely open source under the CC0: Public Domain license, and the owner of this data has waived all rights to its use. It can be found [here](https://www.kaggle.com/datasets/arashnic/fitbit).

Since we had enough data on the activity trends for Bellabeat users, a high-level analysis will be
conducted using a particular format.

The survey was completed with 30 participants over the course of 30 days. The data consists of 18 tables of calorie, activity time, activity intensities, sleep, steps, heart rate, and weight. There are wide and long formats of the intensities, calories per minute, and steps per minute.

### Potential Issues:

I have identified four potential issues with the data:

1. Some of the tables that are a part of this data set re incomplete and are not comprehensive enough for the sample size to be included in the analysis. These tables will be explained in more detail later in the Preparation phase. 

2. The sample size of *n*=30, as explained in the Kaggle.com data description, is too exceedingly small to be used for marketing and user analysis. Thus, this analysis will need to remain high-level. 

3. There is not sufficient data to make any inferences about the ‘Spring’ water bottle.

4. It is not clear which device collected each data point.

5. The study claimes to have 30 participants, but 33 unique user IDs are present.

6. Our most concerning observation: The sample size may not be a representative sample size of the population in question, seeing as our data description from Kaggle.com is limited. We are not given the sex for each unique user ID, and the Bellabeat business tasks are targeted toward a women audience. With this being a case study, we will proceed with the data that we have been provided.

## Methodology and Tools

The data consists of mulitple tables in both wide and long formats. Considering these differing formats, we need to utilize the power of R to conduct this analysis. We will process the data--primarily with dplyr-- into a more useful format before we begin the analysis process. We won't use the tables that have data per minute as this is too detailed of a level for a sample size of n=30.

First, let's start by importing the tables that we are going to use.

In [None]:
# importing the data
dailyactivity <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
dailycalories <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
dailyintensities <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
dailysteps <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
heartrateseconds <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
hourlycalories <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
hourlyintensities <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
hourlysteps <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
sleepday <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weightlog <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")


Now that all of the data that I am going to use is imported, I will load the packages into my instance that I will use to analyze the data.

In [None]:
# loading the packages
library(tidyverse)
library(lubridate)
library(dplyr)
library(tidyr)
library(skimr)

## Data Preparation

Now that my instance is set up, I can manipulate and prepare the data for exploration.

First, I want to look at all distinct users contained in each data frame.

In [None]:
# Checking for all distinct user IDs
n_distinct(dailyactivity$Id)
n_distinct(dailycalories$Id)
n_distinct(dailyintensities$Id)
n_distinct(dailysteps$Id)
n_distinct(heartrateseconds$Id)
n_distinct(hourlycalories$Id)
n_distinct(hourlyintensities$Id)
n_distinct(hourlysteps$Id)
n_distinct(sleepday$Id)
n_distinct(weightlog$Id)

It looks like the heart rate dataframe and weight log data frame has less than half of the expected users (33), so I will not use these dataframes as part of the analysis.

Now, let's look for any inconsistincies in the dataframes.

In [None]:
glimpse(dailyactivity)
glimpse(dailycalories)
glimpse(dailyintensities)
glimpse(dailysteps)
glimpse(hourlycalories)
glimpse(hourlyintensities)
glimpse(hourlysteps)
glimpse(sleepday)

Before moving forward, the date columns should all be transformed to a common format so that they can be used in a join in a later step of the analysis. I will change all date columns in dataframes from a character format to a datetime format. The rest of the attributes should align as expected.

In [None]:
#Converting each date column (formatted as character) to a date-time format with as.POSIXct function 
dailyactivity$ActivityDate = as.POSIXct(dailyactivity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
dailycalories$ActivityDay = as.POSIXct(dailycalories$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())
dailyintensities$ActivityDay = as.POSIXct(dailyintensities$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())
dailysteps$ActivityDay = as.POSIXct(dailysteps$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())
hourlycalories$ActivityHour = as.POSIXct(hourlycalories$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourlyintensities$ActivityHour = as.POSIXct(hourlyintensities$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleepday$SleepDay=as.POSIXct(sleepday$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourlysteps$ActivityHour = as.POSIXct(hourlysteps$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())

I will double check a few of the dataframes to see if everything looks correctly transformed.

In [None]:
head(dailycalories)
head(hourlysteps)
head(sleepday)

Changing the character format date columns to a datetime format was a success. I can now join these tables into more holistic dataframes. I'll join 3 sets of tables to begin producing some summaries for preliminary insights.

In [None]:
# joining daily activity and daily calories
activitycalories <- full_join(dailyactivity, dailycalories, by=c("Id", "Calories"))
# joining daily intensity with daily steps
intensitysteps <- full_join(dailyintensities, dailysteps, by=c("Id", "ActivityDay"))
# joining hourly calories with hourly intensities
caloriesintensities <- full_join(hourlycalories, hourlyintensities, by=c("Id", "ActivityHour"))

# Previewing the new tables
head(activitycalories)
head(intensitysteps)
head(caloriesintensities)

## Analysis Phase 1: Attribute Trends
### Summaries & looking at relationships between key attributes.

In [None]:
# Producing a summary of each pertinent attribute, using the select command to avoid duplicates
activitycalories %>%
    select(TotalSteps, VeryActiveMinutes, FairlyActiveMinutes, 
           LightlyActiveMinutes, SedentaryMinutes, Calories) %>%
    summary()
intensitysteps %>%
    select(LightActiveDistance, ModeratelyActiveDistance, VeryActiveDistance, StepTotal) %>%
    summary()
caloriesintensities %>%
    select(TotalIntensity, AverageIntensity, Calories) %>%
    summary()
sleepday %>%
    select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
    summary()

**Takeaways:**
1. There is a significantly higher average of lightly active minutes versus all other kinds of activity.
2. The mean amount of sedentary minutes is multitudes higher than the mean amount of activity minutes.
3. The mean step count per day is at 7638, which is under the recommended 10000 count.
4. According to [Healthline](https://www.healthline.com/health/fitness-exercise/how-many-calories-do-i-burn-a-day#_noHeaderPrefixedContent), the average person needs around 2200-3000 calories consumed per day. The average calories burned according to the data is sitting at 2148, which is less than the expected consumption average. On average, women should be burning more calories to increase workout effectiveness.

### Looking at relationships between attributes of interest.

In [None]:
# Total steps versus calories
activitycalories %>%
    ggplot(aes(x=TotalSteps, y=Calories)) +
    geom_point(alpha = 0.3,  position = position_jitter()) +
    geom_smooth(method="gam", formula = y ~s(x)) +
    labs(title="Total Steps Vs. Calories")

Within this plot, we can see that there are 4-5 specific segments of calorie-burning per minute, all moving in an upwards correlated nature. Obviously, the more steps taken, the more calories are burned. We do not have conclusive weight data to be able to explore this relationship further.

In [None]:
# Time in bed versus time sleeping
sleepday %>%
    ggplot(aes(x=TotalTimeInBed, y=TotalMinutesAsleep)) +
    geom_point(alpha = 0.3,  position = position_jitter()) +
    geom_smooth(method="gam", formula = y ~s(x)) +
    labs(title="Time in Bed Vs. Time Sleeping") + annotate("text", x=750, y=300,
        label="too much time in bed without sleeping")

The majority of participants have regular sleep patterns; however, several of our participants seem to spend too much time in bed without sleeping.

**Takeaway:**
1. It might be wise to set reminders that it is time to get off of social media, or for those with sleep problems/insomnia, recommend a form of meditation.

In [None]:
# Looking at calories burnt vs. average intensity levels
caloriesintensities %>%
    ggplot(aes(x=AverageIntensity, y=Calories)) +
    geom_point(alpha = 0.3,  position = position_jitter()) +
    geom_smooth(method="gam", formula = y ~s(x))

**Takeaway:**
1. This plot shows a clear relationship between average intensities and calories burned. Those who maintained an average intensity of ~2.5 or higher greatly increased calories burned. Those who want to burn more calories should consider more intense workouts.

In [None]:
# Checking sedentary minutes per day vs. calories burnt per day
activitycalories %>%
    ggplot(aes(x=Calories, y=SedentaryMinutes)) +
    geom_point(alpha = 0.3,  position = position_jitter()) +
    geom_smooth(method="gam", formula = y ~s(x))

**Takeaways:**
1. The vast majority of participants burned at least ~1500 calories per day.
2. Using ~1750 calories as a minimum benchmark, the less sedentary minutes a participant spent per day, the more calories they burned.
3. Just by decreasing sedentary minutes without considering intensity, participants can increase calorie burning.

## Analysis Phase 2: Time Trends

### Here I want to look at some of the trends for general participation in the study.

In [None]:
# Original dailyactivity dataframe needs the 'datetime' format changed back to 'date' for the plot to work
dailyactivity$ActivityDate <- 
    as.Date(dailyactivity$ActivityDate)

ggplot(data = 
       (aggregate(dailyactivity$TotalSteps, by=list(dailyactivity$ActivityDate), mean)
    ) %>%
      drop_na(), aes(x=Group.1, y=x, fill=x)) +
        geom_col() +
        geom_smooth(method = loess, formula = "y ~ x") +
        labs(x="Day", y="Average Steps")


**Key Takeaway:**
1. We can see that engagement for the study decreased drastically on the last day. Participants may have become less interested over the 30-day period.
2. For any future studies, certain strategies to keep engagement for the full study may need to be taken in order to get the highest quality data.

In [None]:
# Creating weekday column
weekdaysteps <-
    dailyactivity %>%
        mutate(weekday = weekdays(ActivityDate)
)
# Ordering by weekday in new column
weekdaysteps$weekday <-
    ordered(weekdaysteps$weekday, 
        levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
        "Friday", "Saturday", "Sunday")
)
# Creating weekday mean steps per day summary
weekdaysteps <- 
    weekdaysteps %>%
          group_by(weekday) %>%
          summarize(dailysteps = mean(TotalSteps)                   
)
# Viewing the data in table form
print(weekdaysteps)
# Plot!
ggplot(data = weekdaysteps, aes(x=weekday, y=dailysteps, fill=dailysteps)) +
    geom_col(
)

**Takeaways:**
1. Participants were most active on Saturday and Tuesday, and they were least active on Sundays. This is most likely due to Saturdays being time off of work and Sundays being rest/family time.
2. The middle of the week seems to have a lull in steps taken. It might be beneficial to send reminders to keep excercising on the dreaded "hump days."

## Share and Act
### Presenting key findings and making recommendations.

**Key Findings:**

1. The majority of participants are lightly active for majority of their workouts. Intensity is more important than step amount in order to achieve an effective workout.
2. Participants are less likely to be active during the middle of the week and Sunday. Rest days are important, so the concern is more on the days Wednesday - Friday.
3. Participants saw great increase in calories burned when sedentary minutes decreased. Many participants saw long periods of sedentary time, decreasing their ability to burn calories for the day. Even just breaking up sedentary periods with short, active periods can see large health benefits. See [this article](https://www.themuse.com/advice/walking-during-work-good-for-brain-body) from www.themuse.com.

#### Recommendations:

1. Begin using Bellabeat products to send reminders for the following items:
    - Be sure to get in at least one moderately active or higher intensity workout during the middle of the week.
    - Increasing intensity during a workout for certain periods of time, tailored to that participant's health goals.
    - While in bed, stay off of social media and choose a healthier method of relaxation to encourage falling asleep quicker.
2. Use the Bellabeat products to act as an accountability partner. Meaning, after purchase, users can input physiological, demographic, and lifestyle data. Using this, the app/products can help the user prepare an algorithm-driven health plan to increase muscle mass and decrease body fat.

#### Data Requests:

If further analysis is needed, I would request the following data points.

1. Increasing the sample size: In order to do this well and achieve a 95% confidence level with a 5% margin of error, I would need the current count of total active Bellabeat users.
2. Consistent weight data
3. Consistent heart rate data
4. Demographic data to create a joinable, master user data table
5. A data point in each table that explains which product collected the data

## **Thank you for reading my study!**