## 1. Ask 
Our **business task** is to leverage the insights of knowing how consumers use their smart device and what are their daily habits. This knowledge could reveal more opportunities for growth, enhance **Bellabeat** products,  and make data-driven decisions in future investments. 


Our stakeholders are:

* **Urška Sršen**: Bellabeat’s cofounder and Chief Creative Officer
* **Sando Mur**: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
* **Bellabeat marketing analytics team**: A team of data analysts responsible for collecting, analyzing, and
reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.


## 2. Prepare
We have used public data that explores smart device users’ daily habits from **Kaggle**: [FitBit Fitness Tracker Data](https://www.kaggle.com/arashnic/fitbit)  (CC0: Public Domain, dataset made available through Mobius). This Kaggle data set contains 18 tables with personal fitness information from thirty Fitbit users who consented to the submission of personal tracker data. From these 18 table, we will only use two of them: 

* **dailyActivity_merged.csv** (`activity` from now on): this table put together activity information from other tables (steps, distance, minutes of active, moderate or light activity, etc.)
* **sleepDay_merged.csv** (`sleep` from now on): this table summarize sleeping information.



### 2.1. Data credibility

The data we are using does not comply with the principles of good data ("**ROCCC**") because:

* It is **not reliable** for several reasons. First, this dataset is supposed to track 30 eligible Fitbit users but it contains 33 different Ids. In addition, 
according to the website [Business of Apps](https://www.businessofapps.com/data/fitbit-statistics/), Fitbit had 31 million of active users in 2020 (data provided by Fitbit company). Bearing in mind this population, a confidence level of 95%, and a sample size of 30 people (which is the minimum sample size required according to  the **Central Limit Theorem (CLT)**), we would have a margin of error of 17.9%, which is extremely high. 

* It is **not original**. This data was collected from Amazon Mechanical Turk but we got to it through a third party source, in this case a Kaggle user.

* It is **not comprehensive**. Exploring this data we want to extract insights for women products but we do not know the sex of Fitbit users, nor their age, if they have kids,... Also, this data was collected over the course of two months, which can bias results: it is widely known that gyms are hectic before summer or after Christmas holidays. The activity can vary a lot depending on the month of the year. 

* It is not **current** as this data was collected in 2016, eight years ago.

* It is **cited**, although it is third-party data, generated by respondents to a distributed survey via Amazon Mechanical Turk.

With all that said, the conclusions and insights we can take from this analysis must be taken cautiously and as mere guidance.


### 2.2. Data integrity

According to the principles of data integrity:

* The data is **valid** as it conforms to certain requirements for specific types of information.
* We can assume the data on `activity` and `sleep` tables is **accurate** as it is taken automatically from the Fitbit device, although we can not guarantee it. 
* The data is **not complete** because we only have information from 24 users in the `sleep` table, distinct from the 33 different users in the `activity` table. 
* The data is **not consistent** as the format of date and datetime differs across the tables.



### 2.3. Download and store the data

We are storing our data in our **RStudio cloud** account in order to work with it. We first load the packages we will need (Tidyverse packages) and import the two tables. 

In [None]:
```{r Install and load packages, echo=TRUE, message=FALSE}
# Install and load packages
install.packages("tidyverse")
install.packages("lubridate")
library(tidyverse)
library(lubridate)
```

We load the three tables we are going to analyse:

In [None]:
```{r Load data, echo=TRUE, message=FALSE}
# Load data
activity <- read_csv("dataset/activity_clean.csv")
sleep <- read_csv("dataset/sleep_clean.csv")
```

## 3. Process

In this step we are going to check the data for errors, clean the data tables we will be using, and transform the data for upcoming analysis. We will document the cleaning process in a markdown file (our change log).


### 3.1. Explore the data
First, we explore the data in R and check it was properly imported:


In [None]:
```{r Exploring the data}
# Exploring the data
head(activity)
colnames(activity)

head(sleep)
colnames(sleep)
```

### 3.2. Cleaning the data

We will use **SQL** to detect errors and clean the data. For that, we will upload our 
dataset to **BigQuery**.

We had some issues uploading the tables we needed into BigQuery because of a problem with the date time format on the `sleep` and `weight` tables. The way we solved it was opening the files with **Google Sheets** and change the format to Date Time. After doing that, we could upload the files into BigQuery.

In our cleaning process we have found:

* There are no null values in the `activity` and `sleep` tables.
* We have cast `sleepDay` field in the `sleep` table into `date` format as the time is always the same. Like that, the two tables we are working with are consistent.


In [None]:
```{SQL, message = FALSE}
  SELECT   
    CAST(SleepDay AS date)    
  FROM `case-studies-370312.fitbit.sleep_clean`  
```

* We only have one month worth of data from **12/04/2016** to **12/05/2016**.

In [None]:
```{SQL}
SELECT 
  MAX(ActivityDate) AS Max_activity,
  MIN(ActivityDate) AS Min_activity
FROM `case-studies-370312.fitbit.activity_clean`;

SELECT
  MAX(CAST(SleepDay AS date)) AS Max_sleep,
  MIN(CAST(SleepDay AS date)) AS Min_sleep
FROM `case-studies-370312.fitbit.sleep_clean`;
```

* We do not have the same number of users in each table:
  * 33 different Ids in the `activity` table.
  * 24 different Ids in the `sleep` table.

In [None]:
```{SQL}
  SELECT   
    DISTINCT Id   
  FROM `case-studies-370312.fitbit.activity_clean`;  

  SELECT   
    DISTINCT Id  
  FROM `case-studies-370312.fitbit.sleep_clean`;
```


* **Id** fields between the two tables we are studying (`activity` and `sleep`) **match** in only 24 users (those from the `sleep` table).

In [None]:
```{SQL}
SELECT DISTINCT(activity.Id)
FROM `case-studies-370312.fitbit.activity_clean` AS activity
  INNER JOIN `case-studies-370312.fitbit.sleep_clean` AS sleep
  ON activity.Id = sleep.Id
```

* We are going to remove those observations on the `activity` table where `TotalDistance = 0`.

In [None]:
```{SQL}
SELECT *
FROM `case-studies-370312.fitbit.activity_clean` 
WHERE TotalDistance > 0
```

## 4. Analyze

We will analyze the data using R bearing in mind all the information we discovered throughout the cleaning process.

### 4.1. Understanding some summary statistics

These are the number of observations we have from each data frame:


In [None]:
```{r observations}
nrow(activity)
nrow(sleep)
``` 

For the `activity` dataframe:

In [None]:
```{r}
activity %>%  
  select(TotalSteps,
         TotalDistance,
         VeryActiveDistance,
         ModeratelyActiveDistance,
         LightActiveDistance,
         SedentaryActiveDistance,
         VeryActiveMinutes,
         FairlyActiveMinutes,
         LightlyActiveMinutes,
         SedentaryMinutes) %>% 
  summary()
```

For the `sleep` dataframe:

In [None]:
```{r}
sleep %>%  
  select(TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()
```

### 4.2. Plotting a few explorations

We are going to explore some relationships between the data.


In [None]:
```{r Plot 1, message=FALSE, warning=FALSE, fig.align = 'center'}
ggplot(data = activity) + geom_point(mapping = aes(x=TotalSteps, y=SedentaryMinutes)) + geom_smooth(mapping = aes(x=TotalSteps, y=SedentaryMinutes), method = "loess") + labs(title = "Total Steps compared to Sendentary Minutes")
```

**Key takeaway**: There is **no relationship** between `TotalSteps` and `SedentaryMinutes`.

In [None]:
```{r Plot 2, message=FALSE, warning=FALSE, fig.align = 'center'}
ggplot(data=sleep, aes(x=TotalTimeInBed, y=TotalMinutesAsleep)) + geom_point() + geom_smooth(mapping = aes(x=TotalTimeInBed, y=TotalMinutesAsleep), method = "loess") + labs(title = "Total Minutes Asleep compared to Total Time in Bed")
```

**Key takeaway**: As expected, we can see there is a **positive relationship** between `TotalMinutesAsleep` and `TotalTimeInBed`.



### 4.3. Trends

* The average Total Steps for an individual is 8329, equivalent to **5.986 km**.
* The average time of activity is: 
  * 23.04 minutes of Very Active activity
  * 14.79 minutes of Fairly Active activity
  * 210.3 minutes of Lightly Active activity
  * 955.2 minutes of Sedentary activity
* The average minutes asleep is 419.5, equivalent to **7 hours**. That is the minimum recommended for adults according to the [American Academy of Sleep Medicine](https://sleepeducation.org/healthy-sleep/healthy-sleep-habits/).
* The average time spent in bed is 458.6 minutes, equivalent to 7 hours and 38 minutes.




## 5. Share

In this phase of the analysis, we have continued working with RStudio to create some visualizations in order to get to further insights.

We previously installed the `tidyverse` packages. In this section, `ggplot2` is going to be essential for creating all the visualizations.



### 5.1. Percentages of intensity in Activity

#### 5.1.1. Percentage average distance according to activity level


In [None]:
```{r Percentage average of distance according to activity, message=FALSE, warning=FALSE, fig.align = 'center'}

activity_pert_distance <- mutate(activity, 
  MediumDistance = (VeryActiveDistance + ModeratelyActiveDistance),                         
  PerVeryActiveDistance = (VeryActiveDistance*100)/TotalDistance,
  PerModeratelyActiveDistance = (ModeratelyActiveDistance*100)/TotalDistance,
  PerLightActiveDistance = (LightActiveDistance*100)/TotalDistance,
  PerSendentaryActiveDistance = (SedentaryActiveDistance*100)/TotalDistance)

means_activiy_per_distance <- activity_pert_distance %>% 
  summarise(avg_pvad = mean(PerVeryActiveDistance),
            avg_pmad = mean(PerModeratelyActiveDistance),
            avg_plad = mean(PerLightActiveDistance),
            avg_psad = mean(PerSendentaryActiveDistance))

means_per_distance_df <- data.frame(
  Type = c("Active", "Moderately", "Light", "Sedentary"), 
  tags = c("A", "M", "L", "S"), 
  means = c(means_activiy_per_distance$avg_pvad, means_activiy_per_distance$avg_pmad, means_activiy_per_distance$avg_plad, means_activiy_per_distance$avg_psad))


ggplot(means_per_distance_df, aes(x="", y=means, fill=Type)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void() + labs(title = "Percentage average distance according to activity level")

```

**Key takeaway**: From the total distance gone over the participants, 71% out of it was of *light* intensity; 18.6% out of it was of *active* intensity; 8.8% was of *moderate* intensity; and finally, 0.04% was of *sedentary* intensity. That makes sense as, as soon as you start walking, you stop having a sedentary intensity.



#### 5.1.2. Percentage average time according to activity level


In [None]:
```{r Percentage average of minutes according to activity, message=FALSE, warning=FALSE, fig.align = 'center'}

activity_per_minutes <- mutate(activity, 
    TotalMinutes = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes + SedentaryMinutes,
    MediumActivity = VeryActiveMinutes + FairlyActiveMinutes,
    PerVeryActiveMinutes = (VeryActiveMinutes*100)/TotalMinutes, 
    PerFairlyActiveMinutes = (FairlyActiveMinutes*100)/TotalMinutes,
    PerLightlyActiveMinutes = (LightlyActiveMinutes*100)/TotalMinutes,
    PerSendentaryMinutes = (SedentaryMinutes*100)/TotalMinutes)

means_activiy_per_minutes <- activity_per_minutes %>% 
  summarise(avg_pvam = mean(PerVeryActiveMinutes),
            avg_pfam = mean(PerFairlyActiveMinutes),
            avg_plam = mean(PerLightlyActiveMinutes),
            avg_psam = mean(PerSendentaryMinutes))

means_per_minutes_df <- data.frame(
  Type = c("Active", "Fairly", "Lightly", "Sedentary"), 
  tags = c("A", "F", "L", "S"), 
  means = c(means_activiy_per_minutes$avg_pvam, means_activiy_per_minutes$avg_pfam, means_activiy_per_minutes$avg_plam, means_activiy_per_minutes$avg_psam))

ggplot(means_per_minutes_df, aes(x="", y=means, fill=Type)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void() + labs(title = "Percentage average of minutes according to activity")

```

**Key takeaway**: 78.2% of the time is spent in a *sedentary* mode, followed by 18.5% of the mean time spent in a *light* mode, 2% of the mean time spent in an *active* mode, and finally 1.3% of the mean time is spent in a *fairly active* mode.



### 5.2. Relationship between Activity and Sleep

We will run some more explorations between these two datasets.


In [None]:
```{r Merging tables}
combined_data <- merge(sleep, activity, by="Id")
```

#### 5.2.1. Steps vs. Time asleep

In [None]:
```{r Total Steps compared to Total Minutes Asleep, fig.align = 'center', message=FALSE, warning=FALSE}
ggplot(data = combined_data) + geom_point(mapping = aes(y = TotalMinutesAsleep, x = TotalSteps), color="red3") + geom_smooth(mapping = aes(y = TotalMinutesAsleep, x = TotalSteps)) + labs(title = "Total Steps compared to Total Minutes Asleep")
```

**Key takeaway**: There is no relationship between total steps and total minutes asleep.


#### 5.2.2. Activity level vs. Time asleep

We are going to analyze whether participants who sleep more also spend more minutes in an active mode per day.

In [None]:
```{r Time Asleep compared to Very or Fairly Active Minutes, echo=FALSE, message=FALSE, warning=FALSE, out.width = "50%"}
ggplot(data = combined_data) + geom_point(mapping = aes(y = TotalMinutesAsleep, x = VeryActiveMinutes), color = "deeppink1") + geom_smooth(mapping = aes(y = TotalMinutesAsleep, x = VeryActiveMinutes)) + labs(title = "Time Asleep compared to Very Active Minutes")

ggplot(data = combined_data) + geom_point(mapping = aes(y = TotalMinutesAsleep, x = FairlyActiveMinutes), color = "turquoise3") + geom_smooth(mapping = aes(y = TotalMinutesAsleep, x = FairlyActiveMinutes)) + labs(title = "Time Asleep compared to Fairly Active Minutes")

```

In [None]:
```{r Time Asleep compared to Lightly Active  or Sedentary Minutes, echo=FALSE, message=FALSE, warning=FALSE, out.width = "50%"}
ggplot(data = combined_data) + geom_point(mapping = aes(y = TotalMinutesAsleep, x = LightlyActiveMinutes), color = "sienna1") + geom_smooth(mapping = aes(y = TotalMinutesAsleep, x = LightlyActiveMinutes)) + labs(title = "Time Asleep compared to Lightly Active Minutes")

ggplot(data = combined_data) + geom_point(mapping = aes(y = TotalMinutesAsleep, x = SedentaryMinutes), color = "mediumorchid1") + geom_smooth(mapping = aes(y = TotalMinutesAsleep, x = SedentaryMinutes)) + labs(title = "Time Asleep compared to Sedentary Minutes")
```

**Key takeaway**: There is **no relationship** between the intensity of the time spent with certain activity and time asleep.



### 5.3. Sleep patterns

#### 5.3.1. Average minutes asleep depending on the day of the week


In [None]:
```{r Average minutes asleep depending on the day of the week, message=FALSE, warning=FALSE, fig.align='center'}
sleep_wday_df <- mutate(sleep, WeekDay = wday(SleepDate, label = TRUE, abbr = FALSE))

sleep_wday_avg <- sleep_wday_df %>% 
  group_by(WeekDay) %>% 
  summarise(AvgSleep = mean(TotalMinutesAsleep))

ggplot(data = sleep_wday_avg) + geom_col(mapping = aes(x = WeekDay, y = AvgSleep, fill=AvgSleep)) + geom_hline(yintercept = 420, color = "red") + labs(title = "Average minutes asleep depending on the day of the week")
```

**Key takeaway**: Only three days of the week (Sundays, Wednesdays and Saturdays) the participants sleep an average minimum of 420 minutes (7 hours).



### 5.4. Calories

In [None]:
```{r fig.align='center'}
calories_below_2000_df <- activity_per_minutes %>% 
  filter(Calories < 2000)


ggplot(data = calories_below_2000_df) + geom_jitter(mapping = aes(x = MediumActivity, y = Calories)) + labs(title = "Users who spend less than 2000 Calories with medium intensity activity")

calories_below_2000_df %>% 
    group_by(Id) %>% 
    summarize(distinct_Ids = n_distinct(Id), max(MediumActivity), mean(MediumActivity))

```

**Key takeaway**: 29 out of 33 different Ids (almost 88% of participants) spend less than 2000 calories. Those participants spend less a maximum average of 80 minutes doing an active or fairly active activity.


## 6. Act

### 6.1. Trends and insights identified

* We would have expected that people who exercise more, also sleep more but we can not confirm this hypothesis. Our analysis tells us that it does not matter how much the participants exercise, they are going to have a good rest.
* Participant spent 78.2% of the time in a *sedentary* mode, followed by 18.5% of the mean time spent in a *light*.
* 71% of the distance traveled was in *light* intensity.
* Only three days of the week (Sundays, Wednesdays and Saturdays) the participants slept an average minimum of 420 minutes (7 hours).
* 88% of participants spent less than 2000 calories.


### 6.2. Recommendations

As mentioned before, the insights taken from this analysis must be taken cautiously. Our recommendations for Bellabeat in order to improve its growth are:

* **Bellabeat app** should have notifications when users spend too much time in a sedentary mode, when they do not get their objective, or when they have not spend a certain amount of calories.

* **Bellabeat leaf** should be able to track sleep time automatically.

* **Bellabeat** should track the time users exercise. This has a huge influence in a good sleep at night. High intensity exercise shoot adrenaline and it can make harder switching off when bed time arrives.

* **Bellabeat** should take a new data analysis with their own data to get more accurate insights and to be able to make the most of their efforts.
