# Google's Data Analytics Capstone Project - Track 1

## Introduction

In this project, we'll act as a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused
products for women. Where Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart
device fitness data could help unlock new growth opportunities for the company. We have been asked to analyze smart device data to gain insights into how customers are using their devices and help guide the marketing strategy for Bellabeat.

For this project, we'll refer from using Excel or sheets since the objective is to showcase SQL, R and Tableau (also because we don't have an Excel license on the PC we're working with). We'll limit to opening files with Excel to peek at the info and the structure.

## Ask

We'll begin by defining our business task:

`Identify trends in smart device usage and use these trends to influence Bellabeat marketing strategy`

Which we can then break down into the following questions:
* What are some trends in smart device usage?
* How could these trends apply to Bellabeat customers?
* How could these trends help influence Bellabeat marketing strategy?

## Prepare

We are referred to the following dataset:
 <a href="https://www.kaggle.com/datasets/arashnic/fitbit" target="_blank">Fitbit fitness tracker</a>(Public domain, available through Kaggle) 
Which contains data from fitbit users, including information for physical activity, heart rate, weight and sleep monitoring. This information is divided in 18 CSV documents which contain the information at different levels such as daily, hourly or by minute.

We begin importing these documents in BigQuery and after some exploratory analysis we encounter certain limitations with this dataset:
* Neither gender nor age are disclosed for users
* With only 33 users and data for 1 month the sample size is too small to make valid assumptions
* Not all segments (sleep, activity, weight) have the same number of users contributing their data
* Some datasets do not include information for the units they're measuring with (distance, instensity,fat)

![Total users in daily activity](img/users_daily_activity.png) 
    Total users in activity data
![Total users in sleep activity](img/users_sleep_activity.png)
    Total users in sleep data

To address these limitations we can search for more complete datasets or work under some assumptions. While looking for other datasets we come across the <a href="https://zenodo.org/record/53894#.X9oeh3Uzaao" target="_blank">original source</a> for this data which contains an additional month of information. We also note that the researchers used a company called Fitabase to source this data, upon looking at their website we come across a <a href="https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf" target="_blank">data dictionary</a> that will be helpful in clarifying some questions we had about the data such as what units correspond to some columns in some datasets.

|Placeholder table to outline all data sources used   |   |
|---|---|
|   |   |

## Process

### Parsing dates

When we first open the CSV files contained in the dataset we identify the date variables as a timestamp, however when trying to upload them to BigQuery all of them can't be parsed as they are. This is because, according to the source, there are differences in format between devices and personal preferences too. Because of this we'll need to import them as strings, parse and separate them in another SQL query with a temp table and the help of the functions: `PARSE_DATETIME(format_string, datetime_string)` and `EXTRACT(part FROM datetime_expression)`.

![Parsing and extracting query for heartrate data](img/heartrate_parsed_query.png)

Even having different formats for datetime in the same dataset this functions can parse them correctly. We also have to note that the discovery of the 2nd part of the dataset was made after importing most of the files from the 1st part. This isn't a problem though, since we can easily append the two datasets in each query for parsing the dates with `UNION ALL`. Also, the use of a temp table can be skipped and just generate the query with the parsing of dates, some testing is needed to define which one is faster but the latter produces a more readable query.

![Parsing and appending query for heartrate data](img/heartrate_parsed_union_query.png)

