## Setting up my environment

All datasets were downloaded from Kaggle, as described later in this report. They were hosted locally on my machine via a postgresql server. The following code set up my environment to run the queries, but the code is only formatted as SQL visually. It is not executable from this notebook at this time.

In [1]:
# Install required packages if they are missing (runs in R)
if (!requireNamespace("DBI", quietly = TRUE)) install.packages("DBI", repos = "https://cloud.r-project.org")
if (!requireNamespace("RPostgres", quietly = TRUE)) install.packages("RPostgres", repos = "https://cloud.r-project.org")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr", repos = "https://cloud.r-project.org")
if (!requireNamespace("dotenv", quietly = TRUE)) {
  install.packages("dotenv", repos = "https://cloud.r-project.org")
}

In [2]:
# Load the package
library(dotenv)

# Load a .env file from the notebook working directory (or provide a full path)
# Returns TRUE on success
loaded <- tryCatch({
  dotenv::load_dot_env(file = ".env")
}, error = function(e) {
  message("Failed to load .env: ", e$message)
  FALSE
})

In [3]:
# Load libraries
library(DBI)
library(RPostgres)
library(dplyr)
library(glue)

# Read values from environment
pg_host <- Sys.getenv("PGHOST", unset = "localhost")
pg_port <- as.integer(Sys.getenv("PGPORT", unset = "5432"))
pg_user <- Sys.getenv("PGUSER", unset = "")
pg_password <- Sys.getenv("PGPASSWORD", unset = "")
pg_db <- Sys.getenv("PGDATABASE", unset = "")

# Show current environment values (does NOT print secrets)
cat("PGHOST:   ", Sys.getenv("PGHOST",   unset = "<not set>"), "\n")
cat("PGPORT:   ", Sys.getenv("PGPORT",   unset = "<not set>"), "\n")
cat("PGDATABASE:", Sys.getenv("PGDATABASE", unset = "<not set>"), "\n")
cat("PGUSER:   ", Sys.getenv("PGUSER",   unset = "<not set>"), "\n")
if (nzchar(Sys.getenv("PGPASSWORD"))) {
  cat("PGPASSWORD: (loaded, not printed)\n")
} else {
  cat("PGPASSWORD: (not set)\n")
}


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




PGHOST:    localhost 
PGPORT:    5435 
PGDATABASE: erindamm 
PGUSER:    erindamm 
PGPASSWORD: (loaded, not printed)


In [4]:
# Connect to PostgreSQL
con <- dbConnect(
  RPostgres::Postgres(),
  host = pg_host,
  port = pg_port,
  dbname = pg_db,
  user = pg_user,
  password = pg_password
)

cat("Connection established.\n")

# List tables (shows first few if many)
tables <- dbListTables(con)
cat("Tables (first 20):\n")
print(head(tables, 20))


Connection established.
Tables (first 20):
[1] "health_fitness" "survey_605"    


## The Business Problem

[Bellabeat](https://bellabeat.com) is looking to analyze existing data in smart device usage to understand how potential users are tracking diferent health related metrics in their daily lives. These trends could be used to identify features that current Bellabeat customers may not know our devices have. These trends could also influence marketing strategy in identifying other Bellabeat products to recommend to existing customers, and which features to highlight in future marketing campaigns.

In this analysis we are looking to identify the following trends:

  * frequency of fitness wearable use
  * frequency of tracking fitness activity with wearable technology
  * impact of wearable fitness trackers on purchasing decisons
  * feature used least consistently
  * when users are most/least active (to schedule marketing campaigns)
  * attitudes of users towards their wearable technology



## The Data
  
For this analysis the marketing team started with the [FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit) (CC0: Public Domain, dataset made available through [Mobius](https://www.kaggle.com/arashnic)). This dataset contains the personal fitness information from thirty fitbit users. These users consented to the submission of their data including output for physcial activity, heart rate, and sleep monitoriing. It also includes information about daily physcial activity including steps taken, distance traveled, and heart rate that can be used to explore consumer usage habits. This data set has limitations in that it does not explore all the features available with Bellabeat devices and it does not explore any device types other than fitbit. It also doesn't explore general attitudes towards fitness wearables that could be useful for marketing purposes. For this reason, extra data was needed. 

Two additional sets of data were sourced to be evaluated as potential supplements to the existing data. 

The [FitLife: Heath & Fitness Tracking](https://www.kaggle.com/datasets/jijagallery/fitlife-health-and-fitness-tracking-dataset) (CCO: Public Domain, dataset made available through [Jija Taheri](https://www.kaggle.com/jijagallery)) dataset was considered and evaluated as a possible data source, but as it is synthetic data, it wasn't captured from actual users. Rather, it is a dataset created specifically for the purposes of data exploration and community learning. This dataset didn't contain any data trends that could be applicable to this business question we are attempting to answer in this case study, but was a good exercise in cleaning and analyzing data using various tools, including BigQuery, spreadsheets, and Tableau Public. My changelog for this dataset can be found [here](https://672c99c272ad43a8b8b96a3eec6c855f.app.posit.cloud/file_show?path=%2Fcloud%2Fproject%2Fbellabeat_case_study_fitlife_changelog.nb.html)

The [Fitness Consumer Survey Data](https://www.kaggle.com/datasets/harshitaaswani/fitness-consumer-survey-data)  dataset (CCO: Public Domain, dataset made available through [Harshita Aswani](https://www.kaggle.com/harshitaaswani)) was also considered and evaluated as a possible data source. This dataset contains survey responses from a variety of respondents about their attitudes and experiences using a fitness wearable. The data was collected from an online survey and all respondents consented to their anonymous responses being shared. The data is primary in nature and appears credible and relatively current. This dataset will be useful in providing context to the Bellabeat executives on broader views of fitness wearables. My changelog for this dataset can be found [here](#fitness_consumer_survey) 


## The Analysis

To discuss trends that can apply to Bellabeat customers, we must first understand who those customers are. Using the Fitness Consumer Survey Data, we can see that our customers targeted by this study span ages ranging from under 18 to 64 years old, with most respondents falling in the 18-24 and 25-34 ranges. To better represent our target market segment, we then filter our data to exclude those that self-identified as "Male." This shows that the 18-24 and 25-34 categories still hold as the top two, with the 45-54 category coming in with the third most respondents.

Our customers span a wide range of occupations and education levels. These images will be useful to refer back to when tailoring campaigns to specific slices of the market, though they provide more general categorizations rather than specific occupational data.

When examining how engaged users feel with their fitness wearable, we see interesting trends when we filter our graph by age groups. The 18-24 year olds span from negative to very positive responses regarding engagement, while other age groups tend to answer in neighboring pairs like neutral and somewhat engaged, or somewhat and very engaged.

On the next slide we start to examine the different ways users engage with their fitness wearables. Most customers reported a positive impact on their fitness routines, an acceleration in the achievement of their fitness goals, and an increase in both enjoying excercising and mainting motivation to exercise. 

Continuing our exploration of how users engage with their wearables, we see that users give generally favorable responses in seeing an improvement in sleep patterns, in feeling connected to the fitness community through their wearables, and in impact on both their overall health and well-being.

Now that we know a little bit more about our users and their opinions, we can take a quick peek at some anonymized fitness tracker data. These 4 charts look overwhelming at the start because it shows data for every user, but if you take a scroll through the IDs in the legend, you will see the data for each category for a single user at a time. Please note that not every user tracked sleep data, so some of those charts will appear blank.

So, now we know a bit about the users, and a bit about how they are using products like ours. Let's take a look at their decision making trends. As you can see users reported that using a wearable fitness device influenced them to make other health & fitness changes, such as dietary changes, increasing their exercise activity, joining a gym or fitness class, and purchasing other fitness related products. This is helpful because we can use the information to help us decide which features to highlight in ad campaigns and help target possible repeat customers for new product launches.

Our last slide just repeats the fourth graph from the previous slide because it is important. The majority of respondents reported that using a fitness wearable influenced their decision to purchase other fitness-related products. This shows that there is a potential repeat purchase audience available, which could lead to possible future products and potential partnerships for limited edition jewelry accessories, etc. 


## Fitbase Changelog

This dataset is broken up into two one-month chunks. The first month contains 11 tables covering metrics like steps, calories and sleep broken down over different granularities of time. The second month contains continuatios of these same metrics, but adds 7 additional tables to show daily composites of the metrics as well as wide formats to compliment the narrow ones provided where data is granulated down to the minute.

The tables were located onto my local machine's postgresql server using the following queries to create the tables, plus the data import wizard to fill them in. The 11 tables that were repeated across both months had both files combined into the same table.

```sql
CREATE TABLE fitbase_daily_activity_merged (
    participant_id BIGINT,
    activity_date DATE,
    total_steps INTEGER,
    total_distance NUMERIC,
    tracker_distance NUMERIC,
    logged_activities_distance NUMERIC,
    very_active_distance NUMERIC,
    moderately_active_distance NUMERIC,
    light_active_distance NUMERIC,
    sedentary_active_distance NUMERIC,
    very_active_minutes INTEGER,
    fairly_active_minutes INTEGER,
    lightly_active_minutes INTEGER,
    sedentary_minutes INTEGER,
    calories INTEGER)
```

```sql
CREATE TABLE fibase_daily_calories_merged (
    participant_id BIGINT,
    activity_day DATE,
    calories INTEGER)
```

```sql
CREATE TABLE fitbase_daily_intensities_merged (
    participant_id BIGINT,
    activity_day DATE,
    sedentary_minutes INTEGER,
    lightly_active_minutes INTEGER,
    fairly_active_minutes INTEGER,
    very_active_minutes INTEGER,
    sedentary_active_distance NUMERIC,
    light_active_distance NUMERIC,
    moderately_active_distance NUMERIC,
    very_active_distance NUMERIC)
```

```sql
CREATE TABLE fitbase_daily_sleep_merged (
    participant_id BIGINT,
    sleep_day TIMESTAMP,
    total_sleep_records INTEGER,
    total_minutes_asleep INTEGER,
    total_time_in_bed INTEGER)
```

```sql
CREATE TABLE fitbase_daily_steps_merged (
    participant_id BIGINT,
    activity_day DATE,
    step_total INTEGER)
```

```sql
CREATE TABLE fitbase_heartrate_seconds_merged (
    participant_id BIGINT,
    log_time TIMESTAMP,
    heartrate_value INTEGER
```)

```sql
CREATE TABLE fitbase_hourly_calories_merged (
    participant_id BIGINT,
    activity_hour TIMESTAMP,
    calories INTEGER)
```

```sql
CREATE TABLE fitbase_hourly_intensities_merged (
    participant_id BIGINT,
    activity_hour TIMESTAMP,
    total_intensity INTEGER,
    average_intensity NUMERIC)
```

```sql
CREATE TABLE fitbase_hourly_steps_merged ()
    participant_id BIGINT,
    activity_hour TIMESTAMP,
    step_total INTEGER)
```

```sql
CREATE TABLE fitbase_minute_calories_narrow_merged (
    participant_id BIGINT,
    activity_minute TIMESTAMP,
    calories NUMERIC)
```

```sql
CREATE TABLE fitbase_minute_calories_wide_merged (
   participant_id BIGINT,
    activity_hour TIMESTAMP,
    calories00 NUMERIC,
    calories01 NUMERIC,
    calories02 NUMERIC,
    calories03 NUMERIC,
    calories04 NUMERIC,
    calories05 NUMERIC,
    calories06 NUMERIC,
    calories07 NUMERIC,
    calories08 NUMERIC,
    calories09 NUMERIC,
    calories10 NUMERIC,
    calories11 NUMERIC,
    calories12 NUMERIC,
    calories13 NUMERIC,
    calories14 NUMERIC,
    calories15 NUMERIC,
    calories16 NUMERIC,
    calories17 NUMERIC,
    calories18 NUMERIC,
    calories18 NUMERIC,
    calories19 NUMERIC,
    calories20 NUMERIC,
    calories21 NUMERIC,
    calories22 NUMERIC,
    calories23 NUMERIC,
    calories24 NUMERIC,
    calories25 NUMERIC,
    calories26 NUMERIC,
    calories27 NUMERIC,
    calories28 NUMERIC,
    calories29 NUMERIC,
    calories30 NUMERIC,
    claories31 NUMERIC,
    calories32 NUMERIC,
    calories33 NUMERIC,
    calories34 NUMERIC,
    calories35 NUMERIC,
    calories36 NUMERIC,
    calories37 NUMERIC,
    calories38 NUMERIC,
    calories39 NUMERIC,
    calories40 NUMERIC,
    calories41 NUMERIC,
    calories42 NUMERIC,
    calories43 NUMERIC,
    calories44 NUMERIC,
    calories45 NUMERIC,
    calories46 NUMERIC,
    calories47 NUMERIC,
    calories48 NUMERIC,
    calories49 NUMERIC,
    calories50 NUMERIC,
    calories51 NUMERIC,
    calories52 NUMERIC,
    calories53 NUMERIC,
    calories54 NUMERIC,
    calories55 NUMERIC,
    calories56 NUMERIC,
    calories57 NUMERIC,
    calories59 NUMERIC)
```

```sql
CREATE TABLE fitbase_minute_intensities_narrow_merged (
    participant_id BIGINT,
    activity_minute TIMESTAMP,
    intensities INTEGER)
```

```sql
CREATE TABLE fitbase_minute_intensities_wide_merged (
    participant_id BIGINT,
    activity_hour TIMESTAMP,
    intensity00 NUMERIC,
    intensity01 NUMERIC,
    intensity02 NUMERIC,
    intensity03 NUMERIC,
    intensity04 NUMERIC,
    intensity05 NUMERIC,
    intensity06 NUMERIC,
    intensity07 NUMERIC,
    intensity08 NUMERIC,
    intensity09 NUMERIC,
    intensity10 NUMERIC,
    intensity11 NUMERIC,
    intensity12 NUMERIC,
    intensity13 NUMERIC,
    intensity14 NUMERIC,
    intensity15 NUMERIC,
    intensity16 NUMERIC,
    intensity17 NUMERIC,
    intensity18 NUMERIC,
    intensity19 NUMERIC,
    intensity20 NUMERIC,
    intensity21 NUMERIC,
    intensity22 NUMERIC,
    intensity23 NUMERIC,
    intensity24 NUMERIC,
    intensity25 NUMERIC,
    intensity26 NUMERIC,
    intensity27 NUMERIC,
    intensity28 NUMERIC,
    intensity29 NUMERIC,
    intensity30 NUMERIC,
    intensity31 NUMERIC,
    intensity32 NUMERIC,
    intensity33 NUMERIC,
    intensity34 NUMERIC,
    intensity35 NUMERIC,
    intensity36 NUMERIC,
    intensity37 NUMERIC,
    intensity38 NUMERIC,
    intensity39 NUMERIC,
    intensity40 NUMERIC,
    intensity41 NUMERIC,
    intensity42 NUMERIC,
    intensity43 NUMERIC,
    intensity44 NUMERIC,
    intensity45 NUMERIC,
    intensity46 NUMERIC,
    intensity47 NUMERIC,
    intensity48 NUMERIC,
    intensity49 NUMERIC,
    intensity50 NUMERIC,
    intensity51 NUMERIC,
    intensity52 NUMERIC,
    intensity53 NUMERIC,
    intensity54 NUMERIC,
    intensity55 NUMERIC,
    intensity56 NUMERIC,
    intensity57 NUMERIC,
    intensity58 NUMERIC,
    intensity59 NUMERIC)
```

```sql
CREATE TABLE fibase_minutes_mets_narrow_merged (
    participant_id BIGINT,
    activity_minute TIMESTAMP,
    mets INTEGER)
```

```sql
CREATE TABLE fitbase_minute_sleep_merged (
    participant_id BIGINT,
    activity_date TIMESTAMP,
    sleep_value INTEGER,
    log_id BIGINT)
```

```sql
CREATE TABLE fitbase_minute_steps_narrow (
    participant_id BIGINT,
    activity_minute TIMESTAMP,
    steps INTEGER)
```

```sql
CREATE TABLE fitbase_minute_steps_wide_merged (
    participant_id BIGINT,
    activity_hour TIMESTAMP,
    steps00 NUMERIC,
    steps01 NUMERIC,
    steps02 NUMERIC,
    steps03 NUMERIC,
    steps04 NUMERIC,
    steps05 NUMERIC,
    steps06 NUMERIC,
    steps07 NUMERIC,
    steps08 NUMERIC,
    steps09 NUMERIC,
    steps10 NUMERIC,
    steps11 NUMERIC,
    steps12 NUMERIC,
    steps13 NUMERIC,
    steps14 NUMERIC,
    steps15 NUMERIC,
    steps16 NUMERIC,
    steps17 NUMERIC,
    steps18 NUMERIC,
    steps19 NUMERIC,
    steps20 NUMERIC,
    steps21 NUMERIC,
    steps22 NUMERIC,
    steps23 NUMERIC,
    steps24 NUMERIC,
    steps25 NUMERIC,
    steps26 NUMERIC,
    steps27 NUMERIC,
    steps28 NUMERIC,
    steps29 NUMERIC,
    steps30 NUMERIC,
    steps31 NUMERIC,
    steps32 NUMERIC,
    steps33 NUMERIC,
    steps34 NUMERIC,
    steps35 NUMERIC,
    steps36 NUMERIC,
    steps37 NUMERIC,
    steps38 NUMERIC,
    steps39 NUMERIC,
    steps40 NUMERIC,
    steps41 NUMERIC,
    steps42 NUMERIC,
    steps43 NUMERIC,
    steps44 NUMERIC,
    steps45 NUMERIC,
    steps46 NUMERIC,
    steps47 NUMERIC,
    steps48 NUMERIC,
    steps49 NUMERIC,
    steps50 NUMERIC,
    steps51 NUMERIC,
    steps52 NUMERIC,
    steps53 NUMERIC,
    steps54 NUMERIC,
    steps55 NUMERIC,
    steps56 NUMERIC,
    steps57 NUMERIC,
    steps58 NUMERIC,
    steps59 NUMERIC)
```

```sql
CREATE TABLE fitbase_weight_log_info_merged (
    participant_id BIGINT,
    log_date TIMESTAMP,
    weight_kg NUMERIC,
    weight_pounds NUMERIC,
    fat NUMERIC,
    bmi NUMERIC,
    is_manual_report TEXT,
    log_id BIGINT)
```

## 1. Investigating NULL Values

The following queries were run to investigate any possible NULL values in the dataset.

```sql
SELECT *
FROM fitbase_daily_activity_merged
WHERE participant_id IS NULL
    OR activity_date IS NULL
    OR total_steps IS NULL
    OR total_distance IS NULL
    OR tracker_distance IS NULL
    OR logged_activities_distance IS NULL
    OR very_active_distance IS NULL
    OR moderately_active_distance IS NULL
    OR light_active_distance IS NULL
    OR sedentary_active_distance IS NULL
    OR very_active_minutes IS NULL
    OR fairly_active_minutes IS NULL
    OR lightly_active_minutes IS NULL
    OR sedentary_minutes IS NULL
    OR calories IS NULL;
```

```sql
SELECT *
FROM fitbase_daily_calories_merged
WHERE participant_id IS NULL
    OR activity_day IS NULL
    OR calories IS NULL;
```

```sql
SELECT *
FROM fitbase_daily_intensities_merged
WHERE participant_id IS NULL
    OR activity_day IS NULL
    OR sedentary_minutes IS NULL
    OR lightly_active_minutes IS NULL
    OR fairly_active_minutes IS NULL
    OR very_active_minutes IS NULL
    OR sedentary_active_distance IS NULL
    OR light_active_distance IS NULL
    OR moderately_active_distance IS NULL
    OR very_active_distance IS NULL;
```

```sql
SELECT *
FROM fitbase_daily_sleep_merged
WHERE participant_id IS NULL
    OR sleep_day IS NULL
    OR total_sleep_records IS NULL
    OR total_minutes_asleep IS NULL
    OR total_time_in_bed IS NULL;
```

```sql
SELECT *
FROM fitbase_daily_steps_merged
WHERE participant_id IS NULL
    OR activity_day IS NULL
    OR step_total IS NULL;
```

```sql
SELECT *
FROM fitbase_heartrate_seconds_merged
WHERE participant_id IS NULL
    OR log_time IS NULL
    OR heartrate_value IS NULL;
```

```sql
SELECT *
FROM fitbase_hourly_calories_merged
WHERE participant_id IS NULL
    OR activity_hour IS NULL
    OR calories IS NULL;
```

```sql
SELECT *
FROM fitbase_hourly_intensities_merged
WHERE participant_id IS NULL
    OR activity_hour IS NULL
    OR total_intensity IS NULL
    OR average_intensity IS NULL;
```

```sql
SELECT *
FROM fitbase_hourly_steps_merged
WHERE participant_id IS NULL
    OR activity_hour IS NULL
    OR step_total IS NULL;
```

```sql
SELECT *
FROM fitbase_minute_calories_narrow_merged
WHERE participant_id IS NULL
    OR activity_minute IS NULL
    OR calories IS NULL;
```

```sql
SELECT *
FROM fitbase_minute_calories_wide_merged
WHERE 
	   participant_id IS NULL
	OR activity_hour IS NULL
	OR calories00 IS NULL
    OR calories01 IS NULL
    OR calories02 IS NULL
    OR calories03 IS NULL
    OR calories04 IS NULL
    OR calories05 IS NULL
    OR calories06 IS NULL
    OR calories07 IS NULL
    OR calories08 IS NULL
    OR calories09 IS NULL
    OR calories10 IS NULL
    OR calories11 IS NULL
    OR calories12 IS NULL
    OR calories13 IS NULL
    OR calories14 IS NULL
    OR calories15 IS NULL
    OR calories16 IS NULL
    OR calories17 IS NULL
    OR calories18 IS NULL
    OR calories19 IS NULL
    OR calories20 IS NULL
    OR calories21 IS NULL
    OR calories22 IS NULL
    OR calories23 IS NULL
    OR calories24 IS NULL
    OR calories25 IS NULL
    OR calories26 IS NULL
    OR calories27 IS NULL
    OR calories28 IS NULL
    OR calories29 IS NULL
    OR calories30 IS NULL
    OR calories31 IS NULL
    OR calories32 IS NULL
    OR calories33 IS NULL
    OR calories34 IS NULL
    OR calories35 IS NULL
    OR calories36 IS NULL
    OR calories37 IS NULL
    OR calories38 IS NULL
    OR calories39 IS NULL
    OR calories40 IS NULL
    OR calories41 IS NULL
    OR calories42 IS NULL
    OR calories43 IS NULL
    OR calories44 IS NULL
    OR calories45 IS NULL
    OR calories46 IS NULL
    OR calories47 IS NULL
    OR calories48 IS NULL
    OR calories49 IS NULL
    OR calories50 IS NULL
    OR calories51 IS NULL
    OR calories52 IS NULL
    OR calories53 IS NULL
    OR calories54 IS NULL
    OR calories55 IS NULL
    OR calories56 IS NULL
    OR calories57 IS NULL
    OR calories58 IS NULL
    OR calories59 IS NULL;
```

```sql
SELECT *
FROM fitbase_minute_intensities_narrow_merged
WHERE 
	   participant_id IS NULL
	OR activity_minute IS NULL
	OR intensities IS NULL;
```

```sql
SELECT *
FROM fitbase_minute_steps_wide_merged
WHERE 
	   participant_id IS NULL
	OR activity_hour IS NULL
	OR intensities00 IS NULL
    OR intensities01 IS NULL
    OR intensities02 IS NULL
    OR intensities03 IS NULL
    OR intensities04 IS NULL
    OR intensities05 IS NULL
    OR intensities06 IS NULL
    OR intensities07 IS NULL
    OR intensities08 IS NULL
    OR intensities09 IS NULL
    OR intensities10 IS NULL
    OR intensities11 IS NULL
    OR intensities12 IS NULL
    OR intensities13 IS NULL
    OR intensities14 IS NULL
    OR intensities15 IS NULL
    OR intensities16 IS NULL
    OR intensities17 IS NULL
    OR intensities18 IS NULL
    OR intensities19 IS NULL
    OR intensities20 IS NULL
    OR intensities21 IS NULL
    OR intensities22 IS NULL
    OR intensities23 IS NULL
    OR intensities24 IS NULL
    OR intensities25 IS NULL
    OR intensities26 IS NULL
    OR intensities27 IS NULL
    OR intensities28 IS NULL
    OR intensities29 IS NULL
    OR intensities30 IS NULL
    OR intensities31 IS NULL
    OR intensities32 IS NULL
    OR intensities33 IS NULL
    OR intensities34 IS NULL
    OR intensities35 IS NULL
    OR intensities36 IS NULL
    OR intensities37 IS NULL
    OR intensities38 IS NULL
    OR intensities39 IS NULL
    OR intensities40 IS NULL
    OR intensities41 IS NULL
    OR intensities42 IS NULL
    OR intensities43 IS NULL
    OR intensities44 IS NULL
    OR intensities45 IS NULL
    OR intensities46 IS NULL
    OR intensities47 IS NULL
    OR intensities48 IS NULL
    OR intensities49 IS NULL
    OR intensities50 IS NULL
    OR intensities51 IS NULL
    OR intensities52 IS NULL
    OR intensities53 IS NULL
    OR intensities54 IS NULL
    OR intensities55 IS NULL
    OR intensities56 IS NULL
    OR intensities57 IS NULL
    OR intensities58 IS NULL
    OR intensities59 IS NULL;
```

```sql
SELECT *
FROM fitbase_minute_mets_narrow_merged
WHERE 
	   participant_id IS NULL
	OR activity_minute IS NULL
	OR mets IS NULL;
```

```sql
SELECT *
FROM fitbase_minute_sleep_merged
WHERE 
	   participant_id IS NULL
	OR activity_date IS NULL
	OR sleep_value IS NULL
    OR log_id IS NULL;
```

```sql
SELECT *
FROM fitbase_minute_steps_narrow_merged
WHERE 
	   participant_id IS NULL
	OR activity_minute IS NULL
	OR steps IS NULL;
```

```sql
SELECT *
FROM fitbase_minute_steps_wide_merged
WHERE 
	   participant_id IS NULL
	OR activity_hour IS NULL
	OR steps00 IS NULL
    OR steps01 IS NULL
    OR steps02 IS NULL
    OR steps03 IS NULL
    OR steps04 IS NULL
    OR steps05 IS NULL
    OR steps06 IS NULL
    OR steps07 IS NULL
    OR steps08 IS NULL
    OR steps09 IS NULL
    OR steps10 IS NULL
    OR steps11 IS NULL
    OR steps12 IS NULL
    OR steps13 IS NULL
    OR steps14 IS NULL
    OR steps15 IS NULL
    OR steps16 IS NULL
    OR steps17 IS NULL
    OR steps18 IS NULL
    OR steps19 IS NULL
    OR steps20 IS NULL
    OR steps21 IS NULL
    OR steps22 IS NULL
    OR steps23 IS NULL
    OR steps24 IS NULL
    OR steps25 IS NULL
    OR steps26 IS NULL
    OR steps27 IS NULL
    OR steps28 IS NULL
    OR steps29 IS NULL
    OR steps30 IS NULL
    OR steps31 IS NULL
    OR steps32 IS NULL
    OR steps33 IS NULL
    OR steps34 IS NULL
    OR steps35 IS NULL
    OR steps36 IS NULL
    OR steps37 IS NULL
    OR steps38 IS NULL
    OR steps39 IS NULL
    OR steps40 IS NULL
    OR steps41 IS NULL
    OR steps42 IS NULL
    OR steps43 IS NULL
    OR steps44 IS NULL
    OR steps45 IS NULL
    OR steps46 IS NULL
    OR steps47 IS NULL
    OR steps48 IS NULL
    OR steps49 IS NULL
    OR steps50 IS NULL
    OR steps51 IS NULL
    OR steps52 IS NULL
    OR steps53 IS NULL
    OR steps54 IS NULL
    OR steps55 IS NULL
    OR steps56 IS NULL
    OR steps57 IS NULL
    OR steps58 IS NULL
    OR steps59 IS NULL;
```

```sql
SELECT *
FROM fitbase_weight_log_info_merged
WHERE 
	   participant_id IS NULL
	OR log_date IS NULL
	OR weight_kg IS NULL
	OR weight_pounds IS NULL
	OR fat IS NULL
	OR bmi IS NULL
	OR is_manual_report IS NULL
	or log_id IS NULL;
```

This query returned many rows and at quick glance it appears the 'fat' column contains several NULL values. The following script was run to discern if there were any non_NULL values in the column.

```sql
SELECT *
FROM fitbase_weight_log_info_merged
WHERE 
	   participant_id IS NULL
	OR log_date IS NULL
	OR weight_kg IS NULL
	OR weight_pounds IS NULL
	OR fat IS NOT NULL
	OR bmi IS NULL
	OR is_manual_report IS NULL
	or log_id IS NULL;
```

There were 4 rows returned with values in the 'fat' column. It was noted that all 4 rows also contained the value 'TRUE' in the 'is_manual_report' column and the following query was run to determine the reporting status of the NULL values in the column.

```sql
SELECT fat, is_manual_report
FROM fitbase_weight_log_info_merged
WHERE 
       fat IS NULL;
```
It is determined that the NULL values are associated with both 'TRUE' and 'FALSE' responses. It is likely that only some participants had models that specifically tracked the 'fat' metric. With only 4 rows in the entire table with non-NULL values, there is not a significant enough sample size to analyze for useful trends among so much other data. This column will be disregarded in further analysis.

The rest of the columns were evaluated individually to determine if there were any other NULL values. 

```sql
SELECT *
FROM fitbase_weight_log_info_merged
WHERE 
	   participant_id IS NULL;
```

```sql
SELECT *
FROM fitbase_weight_log_info_merged
WHERE 
	   weight_kg IS NULL;
```

```sql
SELECT *
FROM fitbase_weight_log_info_merged
WHERE 
	   weight_pounds IS NULL;
```

```sql
SELECT *
FROM fitbase_weight_log_info_merged
WHERE 
	   fat IS NULL;
```

```sql
SELECT *
FROM fitbase_weight_log_info_merged
WHERE 
	   bmi IS NULL;
```

```sql
SELECT *
FROM fitbase_weight_log_info_merged
WHERE 
	   is_manual_report IS NULL;
```

```sql
SELECT *
FROM fitbase_weight_log_info_merged
WHERE 
	   log_id IS NULL;
```

There were no other NULL values to investigate.

## 2. Visualizing the Data

These new tables were exported as single csv files to be uploaded to Tableau for visual analysis.

# FitLife Dataset Changelog

This dataset is too large to view in a spreadsheet, so this notebook is a changelog for using SQL to clean the health_fitness_dataset. The table was created using the following query, followed by using the import wizard to upload the data from the csv file.

```sql
CREATE TABLE health_fitness (
	participant_id INTEGER,
	activity_date DATE,
	age INTEGER,
	gender TEXT,
	height_cm NUMERIC,
	weight_kg NUMERIC,
	activity_type TEXT,
	duration_minutes INTEGER,
	intensity TEXT,
	calories_burned NUMERIC,
	avg_heart_rate INTEGER,
	hours_sleep NUMERIC,
	stress_level INTEGER,
	daily_steps INTEGER,
	hydration_level NUMERIC,
	bmi NUMERIC,
	resting_heart_rate NUMERIC,
	blood_pressure_systolic NUMERIC,
	blood_pressure_diastolic NUMERIC,
	health_condition TEXT,
	smoking_status TEXT,
	fitness_level NUMERIC
)
```

## 1. Investigating NULL Values

Variations of the following script were run for every column to discover if there were any NULL values.
```sql
SELECT *
FROM health_fitness
WHERE participant_id IS NULL;
```
As expected from a synthetic dataset, there were no rows containing NULL values that needed to be excluded from the dataset.



## 2. Summarizing Data
The following queries were run to summarize the data. [This summary table](https://docs.google.com/spreadsheets/d/1yUZ-9m3Vjc8Dt2WBRqhE59gbnROLvnGts3fppkOWafQ/edit?usp=sharing) was created in Google Sheets to compile these results.

### Summarizing Gender
```sql
SELECT
  gender,
  COUNT(DISTINCT participant_id) AS user_count
FROM health_fitness
GROUP BY gender;
```

### Summarizing Activity Types

First, this query was run to summarize the activity types, split up by intensity:
```sql
SELECT
  activity_type, intensity,
  COUNT(DISTINCT participant_id) AS activity_type_count,
  COUNT(participant_id) AS activity_session_count,
  SUM(duration_minutes) AS activity_duration_count,
  ROUND(AVG(duration_minutes), 2) AS activity_duration_average_in_minutes,
  ROUND(MIN(duration_minutes), 2) AS activity_duration_min_in_minutes,
  ROUND(MAX(duration_minutes), 2) AS activity_duration_max_in_minutes
FROM health_fitness
GROUP BY activity_type, intensity
ORDER BY activity_type,
  CASE intensity
        WHEN 'Low' THEN 1
        WHEN 'Medium' THEN 2
        WHEN 'High' THEN 3
        ELSE NULL
        END
```

### Intensity minutes breakdown
```sql
SELECT
  intensity,
  COUNT(intensity) AS intensity_session_total_count
FROM health_fitness
GROUP BY  
  intensity
ORDER BY  
  CASE intensity
        WHEN 'Low' THEN 1
        WHEN 'Medium' THEN 2
        WHEN 'High' THEN 3
        ELSE NULL
        END
```

### Min, Max, Avg Summaries by User
```sql
SELECT
  activity_type,
  intensity,
  COUNT(DISTINCT participant_id) AS distinct_participant_count,
  ROUND(AVG(stress_level), 2) AS stress_average,
  ROUND(MIN(stress_level), 2) AS stress_min,
  ROUND(MAX(stress_level), 2) AS stress_max,
  ROUND(AVG(hours_sleep), 2) AS hours_sleep_average,
  ROUND(MIN(hours_sleep), 2) AS hours_sleep_min,
  ROUND(MAX(hours_sleep), 2) AS hours_sleep_max,
  ROUND(AVG(weight_kg), 2) AS weight_kg_average,
  ROUND(MIN(weight_kg), 2) AS weight_kg_min,
  ROUND(MAX(weight_kg), 2) AS weight_kg_max,
  ROUND(AVG(hydration_level), 2) AS hydration_level_average,
  ROUND(MIN(hydration_level), 2) AS hydration_level_min,
  ROUND(MAX(hydration_level), 2) AS hydration_level_max,
  ROUND(AVG(daily_steps), 0) AS daily_steps_average,
  ROUND(MIN(daily_steps), 0) AS daily_steps_min,
  ROUND(MAX(daily_steps), 0) AS daily_steps_max,
  ROUND(AVG(age), 2) AS average_age,
  ROUND(AVG(duration_minutes), 0) AS duration_average_in_minutes,
  ROUND(MIN(duration_minutes), 0) AS duration_min_in_minutes,
  ROUND(MAX(duration_minutes), 0) AS duration_max_in_minutes,
  ROUND(AVG(bmi), 2) AS bmi_average,
  ROUND(MIN(bmi), 2) AS bmi_min,
  ROUND(MAX(bmi), 2) AS bmi_max,
  ROUND(AVG(resting_heart_rate), 1) AS resting_heart_rate_average,
  ROUND(MIN(resting_heart_rate), 1) AS resting_heart_rate_min,
  ROUND(MAX(resting_heart_rate), 1) AS resting_heart_rate_max,
  ROUND(AVG(blood_pressure_diastolic), 1) AS blood_pressure_diastolic_average,
  ROUND(MIN(blood_pressure_diastolic), 1) AS blood_pressure_diastolic_min,
  ROUND(MAX(blood_pressure_diastolic), 1) AS blood_pressure_diastolic_max,
  ROUND(AVG(blood_pressure_systolic), 1) AS blood_pressure_systolic_average,
  ROUND(MIN(blood_pressure_systolic), 1) AS blood_pressure_systolic_min,
  ROUND(MAX(blood_pressure_systolic), 1) AS blood_pressure_systolic_max,
  ROUND(AVG(calories_burned), 0) AS calories_burned_average,
  ROUND(MIN(calories_burned), 0) AS calories_burned_min,
  ROUND(MAX(calories_burned), 0) AS calories_burned_max
FROM health_fitness
GROUP BY activity_type, intensity
ORDER BY activity_type,
  CASE intensity
    WHEN 'Low' THEN 1
    WHEN 'Medium' THEN 2
    WHEN 'High' THEN 3
    ELSE NULL
  END;
```

### Min, Max, Avg Summaries by Date
```sql
SELECT
  DISTINCT date,
  ROUND(AVG(stress_level), 2) AS stress_average,
  ROUND(MIN(stress_level), 2) AS stress_min,
  ROUND(MAX(stress_level), 2) AS stress_max,
  ROUND(AVG(hours_sleep), 2) AS hours_sleep_average,
  ROUND(MIN(hours_sleep), 2) AS hours_sleep_min,
  ROUND(MAX(hours_sleep), 2) AS hours_sleep_max,
  ROUND(AVG(weight_kg), 2) AS weight_kg_average,
  ROUND(MIN(weight_kg), 2) AS weight_kg_min,
  ROUND(MAX(weight_kg), 2) AS weight_kg_max,
  ROUND(AVG(hydration_level), 2) AS hydration_level_average,
  ROUND(MIN(hydration_level), 2) AS hydration_level_min,
  ROUND(MAX(hydration_level), 2) AS hydration_level_max,
  ROUND(AVG(daily_steps)) AS daily_steps_average,
  ROUND(MIN(daily_steps)) AS daily_steps_min,
  ROUND(MAX(daily_steps)) AS daily_steps_max,
  ROUND(AVG(age), 2) AS average_age,
  ROUND(AVG(duration_minutes)) AS duration_average_in_minutes,
  ROUND(MIN(duration_minutes)) AS duration_min_in_minutes,
  ROUND(MAX(duration_minutes)) AS duration_max_in_minutes,
  ROUND(AVG(bmi)) AS bmi_average,
  ROUND(MIN(bmi)) AS bmi_min,
  ROUND(MAX(bmi)) AS bmi_max,
  ROUND(AVG(resting_heart_rate)) AS resting_heart_rate_average,
  ROUND(MIN(resting_heart_rate)) AS resting_heart_rate_min,
  ROUND(MAX(resting_heart_rate)) AS resting_heart_rate_max,
  ROUND(AVG(blood_pressure_diastolic)) AS blood_pressure_diastolic_average,
  ROUND(MIN(blood_pressure_diastolic)) AS blood_pressure_diastolic_min,
  ROUND(MAX(blood_pressure_diastolic)) AS blood_pressure_diastolic_max,
  ROUND(AVG(blood_pressure_systolic)) AS blood_pressure_systolic_average,
  ROUND(MIN(blood_pressure_systolic)) AS blood_pressure_systolic_min,
  ROUND(MAX(blood_pressure_systolic)) AS blood_pressure_systolic_max,
  ROUND(AVG(calories_burned)) AS calories_burned_average,
  ROUND(MIN(calories_burned)) AS calories_burned_min,
  ROUND(MAX(calories_burned)) AS calories_burned_max,
  ROUND(AVG(duration_minutes)) AS duration_minutes_average,
  ROUND(MIN(duration_minutes)) AS duration_minutes_min,
  ROUND(MAX(duration_minutes)) AS duration_minutes_max,
FROM health_fitness
GROUP BY date;
```

### 3. Summarizing While Filtering for Gender
Bellabeat has positioned itself as a health & wellness company for women, so the data has been filtered to exclude men by adding the following to each summary query written in the previous section to filter the men out of the dataset without altering the dataset itself.

```sql
WHERE gender <> "M" 
```

## Summing activities by user and date without dividing by intensity
Breaking the sums up created a potential for confusing data as each user could be counted in each intensity category, meaning just summing those values would not be accurate in counting unique users for each activity.

```sql
SELECT 
    activity_type, 
    COUNT(DISTINCT participant_id) AS unique_users 
FROM health_fitness 
WHERE gender <> 'M' 
GROUP BY activity_type 
ORDER BY activity_type;
```

### Activity Session Counts
```sql
SELECT
  activity_type,
  COUNT(activity_type) AS activity_session_count
FROM health_fitness
WHERE gender <> 'M'
GROUP BY
  activity_type
ORDER BY
  activity_type;

```

### Activity Duration Summary
```sql
SELECT
  activity_type,
  SUM(duration_minutes) AS activity_duration_in_minutes
FROM health_fitness
WHERE gender <> 'M'
GROUP BY
  activity_type
ORDER BY
  activity_type;
```

## Data Visualization
The data was visualized using various tables in Tableau Public. This dataset was found to only display trends that were steady and show no growth or decrease. This makes sense as the data is synthetic, and thus wouldn't have any surprising trends to be discovered. These visualiztions were not used in the final report.

## Conclusion
This dataset is not useful for this case study but was a good chance to practice cleaning and analyzing data using various tools such as BigQuery, spreadsheets, and Tableau Public.





# bellabeat fitness consumer survey changelog

This dataset is small enough to view as a spreadsheet as it only contains data from 30 respondents and over 21 questions, plus a timestamp. While the data was examined in a spreadsheet, it was uploaded to the database using SQL queries as a practice exercise.

The table was created using the following query, followed by using the import wizard to upload the data from the csv file.

```sql
CREATE TABLE survey_605 (
	response_timestamp timestamp,
	age text,
	gender text, 
	education_level text,
	occupation text,
	weekly_exercise_frequency text,
	length_wearable_history text,
	wearable_use_frequency text,
	fitness_data_tracking_frequency text,
	impact_fitness_routine text,
	impact_fitness_motivation text,
	impact_exercise_enjoyment text,
	wearable_engagement text,
	community_connection text,
	impact_fitness_goal_achievement text,
	impact_overall_health text,
	impact_sleep_patterns text,
	impact_overall_wellbeing text,
	influence_exercise_frequency text,
	influence_fitness_purchases text,
	influence_join_gym text,
	influence_dietary_decision text
);
```

## 1. Investigating NULL Values

There were no null values in the dataset.