**Bellabeat** is a high-tech manufacturer of health-focused products for women Their developed smart devices have 
empowered consumers to take charge of their health data related to activity, sleep, stress, menstrual cycle, and minfulness habits. Since its foundation in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

**Scenario**:
    You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. 
    *Urška Sršen*, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth oppounities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

# Business Task


**Identifying key tasks and deliverables**

Analyze data from the Bellabeat app to see how users interact with the FitBit app and determine trends to guide marketing strategy for the company.

**Primary Stakeholders:**
* Urška Sršen - Bellabeat's cofounder and CCO
* Sando Mur - Mathematician and Bellabeat's cofounder

The marketing analytics team is tasked with analyzing the smart device usage to gain insights into how people are using their smart devices.
Our task as data analysts is to collect, transform, and organize this data to draw conclusions, make predictions, and drive informed decision-making. 

Step of the data analysis process:
1. Ask
2. Prepare
3. Process
4. Analyze
5. Share
6. Act

# Step 1 - Ask
**Ask questions to make data driven decisions.** 
It's important to ask highly effective questions that are SMART (Specific, Measurable, Action-oriente, Relevant, Time-bound). We want to avoid leading questions that only have a particular response, close-ended questions that discourage follow-up, and vague questions that are nonspecific and don't provide context. This develops an analytical thinking framework that identifies and defines a problem, and then solves it by using data in an organized step-by-step manner.

## Questions for Analysis ##
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

# Step 2 - Prepare
**Prepare data for exploration**

This project utilizes publicly available data exploring the daily habits of FitBit users. The [FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit/data) set contains information from thirty FitBit users who consented to share personal fitness tracker data, including minute-level details on physical activity, heart rate, and sleep monitoring.

This step in the data analysis process ensures ethical data analysis practices, addresses issues of bias and credibility, and notes the accessibility of the data source.
## Data Responsibility ##
**Establishing Data Credibility & Addressing Limitations**

**ROCC Framework**
* **Reliable** - The data includes information from thirty FitBit users who consented to share their fitness tracker data. While this meets the Central Limit Theorem's minimim requirement for meaningful analysis, it may not fully represent the broader population's behavior.
* **Original** - The dataset was posted by Mobius on Kaggle and distributed via Amazon Mechanical Turk, classifying it as a third-party data.
* **Comprehensive** - The dataset includes multiple tables on daily activity, heartrate, calories, steps, sleep, and weight data. However, some tables are redundant (e.g. *dailyIntensities_merged* and *dailyActivity_merged*), and the weight log info table only contains data for 11 unique users.
* **Current** - The data was collected between *03/12/2016 - 05/12/20126*, making it slightly outdated and potentially less reflective of current user behavior.

Despite these limitations, this dataset offers valuable insights into smart device usage trends, therefore we will proceed with our analysis accordingly.

## Installing & Loading Necessary Packages ##

In [None]:
# Install Necessary Packages #
# install.packages("tidyverse")
# install.packages("skimr")
# install.packages("here")
# install.packages("janitor")
# install.packages("ggplot2")
# install.packages("lubridate")
# install.packages("sqldf")
# install.packages("plotrix")
install.packages("plotly")

# Load necessary libraries
library(tidyverse)
library(skimr)
library(here)
library(janitor)
library(ggplot2)
library(lubridate)
library(sqldf)
library(plotrix)
library(plotly)
library(purrr)
library(scales)

## Importing FitBit Tracker Data ##

In [None]:
daily_activity <- read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
sleep_day <- read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
weight_info_log <- read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')
hourly_steps <- read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv')
hourly_calories <- read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv')
heartrate <- read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')



# Step 3 - Process
**Process data from dirty to clean**

Dirty data is data that is incomplete, incorrect, or irrelevant to the problem you are trying to solve. There are a number of causes of dirty data, including manual data entry errors, batch data imports, data migration, software obsolescence, improper data collection, and human errors during data input. As the data professional in this study, I can take steps to mitigate the impact of dirty data by implementing effective data quality processes.

The process of cleaning data includes transforming data into a more useful format, combining two or more datasets to make information more complete, and removing outliers. 

Princples of Data Integrity
1. Validity - the concept of using data integrity to ensure measures conform to defined business rules or constraints.
2. Accuracy - the degree of conformity of a measure to a standard or a true value.
3. Completedness - the degree of which all required measures are known.
4. Consistency - teh degree to which a set of measures is equivalent across systems.

## Inspect the Data

We start with view and glimpse to quickly view the structure and a portion of the data in a data frame to get an understanding of the data & its structure. This will help identify major issues before diving deeper.

In [None]:
view(daily_activity)
view(sleep_day)
View(weight_info_log)
view(hourly_steps)
view(hourly_calories)
view(heartrate)
glimpse(daily_activity)
glimpse(sleep_day)
glimpse(weight_info_log)
glimpse(hourly_steps)
glimpse(hourly_calories)
glimpse(heartrate)

## Data Cleaning

We can see that there are structural issues with column names, date, and time format consistency. The intensities of activity also differ across ("light" vs "lightly" and "moderately" vs "fairly") distance and minutes columns. We need to correct capitalization, spacing, and inconsistent column names to ensure a more logical flow of data organization before merging, filtering, or calculating new values. 

In [None]:
# Adjusting Column Names and Separating Date-Time Components 
# Daily Activity
daily_activity <- daily_activity %>%
  rename(Date = ActivityDate,
         ModeratelyActiveMinutes = FairlyActiveMinutes,
         LightlyActiveDistance = LightActiveDistance) %>%
  mutate(Id = as.character(Id),
         Date = as.Date(Date, format = "%m/%d/%Y")) %>%
  arrange(Id, Date)
# Sleep Day 
sleep_day <- sleep_day %>%
  rename(DateTime = SleepDay) %>%
  mutate(Id = as.character(Id),
         DateTime = as.POSIXct(DateTime, format = "%m/%d/%Y %I:%M:%S %p"),
         Date = as.Date(DateTime)) %>%
  arrange(Id, Date) %>%
  select(-DateTime)
# Time dropped since all sleep logs occurred at 12:00 AM 
# Weight Log Info
weight_info_log <- weight_info_log %>%
  rename(DateTime = Date) %>%
  mutate(Id = as.character(Id),
         DateTime = as.POSIXct(DateTime, format = "%m/%d/%Y %I:%M:%S %p"),
         Date = as.Date(DateTime),
         Time = format(DateTime, "%H:%M")) %>%
  arrange(Id, Date) %>%
  select(-DateTime)
# Hourly Steps
hourly_steps <- hourly_steps %>%
  mutate(Id = as.character(Id),
         ActivityHour = as.POSIXct(ActivityHour, format = "%m/%d/%Y %I:%M:%S %p"),
         Date = as.Date(ActivityHour),
         Time = format(ActivityHour, "%H:%M")) %>%
  arrange(Id, Date) %>%
  select(-ActivityHour)
# Hourly Calories
hourly_calories <- hourly_calories %>%
  mutate(Id = as.character(Id),
         ActivityHour = as.POSIXct(ActivityHour, format = "%m/%d/%Y %I:%M:%S %p"),
         Date = as.Date(ActivityHour),
         Time = format(ActivityHour, "%H:%M")) %>%
  arrange(Id, Date) %>%
  select(-ActivityHour)
# Heartrate
heartrate <- heartrate %>%
  rename(DateTime = Time) %>%
  mutate(Id = as.character(Id),
         DateTime = as.POSIXct(DateTime, format = "%m/%d/%Y %I:%M:%S %p"),
         Date = as.Date(DateTime),
         Time = format(DateTime, "%H:%M:%S")) %>%
  arrange(Id, Date) %>%
  select(-DateTime)

Using glimpse to take a look to make sure that date and time are consistent throughout all data sets.

In [None]:
glimpse(daily_activity)
glimpse(sleep_day)
glimpse(weight_info_log)
glimpse(hourly_steps)
glimpse(hourly_calories)
glimpse(heartrate)

We want to eliminate as much unnecessary data as possible, so we will drop *LoggedActivitiesDistance* and *TrackerDistance* from *daily_activity*. *WeightKg* is also redundant since we have *WeightPounds*. The column *Fat* contains only N/A values, so it too can be discarded. 

In [None]:
# Eliminating Unnecessary Columns #
daily_activity <- daily_activity %>% 
  select(-LoggedActivitiesDistance, -TrackerDistance)
weight_info_log <- weight_info_log %>%
  select(-WeightKg, -LogId, -Fat)

**Handling Duplicates**


In [None]:
sum(duplicated(daily_activity))
sum(duplicated(sleep_day)) 
sleep_day <- sleep_day %>%
  distinct(Id, Date, .keep_all = TRUE)
sum(duplicated(weight_info_log))
sum(duplicated(hourly_steps))
sum(duplicated(hourly_calories))
sum(duplicated(heartrate))

## Identifying Participants
We start examining the structure of our data by counting the number of distinct user ID's in each data set. By calculating the sample size, we can determinine how statistically sound the sample size of our Fitbit users represents the average result of  the population of all Bellabeat users. 
The larger a sample size, the higher **the confidence level** (calculates how confident we are in the survey results), the lower the **margin of error** (how the sample's results are expected to differ from the result would have been if I had surveyed the entire population), and the greater the **statistical significance** (the determinaton of whether the result could be due to random chance or not).

In [None]:
# Checking the number of distinct ID's per data set #
list_of_datasets <- list(daily_activity, sleep_day, weight_info_log, hourly_steps, hourly_calories, heartrate)
distinct_counts <- map(list_of_datasets, ~ n_distinct(.x$Id))
names(distinct_counts) <- c("daily_activity", "sleep_minutes", "weight_info_log", "hourly_steps", "hourly_calories", "heartrate")
print(distinct_counts)

## Data Limitations

While conducting this analysis, several important limitations of the Bellabeat data were identified:

*Small Sample Size*:
The weight_info_log dataset includes only 8 unique users, and the heartrate dataset includes 14 unique users. This is a very small sample relative to the broader Fitbit user population, which limits the generalizability of any findings.

*Sampling Bias*:
The users included in the dataset may not be representative of the general population. Factors such as age, gender, fitness level, and lifestyle habits are unknown and may introduce bias.

*High-Resolution Heart Rate Data*:
The heart rate data is sampled every 5 seconds, resulting in a highly detailed but large dataset. Due to its volume and complexity, heart rate data was kept separate from the main analysis and used only for exploratory purposes.

*Missing Variables*:
The dataset does not include information on important factors such as sleep quality ratings, stress levels, diet, or other health indicators that could provide a more complete picture of wellness.

*Self-Reported and Device-Generated Data*:
Some data points (e.g., manually entered weight logs) may be self-reported, introducing potential inaccuracies. Device-recorded activity levels and sleep patterns are subject to measurement error depending on device wear time and sensor accuracy.

Despite these limitations, the dataset provides a valuable opportunity to practice and demonstrate data cleaning, analysis, and visualization skills. All insights and recommendations drawn from this analysis should be considered exploratory and limited in scope.

**Merging Cleaned Datasets**

In [None]:
# Merging Cleaned Data sets #
# merge daily activity with sleep day for sleep-activity analysis #
activity_sleep <- left_join(daily_activity, sleep_day, by = c("Id", "Date")) %>%
  mutate(TotalActiveMinutes = LightlyActiveMinutes + ModeratelyActiveMinutes + VeryActiveMinutes) %>%
  select(-SedentaryActiveDistance)
view(activity_sleep)
# merge hourly calories with steps
hourly_steps_calories <- hourly_steps %>%
     inner_join(hourly_calories, by = c("Id", "Date", "Time"))
view(hourly_steps_calories)

# Step 4 - Analyze
**Analyze data to answer questions**

Analyzing involves investigating into the central theme of the entire data set to gain statistical information about potential patterns, relationships, and trends within the data.
The first two phases are organize data and format and adjust data. Sorting arranges data in a meaningful order, while filtering displays only data that meets specific criteria. Combining filtering and sorting allows for organizing only relevant data for analysis.
We will begin by summarizing our cleaned data sets to get conclude general observations about each health metric.

In [None]:
# Daily Activity Summary
cat("Summary of Activity Data:\n")
activity_sleep %>%  
  select(TotalSteps, TotalDistance, TotalActiveMinutes, SedentaryMinutes, Calories, TotalMinutesAsleep, TotalTimeInBed) %>%
  summary()
# Active Minutes per Category (Very Active, Fairly Active, Lightly Active)
cat("\nSummary of Active Minutes and Distance per Category:\n")
activity_sleep %>%
  select(VeryActiveMinutes, ModeratelyActiveMinutes, LightlyActiveMinutes, VeryActiveDistance, ModeratelyActiveDistance, LightlyActiveDistance) %>%
  summary()
# Weight Info Log Summary (Weight in Kg, BMI)
cat("\nSummary of Weight Data:\n")
weight_info_log %>%
  select(WeightPounds, BMI) %>%
  summary()
# Heart Rate Summary
cat("Summary of Heart Rate Data:\n")
heartrate_daily <- heartrate %>%
  group_by(Id, Date) %>%
  summarise(
    AvgHeartRate = mean(Value, na.rm = TRUE),
    MaxHeartRate = max(Value, na.rm = TRUE),
    MinHeartRate = min(Value, na.rm = TRUE),
    HR_Readings = n(),
    .groups = "drop"
  )
heartrate_summary <- heartrate_daily %>%
  group_by(Id) %>%
  summarise(
    MeanDailyAvgHR = mean(AvgHeartRate, na.rm = TRUE),
    MeanMaxHR = mean(MaxHeartRate, na.rm = TRUE),
    MeanMinHR = mean(MinHeartRate, na.rm = TRUE),
    DaysTracked = n()
  )
head(heartrate_summary)

Next step will analyze fitness metrics in terms of weekday observations.

In [None]:
# Add day of the week for analysis
merged_activity_sleep <- mutate(activity_sleep, 
                                Weekday = wday(Date, label = TRUE))
view(merged_activity_sleep)
# Summarize health data averages by weekday
summarized_activity_sleep <- merged_activity_sleep %>% 
  group_by(Weekday) %>% 
  summarise(AvgDailySteps = mean(TotalSteps, na.rm = TRUE),
            AvgAsleepMinutes = mean(TotalMinutesAsleep, na.rm = TRUE),
            AvgSedentaryMinutes = mean(SedentaryMinutes, na.rm = TRUE),
            AvgLightlyActiveMinutes = mean(LightlyActiveMinutes, na.rm = TRUE),
            AvgFairlyActiveMinutes = mean(ModeratelyActiveMinutes, na.rm = TRUE),
            AvgVeryActiveMinutes = mean(VeryActiveMinutes, na.rm = TRUE), 
            AvgCalories = mean(Calories, na.rm = TRUE),
            AvgActiveMinutes = mean(TotalActiveMinutes, na.rm = TRUE))
view(summarized_activity_sleep)

I would also like to create a user summary table to show how users differ by summmarizing by ID. This will deepen our understanding of users' typical cardiovascular profiles.

In [None]:
# Per-User Summary Table
user_summary <- activity_sleep %>%
  group_by(Id) %>%
  summarise(
    DaysTracked = n(),
    AvgSteps = mean(TotalSteps, na.rm = TRUE),
    AvgCalories = mean(Calories, na.rm = TRUE),
    AvgMinutesAsleep = mean(TotalMinutesAsleep, na.rm = TRUE),
    AvgSedentary = mean(SedentaryMinutes, na.rm = TRUE),
    AvgLightActive = mean(LightlyActiveMinutes, na.rm = TRUE),
    AvgFairlyActive = mean(ModeratelyActiveMinutes, na.rm = TRUE),
    AvgVeryActive = mean(VeryActiveMinutes, na.rm = TRUE),
    AvgActiveMinutes = mean(TotalActiveMinutes, na.rm = TRUE),
    AvgTimeInBed = mean(TotalTimeInBed, na.rm = TRUE))
view(user_summary)

# Step 5 - Share
**Share data through the art of visualization**

To effectively communicate insights from the Fitbit user data, we apply visualization techniques that highlight key trends in activity, sleep, and health behaviors.
Visualizations are selected based on the nature of the data and the story we aim to tell:
* Line charts illustrate changes in metrics over time.
* Column charts compare activity intensity levels across days of the week.
* Scatter plots explore relationships between variables, such as steps taken and calories burned.

These visual elements help translate complex datasets into clear, actionable insights that can inform Bellabeat’s product and marketing strategies.


## Unique Participants Per Data Set
The following bar chart shows the number of unique participants in each of the following categories: daily activity, hourly activity, daily sleep, heart rate, and weight info log. 

In [None]:
datasets <- list(
  "Daily Activity" = daily_activity,
  "Daily Sleep" = sleep_day,
  "Hourly Activity" = hourly_steps_calories,
  "Heart Rate" = heartrate,
  "Weight Log" = weight_info_log)
participant_count <- tibble(
  data_type = names(datasets),
  participant_number = sapply(datasets, function(df) n_distinct(df$Id)))
participants <- participant_count %>%
  ggplot(aes(x = reorder(data_type, desc(participant_number)),
             y = participant_number,
             fill = data_type)) +
  geom_col(position = "dodge") +
  geom_label(aes(label = participant_number),
             fill = "white", colour = "black", vjust = 1) +
  scale_y_continuous(breaks = pretty_breaks(n = 10)) +
  scale_fill_viridis_d(option = "plasma") +
  labs(title = "Unique Participants by Data Type",
       x = "Data Type",
       y = "Number of Participants",
       caption = "Fitbit Fitness Tracker Data") +
  theme_minimal() +
  theme(legend.position = "none",
        text = element_text(size = 12),
        plot.title = element_text(hjust = 0.25, size = 16, face = "bold"),
        axis.title.x = element_text(margin = margin(t = 15)),
        axis.title.y = element_text(margin = margin(r = 15))
       )
participants

In [None]:
manual_auto_count <- weight_info_log %>%
  group_by(IsManualReport) %>%
  summarize(count = n())
manual_auto_count

**Participant Logging Activity Observations:**

* Bellabeat users logged Daily Activity and Hourly Activity most frequently (100% participation).
* Weight Log had the lowest participant use (24.2%).
* Heart Rate data was logged by less than half of the users (42.4%).
* Daily Sleep data was logged by 72.7% of total users.

**Assumptions:**
* Weight Log entries were mostly manual (61.2%), which may have discouraged frequent logging. Manual entry at a consistent time (e.g., 12:00 AM) could be tedious compared to other metrics that are automatically tracked by the device, contributing to the small sample size for weight data.
* Heart rate data is tracked every 5 seconds, resulting in a large and detailed dataset. Due to the volume and potential device or user variability, it is possible that not all users had continuous or accurate cardiovascular tracking.
* Sleep is automatically tracked based on body movement during the night [Bellabeat FAQ]((https://www.qvc.com/footers/ws/pdf/Bellabeat_FAQs.pdf)). Wearing a device consistently overnight may be uncomfortable for some users, potentially affecting sleep data participation.

**Key Takeaway**

These observations highlight that **automatic tracking tends to yield higher participation rates** compared to manual input tasks.

## Plotting Activity by Weekday

In [None]:
activity_weekday_summary <- merged_activity_sleep %>%
  group_by(Weekday) %>%
  summarise(
    total_active_minutes = sum(TotalActiveMinutes, na.rm = TRUE),
    avg_active_minutes = mean(TotalActiveMinutes, na.rm = TRUE))
activity_weekday_plot <- ggplot(activity_weekday_summary, aes(x = Weekday, y = total_active_minutes, fill = Weekday)) +
  geom_col() +
  geom_label(aes(label = round(avg_active_minutes, 1)),
             fill = "white", color = "black", # fix typo: "colour" -> "color"
             vjust = -0.5,
             nudge_y = 300) +
  scale_y_continuous(labels = comma,
                     expand = expansion(mult = c(0, 0.15))) +
  scale_fill_viridis_d(option = "plasma") +
  labs(
    title = "Total Active Minutes by Day of the Week",
    x = NULL,
    y = "Total Active Time (minutes)",
    caption = "Data by Fitbit Fitness Tracker"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    text = element_text(size = 10),
    plot.title = element_text(hjust = 0, size = 16, face = "bold"),
    axis.title.y = element_text(margin = margin(r = 15)))
activity_weekday_plot

sedentary_weekday_summary <- merged_activity_sleep %>%
  group_by(Weekday) %>%
  summarise(
    total_sedentary_minutes = sum(SedentaryMinutes, na.rm = TRUE),
    avg_sedentary_minutes = mean(SedentaryMinutes, na.rm = TRUE))
sedentary_weekday_plot <- ggplot(sedentary_weekday_summary, aes(x = Weekday, y = total_sedentary_minutes, fill = Weekday)) +
  geom_col() +
  geom_label(aes(label = round(avg_sedentary_minutes, 1)), 
             fill = "white", colour = "black", 
             vjust = -0.5, 
             nudge_y = 300) +  
  scale_y_continuous(labels = comma, 
                     expand = expansion(mult = c(0, 0.15))) + 
  scale_fill_viridis_d(option = "plasma") +
  labs(
    title = "Total Sedentary Minutes by Day of the Week",
    x = NULL,
    y = "Total Sedentary Time (minutes)",
    caption = "Data by Fitbit Fitness Tracker"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    text = element_text(size = 10),
    plot.title = element_text(hjust = 0, size = 16, face = "bold"),
    axis.title.y = element_text(margin = margin(r = 15)))
sedentary_weekday_plot

**Activity and Sedentary Time by Weekday Observations:**
* Active minutes are lowest on Sunday (208.5 min) and Monday (229.2 min).
* Users are most active on Tuesday (234.6 min), Wednesday (223.7 min), and Thursday (216.8 min).
* Sedentary minutes peak midweek (Tuesday-Thursday) and are lowest from Friday to Monday.
* Sedentary minutes show less variation between users than activity minutes.

**Assumptions:**
* Higher activity midweek may align with structured weekday routines.
* Increased sedentary time could reflect work-related inactivity (e.g. desk jobs).
* Weekend routines may disrupt consistent device usage or structured exercise.

**Key Takeaway**
* Bellabeat users are most active and sedentary midweek, suggesting **weekday productivity and structured schedules drive engagement**. 

## Daily Steps vs Calories

Let's first look at a bar chart examining the relationship between total  daily steps and calories burned.

In [None]:
install.packages("viridis")
library(scales)
library(dplyr)
library(viridis)

average_point <- activity_sleep %>%
  summarise(
    avg_steps = mean(TotalSteps, na.rm = TRUE),
    avg_calories = mean(Calories, na.rm = TRUE)
  )
daily_steps_calories <- ggplot(activity_sleep, aes(x = TotalSteps, y = Calories, color = TotalDistance)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", formula = y ~ x, linewidth = 0.8, color = "#D44292") +
  geom_vline(xintercept = average_point$avg_steps, linetype = "dashed", color = "#4B2991") +
  geom_hline(yintercept = average_point$avg_calories, linetype = "dashed", color = "#EA4F88") +
  geom_point(data = average_point, 
             aes(x = avg_steps, y = avg_calories), 
             inherit.aes = FALSE,
             color = "black", size = 3, shape = 18) +
  annotate("text", x = Inf, y = Inf, 
           label = paste0("Avg Steps: ", scales::comma(round(average_point$avg_steps, 0)),
                          "\nAvg Calories: ", scales::comma(round(average_point$avg_calories, 0))),
           hjust = 1.1, vjust = 2, size = 3.5, color = "black") +
  scale_color_viridis_c(direction = -1) +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Total Daily Steps vs. Daily Calories",
    x = "Total Steps",
    y = "Calories Burned",
    color = "Distance (miles)",
    caption = "Fitbit Fitness Tracker Data"
  ) +
  theme_minimal() +
  theme(
    text = element_text(size = 10, family = "Calibri"),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(color = "#4B2991", margin = margin(t = 15)),
    axis.title.y = element_text(color = "#EA4F88", margin = margin(r = 15)))

interactive_daily <- ggplotly(daily_steps_calories)
interactive_daily

**Steps and Calories Observations**
* Bellabeat users average 7638 steps and 2304 calories per day.
* Steps increases with calorie expenditure, positive correlation.

**Key Takeaway**

For users with weight loss goals, focusing on increasing higher-intensity activity could improve outcomes. [The app could offer personalized exercise goals based on calories burned during high-intensity movement or distance covered.]

## Hourly Sleep and Sedentary Time

In [None]:
library(plotly)

# Your original ggplot code
sleep_sedentary <- ggplot(
  data = merged_activity_sleep, 
  aes(x = SedentaryMinutes / 60, y = TotalMinutesAsleep / 60, color = Calories)
) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "loess", formula = y ~ x, size = 0.8, color = "#bc3754") + 
  scale_color_viridis_c(direction = -1) + 
  labs(
    title = "Asleep Hours by Sedentary Hours",
    x = "Sedentary Hours",
    y = "Asleep Hours",
    color = "Calories Burned", 
    caption = "Fitbit Fitness Tracker Data"
  ) +
  theme_minimal() +
  theme(
    text = element_text(),
    plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
    axis.title.x = element_text(margin = margin(t = 15)),
    axis.title.y = element_text(margin = margin(r = 15))
  )

# Make it interactive!
sleep_sedentary_plotly <- ggplotly(sleep_sedentary)

# View the interactive plot
sleep_sedentary_plotly


**Hourly Sedentary and Sleep Observations:**
* Sleep duration remains stable (7-9 hours) at moderate sedentary levels (5-12 hours).
* Sleep sharply declines as sedentary time exceeds 14 hours.
* Users with lower sedentary hours burn more calories.

**Assumptions**
* High sedentary time may reflect disengagement from physical activity, contributing to poor sleep quality.
* Users witth higher activity levels (lower sedentary time) maintain more consistent sleep patterns.

[- Users who are more sedentary might be experiencing irregular sleep patterns, fatigue, or lower overall physical engagement.
Lower sleep time with high sedentary time could indicate daytime inactivity without the restorative benefits of proper sleep (e.g., couch-sitting or inactivity without restful sleep).
Users who burn more calories tend to have lower sedentary time, suggesting more physical engagement throughout the day.]

**Key Takeaway**
Excessive sedentary behavior correlates with reduced sleep among Bellabeat users, suggesting that **encouraging regular movement may support healthier sleep habits**.

## Proportion of Activity Intensities

This pie chart breaks down the distribution of active minutes by activity intensity level: sedentary, lightly, moderately, and very intense. 

In [None]:
library(plotly)

activity_distribution <- daily_activity %>%
  summarize(
    SedentaryMinutes = sum(SedentaryMinutes, na.rm = TRUE),
    LightlyActiveMinutes = sum(LightlyActiveMinutes, na.rm = TRUE),
    ModeratelyActiveMinutes = sum(ModeratelyActiveMinutes, na.rm = TRUE),
    VeryActiveMinutes = sum(VeryActiveMinutes, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = everything(),
               names_to = "IntensityCategory",
               values_to = "TotalMinutes") %>%
  mutate(Share = TotalMinutes / sum(TotalMinutes) * 100,
         AverageMinutesPerDay = TotalMinutes / nrow(daily_activity))

activity_pie <- plot_ly(
  activity_distribution,
  labels = ~IntensityCategory,
  values = ~Share,
  type = "pie",
  sort = FALSE,
  textinfo = "percent",
  textposition = "outside",
  text = ~paste(round(AverageMinutesPerDay, 1), "minutes/day"),
  hoverinfo = "percent+text",
  outsidetextfont = list(color = "black"),
  insidetextfont = list(color = "black"),
  marker = list(colors = c("#0D0887", "#6A00A8", "#CB4679", "#F0F921"),
                line = list(color = "white", width = 1)),
  showlegend = TRUE,
  legendgrouptitle = list(text = "Intensity levels")
)

activity_pie <- activity_pie %>%
  layout(
    title = list(text = "Distribution of Activity Intensity Levels", size = 12),
    xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
    yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)
  )

activity_pie <- activity_pie %>%
  add_annotations(
    text = "Fitbit Fitness Tracker Data",
    showarrow = FALSE,
    x = 1,
    y = 0,
    font = list(size = 10, family = "Calibri"),
    align = "right",
    valign = "bottom"
  )

activity_pie

**Activity Intensity Observations**
* 81.3% of total minutes are spent in a sedentary state.
* Lightly active minutes account for 15.8%, the second most common category.
* Very active minutes (1.74%) are slightly higher than moderately active minutes (1.11%), which is an unusual but noticeable pattern.

**Assumptions**
* The high proportion of sedentary time suggests the Bellabeat user demographic may largely consist of working adults with desk-bound jobs.
* Very active minutes exceeding moderately active minutes may indicate that users engage in short bursts of high-intensity exercise (e.g., HIIT workouts) but remain inactive for much of the day.

**Key Takeaway**

Bellabeat users spend the vast **majority of their time sedentary**, highlighting a potential opportunity to promote more consistent moderate and vigorous physical activity throughout the day.

## Hourly Steps and Calories

While individual daily variations are limited, this heatmap highlights that step activity peaks at mid-morning and early evening hours across weekdays.

In [None]:
# heat map
heatmap_data <- hourly_steps_calories %>%
  mutate(Weekday = wday(Date, label = TRUE, abbr = FALSE),  
         Hour = as.numeric(format(strptime(Time, "%H:%M"), "%H"))
  ) %>%  
  group_by(Weekday, Hour) %>%
  summarize(AverageSteps = mean(StepTotal, na.rm = TRUE), .groups = "drop") %>%
  ungroup()

heatmap_plot <- ggplot(heatmap_data, aes(x = Weekday, y = Hour, fill = AverageSteps)) +
  geom_tile(color = "black", size = 0.6) +  
  scale_fill_viridis_c(option = "plasma", labels = scales::comma) + 
  labs(
       title = "Hourly Steps Heatmap by Day of the Week",
       x = "Day of the Week",
       y = "Hour of Day",
       fill = "Average Steps",
       caption = "Fitbit Fitness Tracker Data") +
  theme_minimal() +
  theme(text = element_text(size = 10),
        axis.text.x = element_text(angle = 45, hjust = 1, size = 12, face = "bold"),
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold", margin = margin(t = 15)),
        axis.title.y = element_text(size = 12, face = "bold", margin = margin(r = 15))
       )
heatmap_plotly <- ggplotly(heatmap_plot) %>%
  layout(
    height = 900,  # 👈 make the plot taller
    margin = list(l = 60, r = 40, t = 80, b = 60)  # adjust margins a little if needed
  )

heatmap_plotly

**Hourly Steps by Weekday Observation:**
* Step activity peaks Wednesdays evenings (4-5 pm) and Saturdays early afternoons (around 1 pm).
* Lowest activity occurs consistently between 11 pm and 5 am throughout the week, aligning with typical sleep hours.
* Moderate step activity is also visible during mid-morning and early evening hours on weekdays.

**Assumptions:**
* Bellabeat users likely have more fexible schedules on Saturday afternoons and Wednesday evenings.
* Mid-morning weekday activity could indicate commuting or short movement breaks during work hours.

**Key Takeaway**

Bellabeat users show structured weekday activity patterns and elevated movement during Saturday afternoons and Wednesday evenings, suggesting that **free time availability significantly influences physical activity levels**.

## Heartrate Distribution

In [None]:
ggplot(heartrate_daily, aes(x = AvgHeartRate)) +
  geom_density(fill = "plum", alpha = 0.6) +
  labs(
    title = "Distribution of Average Daily Heart Rates",
    x = "Average Heart Rate (bpm)",
    y = "Density"
  ) +
  theme_minimal()

In [None]:
# Prepare heartrate data
heartrate_ready <- heartrate %>%
  mutate(Weekday = wday(Date, label = TRUE, abbr = FALSE))  # Get full weekday name

heartrate_ready <- heartrate_ready %>%
  mutate(Weekday = factor(Weekday,
                          levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))

ggplot(heartrate_ready, aes(x = Weekday, y = Value, fill = Weekday)) +
  geom_boxplot(alpha = 0.7, outlier.color = "#21908Cff", outlier.shape = 1) +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_viridis_d(option = "plasma") +
  labs(title = "Heart Rate Variability Across Days of the Week",
       x = "Day of the Week",
       y = "Heart Rate (BPM)",
       fill = "Day",
       caption = "Fitbit Fitness Tracker Data") +
  theme_minimal() +
  theme(text = element_text(size = 10),
        plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))


**Heart Rate Observations**
* Most daily heart rates range between ~60 to ~90 BPM (beats per minute).
* Median heart rate slightly increases toward the weekend.
* Weekends show more variability
* There are a lot of extreme outliers (above 150-200 BPM) every day.

**Assumptions**
* Higher weekend heart rates may be linked to more physical activity and social activity.
* Higher variability on weekend could indiicate less consistent routines.

**Key Takeaway**
Overall, Bellabeat users maintain a consistent daily heart rate throughout the week, with a slight increase and greater variability observed during the weekend, possibly due to changes in lifestyle and physical activity patterns. The presence of high heart rate outliers suggests periods of intense activity across all days.

# Step 6 - Act

Now that you have finished creating your visualizations, act on your findings. Prepare the deliverables you have been asked to create, including the high-level recommendations based on your analysis

**Guiding questions**
* What is your final conclusion based on your analysis?
* How could your team and business apply your insights?
* What next steps would you or your stakeholders take based on your findings?
* Is there additional data you could use to expand on your finndings?

##  Concluding Findings & Business Recommendations

1. **Heart Rate Monitoring and Alerts**
- Enable customizable heart rate alerts in the app when a user’s heart rate falls below or exceeds healthy thresholds (e.g., below 40 BPM or above 200 BPM), based on medical guidelines.
- Highlight healthy heart rate zones and educate users about moderate- and high-intensity heart rate targets through app notifications.
2. **Boost Sleep and Weight Logging Participation**
- Use scientific guidelines (e.g., 8,000–10,000 steps per day and moderate to high-intensity exercise) to design motivational reminders and goal notifications.
- Send personalized nudges after work hours (5–7 PM), when users are most active, to encourage walking, running, or gym activity.
- Educate users on benefits like reduced mortality risk associated with higher step counts.
3. **Increase Daily Activity**
- Use scientific guidelines (e.g., 8,000–10,000 steps per day and moderate to high-intensity exercise) to design motivational reminders and goal notifications.
- Send personalized nudges after work hours (5–7 PM), when users are most active, to encourage walking, running, or gym activity.
- Educate users on benefits like reduced mortality risk associated with higher step counts.
4. **Combat Sedentary Behavior**
- Implement hourly movement reminders encouraging users to stand or walk briefly, particularly targeting sedentary work hours.
- Highlight health risks of sedentary behavior (based on NIH findings) through gentle educational prompts in the app.
5. **Promote Healthy Weight and Calorie Awareness**
- Introduce healthy eating guidance by suggesting low-calorie meals and snacks through app notifications.
- Encourage partnerships with nutrition-tracking apps like **MyFitnessPal** or MyNetDiary to integrate food and calorie tracking, providing a complete health management system.
6. **Address Women's Health Holistically**
- Add menstrual cycle tracking to the Bellabeat app to help women understand hormonal impacts on mood, energy, sleep, and physical performance.
- Use cycle data to deliver personalized recommendations and help women spot potential health issues early.
7. **Target Audience Positioning**
- Focus marketing on professional, full-time working women who balance demanding careers with the desire to build healthier routines.
- Position Bellabeat not just as a fitness tracker, but as a daily wellness companion a supportive friend helping women maintain balance in personal, professional, and health goals.

Key Message: Bellabeat empowers modern women to balance life, work, and health through personalized recommendations, education, and motivation.* 

## Overall Strategic Recommendations

- Enhance user engagement by combining fitness, sleep, weight, nutrition, and women’s health tracking in one app.
- Use evidence-based guidelines for all health prompts and notifications to build user trust.
- Strengthen community through gamified achievements, social connections, and rewards.
- Expand personalization by using behavioral and biological data (like heart rate, steps, cycle tracking) to adapt recommendations.