# Explainer notebook
This notebook answers the following questions related to our final assignment in **Social data analysis and visualization (02806)** Spring 2024.

The website with our visualizations and accompanying text can be found on [Medium]()    

The website for the first part of this assignment can be found here : [Project Assignment A](https://clbokea.github.io/)

## Table of Contents
* Motivation
    * What is your dataset?
    * Why did you choose this/these particular dataset(s)?
    * What was your goal for the end user's experience?
* Basic stats
    * Write about your choices in data cleaning and preprocessing
    * Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.
* Data Analysis
    * Describe your data analysis and explain what you've learned about the dataset.
    * If relevant, talk about your machine-learning.
    * Genre. Which genre of data story did you use?
    * Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
    * Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?
* Visualizations.
    * Explain the visualizations you've chosen.
    * Why are they right for the story you want to tell?
* Discussion. 
    * What went well?,
    * What is still missing? What could be improved?, Why?
* Contributions. 

In [24]:
# Setup of notebook
import os
import json
import numpy as np 
import pandas as pd
print('Setup complete!')

Setup complete!


## Motivation

### What is your dataset?

Our dataset consists of data collected from a Garmin watch worn on the wrist for the past three years. Additionally, we have correlated the quantitative data from the Garmin with qualitative data extracted from a personal diary. The Garmin dataset is derived from the following files.

In [18]:
garmin_base_dir = "../files/Garmin_20241403"
#%ls -R {garmin_base_dir} 

Our main focus have been on these json files: 

In [19]:
folder = os.path.join(garmin_base_dir, "DI_CONNECT", "DI-Connect-Aggregator")

In [20]:
os.listdir(folder)

['HydrationLogFile_2021-03-10_2021-06-18.json',
 'UDSFile_2012-09-14_2012-12-23.json',
 'UDSFile_2011-05-03_2011-08-11.json',
 'UDSFile_2011-08-11_2011-11-19.json',
 'HydrationLogFile_2021-09-26_2022-01-04.json',
 'UDSFile_2022-07-24_2022-11-01.json',
 'HydrationLogFile_2022-01-04_2022-04-14.json',
 'HydrationLogFile_2021-06-18_2021-09-26.json',
 'HydrationLogFile_2023-08-27_2023-12-05.json',
 'UDSFile_2022-11-01_2023-02-09.json',
 'UDSFile_2023-08-28_2023-12-06.json',
 'HydrationLogFile_2020-08-22_2020-11-30.json',
 'HydrationLogFile_2020-05-14_2020-08-22.json',
 'HydrationLogFile_2022-04-14_2022-07-23.json',
 'UDSFile_2005-11-10_2006-02-18.json',
 'HydrationLogFile_2023-05-19_2023-08-27.json',
 'UDSFile_2014-08-15_2014-11-23.json',
 'UDSFile_2014-01-27_2014-05-07.json',
 'HydrationLogFile_2023-02-08_2023-05-19.json',
 'HydrationLogFile_2022-07-23_2022-10-31.json',
 'UDSFile_2021-09-27_2022-01-05.json',
 'UDSFile_2013-07-11_2013-10-19.json',
 'UDSFile_2022-01-05_2022-04-15.json',
 'Hy

We have then read the content from all files and put them into one DataFrame

In [25]:
# Setting up paths and configurations
garmin_base_dir = "../files/Garmin_20241403"
di_connect_path = os.path.join(garmin_base_dir, "DI_CONNECT", "DI-Connect-Aggregator")
columns_of_interest = ['calendarDate', 'totalKilocalories', 'activeKilocalories', 'restingCaloriesFromActivity', 
                       'totalSteps', 'moderateIntensityMinutes', 'vigorousIntensityMinutes', 'userIntensityMinutesGoal', 
                       'minHeartRate', 'maxHeartRate', 'restingHeartRate', 'minAvgHeartRate', 'maxAvgHeartRate',
                       'allDayStress', 'bodyBattery']

# Function to load JSONs and combine them into a filtered DataFrame
def load_and_filter_json(path, start_year=2020):
    all_dfs = []
    for root, _, files in os.walk(path):
        json_files = [f for f in sorted(files) if f.startswith('UDS') and f.endswith('.json')]
        for file in json_files:
            with open(os.path.join(root, file), 'r') as f:
                data = json.load(f)
            df = pd.DataFrame(data) if isinstance(data, list) else pd.DataFrame([data])
            all_dfs.append(df)

    # Combining and filtering the data
    if all_dfs:
        full_df = pd.concat(all_dfs, ignore_index=True)
        full_df['calendarDate'] = pd.to_datetime(full_df['calendarDate'])
        filtered_df = full_df.loc[full_df['calendarDate'].dt.year >= start_year, columns_of_interest]
        return filtered_df
    return pd.DataFrame()  # Return empty DataFrame if no data was loaded

# Apply the function and display the resulting DataFrame
focus_df = load_and_filter_json(di_connect_path)
focus_df.head()

Unnamed: 0,calendarDate,totalKilocalories,activeKilocalories,restingCaloriesFromActivity,totalSteps,moderateIntensityMinutes,vigorousIntensityMinutes,userIntensityMinutesGoal,minHeartRate,maxHeartRate,restingHeartRate,minAvgHeartRate,maxAvgHeartRate,allDayStress,bodyBattery
14,2020-06-18,1923.0,446.0,,13987.0,11.0,0.0,180.0,64.0,128.0,68.0,65.0,121.0,"{'userProfilePK': 86607424, 'calendarDate': '2...","{'userProfilePK': 86607424, 'calendarDate': '2..."
15,2020-06-19,1885.0,408.0,,12455.0,2.0,10.0,180.0,55.0,160.0,64.0,56.0,158.0,"{'userProfilePK': 86607424, 'calendarDate': '2...","{'userProfilePK': 86607424, 'calendarDate': '2..."
16,2020-06-20,2456.0,975.0,,26379.0,20.0,88.0,180.0,53.0,159.0,62.0,54.0,156.0,"{'userProfilePK': 86607424, 'calendarDate': '2...","{'userProfilePK': 86607424, 'calendarDate': '2..."
17,2020-06-21,2202.0,734.0,,12401.0,52.0,4.0,180.0,50.0,139.0,60.0,52.0,134.0,"{'userProfilePK': 86607424, 'calendarDate': '2...","{'userProfilePK': 86607424, 'calendarDate': '2..."
18,2020-06-22,2017.0,549.0,,15256.0,5.0,43.0,180.0,53.0,152.0,60.0,54.0,149.0,"{'userProfilePK': 86607424, 'calendarDate': '2...","{'userProfilePK': 86607424, 'calendarDate': '2..."


### Why did you choose this/these particular dataset(s)?

Our primary objective has been to craft an article for publication on "Medium" that delves into the insights obtainable from one's personal Garmin Data. Accordingly, our narrative centers around a personal story rather than a broad analysis applicable to humanity at large. This means we opted not to examine datasets from a wide array of individuals but focused exclusively on the detailed tracking data of one specific person. In essence, our aim was not to extrapolate general conclusions about health data from a large population, but rather to explore a distinctly personal perspective.


### What was your goal for the end user's experience?


Your primary objective has been to inspire readers of our article to explore their personal data in ways that might not be immediately apparent to everyone. We have opted for a narrative that emphasizes personal stories to infuse the article with energy, inspiration, and creativity, adopting an approach that leans more towards the artistic than the strictly scientific. Our hope is that this approach will resonate with readers in a way that encourages them to delve deeper into their own data.

## Basic stats.

### Write about your choices in data cleaning and preprocessing
We have already written a bit about what data in which files we have choosen to work with, but identifying these files where done by looping through all files creating a Dataframe containing the content of each file, and then printing out the the first 5 rows of the dataset. Most of the files turned out not to contain, in this context, any relevant data, but the files in the "DI-Connect-Aggregator" contained daily health data, and the "DI-Connect-Wellness" contained sleep data. Thise data have been the ones of our interest.

We ended up with 22 json files from the "DI-Connect-Aggregator" folder where we concatenated all data from these files into a single dataframe. All data from before 2020 where removed, since there where no watch on the wrist before that.

In the dataframe 2 collumns (aggregatorList and bodyBatteryStatList) contained dictionaries with key value pairs of data. To give ourself a more dirrect approach to work with these data we changed it / extracted it into a dataframe cell in the parrent dataframe.

### Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.

## Data Analysis
* Describe your data analysis and explain what you've learned about the dataset.
* If relevant, talk about your machine-learning.
* Genre. Which genre of data story did you use?
* Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
* Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?

## Visualizations.
* Explain the visualizations you've chosen.
* Why are they right for the story you want to tell?

## Discussion. 
Think critically about your creation
* What went well?,
* What is still missing? What could be improved?, Why?

## Contributions. 
Who did what?
* You should write (just briefly) which group member was the main responsible for which elements of the assignment. (I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work. That's what you should explain).
* It is not OK simply to write "All group members contributed equally".

## Make sure that you use references when they're needed and follow academic standards.
Handing in the assignment: Simply upload the link to your website via DTU Learn.



.









