

# Exploring the fit bit kaggle dataset 

This notebook aims to explore and analyze the Fitbit dataset. The dataset contains various CSV files with information on daily activities, calories burned, heart rate, sleep patterns, and more. We will start by listing the directory structure of the dataset, then proceed to load and inspect some key CSV files. Basic statistics and visualizations will be generated to understand the data better.

# Table of Contents

1. [Accessing the Kaggle Dataset](#accessing-the-kaggle-dataset)
2. [Data Exploration](#data-exploration)
   - [Segmentation of Dataset Population by the Distribution of Their Variables](#segmentation-of-dataset-population-by-the-distribution-of-their-variables)
3. [Machine Learning](#machine-learning)

# Accessing the Kaggle Dataset
[Back to Table of Contents](#table-of-contents)
<!-- Content for Accessing the Kaggle Dataset -->

This repository includes a GitHub Actions workflow that continuously tests the process of connecting to the Kaggle API, using github secrets to pass username and key securely as environment variables to the automated workflow, and then checks if the files are correctly downloaded and unzipped, if so, a CSV file is imported and printed using pandas.

In this research notebook we will use an analogue operation to gather the dataset, but instead of using secrets, the machine that runs this notebook looks for the kaggle.json file that should be in the .kaggle/ folder, to authenticate and download the file. (This file should be included in the .gitignore file, so it's not commited/shared).

We can make sure all libraries for this step are imported:


In [6]:
import os # to create the directory
import kaggle # To download the dataset
import zipfile # To extract the dataset

 

And can create the data directory that will hold the dataset: `../data/`, by using the `makedirs` function from the `os` module.



In [7]:
# Step 1: Ensure the data directory exists
data_dir = '../data/'
os.makedirs(data_dir, exist_ok=True)

Then, since in the `~/.kaggle/` folder we have saved our kaggle credentials (mine are not commited to this repository) the `kaggle.api.dataset_download_files` function uses it automatically to authenticate with the api. 

Once we define the dataset identifier as `dataset = 'arashnic/fitbit'` the function accesses to the dataset and downloads it: 

In [None]:
# Step 2: Use Kaggle API to download the dataset
dataset = 'arashnic/fitbit'  # The dataset identifier on Kaggle
kaggle.api.dataset_download_files(dataset, path=data_dir, unzip=False)

Once the file has been downloaded, It can be unzipped as follows:

* First the paths of where the file is and where it is going to be extracted are defined: 

In [10]:

zip_file_path = '../data/fitbit.zip'
extract_to_path = '../data/'

* Then, the zip ifle is extracted. 

In [None]:
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to_path)

Now that we have successfully unzipped the files, we can take a look to what we are dealing with, by printing the structure of the files within the `../data/` folder:

In [12]:
# print the structure of the ../data/ directory

def print_directory_structure(root_dir, indent=''):
    for item in os.listdir(root_dir):
        item_path = os.path.join(root_dir, item)
        if os.path.isdir(item_path):
            print(f"{indent}📁 {item}/")
            print_directory_structure(item_path, indent + '    ')
        else:
            print(f"{indent}📄 {item}")

# Define the root directory
root_directory = '../data'

# Print the directory structure
print(f"Directory structure of {root_directory}:")
print_directory_structure(root_directory)


Directory structure of ../data:
📄 fitbit.zip
📁 mturkfitbit_export_4.12.16-5.12.16/
    📁 Fitabase Data 4.12.16-5.12.16/
        📄 minuteIntensitiesNarrow_merged.csv
        📄 minuteStepsWide_merged.csv
        📄 dailyActivity_merged.csv
        📄 hourlySteps_merged.csv
        📄 dailyIntensities_merged.csv
        📄 minuteCaloriesWide_merged.csv
        📄 hourlyCalories_merged.csv
        📄 minuteStepsNarrow_merged.csv
        📄 dailyCalories_merged.csv
        📄 minuteCaloriesNarrow_merged.csv
        📄 weightLogInfo_merged.csv
        📄 hourlyIntensities_merged.csv
        📄 dailySteps_merged.csv
        📄 minuteMETsNarrow_merged.csv
        📄 heartrate_seconds_merged.csv
        📄 minuteSleep_merged.csv
        📄 minuteIntensitiesWide_merged.csv
        📄 sleepDay_merged.csv
📁 mturkfitbit_export_3.12.16-4.11.16/
    📁 Fitabase Data 3.12.16-4.11.16/
        📄 minuteIntensitiesNarrow_merged.csv
        📄 dailyActivity_merged.csv
        📄 hourlySteps_merged.csv
        📄 hourlyCalorie

# Data Exploration
[Back to Table of Contents](#table-of-contents)
<!-- Content for Data Exploration -->
## Segmentation of Dataset Population by the Distribution of Their Variables
[Back to Table of Contents](#table-of-contents)
<!-- Content for Segmentation of Dataset Population by the Distribution of Their Variables -->
# Machine Learning
[Back to Table of Contents](#table-of-contents)
<!-- Content for Machine Learning -->

