

# Exploring the fit bit kaggle dataset 

This notebook aims to explore and analyze the Fitbit dataset. The dataset contains various CSV files with information on daily activities, calories burned, heart rate, sleep patterns, and more. We will start by listing the directory structure of the dataset, then proceed to load and inspect some key CSV files. Basic statistics and visualizations will be generated to understand the data better.

# Table of Contents

1. [Ask](#Ask)
   - [Accessing the Kaggle Dataset](#accessing-the-kaggle-dataset)
   - [Data Exploration](#data-exploration)
2. [Prepare](#prepare)
3. [Process](#process)
4. [Analyze](#analyze)
5. [Share](#share)
6. [Act](#act)


# Ask
[Back to Table of Contents](#table-of-contents)


The business task is to use the Fitbit Fitness Tracker Data to analyze trends in users’ daily habits and provide insights to help guide marketing strategy for Bellabeat. The main questions to guide the analysis are:

1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

The key stakeholders are Urška Sršen – cofounder and Chief Creative Officer; Sando Mur – cofounder and key member of the executive team.

# Prepare 

Deliverable: A description of all data sources used.



## Accessing the Kaggle Dataset
[Back to Table of Contents](#table-of-contents)

This repository includes a GitHub Actions workflow that continuously tests the process of connecting to the Kaggle API, using github secrets to pass username and key securely as environment variables to the automated workflow, and then checks if the files are correctly downloaded and unzipped, if so, a CSV file is imported and printed using pandas.

In this research notebook we will use an analogue operation to gather the dataset, but instead of using secrets, the machine that runs this notebook looks for the kaggle.json file that should be in the .kaggle/ folder, to authenticate and download the file. (This file should be included in the .gitignore file, so it's not commited/shared).

We can make sure all libraries for this step are imported:


In [1]:
import os # to create the directory
import kaggle # To download the dataset
import zipfile # To extract the dataset

 

And can create the data directory that will hold the dataset: `../data/`, by using the `makedirs` function from the `os` module.



In [2]:
# Step 1: Ensure the data directory exists
data_dir = '../data/'
os.makedirs(data_dir, exist_ok=True)

Then, since in the `~/.kaggle/` folder we have saved our kaggle credentials (mine are not commited to this repository) the `kaggle.api.dataset_download_files` function uses it automatically to authenticate with the api. 

Once we define the dataset identifier as `dataset = 'arashnic/fitbit'` the function accesses to the dataset and downloads it: 

In [None]:
# Step 2: Use Kaggle API to download the dataset
dataset = 'arashnic/fitbit'  # The dataset identifier on Kaggle
kaggle.api.dataset_download_files(dataset, path=data_dir, unzip=False)

Once the file has been downloaded, It can be unzipped as follows:

* First the paths of where the file is and where it is going to be extracted are defined: 

In [4]:

zip_file_path = '../data/fitbit.zip'
extract_to_path = '../data/'

* Then, the zip ifle is extracted. 

In [5]:
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to_path)

Now that we have successfully unzipped the files, we can take a look to what we are dealing with, by printing the structure of the files within the `../data/` folder:

In [None]:
# print the structure of the ../data/ directory

def print_directory_structure(root_dir, indent=''):
    for item in os.listdir(root_dir):
        item_path = os.path.join(root_dir, item)
        if os.path.isdir(item_path):
            print(f"{indent}📁 {item}/")
            print_directory_structure(item_path, indent + '    ')
        else:
            print(f"{indent}📄 {item}")

# Define the root directory
root_directory = '../data'

# Print the directory structure
print(f"Directory structure of {root_directory}:")
print_directory_structure(root_directory)


## Data Exploration
 
There are different kinds of data in these folder, one sample from march 12 to april 11 of 2016, and other from april 12 to may 12 of 2016.

We will first assess all data files present in the folder of the first month, checking for missing values would be great first step. Also, since there are several datetime or date data, printing the first date value of each file will allow us to understand the granularity of the data.


In [5]:
import os
import pandas as pd

# Define the directory containing the CSV files
directory = '../data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/'  # Replace with your directory path

# Initialize a list to store the results
results = []

# Function to read and process each file
def process_file(filepath):
    # Read the CSV file
    df = pd.read_csv(filepath)
    
    # Get the filename
    filename = os.path.basename(filepath)
    
    # Identify columns with missing values
    missing_values = df.isna().sum()
    missing_columns = missing_values[missing_values > 0]
    
    if not missing_columns.empty:
        for col in missing_columns.index:
            nan_indices = df[df[col].isna()].index.tolist()
            results.append({
                'File': filepath,
                'Column': col,
                'Missing Values': missing_columns[col],
                'NaN Indices': nan_indices,
                'First Datetime Value': None
            })
    
    # Identify datetime columns and print the first value
    datetime_columns = df.select_dtypes(include=['datetime64', 'object']).columns
    for col in datetime_columns:
        try:
            first_value = df[col].dropna().iloc[0]
            results.append({
                'File': filepath,
                'Column': col,
                'Missing Values': 0,
                'NaN Indices': [],
                'First Datetime Value': first_value
            })
        except IndexError:
            results.append({
                'File': filepath,
                'Column': col,
                'Missing Values': 0,
                'NaN Indices': [],
                'First Datetime Value': 'Empty or NaN'
            })

# Iterate through all CSV files in the directory
for filename in os.listdir(directory):
    if filename.endswith('.csv'):
        filepath = os.path.join(directory, filename)
        process_file(filepath)

# Convert the results to a DataFrame and display it
results_df = pd.DataFrame(results)
display(results_df)

Unnamed: 0,File,Column,Missing Values,NaN Indices,First Datetime Value
0,../data/mturkfitbit_export_4.12.16-5.12.16/Fit...,ActivityMinute,0,[],4/12/2016 12:00:00 AM
1,../data/mturkfitbit_export_4.12.16-5.12.16/Fit...,ActivityHour,0,[],4/13/2016 12:00:00 AM
2,../data/mturkfitbit_export_4.12.16-5.12.16/Fit...,ActivityDate,0,[],4/12/2016
3,../data/mturkfitbit_export_4.12.16-5.12.16/Fit...,ActivityHour,0,[],4/12/2016 12:00:00 AM
4,../data/mturkfitbit_export_4.12.16-5.12.16/Fit...,ActivityDay,0,[],4/12/2016
5,../data/mturkfitbit_export_4.12.16-5.12.16/Fit...,ActivityHour,0,[],4/13/2016 12:00:00 AM
6,../data/mturkfitbit_export_4.12.16-5.12.16/Fit...,ActivityHour,0,[],4/12/2016 12:00:00 AM
7,../data/mturkfitbit_export_4.12.16-5.12.16/Fit...,ActivityMinute,0,[],4/12/2016 12:00:00 AM
8,../data/mturkfitbit_export_4.12.16-5.12.16/Fit...,ActivityDay,0,[],4/12/2016
9,../data/mturkfitbit_export_4.12.16-5.12.16/Fit...,ActivityMinute,0,[],4/12/2016 12:00:00 AM
