# Data Preprocessing for Machine Learning Model

## Introduction
This Jupyter notebook describes the process of preprocessing raw data to generate feature vectors suitable for fitting a machine learning model. The dataset comprises vectors of size (11,) containing various biometric parameters such as heart rate variability, oxygen saturation, etc. These parameters play a crucial role in predicting the readiness value.

## Data Filtering and Transformation
The raw data is collected from a Fitbit smartwatch through various CSV files.

- **Step 1:** Load the CSV files into pandas dataframes.
- **Step 2:** Adjust the values, such as converting dates and selecting only relevant categories of data.
- **Step 3:** Merge the dataframes by date to create a consolidated dataframe.

## Data Vectorization
After extracting features, they are combined into feature vectors. These vectors, along with the corresponding readiness values, serve as input and output, respectively, for the machine learning model.

## Conclusion
By the conclusion of this notebook, we will have transformed the raw data into a structured format suitable for training our machine learning model.


In [34]:
import pandas as pd
import numpy as np
import os

In [36]:
PATH = "Fitbit"  # The main path of the folder containing every CSV

#### In the following cells we collect data about physical activity levels throughout the day, heart rate variability, oxygen saturation, sleep score, stress score and daily readiness.

In [23]:
def vectorize_azm(file):
    df = pd.read_csv(file)
    df['date'] = pd.to_datetime(df['date_time'])
    df.drop(columns=['date_time'], inplace=True)
    df['date'] = df['date'].dt.strftime('%Y-%m-%d')
    return df.groupby("date")["total_minutes"].sum().to_frame()

In [24]:
dir = f"{PATH}/Active Zone Minutes (AZM)/"
azm_data = None
files = os.listdir(dir)
for file in files:
    tmp = vectorize_azm(f"{dir}{file}")
    if azm_data is None:
        azm_data = tmp
    else:
        azm_data = pd.concat([azm_data, tmp])
azm_data.reset_index(inplace=True)

In [25]:
dir = f"{PATH}/Heart Rate Variability/"
files = os.listdir(dir)
hrv_data = None
for file in files:
    if file.startswith("Daily Heart Rate Variability Summary") and file[len(file)-3:] == "csv":
        df = pd.read_csv(f"{dir}{file}")
        df["timestamp"] = pd.to_datetime(df["timestamp"]).dt.strftime('%Y-%m-%d')
        if hrv_data is None:
            hrv_data = df
        else:
            hrv_data = pd.concat([hrv_data, df])

In [26]:
dir = f"{PATH}/Oxygen Saturation (SpO2)/"
files = os.listdir(dir)
os_data = None
for file in files:
    if file.startswith("Daily SpO2 - "):
        df = pd.read_csv(f"{dir}{file}")
        df['timestamp'] = pd.to_datetime(df['timestamp']).dt.strftime('%Y-%m-%d')
        df.drop(["lower_bound", "upper_bound"], axis=1, inplace=True)
        if os_data is None:
            os_data = df
        else:
            os_data = pd.concat([os_data, df])

In [8]:
dir = f"{PATH}/Sleep Score/sleep_score.csv"
df = pd.read_csv(dir)
df["timestamp"] = pd.to_datetime(df["timestamp"]).dt.strftime("%Y-%m-%d")
df.drop(["composition_score", "duration_score", "sleep_log_entry_id"], axis=1, inplace=True)
sleep_data = df

In [9]:
dir = f"{PATH}/Stress Score/Stress Score.csv"
df = pd.read_csv(dir)
df = df[~df['CALCULATION_FAILED']]
df = df[["DATE", "STRESS_SCORE"]]
df['DATE'] = pd.to_datetime(df['DATE']).dt.strftime('%Y-%m-%d')
stress_data = df

In [11]:
dir = f"{PATH}/Daily Readiness/"
files = os.listdir(dir)
df = None
for file in files:
    if file.startswith("Daily Readiness Score -"):
        tmp = pd.read_csv(f"{dir}{file}")
        tmp = tmp[["date", "readiness_score_value"]]
        if df is None:
            df = tmp
        else:
            df = pd.concat([df, tmp])

In [10]:
tmp_data = [azm_data, hrv_data, os_data, sleep_data, stress_data]
new_names = {"timestamp": "date", "DATE": "date"}
for i in tmp_data:
    i.rename(columns=new_names, inplace=True)

In [12]:
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='date'), tmp_data)

In [13]:
from datetime import datetime, timedelta

In [14]:
data = df_final.values
for i in range(len(data)):
    data[i][0]= datetime.strptime(data[i][0], '%Y-%m-%d')

In [15]:
output = df.values
for i in range(len(output)):
    output[i][0] = datetime.strptime(output[i][0], '%Y-%m-%d')

In [16]:
sorted_data = data[data[:,0].argsort()]
sorted_output = output[output[:, 0].argsort()]

In [17]:
vectors = []
for i in sorted_data:
    target = i[0] + timedelta(days=1)
    for j in range(len(sorted_output)):
        if target == sorted_output[j][0]:
            vectors.append(np.concatenate([i[1:], sorted_output[j][1:]]))
            break

In [33]:
print(f"Data available: {len(sorted_data)}\nOutput registered: {len(sorted_output)}\nFinal vectors: {len(vectors)}")

Data available: 419
Output registered: 183
Final vectors: 155


In [19]:
np.savetxt("data.csv", vectors, delimiter=",")

**Example:** 

Here is an example of a vector with shape (12, ) where the first 11 parameters are provided to the model, and the last parameter is the value to be predicted.

In [35]:
vectors[-1]

array([3, 86.819, 48.853, 3.314, 95.9, 77, 21, 82, 53, 0.0976116303219107,
       72, 100.0], dtype=object)