# 📕 04 - Transform Raw Data into Features and Targets

## Introduction

After having diligently walked through the preliminary stages of data exploration, validation, and transformation in the first three notebooks, it's time to consolidate our understanding and build a more streamlined pipeline. 

In this notebook, our approach will evolve from a more exploratory stance to a structured one. Rather than manually handling the data transformations, we'll be leveraging functions defined in the `src` directory. This modular approach ensures reusability and better code management. While we previously dealt with a single month's data to ensure the feasibility of our methods, we'll now scale our operations to handle an entire year's worth of data. All of these transformations and data handling functions will be encapsulated in `data.py` located within the `src` directory.

Our main goal remains consistent: transforming the raw data into a structured format, apt for machine learning, with clearly defined features and targets.

## Notebook Overview:

1. **Load an Entire Year Raw Data**: 
    - We commenced by loading the complete dataset for a year, a significant uptick in volume compared to our initial explorations with just a month's data. This step set the stage for the subsequent transformations and provided a comprehensive view of the data landscape.

2. **Transform Raw Data into TS Data**: 
    - With the raw data at hand, our next step was to mold it into a time series format. This transformation is essential, especially considering our data's temporal nature. It allowed us to capture the sequential dependencies present in the data, making it primed for the next step.

3. **Convert Time Series Data into Features and Target**: 
    - After having our time series data ready, we further broke it down into distinct features and targets. The features encapsulate the historical data, while the target represents the future values we aim to predict. This structure aligns with typical supervised machine learning tasks, making the data ready for model training and evaluation.

4. **Save Tabular Data**: 
    - Finally, with our data in the desired format, we persisted it for future use. This not only ensures that we can reuse the processed data without redoing the steps but also aids in data versioning, making our experimentation reproducible.


In [1]:
%reload_ext autoreload
%autoreload 2

# import  libraries
from src.paths import TRANSFORMED_DATA_DIR
from src.data import load_raw_data
from src.data import transform_raw_data_into_ts_data
from src.data import transform_ts_data_into_features_and_target

# load 2022 raw data
rides = load_raw_data(year=2023)
rides.head()

File 2023-01 was already in local storage
File 2023-02 was already in local storage
File 2023-03 was already in local storage
File 2023-04 was already in local storage
File 2023-05 was already in local storage
File 2023-06 was already in local storage
File 2023-07 was already in local storage
Downloading file 2023-08
2023-08 file is not available
Downloading file 2023-09
2023-09 file is not available
Downloading file 2023-10
2023-10 file is not available
Downloading file 2023-11
2023-11 file is not available
Downloading file 2023-12
2023-12 file is not available


Unnamed: 0,pickup_datetime,pickup_location_id
0,2023-01-01 00:32:10,161
1,2023-01-01 00:55:08,43
2,2023-01-01 00:25:04,48
3,2023-01-01 00:03:48,138
4,2023-01-01 00:10:29,107


In [2]:
# transform raw data into features and target
ts_data = transform_raw_data_into_ts_data(rides)
ts_data.head()

100%|██████████| 265/265 [00:05<00:00, 44.88it/s]


Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2023-01-01 00:00:00,0,1
1,2023-01-01 01:00:00,0,1
2,2023-01-01 02:00:00,0,1
3,2023-01-01 03:00:00,0,1
4,2023-01-01 04:00:00,0,1


In [3]:
# transform ts data into features and target
features, targets = transform_ts_data_into_features_and_target(
    ts_data,
    input_seq_len=24*28*1, # one month
    step_size=1,
)

# check feature and target shapes
print(f'{features.shape=}')
print(f'{targets.shape=}')

  0%|          | 0/265 [00:00<?, ?it/s]

 28%|██▊       | 73/265 [02:32<06:40,  2.08s/it]


KeyboardInterrupt: 

In [5]:
# Consolidate the features and targets into a single DataFrame
tabular_data = features
tabular_data['target_rides_next_hour'] = targets

# Save the consolidated data as a parquet file for future use
tabular_data.to_parquet(TRANSFORMED_DATA_DIR / 'tabular_data.parquet')