# 📕 04 - Transform Raw Data into Training Data


1. **Load an Entire Year Raw Data**: 
    - We commenced by loading the complete dataset for a year, a significant uptick in volume compared to our initial explorations with just a month's data. This step set the stage for the subsequent transformations and provided a comprehensive view of the data landscape.

2. **Transform Raw Data into TS Data**: 
    - With the raw data at hand, our next step was to mold it into a time series format. This transformation is essential, especially considering our data's temporal nature. It allowed us to capture the sequential dependencies present in the data, making it primed for the next step.

3. **Convert Time Series Data into Features and Target**: 
    - After having our time series data ready, we further broke it down into distinct features and targets. The features encapsulate the historical data, while the target represents the future values we aim to predict. This structure aligns with typical supervised machine learning tasks, making the data ready for model training and evaluation.

4. **Save Tabular Data**: 
    - Finally, with our data in the desired format, we persisted it for future use. This not only ensures that we can reuse the processed data without redoing the steps but also aids in data versioning, making our experimentation reproducible.

In [17]:
%reload_ext autoreload
%autoreload 2

# import libraries
from src.paths import TRANSFORMED_DATA_DIR
from src.data import load_raw_data, transform_raw_data_into_ts_data, transform_ts_data_into_training_data

In [12]:
# load 2023 raw data
rides = load_raw_data(year=2023)
rides.head()

2023-01 was already downloaded.
2023-02 was already downloaded.
2023-03 was already downloaded.
2023-04 was already downloaded.
2023-05 was already downloaded.
2023-06 was already downloaded.
2023-07 was already downloaded.
Downloading 2023-08...
2023-08 file not available.
Downloading 2023-09...
2023-09 file not available.
Downloading 2023-10...
2023-10 file not available.
Downloading 2023-11...
2023-11 file not available.
Downloading 2023-12...
2023-12 file not available.


Unnamed: 0,pickup_datetime,pickup_location_id
0,2023-01-01 00:32:10,161
1,2023-01-01 00:55:08,43
2,2023-01-01 00:25:04,48
3,2023-01-01 00:03:48,138
4,2023-01-01 00:10:29,107


In [13]:
# transform raw data into time series data
ts_data = transform_raw_data_into_ts_data(rides)

100%|██████████| 262/262 [00:04<00:00, 56.35it/s]


In [18]:
ts_data.head()

Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2023-01-01 00:00:00,19,4
1,2023-01-01 01:00:00,28,4
2,2023-01-01 02:00:00,43,4
3,2023-01-01 03:00:00,33,4
4,2023-01-01 04:00:00,12,4


In [19]:
# transform ts data into features and target
features, targets = transform_ts_data_into_training_data(
    ts_data,
    input_seq_len=24*28*1, # one month
    step_size=1,
)

# check feature and target shapes
print(f'{features.shape=}')
print(f'{targets.shape=}')

 19%|█▉        | 51/262 [01:26<06:42,  1.91s/it]

In [None]:
# Consolidate the features and targets into a single DataFrame
tabular_data = features
tabular_data['target_rides_next_hour'] = targets

# Save the consolidated data as a parquet file for future use
tabular_data.to_parquet(TRANSFORMED_DATA_DIR / 'tabular_data.parquet')