## Steps
1. Download all raw data for a given year and month
1. For each raw data, apply the filter before saving it
1. Transform the saved raw data into TS data
1. Convert the ts data into features and targets
1. Save the transformed data


Main objective is to write utility functions to do all these things so we can reuse them later. 

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import os

# Add the parent directory to the Python path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

# Now you can import from src
from src.data_utils import load_and_process_rides_data

In [3]:
rides = load_and_process_rides_data(year=2024, months=[1,2,3])

File already exists for 2024-01.
Loading data for 2024-01...
Total records: 1,888,085
Valid records: 1,880,623
Records dropped: 7,462 (0.40%)
Successfully processed data for 2024-01.
File already exists for 2024-02.
Loading data for 2024-02...
Total records: 2,121,501
Valid records: 2,115,129
Records dropped: 6,372 (0.30%)
Successfully processed data for 2024-02.
File already exists for 2024-03.
Loading data for 2024-03...
Total records: 2,663,295
Valid records: 2,654,825
Records dropped: 8,470 (0.32%)
Successfully processed data for 2024-03.
Combining all monthly data...
Data loading and processing complete!


In [4]:
rides

Unnamed: 0,pickup_datetime,pickup_location_id
0,2024-01-22 18:43:19.012,7954.12
1,2024-01-11 19:19:18.721,6771.13
2,2024-01-30 19:17:41.693,5659.11
3,2024-01-27 11:27:01.759,6771.13
4,2024-01-16 15:15:41.000,7443.01
...,...,...
6650572,2024-03-15 11:44:14.141,6115.09
6650573,2024-03-18 16:50:23.185,6115.09
6650574,2024-03-20 11:35:21.922,5826.02
6650575,2024-03-15 09:29:16.009,5826.02


In [5]:
from src.data_utils import transform_raw_data_into_ts_data

ts_data = transform_raw_data_into_ts_data(rides)
ts_data.head()

Unnamed: 0,pickup_hour,pickup_location_id,rides
0,2024-01-01 00:00:00,190 Morgan,0
1,2024-01-01 01:00:00,190 Morgan,0
2,2024-01-01 02:00:00,190 Morgan,0
3,2024-01-01 03:00:00,190 Morgan,0
4,2024-01-01 04:00:00,190 Morgan,0


In [6]:
ts_data.shape

(4752384, 3)

In [7]:
ts_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4752384 entries, 0 to 4752383
Data columns (total 3 columns):
 #   Column              Dtype         
---  ------              -----         
 0   pickup_hour         datetime64[ns]
 1   pickup_location_id  object        
 2   rides               int16         
dtypes: datetime64[ns](1), int16(1), object(1)
memory usage: 81.6+ MB


In [12]:
top_locations = (
    ts_data.groupby("pickup_location_id")["rides"]
    .sum()
    .sort_values(ascending=False)
    .head(3)
)
top_locations

pickup_location_id
6140.05    28876
6450.05    22705
6822.09    22373
Name: rides, dtype: int16

In [8]:
from src.data_utils import transform_ts_data_info_features_and_target_loop

features, targets = transform_ts_data_info_features_and_target_loop(ts_data, window_size=24*28, step_size=24)


In [9]:
features

Unnamed: 0,rides_t-672,rides_t-671,rides_t-670,rides_t-669,rides_t-668,rides_t-667,rides_t-666,rides_t-665,rides_t-664,rides_t-663,...,rides_t-8,rides_t-7,rides_t-6,rides_t-5,rides_t-4,rides_t-3,rides_t-2,rides_t-1,pickup_hour,pickup_location_id
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2024-01-29,190 Morgan
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2024-01-30,190 Morgan
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2024-01-31,190 Morgan
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2024-02-01,190 Morgan
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2024-02-02,190 Morgan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137083,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2024-03-27,SYS038
137084,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,2024-03-28,SYS038
137085,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2024-03-29,SYS038
137086,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2024-03-30,SYS038


In [10]:
features['pickup_hour'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 137088 entries, 0 to 137087
Series name: pickup_hour
Non-Null Count   Dtype         
--------------   -----         
137088 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 1.0 MB


In [11]:
tabular_data = features
tabular_data["target"] = targets

from src.config import TRANSFORMED_DATA_DIR
tabular_data.to_parquet(TRANSFORMED_DATA_DIR / "tabular_data.parquet", engine = "pyarrow")