## Preprocessing and feature engineering

In [69]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
import holidays

In this notebook, we will prepare the data for the modelling 

### Verifying steps from EDA

In [70]:
data = pd.read_parquet(Path("data") / "train.parquet")

In [71]:
missing_values = data.isnull().sum()
print("Missing values:\n", missing_values)

Missing values:
 counter_id                   0
counter_name                 0
site_id                      0
site_name                    0
bike_count                   0
date                         0
counter_installation_date    0
coordinates                  0
counter_technical_id         0
latitude                     0
longitude                    0
log_bike_count               0
dtype: int64


As discovered in EDA, we do not need to specifically handle missing values as there are no missing values in the train data set. 

In [72]:
print(data.describe())

            site_id     bike_count                        date  \
count  4.968270e+05  496827.000000                      496827   
mean   1.053450e+08      60.191475  2021-03-08 07:25:59.668858   
min    1.000070e+08       0.000000         2020-09-01 01:00:00   
25%    1.000475e+08       5.000000         2020-12-05 22:00:00   
50%    1.000562e+08      29.000000         2021-03-08 11:00:00   
75%    1.000563e+08      79.000000         2021-06-09 14:00:00   
max    3.000147e+08    1302.000000         2021-09-09 23:00:00   
std    3.210346e+07      87.590566                         NaN   

        counter_installation_date       latitude      longitude  \
count                      496827  496827.000000  496827.000000   
mean   2019-04-04 07:24:35.245911      48.854343       2.345479   
min           2013-01-18 00:00:00      48.826360       2.265420   
25%           2018-11-29 00:00:00      48.840801       2.314440   
50%           2019-11-06 00:00:00      48.852090       2.353870   
75%

### Feature Extraction 
To account for the temporal aspects of the data, we cannot input the `date` field directly into the model. Instead we extract the features on different time-scales from the `date` field:

In [73]:
print(data["date"].head())
print(data["date"].isna().sum())
print(data["date"].dtype)

48321   2020-09-01 02:00:00
48324   2020-09-01 03:00:00
48327   2020-09-01 04:00:00
48330   2020-09-01 15:00:00
48333   2020-09-01 18:00:00
Name: date, dtype: datetime64[us]
0
datetime64[us]


In [74]:
# Define cyclical encoding for dates
def _encode_dates(X):
    X = X.copy()
    X["year"] = X["date"].dt.year
    X["month"] = X["date"].dt.month
    X["day"] = X["date"].dt.day
    X["weekday"] = X["date"].dt.weekday
    X["hour"] = X["date"].dt.hour

    # Apply cyclical encoding
    X["hour_sin"] = np.sin(2 * np.pi * X["hour"] / 24)
    X["hour_cos"] = np.cos(2 * np.pi * X["hour"] / 24)
    X["month_sin"] = np.sin(2 * np.pi * X["month"] / 12)
    X["month_cos"] = np.cos(2 * np.pi * X["month"] / 12)
    X["weekday_sin"] = np.sin(2 * np.pi * X["weekday"] / 7)
    X["weekday_cos"] = np.cos(2 * np.pi * X["weekday"] / 7)

    return X

Why we choose Cyclical Encoding?

Cyclical encoding was chosen for temporal features (`hour`, `month`, and `weekday`) to capture their periodic relationships effectively. For instance:
- `23:00` is close to `00:00`, and December is close to January. 
- This encoding avoids artificial discontinuities inherent in one-hot encoding for such features.

While one-hot encoding could be used for features like `weekday` (as days may not always exhibit strong cyclical patterns), we opted for consistency and simplicity at this stage. Future iterations can experiment with hybrid approaches to evaluate their impact on model performance. 

In [75]:
categorical_columns = data.select_dtypes(include=['object', 'category']).columns
print("Categorical columns:", categorical_columns)

Categorical columns: Index(['counter_id', 'counter_name', 'site_name', 'coordinates',
       'counter_technical_id'],
      dtype='object')


### Feature Interaction 

In [76]:
## add something maybe later

### Feature Engineering

In [77]:
'''# Aggregate Statistics by site_id
site_stats = data.groupby('site_id')['bike_count'].agg(['mean', 'std']).reset_index()
site_stats.rename(columns={'mean': 'site_mean_count', 'std': 'site_std_count'}, inplace=True)

data = data.merge(site_stats, on='site_id', how='left')'''

"# Aggregate Statistics by site_id\nsite_stats = data.groupby('site_id')['bike_count'].agg(['mean', 'std']).reset_index()\nsite_stats.rename(columns={'mean': 'site_mean_count', 'std': 'site_std_count'}, inplace=True)\n\ndata = data.merge(site_stats, on='site_id', how='left')"

**Consider dropping maybe, not too meaningful**
Why we use it: Sites with similar conditions (e.g., busy vs. quiet areas) might exhibit consistent traffic patterns. Aggregate features help capture site-specific behavior, such as average bike traffic.


In [78]:
# Define weekend indicator
def _add_weekend_indicator(X):
    X = X.copy()
    X["is_weekend"] = (X["weekday"] >= 5).astype(int)
    return X

Why: We have seen in EDA that bike_counts is higher on weekdays than on weekends (likely due to commuting). This way we capture the differences. 

In [79]:
# Define rush hour indicator
def _add_rush_hour_indicator(X):
    X = X.copy()
    X["is_rush_hour"] = X["hour"].isin([7, 8, 9, 17, 18, 19]).astype(int)
    return X

Why: The rush hour indicator is created to capture periods of high bike traffic, typically aligned with commuting times. It is based on the observation from the graph that bike counts peak around 6-9 AM and 3-6 PM, representing morning and evening rush hours. This feature helps the model identify patterns specific to these high-traffic periods, improving its ability to predict bike counts.

In [80]:
# Define holiday indicator
def _add_holiday_indicator(X):
    X = X.copy()
    france_holidays = holidays.FR()  # Use the holidays package for France
    X["is_holiday"] = X["date"].dt.date.apply(lambda x: 1 if x in france_holidays else 0)
    return X

### Drop Features with low predictive value

In [81]:
# Define feature dropping for low predictive value and date (as we encoded it)
def _drop_low_value_features(X):
    X = X.copy()
    X.drop(columns=['counter_id', 'counter_name', 'site_name', 'date', 'counter_installation_date', 'counter_technical_id', 'coordinates'], inplace=True)
    return X

### Add average bike count for each site

In [82]:
def _add_site_avg_log_bike_count(X):
    X = X.copy()
    site_avg = X.groupby("site_id")["log_bike_count"].transform("mean")
    X["site_avg_log_bike_count"] = site_avg
    return X

### Add daily variance

In [83]:
def _add_daily_variance(X):
    X = X.copy()
    daily_variance = X.groupby(["site_id", "year", "month", "day"])["log_bike_count"].transform("var")
    X["daily_variance"] = daily_variance.fillna(0)
    return X

### Pipeline

In [84]:
# Preprocessing pipeline
preprocessor = Pipeline(steps=[
    ("encode_dates", FunctionTransformer(_encode_dates, validate=False)),
    ("add_rush_hour", FunctionTransformer(_add_rush_hour_indicator, validate=False)),
    ("add_weekend", FunctionTransformer(_add_weekend_indicator, validate=False)),
    ("add_holiday", FunctionTransformer(_add_holiday_indicator, validate=False)),
    ("drop_low_value_features", FunctionTransformer(_drop_low_value_features, validate=False)),
    ("add_site_avg_log_bike_count", FunctionTransformer(_add_site_avg_log_bike_count, validate=False)),
    ("add_daily_variance", FunctionTransformer(_add_daily_variance, validate=False)),
])


In [85]:
data

Unnamed: 0,counter_id,counter_name,site_id,site_name,bike_count,date,counter_installation_date,coordinates,counter_technical_id,latitude,longitude,log_bike_count
48321,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2020-09-01 02:00:00,2013-01-18,"48.846028,2.375429",Y2H15027244,48.846028,2.375429,0.000000
48324,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,1.0,2020-09-01 03:00:00,2013-01-18,"48.846028,2.375429",Y2H15027244,48.846028,2.375429,0.693147
48327,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2020-09-01 04:00:00,2013-01-18,"48.846028,2.375429",Y2H15027244,48.846028,2.375429,0.000000
48330,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,4.0,2020-09-01 15:00:00,2013-01-18,"48.846028,2.375429",Y2H15027244,48.846028,2.375429,1.609438
48333,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,9.0,2020-09-01 18:00:00,2013-01-18,"48.846028,2.375429",Y2H15027244,48.846028,2.375429,2.302585
...,...,...,...,...,...,...,...,...,...,...,...,...
929175,300014702-353245971,254 rue de Vaugirard SO-NE,300014702,254 rue de Vaugirard,445.0,2021-09-09 06:00:00,2020-11-29,"48.83977,2.30198",Y2H20114504,48.839770,2.301980,6.100319
929178,300014702-353245971,254 rue de Vaugirard SO-NE,300014702,254 rue de Vaugirard,145.0,2021-09-09 10:00:00,2020-11-29,"48.83977,2.30198",Y2H20114504,48.839770,2.301980,4.983607
929181,300014702-353245971,254 rue de Vaugirard SO-NE,300014702,254 rue de Vaugirard,218.0,2021-09-09 15:00:00,2020-11-29,"48.83977,2.30198",Y2H20114504,48.839770,2.301980,5.389072
929184,300014702-353245971,254 rue de Vaugirard SO-NE,300014702,254 rue de Vaugirard,21.0,2021-09-09 22:00:00,2020-11-29,"48.83977,2.30198",Y2H20114504,48.839770,2.301980,3.091042


In [86]:
def preprocess_data(data):
    processed_array = preprocessor.fit_transform(data)
    processed_data = pd.DataFrame(processed_array, index=data.index)
    
    return processed_data

In [87]:
# Preprocess the data
processed_data = preprocess_data(data)

In [88]:
# Save the preprocessed data
save_path = 'data/processed_data.parquet'
processed_data.to_parquet(save_path, index=False)

print(f"Dataset saved to {save_path}")

Dataset saved to data/processed_data.parquet


In [89]:
processed_data

Unnamed: 0,site_id,bike_count,latitude,longitude,log_bike_count,year,month,day,weekday,hour,...,hour_cos,month_sin,month_cos,weekday_sin,weekday_cos,is_rush_hour,is_weekend,is_holiday,site_avg_log_bike_count,daily_variance
48321,100007049,0.0,48.846028,2.375429,0.000000,2020,9,1,1,2,...,8.660254e-01,-1.0,-1.836970e-16,0.781831,0.623490,0,0,0,1.981562,2.560114
48324,100007049,1.0,48.846028,2.375429,0.693147,2020,9,1,1,3,...,7.071068e-01,-1.0,-1.836970e-16,0.781831,0.623490,0,0,0,1.981562,2.560114
48327,100007049,0.0,48.846028,2.375429,0.000000,2020,9,1,1,4,...,5.000000e-01,-1.0,-1.836970e-16,0.781831,0.623490,0,0,0,1.981562,2.560114
48330,100007049,4.0,48.846028,2.375429,1.609438,2020,9,1,1,15,...,-7.071068e-01,-1.0,-1.836970e-16,0.781831,0.623490,0,0,0,1.981562,2.560114
48333,100007049,9.0,48.846028,2.375429,2.302585,2020,9,1,1,18,...,-1.836970e-16,-1.0,-1.836970e-16,0.781831,0.623490,1,0,0,1.981562,2.560114
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
929175,300014702,445.0,48.839770,2.301980,6.100319,2021,9,9,3,6,...,6.123234e-17,-1.0,-1.836970e-16,0.433884,-0.900969,0,0,0,3.290936,1.605420
929178,300014702,145.0,48.839770,2.301980,4.983607,2021,9,9,3,10,...,-8.660254e-01,-1.0,-1.836970e-16,0.433884,-0.900969,0,0,0,3.290936,1.605420
929181,300014702,218.0,48.839770,2.301980,5.389072,2021,9,9,3,15,...,-7.071068e-01,-1.0,-1.836970e-16,0.433884,-0.900969,0,0,0,3.290936,1.605420
929184,300014702,21.0,48.839770,2.301980,3.091042,2021,9,9,3,22,...,8.660254e-01,-1.0,-1.836970e-16,0.433884,-0.900969,0,0,0,3.290936,1.605420


In [90]:
processed_data.head()

Unnamed: 0,site_id,bike_count,latitude,longitude,log_bike_count,year,month,day,weekday,hour,...,hour_cos,month_sin,month_cos,weekday_sin,weekday_cos,is_rush_hour,is_weekend,is_holiday,site_avg_log_bike_count,daily_variance
48321,100007049,0.0,48.846028,2.375429,0.0,2020,9,1,1,2,...,0.8660254,-1.0,-1.83697e-16,0.781831,0.62349,0,0,0,1.981562,2.560114
48324,100007049,1.0,48.846028,2.375429,0.693147,2020,9,1,1,3,...,0.7071068,-1.0,-1.83697e-16,0.781831,0.62349,0,0,0,1.981562,2.560114
48327,100007049,0.0,48.846028,2.375429,0.0,2020,9,1,1,4,...,0.5,-1.0,-1.83697e-16,0.781831,0.62349,0,0,0,1.981562,2.560114
48330,100007049,4.0,48.846028,2.375429,1.609438,2020,9,1,1,15,...,-0.7071068,-1.0,-1.83697e-16,0.781831,0.62349,0,0,0,1.981562,2.560114
48333,100007049,9.0,48.846028,2.375429,2.302585,2020,9,1,1,18,...,-1.83697e-16,-1.0,-1.83697e-16,0.781831,0.62349,1,0,0,1.981562,2.560114
