## Preprocessing and feature engineering

In [2]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
import holidays

ModuleNotFoundError: No module named 'holidays'

In this notebook, we will prepare the data for the modelling 

### Verifying steps from EDA

In [2]:
data = pd.read_parquet(Path("data") / "train.parquet")
print(data.describe())

            site_id     bike_count                        date  \
count  4.968270e+05  496827.000000                      496827   
mean   1.053450e+08      60.191475  2021-03-08 07:25:59.668858   
min    1.000070e+08       0.000000         2020-09-01 01:00:00   
25%    1.000475e+08       5.000000         2020-12-05 22:00:00   
50%    1.000562e+08      29.000000         2021-03-08 11:00:00   
75%    1.000563e+08      79.000000         2021-06-09 14:00:00   
max    3.000147e+08    1302.000000         2021-09-09 23:00:00   
std    3.210346e+07      87.590566                         NaN   

        counter_installation_date       latitude      longitude  \
count                      496827  496827.000000  496827.000000   
mean   2019-04-04 07:24:35.245911      48.854343       2.345479   
min           2013-01-18 00:00:00      48.826360       2.265420   
25%           2018-11-29 00:00:00      48.840801       2.314440   
50%           2019-11-06 00:00:00      48.852090       2.353870   
75%

### Feature Extraction 
To account for the temporal aspects of the data, we cannot input the `date` field directly into the model. Instead we extract the features on different time-scales from the `date` field:

In [3]:
print(data["date"].head())
print(data["date"].isna().sum())
print(data["date"].dtype)

48321   2020-09-01 02:00:00
48324   2020-09-01 03:00:00
48327   2020-09-01 04:00:00
48330   2020-09-01 15:00:00
48333   2020-09-01 18:00:00
Name: date, dtype: datetime64[us]
0
datetime64[us]


In [4]:
# Define cyclical encoding for dates
def _encode_dates(X):
    X = X.copy()
    X["year"] = X["date"].dt.year
    X["month"] = X["date"].dt.month
    X["day"] = X["date"].dt.day
    X["weekday"] = X["date"].dt.weekday
    X["hour"] = X["date"].dt.hour

    # Apply cyclical encoding
    X["hour_sin"] = np.sin(2 * np.pi * X["hour"] / 24)
    X["hour_cos"] = np.cos(2 * np.pi * X["hour"] / 24)
    X["month_sin"] = np.sin(2 * np.pi * X["month"] / 12)
    X["month_cos"] = np.cos(2 * np.pi * X["month"] / 12)
    X["weekday_sin"] = np.sin(2 * np.pi * X["weekday"] / 7)
    X["weekday_cos"] = np.cos(2 * np.pi * X["weekday"] / 7)

    return X

Why we choose Cyclical Encoding?

Cyclical encoding was chosen for temporal features (`hour`, `month`, and `weekday`) to capture their periodic relationships effectively. For instance:
- `23:00` is close to `00:00`, and December is close to January. 
- This encoding avoids artificial discontinuities inherent in one-hot encoding for such features.

While one-hot encoding could be used for features like `weekday` (as days may not always exhibit strong cyclical patterns), we opted for consistency and simplicity at this stage. Future iterations can experiment with hybrid approaches to evaluate their impact on model performance. 

### Feature Interaction 

In [5]:
## add something maybe later

### Feature Engineering

In [6]:
def _add_weekend(X):
    X = X.copy()
    X["is_weekend"] = X["weekday"].isin([5, 6]).astype(int)  # Saturday and Sunday
    return X

Why: We have seen in EDA that bike_counts is higher on weekdays than on weekends (likely due to commuting). This way we capture the differences. 

In [7]:
# Define rush hour indicator
def _add_rush_hour(X):
    X = X.copy()
    X["is_rush_hour"] = ((X["hour"] >= 7) & (X["hour"] <= 9)) | ((X["hour"] >= 15) & (X["hour"] <= 18))
    return X

Why: The rush hour indicator is created to capture periods of high bike traffic, typically aligned with commuting times. It is based on the observation from the graph that bike counts peak around 6-9 AM and 3-6 PM, representing morning and evening rush hours. This feature helps the model identify patterns specific to these high-traffic periods, improving its ability to predict bike counts.

In [8]:
# Define holiday indicator
def _add_holidays(X):
    X = X.copy()
    france_holidays = holidays.FR()  # Use the holidays package for France
    X["is_holiday"] = X["date"].dt.date.apply(lambda x: 1 if x in france_holidays else 0)
    return X

In [9]:
categorical_columns = data.select_dtypes(include=['object', 'category']).columns
print("Categorical columns:", categorical_columns)

Categorical columns: Index(['counter_id', 'counter_name', 'site_name', 'coordinates',
       'counter_technical_id'],
      dtype='object')


In [10]:
# Define feature dropping for low predictive value and date
def _drop_low_value_features(X):
    X = X.copy()
    X.drop(columns=['date', 'year', 'month', 'day', 'site_id', 'counter_id', 'date', 'counter_installation_date', 
                    'counter_technical_id', 'coordinates', 'latitude', 'longitude'], inplace=True)
    return X

In [11]:
def _encode_categorical_features(X):
    X = X.copy()
    encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
    encoded_features = encoder.fit_transform(X[["counter_name", "site_name"]])
    encoded_df = pd.DataFrame(
        encoded_features,
        columns=encoder.get_feature_names_out(["counter_name", "site_name"]),
        index=X.index
    )
    # Drop original columns and add encoded features
    X = X.drop(columns=["counter_name", "site_name"], errors="ignore")
    X = pd.concat([X, encoded_df], axis=1)
    return X

### Pipeline

In [12]:
preprocessor = Pipeline(steps=[
    ("encode_dates", FunctionTransformer(_encode_dates, validate=False)),
    ("add_rush_hour", FunctionTransformer(_add_rush_hour, validate=False)),
    ("add_weekend", FunctionTransformer(_add_weekend, validate=False)),
    ("add_holiday", FunctionTransformer(_add_holidays, validate=False)),
    ("encode_categorical", FunctionTransformer(_encode_categorical_features, validate=False)),
    ("drop_low_value_features", FunctionTransformer(_drop_low_value_features, validate=False)),
    ("scaling", StandardScaler()) 
])

In [13]:
def preprocess_data(data):
    # Ensure 'date' is in datetime format
    data["date"] = pd.to_datetime(data["date"], errors="coerce")
    
    # Apply the preprocessor pipeline
    processed_array = preprocessor.fit_transform(data)
    
    # Debugging: Print shape of processed_array
    print(f"Processed array shape: {processed_array.shape}")
    
    # Define column names for each step
    cyclical_features = ["hour_sin", "hour_cos", "month_sin", "month_cos", "weekday_sin", "weekday_cos"]
    categorical_features = data[["counter_name", "site_name"]].nunique().sum()  # Add unique encoding column count
    
    column_names = ["year", "day"] + cyclical_features + list(data["counter_name"].unique()) + list(data["site_name"].unique())

    # Check that the number of columns matches the output array
    if len(column_names) != processed_array.shape[1]:
        print(f"Expected columns: {column_names}")
        print(f"Processed columns: {processed_array.shape[1]}")
        raise ValueError(
            f"Column name mismatch: {len(column_names)} names for {processed_array.shape[1]} columns."
        )
    
    # Create DataFrame with the processed data and column names
    processed_data = pd.DataFrame(processed_array, columns=column_names, index=data.index)
    
    return processed_data

In [14]:
# Preprocess the data
processed_data = preprocess_data(data)

Processed array shape: (496827, 99)
Expected columns: ['year', 'day', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos', 'weekday_sin', 'weekday_cos', '28 boulevard Diderot E-O', '28 boulevard Diderot O-E', '39 quai François Mauriac NO-SE', '39 quai François Mauriac SE-NO', "18 quai de l'Hôtel de Ville NO-SE", "18 quai de l'Hôtel de Ville SE-NO", 'Voie Georges Pompidou NE-SO', 'Voie Georges Pompidou SO-NE', '67 boulevard Voltaire SE-NO', 'Face au 48 quai de la marne NE-SO', 'Face au 48 quai de la marne SO-NE', "Face 104 rue d'Aubervilliers N-S", "Face 104 rue d'Aubervilliers S-N", 'Face au 70 quai de Bercy N-S', 'Face au 70 quai de Bercy S-N', '6 rue Julia Bartet NE-SO', '6 rue Julia Bartet SO-NE', "Face au 25 quai de l'Oise NE-SO", "Face au 25 quai de l'Oise SO-NE", '152 boulevard du Montparnasse E-O', '152 boulevard du Montparnasse O-E', 'Totem 64 Rue de Rivoli E-O', 'Totem 64 Rue de Rivoli O-E', 'Pont des Invalides S-N', 'Pont de la Concorde S-N', 'Pont des Invalides N-S', 'Face au 8

ValueError: Column name mismatch: 94 names for 99 columns.

In [15]:
# Save the preprocessed data
save_path = 'data/processed_data.parquet'
processed_data.to_parquet(save_path, index=False)

print(f"Dataset saved to {save_path}")

Dataset saved to data/processed_data.parquet


In [16]:
processed_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,89,90,91,92,93,94,95,96,97,98
48321,-0.687192,-1.855487,-0.998576,-1.373043,0.707071,1.225165,-1.369512,0.010388,1.096960,0.884634,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
48324,-0.675775,-1.437903,-0.998576,-1.228553,0.999935,1.000397,-1.369512,0.010388,1.096960,0.884634,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
48327,-0.687192,-1.855487,-0.998576,-1.084064,1.224657,0.707475,-1.369512,0.010388,1.096960,0.884634,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
48330,-0.641525,-0.885886,-0.998576,0.505318,-0.999865,-0.999804,-1.369512,0.010388,1.096960,0.884634,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
48333,-0.584441,-0.468302,-0.998576,0.938786,-1.414037,0.000297,-1.369512,0.010388,1.096960,0.884634,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
929175,4.393269,1.819632,0.003924,-0.795085,1.414107,0.000297,-1.369512,0.010388,0.605143,-1.272475,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
929178,0.968239,1.146872,0.003924,-0.217128,0.707071,-1.224571,-1.369512,0.010388,0.605143,-1.272475,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
929181,1.801663,1.391143,0.003924,0.505318,-0.999865,-0.999804,-1.369512,0.010388,0.605143,-1.272475,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
929184,-0.447440,0.006702,0.003924,1.516743,-0.707001,1.225165,-1.369512,0.010388,0.605143,-1.272475,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445


In [17]:
processed_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,89,90,91,92,93,94,95,96,97,98
48321,-0.687192,-1.855487,-0.998576,-1.373043,0.707071,1.225165,-1.369512,0.010388,1.09696,0.884634,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
48324,-0.675775,-1.437903,-0.998576,-1.228553,0.999935,1.000397,-1.369512,0.010388,1.09696,0.884634,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
48327,-0.687192,-1.855487,-0.998576,-1.084064,1.224657,0.707475,-1.369512,0.010388,1.09696,0.884634,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
48330,-0.641525,-0.885886,-0.998576,0.505318,-0.999865,-0.999804,-1.369512,0.010388,1.09696,0.884634,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
48333,-0.584441,-0.468302,-0.998576,0.938786,-1.414037,0.000297,-1.369512,0.010388,1.09696,0.884634,...,-0.193595,-0.135628,-0.135628,-0.135628,-0.193595,-0.193595,-0.193595,-0.193595,-0.19359,-0.188445
