## Preprocessing and feature engineering

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
import holidays

In this notebook, we will prepare the data for the modelling 

### Verifying steps from EDA

In [2]:
data = pd.read_parquet(Path("data") / "train.parquet")

In [3]:
missing_values = data.isnull().sum()
print("Missing values:\n", missing_values)

Missing values:
 counter_id                   0
counter_name                 0
site_id                      0
site_name                    0
bike_count                   0
date                         0
counter_installation_date    0
coordinates                  0
counter_technical_id         0
latitude                     0
longitude                    0
log_bike_count               0
dtype: int64


We do not need to specifically handle missing values as there are no missing values in the train data set. 

In [4]:
print(data.describe())

            site_id     bike_count                        date  \
count  4.968270e+05  496827.000000                      496827   
mean   1.053450e+08      60.191475  2021-03-08 07:25:59.668858   
min    1.000070e+08       0.000000         2020-09-01 01:00:00   
25%    1.000475e+08       5.000000         2020-12-05 22:00:00   
50%    1.000562e+08      29.000000         2021-03-08 11:00:00   
75%    1.000563e+08      79.000000         2021-06-09 14:00:00   
max    3.000147e+08    1302.000000         2021-09-09 23:00:00   
std    3.210346e+07      87.590566                         NaN   

        counter_installation_date       latitude      longitude  \
count                      496827  496827.000000  496827.000000   
mean   2019-04-04 07:24:35.245911      48.854343       2.345479   
min           2013-01-18 00:00:00      48.826360       2.265420   
25%           2018-11-29 00:00:00      48.840801       2.314440   
50%           2019-11-06 00:00:00      48.852090       2.353870   
75%

### Feature Extraction 
To account for the temporal aspects of the data, we cannot input the `date` field directly into the model. Instead we extract the features on different time-scales from the `date` field:

In [5]:
# Define cyclical encoding for dates
def _encode_dates(X):
    X = X.copy()
    X["year"] = X["date"].dt.year
    X["month"] = X["date"].dt.month
    X["day"] = X["date"].dt.day
    X["weekday"] = X["date"].dt.weekday
    X["hour"] = X["date"].dt.hour

    # Apply cyclical encoding
    X["hour_sin"] = np.sin(2 * np.pi * X["hour"] / 24)
    X["hour_cos"] = np.cos(2 * np.pi * X["hour"] / 24)
    X["month_sin"] = np.sin(2 * np.pi * X["month"] / 12)
    X["month_cos"] = np.cos(2 * np.pi * X["month"] / 12)
    X["weekday_sin"] = np.sin(2 * np.pi * X["weekday"] / 7)
    X["weekday_cos"] = np.cos(2 * np.pi * X["weekday"] / 7)

    # Drop the original date column
    return X.drop(columns=["date"])

Why we choose Cyclical Encoding?

Cyclical encoding was chosen for temporal features (`hour`, `month`, and `weekday`) to capture their periodic relationships effectively. For instance:
- `23:00` is close to `00:00`, and December is close to January. 
- This encoding avoids artificial discontinuities inherent in one-hot encoding for such features.

While one-hot encoding could be used for features like `weekday` (as days may not always exhibit strong cyclical patterns), we opted for consistency and simplicity at this stage. Future iterations can experiment with hybrid approaches to evaluate their impact on model performance. 

In [6]:
data["date"].head()

48321   2020-09-01 02:00:00
48324   2020-09-01 03:00:00
48327   2020-09-01 04:00:00
48330   2020-09-01 15:00:00
48333   2020-09-01 18:00:00
Name: date, dtype: datetime64[us]

In [7]:
categorical_columns = data.select_dtypes(include=['object', 'category']).columns
print("Categorical columns:", categorical_columns)

Categorical columns: Index(['counter_id', 'counter_name', 'site_name', 'coordinates',
       'counter_technical_id'],
      dtype='object')


### Feature Interaction 

In [8]:
## add something maybe later

### Feature Engineering

In [9]:
# Aggregate Statistics by site_id
site_stats = data.groupby('site_id')['bike_count'].agg(['mean', 'std']).reset_index()
site_stats.rename(columns={'mean': 'site_mean_count', 'std': 'site_std_count'}, inplace=True)

data = data.merge(site_stats, on='site_id', how='left')

**Consider dropping maybe, not too meaningful**
Why we use it: Sites with similar conditions (e.g., busy vs. quiet areas) might exhibit consistent traffic patterns. Aggregate features help capture site-specific behavior, such as average bike traffic.


In [10]:
# Define weekend indicator
def _add_weekend_indicator(X):
    X = X.copy()
    X["is_weekend"] = (X["weekday"] >= 5).astype(int)
    return X

Why: We have seen in EDA that bike_counts is higher on weekdays than on weekends (likely due to commuting). This way we capture the differences. 

In [11]:
# Define rush hour indicator
def _add_rush_hour_indicator(X):
    X = X.copy()
    X["is_rush_hour"] = X["hour"].isin([7, 8, 9, 17, 18, 19]).astype(int)
    return X

Why: The rush hour indicator is created to capture periods of high bike traffic, typically aligned with commuting times. It is based on the observation from the graph that bike counts peak around 6-9 AM and 3-6 PM, representing morning and evening rush hours. This feature helps the model identify patterns specific to these high-traffic periods, improving its ability to predict bike counts.

In [12]:
# Define holiday indicator
def _add_holiday_indicator(X):
    X = X.copy()
    france_holidays = holidays.FR()  # Use the holidays package for France
    X["is_holiday"] = X["date"].dt.date.apply(lambda x: 1 if x in france_holidays else 0)
    return X

### Drop Features with low predictive value

In [13]:
# Define feature dropping for low predictive value
def _drop_low_value_features(X):
    X = X.copy()
    X.drop(columns=["counter_id", "counter_name", "site_name"], inplace=True)
    return X

### Pipeline

In [18]:
# Preprocessing pipeline
preprocessor = Pipeline(steps=[
    ("encode_dates", FunctionTransformer(_encode_dates, validate=False)),
    ("add_rush_hour", FunctionTransformer(_add_rush_hour_indicator, validate=False)),
    ("add_weekend", FunctionTransformer(_add_weekend_indicator, validate=False)),
    ("add_holiday", FunctionTransformer(_add_holiday_indicator, validate=False)),
    ("drop_low_value_features", FunctionTransformer(_drop_low_value_features, validate=False)),
])

def preprocess_data(data):
    # Ensure 'date' column is a datetime object
    data["date"] = pd.to_datetime(data["date"], errors="coerce")

    # Drop rows with invalid dates
    data = data.dropna(subset=["date"])

    # Run the pipeline
    processed_array = preprocessor.fit_transform(data)

    # Convert the output back to a DataFrame
    processed_data = pd.DataFrame(
        processed_array,
        columns=[
            "hour_sin", "hour_cos", "month_sin", "month_cos",
            "weekday_sin", "weekday_cos", "is_rush_hour",
            "is_weekend", "is_holiday"
        ],
        index=data.index  # Preserve the original index
    )
    return processed_data

In [19]:
# Preprocess the data
processed_data = preprocess_data(data)

# Debug: Check the output type and preview the data
print(type(processed_data))
print(processed_data.head())

# Save the preprocessed data
save_path = 'data/processed_data.parquet'
processed_data.to_parquet(save_path, index=False)

print(f"Dataset saved to {save_path}")

KeyError: 'date'