# RAMP on predicting cyclist traffic in Paris

Authors: *Roman Yurchak (Symerio)*; also partially inspired by the air_passengers starting kit.


## Introduction

The dataset was collected with cyclist counters installed by Paris city council in multiple locations. It contains hourly information about cyclist traffic, as well as the following features,
 - counter name
 - counter site name
 - date
 - counter installation date
 - latitude and longitude
 
Available features are quite scarce. However, **we can also use any external data that can help us to predict the target variable.** 

In [49]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.activations import linear, relu, sigmoid
import xgboost as xgb

from sklearn.model_selection import GridSearchCV
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.activations import linear, relu, sigmoid

# Loading the data with pandas

First, download the data files,
 - [train.parquet](https://github.com/rth/bike_counters/releases/download/v0.1.0/train.parquet)
 - [test.parquet](https://github.com/rth/bike_counters/releases/download/v0.1.0/test.parquet)

and put them to into the data folder.


Data is stored in [Parquet format](https://parquet.apache.org/), an efficient columnar data format. We can load the train set with pandas,

In [2]:
counters_train = pd.read_parquet(Path("data") / "train.parquet")

In [3]:
counters_train.head()

Unnamed: 0,counter_id,counter_name,site_id,site_name,bike_count,date,counter_installation_date,coordinates,counter_technical_id,latitude,longitude,log_bike_count
48321,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2020-09-01 02:00:00,2013-01-18,"48.846028,2.375429",Y2H15027244,48.846028,2.375429,0.0
48324,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,1.0,2020-09-01 03:00:00,2013-01-18,"48.846028,2.375429",Y2H15027244,48.846028,2.375429,0.693147
48327,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2020-09-01 04:00:00,2013-01-18,"48.846028,2.375429",Y2H15027244,48.846028,2.375429,0.0
48330,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,4.0,2020-09-01 15:00:00,2013-01-18,"48.846028,2.375429",Y2H15027244,48.846028,2.375429,1.609438
48333,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,9.0,2020-09-01 18:00:00,2013-01-18,"48.846028,2.375429",Y2H15027244,48.846028,2.375429,2.302585


In [4]:
counters_train.columns

Index(['counter_id', 'counter_name', 'site_id', 'site_name', 'bike_count',
       'date', 'counter_installation_date', 'coordinates',
       'counter_technical_id', 'latitude', 'longitude', 'log_bike_count'],
      dtype='object')

In [5]:
counters_train.drop(columns=counters_train.columns
                    .difference(['date','site_name','counter_name','bike_count',
                                 'counter_installation_date','log_bike_count']),
                    inplace=True)

In [6]:
counters_train

Unnamed: 0,counter_name,site_name,bike_count,date,counter_installation_date,log_bike_count
48321,28 boulevard Diderot E-O,28 boulevard Diderot,0.0,2020-09-01 02:00:00,2013-01-18,0.000000
48324,28 boulevard Diderot E-O,28 boulevard Diderot,1.0,2020-09-01 03:00:00,2013-01-18,0.693147
48327,28 boulevard Diderot E-O,28 boulevard Diderot,0.0,2020-09-01 04:00:00,2013-01-18,0.000000
48330,28 boulevard Diderot E-O,28 boulevard Diderot,4.0,2020-09-01 15:00:00,2013-01-18,1.609438
48333,28 boulevard Diderot E-O,28 boulevard Diderot,9.0,2020-09-01 18:00:00,2013-01-18,2.302585
...,...,...,...,...,...,...
928450,254 rue de Vaugirard SO-NE,254 rue de Vaugirard,51.0,2021-08-08 18:00:00,2020-11-29,3.951244
928453,254 rue de Vaugirard SO-NE,254 rue de Vaugirard,1.0,2021-08-09 02:00:00,2020-11-29,0.693147
928456,254 rue de Vaugirard SO-NE,254 rue de Vaugirard,61.0,2021-08-09 08:00:00,2020-11-29,4.127134
928459,254 rue de Vaugirard SO-NE,254 rue de Vaugirard,44.0,2021-08-09 10:00:00,2020-11-29,3.806662


## Feature extraction

To account for the temporal aspects of the data, we cannot input the `date` field directly into the model. Instead we extract the features on different time-scales from the `date` field, 

In [17]:
def _encode_dates(X):
    X = X.copy()  # modify a copy of X
    # Encode the date information from the DateOfDeparture columns
    X.loc[:, "year"] = X["date"].dt.year
    X.loc[:, "month"] = X["date"].dt.month
    X.loc[:, "day"] = X["date"].dt.day
    X.loc[:, "weekday"] = X["date"].dt.weekday
    X.loc[:, "hour"] = X["date"].dt.hour

    # Finally we can drop the original columns from the dataframe
    return X.drop(columns=["date"])

To use this function with scikit-learn estimators we wrap it with [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html),

In [18]:
from sklearn.preprocessing import FunctionTransformer

date_encoder = FunctionTransformer(_encode_dates, validate=False)
date_encoder.fit_transform(counters_train[["date"]]).head()

Unnamed: 0,year,month,day,weekday,hour
48321,2020,9,1,1,2
48324,2020,9,1,1,3
48327,2020,9,1,1,4
48330,2020,9,1,1,15
48333,2020,9,1,1,18


Since it is unlikely that, for instance, that `hour` is linearly correlated with the target variable, we would need to additionally encode categorical features for linear models. This is classically done with [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), though other encoding strategies exist.

In [19]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(sparse=False)

enc.fit_transform(_encode_dates(counters_train[["date"]])[["hour"]].tail())

array([[0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.]])

## Linear model

Let's now construct our first linear model with [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html). We use a few helper functions defined in `problem.py` of the starting kit to load the public train and test data:

In [38]:
ext_intense20 = pd.read_csv('ext_data/ext_intense20.csv')
ext_intense20.drop('Unnamed: 0', axis=1,inplace=True)
ext_intense20["date"] = pd.to_datetime(ext_conf12["date"], format = "%m/%d/%Y %H:%M")
ext_intense20

Unnamed: 0,date,conf,ww_mix,pmer,ff,t,u,vv,ww,nbas,...,etat_sol,rr1,rr3,hourly,Intensev1,Intensev2,vacances_zone_c,bank_days,strike_rate,Vehicules
0,2020-09-01 00:00:00,0,1.0,102050,1.6,12.6,81,30000,1,0.0,...,0.0,0.0,0.0,0,0,0,0,0,0.0,81.338753
1,2020-09-01 01:00:00,0,0.0,102050,1.6,12.6,81,30000,1,0.0,...,0.0,0.0,0.0,0,0,0,0,0,0.0,81.338753
2,2020-09-01 02:00:00,0,0.0,102050,1.6,12.6,81,30000,1,0.0,...,0.0,0.0,0.0,0,0,0,0,0,0.0,81.338753
3,2020-09-01 03:00:00,0,2.0,101990,1.1,10.8,88,25000,2,0.0,...,0.0,0.0,0.0,0,0,0,0,0,0.0,81.338753
4,2020-09-01 04:00:00,0,0.0,101990,1.1,10.8,88,25000,2,0.0,...,0.0,0.0,0.0,0,0,0,0,0,0.0,81.338753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9884,2021-10-17 20:00:00,0,0.0,102050,2.2,12.5,68,25000,3,0.0,...,0.0,0.0,0.0,0,0,0,0,0,0.0,66.418217
9885,2021-10-17 21:00:00,0,1.0,102120,1.1,9.6,82,25000,1,0.0,...,0.0,0.0,0.0,0,0,0,0,0,0.0,66.418217
9886,2021-10-17 22:00:00,0,0.0,102120,1.1,9.6,82,25000,1,0.0,...,0.0,0.0,0.0,0,0,0,0,0,0.0,66.418217
9887,2021-10-17 23:00:00,0,0.0,102120,1.1,9.6,82,25000,1,0.0,...,0.0,0.0,0.0,0,0,0,0,0,0.0,66.418217


In [57]:
import problem

X_train, y_train = problem.get_train_data()
X_test, y_test = problem.get_test_data()

In [58]:
X_train = pd.merge_asof(
        X_train, ext_intense20, on="date"
)

In [59]:
X_train.shape

(455163, 31)

In [60]:
X_train.drop(columns=X_train.columns
                    .intersection(['counter_id','site_id','coordinates','bike_count',
                                 'counter_technical_id','log_bike_count','latitude','longitude']),
                    inplace=True)

In [61]:
X_test = pd.merge_asof(
        X_test, ext_intense20, on="date"
    )

X_test.drop(columns=X_test.columns
                    .intersection(['counter_id','site_id','coordinates','bike_count',
                                 'counter_technical_id','log_bike_count','latitude','longitude']),
                    inplace=True)


and

Where `y` contains the `log_bike_count` variable. 

The test set is in the future as compared to the train set,

In [62]:
print(
    f'Train: n_samples={X_train.shape[0]},  {X_train["date"].min()} to {X_train["date"].max()}'
)
print(
    f'Test: n_samples={X_test.shape[0]},  {X_test["date"].min()} to {X_test["date"].max()}'
)

Train: n_samples=455163,  2020-09-01 01:00:00 to 2021-08-09 23:00:00
Test: n_samples=41608,  2021-08-10 01:00:00 to 2021-09-09 23:00:00


In [63]:
_encode_dates(X_train[["date"]]).columns.tolist()

['year', 'month', 'day', 'weekday', 'hour']

In [64]:
X_train.columns

Index(['counter_name', 'site_name', 'date', 'counter_installation_date',
       'conf', 'ww_mix', 'pmer', 'ff', 't', 'u', 'vv', 'ww', 'nbas', 'pres',
       'raf10', 'etat_sol', 'rr1', 'rr3', 'hourly', 'Intensev1', 'Intensev2',
       'vacances_zone_c', 'bank_days', 'strike_rate', 'Vehicules'],
      dtype='object')

In [65]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline


date_encoder = FunctionTransformer(_encode_dates)
date_cols = _encode_dates(X_train[["date"]]).columns.tolist()

categorical_encoder = OneHotEncoder(handle_unknown="ignore")
categorical_cols = ["counter_name", "site_name"]

categorical_cols2 = ["counter_name", "site_name", "etat_sol", "ww", "conf", "hourly"]

numerical_cols = ['rr1','Vehicules']


preprocessor = ColumnTransformer(
    [
        ("date", OneHotEncoder(handle_unknown="ignore"), date_cols),
        ("cat", categorical_encoder, categorical_cols2)
        ,("num", 'passthrough', numerical_cols) 
        
    ]
)

xreg = xgb.XGBRegressor(max_depth=13, objective='reg:squarederror', learning_rate=0.2, n_estimators=110)
regressor = Ridge()

pipe = make_pipeline(date_encoder, 
                     preprocessor,
                     xreg)

pipe.fit(X_train, y_train)

Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function _encode_dates at 0x0000018DF746DB80>)),
                ('columntransformer',
                 ColumnTransformer(transformers=[('date',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['year', 'month', 'day',
                                                   'weekday', 'hour']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['counter_name', 'site_name',
                                                   'etat_sol', 'ww', 'conf',
                                                   'h...
                              feature_types=None, gamma=0, gpu_id=-1,
                              grow_policy='depthwise', importance_type=None,
                           

We then evaluate this model with the RMSE metric,

In [66]:
from sklearn.metrics import mean_squared_error

print(
    f"Train set, RMSE={mean_squared_error(y_train, pipe.predict(X_train), squared=False):.3f}"
)
print(
    f"Test set, RMSE={mean_squared_error(y_test, pipe.predict(X_test), squared=False):.3f}"
)

Train set, RMSE=0.314
Test set, RMSE=0.431


## GRIDSEARCH

In [67]:
X = pd.concat([X_train, X_test], axis=0)
Y = pd.concat([pd.Series(y_train), pd.Series(y_test)], axis=0)

In [72]:
parameters = {
    #'xgbregressor__max_depth': [8, 10, 12],
    'xgbregressor__n_estimators': [90, 100, 110],
    'xgbregressor__learning_rate': [0.2, 0.15]
}

xgb1 = xgb.XGBRegressor(max_depth=8, objective='reg:squarederror')
pipe_xgb = make_pipeline(date_encoder, preprocessor, xgb1)

In [80]:
print("The hyper-parameters are for the full-pipeline are:")
for param_name in pipe_xgb.get_params().keys():
    print(param_name)

The hyper-parameters are for the full-pipeline are:
memory
steps
verbose
functiontransformer
columntransformer
xgbregressor
functiontransformer__accept_sparse
functiontransformer__check_inverse
functiontransformer__func
functiontransformer__inv_kw_args
functiontransformer__inverse_func
functiontransformer__kw_args
functiontransformer__validate
columntransformer__n_jobs
columntransformer__remainder
columntransformer__sparse_threshold
columntransformer__transformer_weights
columntransformer__transformers
columntransformer__verbose
columntransformer__verbose_feature_names_out
columntransformer__date
columntransformer__cat
columntransformer__num
columntransformer__date__categories
columntransformer__date__drop
columntransformer__date__dtype
columntransformer__date__handle_unknown
columntransformer__date__sparse
columntransformer__cat__categories
columntransformer__cat__drop
columntransformer__cat__dtype
columntransformer__cat__handle_unknown
columntransformer__cat__sparse
xgbregressor__obj

In [73]:
clf = GridSearchCV(estimator=pipe_xgb,
                    param_grid=parameters,
                  n_jobs = 13,
                  scoring='neg_root_mean_squared_error',
                  cv = 6)
clf.fit(X, Y)

GridSearchCV(cv=6,
             estimator=Pipeline(steps=[('functiontransformer',
                                        FunctionTransformer(func=<function _encode_dates at 0x0000018DF746DB80>)),
                                       ('columntransformer',
                                        ColumnTransformer(transformers=[('date',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['year',
                                                                          'month',
                                                                          'day',
                                                                          'weekday',
                                                                          'hour']),
                                                                        ('cat',
                                                     

In [74]:
print(clf.best_score_)

-0.7585590881221868


In [75]:
print(clf.best_params_)

{'xgbregressor__learning_rate': 0.2, 'xgbregressor__n_estimators': 110}


In [82]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
import catboost as cb

date_encoder = FunctionTransformer(_encode_dates)
date_cols = _encode_dates(X_train[["date"]]).columns.tolist()

categorical_encoder = OneHotEncoder(handle_unknown="ignore")

categorical_cols = ["counter_name", "site_name", "etat_sol", "ww", "conf", "hourly"]

numerical_cols = ['rr1']


preprocessor = ColumnTransformer(
    [
        ("date", OneHotEncoder(handle_unknown="ignore"), date_cols),
        ("cat", categorical_encoder, categorical_cols2)
        ,("num", 'passthrough', numerical_cols) 
        
    ]
)

CatReg = cb.CatBoostRegressor()


pipe_cat = make_pipeline(date_encoder, 
                     preprocessor,
                     CatReg)

#pipe.fit(X_train, y_train)







In [83]:
print("The hyper-parameters are for the full-pipeline are:")
for param_name in CatReg.get_params().keys():
    print(param_name)

The hyper-parameters are for the full-pipeline are:
loss_function


In [None]:
grid = {'iterations': [100, 150, 200],
        'learning_rate': [0.03, 0.1],
        'depth': [2, 4, 6, 8],
        'l2_leaf_reg': [0.2, 0.5, 1, 3]}

In [None]:
#Gridsearch

parameters = {
    #'xgbregressor__max_depth': [8, 10, 12],
    'xgbregressor__n_estimators': [90, 100, 110],
    'xgbregressor__learning_rate': [0.2, 0.15]
}

CatReg = cb.CatBoostRegressor(loss_function='RMSE')
pipe_cat = make_pipeline(date_encoder, 
                     preprocessor,
                     CatReg)


clf = GridSearchCV(estimator=pipe_cat,
                   param_grid=parameters,
                   n_jobs = 13,
                   scoring='neg_root_mean_squared_error',
                   cv = 6)
clf.fit(X, Y)
