# RAMP on predicting cyclist traffic in Paris

Authors: *Zhexuan Qiu & Xinyu Liu*; also partially inspired by the starting kit.


## Introduction

The dataset was collected with cyclist counters installed by Paris city council in multiple locations. It contains hourly information about cyclist traffic, as well as the following features,
 - counter name
 - counter site name
 - date
 - counter installation date
 - latitude and longitude
 
Available features are quite scarce. However, **we can also use any external data that can help us to predict the target variable.** 

In [16]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

# Loading the data with pandas

First, download the data files,
 - [train.parquet](https://github.com/rth/bike_counters/releases/download/v0.1.0/train.parquet)
 - [test.parquet](https://github.com/rth/bike_counters/releases/download/v0.1.0/test.parquet)

and put them to into the data folder.


Data is stored in [Parquet format](https://parquet.apache.org/), an efficient columnar data format. We can load the train set with pandas,

In [17]:
data_train = pd.read_parquet(Path('data') / 'train.parquet')
data_test =  pd.read_parquet(Path('data') / 'test.parquet')

ERROR! Session/line number was not unique in database. History logging moved to new session 314


In [18]:
data_train.head()

Unnamed: 0,counter_id,counter_name,site_id,site_name,bike_count,date,counter_installation_date,counter_technical_id,latitude,longitude,log_bike_count
48321,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2020-09-01 02:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,0.0
48324,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,1.0,2020-09-01 03:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,0.693147
48327,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2020-09-01 04:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,0.0
48330,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,4.0,2020-09-01 15:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,1.609438
48333,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,9.0,2020-09-01 18:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,2.302585


In [19]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 455163 entries, 48321 to 928462
Data columns (total 11 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   counter_id                 455163 non-null  category      
 1   counter_name               455163 non-null  category      
 2   site_id                    455163 non-null  int64         
 3   site_name                  455163 non-null  category      
 4   bike_count                 455163 non-null  float64       
 5   date                       455163 non-null  datetime64[ns]
 6   counter_installation_date  455163 non-null  datetime64[ns]
 7   counter_technical_id       455163 non-null  category      
 8   latitude                   455163 non-null  float64       
 9   longitude                  455163 non-null  float64       
 10  log_bike_count             455163 non-null  float64       
dtypes: category(4), datetime64[ns](2), float64(4), i

In [20]:
data_train.nunique(axis=0)

counter_id                     56
counter_name                   56
site_id                        30
site_name                      30
bike_count                    977
date                         8230
counter_installation_date      22
counter_technical_id           30
latitude                       30
longitude                      30
log_bike_count                977
dtype: int64

In [21]:
data_train.groupby(['site_name', 'counter_name'])['bike_count'].sum().sort_values(ascending=False).head(10).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,bike_count
site_name,counter_name,Unnamed: 2_level_1
Totem 73 boulevard de Sébastopol,Totem 73 boulevard de Sébastopol S-N,1809231.0
Totem 64 Rue de Rivoli,Totem 64 Rue de Rivoli O-E,1406900.0
Totem 73 boulevard de Sébastopol,Totem 73 boulevard de Sébastopol N-S,1357868.0
67 boulevard Voltaire SE-NO,67 boulevard Voltaire SE-NO,1036575.0
Totem 64 Rue de Rivoli,Totem 64 Rue de Rivoli E-O,914089.0
27 quai de la Tournelle,27 quai de la Tournelle SE-NO,888717.0
Quai d'Orsay,Quai d'Orsay E-O,849724.0
Totem Cours la Reine,Totem Cours la Reine O-E,806149.0
Face au 48 quai de la marne,Face au 48 quai de la marne SO-NE,806071.0
Face au 48 quai de la marne,Face au 48 quai de la marne NE-SO,759194.0


In [22]:
print('train dataset starts from ' + str(data_train['date'].min()) + ' and ends at ' + str(data_train['date'].max()))
print('test dataset starts from ' + str(data_test['date'].min()) + ' and ends at ' + str(data_test['date'].max()))

train dataset starts from 2020-09-01 01:00:00 and ends at 2021-08-09 23:00:00
test dataset starts from 2021-08-10 01:00:00 and ends at 2021-09-09 23:00:00


## Using external data

In this starting kit you are provided with weather data from Meteo France, which could correlate with cyclist traffic. It is not very accurate however, as the station is in Orly (15km from Paris) and only provides 3 hour updates.

To load the external data,

In [23]:
## combine all the external data into one file

covid_data_path = Path('data') /  'france_covid_data.csv'
df_covid = pd.read_csv(covid_data_path, parse_dates=['date_without_time'])
df_covid.head()

Unnamed: 0,date_without_time,total_cases,new_cases,new_deaths
0,2020-09-01,326264,5104.0,27
1,2020-09-02,333351,7087.0,26
2,2020-09-03,340473,7122.0,17
3,2020-09-04,349333,8860.0,-20
4,2020-09-05,357927,8594.0,12


In [24]:
weather_data_path = Path('data') /  'weather_data.csv'
df_weather = pd.read_csv(weather_data_path, parse_dates=['date'])
df_weather.head()

Unnamed: 0,numer_sta,date,pmer,tend,cod_tend,dd,ff,t,td,u,...,hnuage1,nnuage2,ctype2,hnuage2,nnuage3,ctype3,hnuage3,nnuage4,ctype4,hnuage4
0,7149,2021-01-01 00:00:00,100810,80,1,270,1.8,272.75,272.15,96,...,600.0,,,,,,,,,
1,7149,2021-01-01 03:00:00,100920,110,3,300,1.7,271.25,270.95,98,...,1500.0,2.0,3.0,3000.0,,,,,,
2,7149,2021-01-01 06:00:00,100950,30,3,290,2.6,271.95,271.65,98,...,480.0,4.0,6.0,2000.0,6.0,3.0,3000.0,,,
3,7149,2021-01-01 09:00:00,101100,150,2,280,1.7,272.45,272.05,97,...,1740.0,3.0,3.0,2800.0,,,,,,
4,7149,2021-01-01 12:00:00,101110,30,0,50,1.0,276.95,274.15,82,...,330.0,4.0,6.0,570.0,7.0,6.0,810.0,,,


In [25]:
df_weather['date_without_time'] = pd.to_datetime(df_weather['date'].dt.date)

In [26]:
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date_without_time  374 non-null    datetime64[ns]
 1   total_cases        374 non-null    int64         
 2   new_cases          373 non-null    float64       
 3   new_deaths         374 non-null    int64         
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 11.8 KB


In [27]:
def _merge_data(dt1, dt2):
    
    dt1 = dt1.copy()
    # When using merge_asof left frame need to be sorted
    dt1['orig_index'] = np.arange(dt1.shape[0])
    dt1 = pd.merge(dt1[['date', 'dd', 'ff', 't', 'td', 'u', 'vv', 'date_without_time', 'orig_index']].sort_values('date_without_time'), dt2.sort_values('date_without_time'), on='date_without_time')
    #df_m = pd.concat([dt1, dt2], axis=1)
    # Sort back to the original order
    dt1 = dt1.sort_values('orig_index')
    del dt1['orig_index']
    return dt1

In [28]:
merged_df = _merge_data(df_weather, df_covid)

In [29]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2989 entries, 976 to 232
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date               2989 non-null   datetime64[ns]
 1   dd                 2989 non-null   int64         
 2   ff                 2989 non-null   float64       
 3   t                  2989 non-null   float64       
 4   td                 2989 non-null   float64       
 5   u                  2989 non-null   int64         
 6   vv                 2989 non-null   int64         
 7   date_without_time  2989 non-null   datetime64[ns]
 8   total_cases        2989 non-null   int64         
 9   new_cases          2981 non-null   float64       
 10  new_deaths         2989 non-null   int64         
dtypes: datetime64[ns](2), float64(4), int64(5)
memory usage: 280.2 KB


In [30]:

merged_df.iloc[1,:]['date_without_time'].weekday()

4

In [31]:
import holidays

fr_holidays = holidays.FR()

merged_df['is_workday'] = merged_df.apply(lambda x: 0 if x['date_without_time'] in fr_holidays or x['date_without_time'].weekday() > 4 else 1, axis=1)

In [32]:
len(set(merged_df[merged_df['is_workday']==0]['date_without_time']))

113

In [33]:
merged_df["lockdown"] = ((merged_df['date'] >= '2020-03-16') & (merged_df['date'] <= '2020-05-11')
| ((merged_df['date'] >= '2020-10-28') & (merged_df['date'] <= '2020-12-15'))
| ((merged_df['date'] >= '2021-04-28') & (merged_df['date'] <= '2021-05-03')))

In [34]:
merged_df["curfew"] = (((merged_df['date'] >= '2021-05-03') & (merged_df['date'] <= '2021-05-18') 
                & (merged_df['date'].dt.hour >= 19) | (merged_df['date'].dt.hour <= 6))
                | ((merged_df['date'] >= '2021-05-19') & (merged_df['date'] <= '2021-06-08') 
                & (merged_df['date'].dt.hour >= 21) | (merged_df['date'].dt.hour <= 6))
                | ((merged_df['date'] >= '2021-06-09') & (merged_df['date'] <= '2021-06-29')
                & (merged_df['date'].dt.hour >= 23) | (merged_df['date'].dt.hour <= 6)))

In [35]:
merged_df["lockdown"] = merged_df.apply(lambda x: 1 if x['lockdown'] else 0, axis=1)
merged_df["curfew"] = merged_df.apply(lambda x: 1 if x['curfew'] else 0, axis=1)

In [36]:
merged_df.head()

Unnamed: 0,date,dd,ff,t,td,u,vv,date_without_time,total_cases,new_cases,new_deaths,is_workday,lockdown,curfew
976,2021-01-01 00:00:00,270,1.8,272.75,272.15,96,990,2021-01-01,2697018,19358.0,133,0,0,1
977,2021-01-01 03:00:00,300,1.7,271.25,270.95,98,210,2021-01-01,2697018,19358.0,133,0,0,1
978,2021-01-01 06:00:00,290,2.6,271.95,271.65,98,3660,2021-01-01,2697018,19358.0,133,0,0,1
979,2021-01-01 09:00:00,280,1.7,272.45,272.05,97,3500,2021-01-01,2697018,19358.0,133,0,0,0
980,2021-01-01 12:00:00,50,1.0,276.95,274.15,82,8000,2021-01-01,2697018,19358.0,133,0,0,0


In [37]:
merged_df.to_csv('data/external_data.csv')

#### 1.weather data
date(in hours) data

#### 2.covid cases in France
date cases

#### 3.is holiday (not in data file)

In [38]:
df_ext = pd.read_csv(Path('submissions') / 'starting_kit' / 'external_data.csv')
df_ext.head()

Unnamed: 0.1,Unnamed: 0,date,dd,ff,t,td,u,vv,date_without_time,total_cases,new_cases,new_deaths,is_workday,lockdown,curfew
0,976,2021-01-01 00:00:00,270,1.8,272.75,272.15,96,990,2021-01-01 00:00:00,2697018,19358.0,133,0,0,1
1,977,2021-01-01 03:00:00,300,1.7,271.25,270.95,98,210,2021-01-01 00:00:00,2697018,19358.0,133,0,0,1
2,978,2021-01-01 06:00:00,290,2.6,271.95,271.65,98,3660,2021-01-01 00:00:00,2697018,19358.0,133,0,0,1
3,979,2021-01-01 09:00:00,280,1.7,272.45,272.05,97,3500,2021-01-01 00:00:00,2697018,19358.0,133,0,0,0
4,980,2021-01-01 12:00:00,50,1.0,276.95,274.15,82,8000,2021-01-01 00:00:00,2697018,19358.0,133,0,0,0


In [65]:
# In this notebook we define the __file__ variable to be in the same conditions as when running the
# RAMP submission

__file__ = Path('submissions') /  'starting_kit' /  'estimator.py'


def _merge_external_data(data_train):
    file_path = Path(__file__).parent / 'external_data.csv'
    df_ext = pd.read_csv(file_path, parse_dates=['date'])
    
    data_train = data_train.copy()
    # When using merge_asof left frame need to be sorted
    data_train['orig_index'] = np.arange(data_train.shape[0])
    data_train = pd.merge_asof(data_train.sort_values('date'), df_ext[['date', 'dd', 'ff', 't', 'td', 'u', 'vv', 'total_cases', 'new_deaths', 'is_workday', 'lockdown', 'curfew']].sort_values('date'), on='date')
    # Sort back to the original order
    data_train = data_train.sort_values('orig_index')
    del data_train['orig_index']
    return data_train
    

In [66]:
import problem

X_train, y_train = problem.get_train_data()
X_test, y_test = problem.get_test_data()

In [67]:
X_train_merge = _merge_external_data(data_train)
X_train_merge.head()

Unnamed: 0,counter_id,counter_name,site_id,site_name,bike_count,date,counter_installation_date,counter_technical_id,latitude,longitude,...,ff,t,td,u,vv,total_cases,new_deaths,is_workday,lockdown,curfew
107,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2020-09-01 02:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,1.6,285.75,282.55,81,30000,326264,27,1,0,1
157,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,1.0,2020-09-01 03:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,1.1,283.95,282.05,88,25000,326264,27,1,0,1
193,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2020-09-01 04:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,1.1,283.95,282.05,88,25000,326264,27,1,0,1
769,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,4.0,2020-09-01 15:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,4.0,293.65,279.95,41,30000,326264,27,1,0,0
959,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,9.0,2020-09-01 18:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,3.0,292.15,280.55,47,30000,326264,27,1,0,0


In [68]:
X_test_merge = _merge_external_data(data_test)
X_test_merge.head()

Unnamed: 0,counter_id,counter_name,site_id,site_name,bike_count,date,counter_installation_date,counter_technical_id,latitude,longitude,...,ff,t,td,u,vv,total_cases,new_deaths,is_workday,lockdown,curfew
255,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,1.0,2021-08-10 05:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,1.5,289.15,288.15,94,20000,6407573,80,1,0,1
289,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,2.0,2021-08-10 06:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,0.7,289.15,288.35,95,20000,6407573,80,1,0,1
339,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,1.0,2021-08-10 07:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,0.7,289.15,288.35,95,20000,6407573,80,1,0,1
489,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2021-08-10 09:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,3.3,293.05,289.45,80,20000,6407573,80,1,0,0
556,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,1.0,2021-08-10 10:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,3.3,293.05,289.45,80,20000,6407573,80,1,0,0


In [69]:
X_train_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 455163 entries, 107 to 454785
Data columns (total 22 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   counter_id                 455163 non-null  category      
 1   counter_name               455163 non-null  category      
 2   site_id                    455163 non-null  int64         
 3   site_name                  455163 non-null  category      
 4   bike_count                 455163 non-null  float64       
 5   date                       455163 non-null  datetime64[ns]
 6   counter_installation_date  455163 non-null  datetime64[ns]
 7   counter_technical_id       455163 non-null  category      
 8   latitude                   455163 non-null  float64       
 9   longitude                  455163 non-null  float64       
 10  log_bike_count             455163 non-null  float64       
 11  dd                         455163 non-null  int64 

In [70]:
x_train_merge.is

SyntaxError: invalid syntax (4169188953.py, line 1)

In [71]:
X_test_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41608 entries, 255 to 41607
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   counter_id                 41608 non-null  category      
 1   counter_name               41608 non-null  category      
 2   site_id                    41608 non-null  int64         
 3   site_name                  41608 non-null  category      
 4   bike_count                 41608 non-null  float64       
 5   date                       41608 non-null  datetime64[ns]
 6   counter_installation_date  41608 non-null  datetime64[ns]
 7   counter_technical_id       41608 non-null  category      
 8   latitude                   41608 non-null  float64       
 9   longitude                  41608 non-null  float64       
 10  log_bike_count             41608 non-null  float64       
 11  dd                         41608 non-null  int64         
 12  ff

# Visualizing the data


Let's visualize the data, starting from the spatial distribution of counters on the map

## spatial distribution (start kit)

## selected bike count graph in different scales (start kit)


## distribution of target variable (start kit)

## Feature extraction

To account for the temporal aspects of the data, we cannot input the `date` field directly into the model. Instead we extract the features on different time-scales from the `date` field, 

In [72]:
def _encode_dates(X):
    X = X.copy()  # modify a copy of X
    # Encode the date information from the DateOfDeparture columns
    X.loc[:, 'year'] = X['date'].dt.year
    X.loc[:, 'month'] = X['date'].dt.month
    X.loc[:, 'day'] = X['date'].dt.day
    X.loc[:, 'weekday'] = X['date'].dt.weekday
    X.loc[:, 'hour'] = X['date'].dt.hour

    # Finally we can drop the original columns from the dataframe
    return X.drop(columns=["date"]) 

In [73]:
data_train_encode = _encode_dates(X_train_merge[['date']])
data_train_encode.head()

Unnamed: 0,year,month,day,weekday,hour
107,2020,9,1,1,2
157,2020,9,1,1,3
193,2020,9,1,1,4
769,2020,9,1,1,15
959,2020,9,1,1,18


## Linear Model & Base Model

In [74]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import Ridge
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import make_pipeline

date_encoder = FunctionTransformer(_encode_dates)
date_cols = _encode_dates(X_train_merge[['date']]).columns.tolist()

categorical_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
categorical_cols = ["counter_name", "site_name"]

numeric_cols = ['dd', 'ff', 't', 'td', 'u', 'vv', 'total_cases', 'new_deaths', 'is_workday', 'lockdown', 'curfew']

preprocessor = ColumnTransformer([
    ('date', "passthrough", date_cols),
    ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_cols),
    ('numeric', 'passthrough', numeric_cols)
])

regressor = Ridge()

pipe = make_pipeline(date_encoder, preprocessor, regressor)
pipe.fit(X_train_merge, y_train)

Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function _encode_dates at 0x132cfb200>)),
                ('columntransformer',
                 ColumnTransformer(transformers=[('date', 'passthrough',
                                                  ['year', 'month', 'day',
                                                   'weekday', 'hour']),
                                                 ('cat',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['counter_name',
                                                   'site_name']),
                                                 ('numeric', 'passthrough',
                                                  ['dd', 'ff', 't', 'td', 'u',
                                                   'vv', 'total_cases',
               

In [75]:
X_train_encoded = date_encoder.fit_transform(X_train_merge)
X_train_process = preprocessor.fit_transform(X_train_encoded)
print(X_train_process)

[[2.020e+03 9.000e+00 1.000e+00 ... 1.000e+00 0.000e+00 1.000e+00]
 [2.020e+03 9.000e+00 1.000e+00 ... 1.000e+00 0.000e+00 1.000e+00]
 [2.020e+03 9.000e+00 1.000e+00 ... 1.000e+00 0.000e+00 1.000e+00]
 ...
 [2.021e+03 8.000e+00 9.000e+00 ... 1.000e+00 0.000e+00 1.000e+00]
 [2.021e+03 8.000e+00 9.000e+00 ... 1.000e+00 0.000e+00 0.000e+00]
 [2.021e+03 8.000e+00 9.000e+00 ... 1.000e+00 0.000e+00 0.000e+00]]


In [76]:
from sklearn.metrics import mean_squared_error

print(f'Train set, RMSE={mean_squared_error(y_train, pipe.predict(X_train_merge), squared=False):.2f}')
print(f'Test set, RMSE={mean_squared_error(y_test, pipe.predict(X_test_merge), squared=False):.2f}')

Train set, RMSE=1.46
Test set, RMSE=1.29


## Tree based model

For tabular data tree based models often perform well, since they are able to learn non linear relationships between features, which would take effort to manually create for the linear model. Here will use Histogram-based Gradient Boosting Regression ([HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html)) which often will produce good results on arbitrary tabular data, and is fairly fast.

### GBDT：Gradient Boosting Descision Tree

In [79]:
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder

date_encoder = FunctionTransformer(_encode_dates)
date_cols = _encode_dates(X_train_merge[['date']]).columns.tolist()

categorical_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
categorical_cols = ["counter_name", "site_name"]

numeric_cols = ['dd', 'ff', 't', 'td', 'u', 'vv', 'total_cases', 'new_deaths', 'is_workday', 'lockdown', 'curfew']

preprocessor = ColumnTransformer([
    ('date', "passthrough", date_cols),
    ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_cols),
    ('numeric', 'passthrough', numeric_cols)
])

regressor = HistGradientBoostingRegressor(random_state=1)

pipe =  make_pipeline(
    date_encoder,
    preprocessor,
    regressor
)

In [80]:
pipe.fit(X_train_merge, y_train)
print(f'Train set, RMSE={mean_squared_error(y_train, pipe.predict(X_train_merge), squared=False):.2f}')
print(f'Test set, RMSE={mean_squared_error(y_test, pipe.predict(X_test_merge), squared=False):.2f}')

Train set, RMSE=0.50
Test set, RMSE=0.62


In [81]:
from sklearn.model_selection import TimeSeriesSplit, cross_val_score

cv = TimeSeriesSplit(n_splits=6)

scores = cross_val_score(pipe, X_train_merge, y_train, cv=cv, scoring='neg_root_mean_squared_error', error_score=np.nan)
print(f'RMSE: {-scores.mean():.3}± {(-scores).std():.3}')


RMSE: 1.17± 0.239


### XGBoost：

In [83]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder

date_encoder = FunctionTransformer(_encode_dates)
date_cols = _encode_dates(X_train_merge[['date']]).columns.tolist()

categorical_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
categorical_cols = ["counter_name", "site_name"]

numeric_cols = ['dd', 'ff', 't', 'td', 'u', 'vv', 'total_cases', 'new_deaths', 'is_workday', 'lockdown', 'curfew']

preprocessor = ColumnTransformer([
    ('date', "passthrough", date_cols),
    ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_cols),
    ('numeric', 'passthrough', numeric_cols)
])

params = {
    "n_estimators": 500,
    "max_depth": 4,
    "min_samples_split": 5,
    "learning_rate": 0.01,
    "loss": "squared_error",
}

regressor = GradientBoostingRegressor(**params)

pipe =  make_pipeline(
    date_encoder,
    preprocessor,
    regressor
)



In [84]:
pipe.fit(X_train_merge, y_train)
print(f'Train set, RMSE={mean_squared_error(y_train, pipe.predict(X_train_merge), squared=False):.2f}')
print(f'Test set, RMSE={mean_squared_error(y_test, pipe.predict(X_test_merge), squared=False):.2f}')

Train set, RMSE=0.78
Test set, RMSE=0.79


In [85]:
cv = TimeSeriesSplit(n_splits=6)

scores = cross_val_score(pipe, X_train_merge, y_train, cv=cv, scoring='neg_root_mean_squared_error', error_score=np.nan)
print(f'RMSE: {-scores.mean():.3}± {(-scores).std():.3}')

RMSE: 1.06± 0.219
