# Machine Leaning Pipeline

### In this notebook, we create a machine learning pipeline that will perform the following tasks:

### - Roll the Time Series Datasets
### - Extract the "Relevant" Features
### - Build and Train the Forecasting Models

### Note: 
### Here, we build multiple models for each time series and just retain the best perfoming one. We neither fine-tune the models nor consider any ensemble model even if could could have done so. Instead, we just consider some basic (hyper) parameters.

### Import packages

In [1]:
import os, sys
import pandas as pd

import time

from sklearn.model_selection import train_test_split

from joblib import Parallel, delayed, parallel_backend
import multiprocessing

import utils as helper

import warnings

warnings.filterwarnings("ignore")

### Global variables

In [2]:
DATA_PATH = "../data"
RESULTS_PATH = "../results"

# STEP 0: Load the Data

In [3]:
df = pd.read_csv(
    "s3://data.atoti.io/notebooks/collateral-shortfall-forecast/data/assets-prices-no-missing-values.csv"
)  # you can replace this path by the one to your local storage folder

print(f"Size of the data: {df.shape}\n\n")
df.head()

Size of the data: (1491, 9)




Unnamed: 0,Date,AC.PA,BNP.PA,CAP.PA,ENGI.PA,G.MI,RACE.MI,SAN.PA,TIT.MI
0,2018-01-02,43.48,62.09,99.0,14.23,15.02,87.300003,71.760002,0.7255
1,2018-01-03,43.310001,62.639999,101.0,14.29,14.89,88.800003,72.07,0.725
2,2018-01-04,43.599998,63.77,101.349998,14.515,15.0,92.5,73.0,0.734
3,2018-01-05,43.77,63.889999,102.5,14.595,15.2,93.349998,74.360001,0.7385
4,2018-01-06,43.973334,64.093333,102.466667,14.626667,15.203333,93.783333,74.356667,0.743167


# STEP 1: Train / Test Split 
<font size=4>We split the data based on the dates.

<font size=4>We consider the earlier data points to train the model, and the later ones to test it.
    
<font size=4>We consider a 80%-20% distribution to split the train/test data points.

In [4]:
train_dates = list(df.Date)[: int(0.8 * len(df))]
test_dates = [d for d in list(df.Date) if d not in train_dates]

In [5]:
print(
    f"Number of training points: {len(train_dates)}\nNumber of testing points: {len(test_dates)}"
)

Number of training points: 1192
Number of testing points: 299


### Split the data

In [6]:
train = df[df.Date.isin(train_dates)]
test = df[df.Date.isin(test_dates)]

In [7]:
train

Unnamed: 0,Date,AC.PA,BNP.PA,CAP.PA,ENGI.PA,G.MI,RACE.MI,SAN.PA,TIT.MI
0,2018-01-02,43.480000,62.090000,99.000000,14.230000,15.020000,87.300003,71.760002,0.725500
1,2018-01-03,43.310001,62.639999,101.000000,14.290000,14.890000,88.800003,72.070000,0.725000
2,2018-01-04,43.599998,63.770000,101.349998,14.515000,15.000000,92.500000,73.000000,0.734000
3,2018-01-05,43.770000,63.889999,102.500000,14.595000,15.200000,93.349998,74.360001,0.738500
4,2018-01-06,43.973334,64.093333,102.466667,14.626667,15.203333,93.783333,74.356667,0.743167
...,...,...,...,...,...,...,...,...,...
1187,2021-04-03,33.210000,52.070001,148.400000,12.134800,17.072001,177.350003,84.259998,0.457240
1188,2021-04-04,33.470000,52.030001,148.600000,12.141200,17.093000,177.000003,84.289998,0.456160
1189,2021-04-05,33.730001,51.990001,148.800000,12.147600,17.114000,176.650003,84.319998,0.455080
1190,2021-04-06,33.990002,51.950001,149.000000,12.154000,17.135000,176.300003,84.349998,0.454000


In [8]:
test

Unnamed: 0,Date,AC.PA,BNP.PA,CAP.PA,ENGI.PA,G.MI,RACE.MI,SAN.PA,TIT.MI
1192,2021-04-08,33.180000,52.000000,150.050003,12.290000,17.105000,175.199997,84.709999,0.445300
1193,2021-04-09,33.040001,51.459999,151.000000,12.326000,16.965000,174.000000,85.059998,0.433900
1194,2021-04-10,32.890001,51.446665,151.216665,12.368000,17.000000,174.700002,84.936666,0.435233
1195,2021-04-11,32.740000,51.433332,151.433329,12.410000,17.035000,175.400004,84.813334,0.436567
1196,2021-04-12,32.590000,51.419998,151.649994,12.452000,17.070000,176.100006,84.690002,0.437900
...,...,...,...,...,...,...,...,...,...
1486,2022-01-27,32.160000,64.500000,193.800003,13.672000,18.280001,200.399994,94.889999,0.408100
1487,2022-01-28,32.029999,62.779999,193.000000,13.558000,18.264999,200.500000,94.260002,0.406900
1488,2022-01-29,32.139999,62.853333,194.483332,13.556666,18.370000,201.299998,93.756668,0.410000
1489,2022-01-30,32.250000,62.926666,195.966665,13.555333,18.475000,202.099996,93.253334,0.413100


#### Save the datasets

In [9]:
train.to_csv(os.path.join(DATA_PATH, "assets-prices-train.csv"), index=False)
test.to_csv(os.path.join(DATA_PATH, "assets-prices-test.csv"), index=False)

# STEP 2: Create the Model Pipeline
<font size=4> First, we roll the data and create the target at the same time using the ***make_forecasting_frame()*** function available in tsfresh.

<font size=4> As required by this function (cf tsfresh's documentation), this will be done for each time series separately.

<font size=4> Plus, here, we will create the target for three different scenarios corresponding to different horizons of forecasting:

<font size=4> **- 1-day horizon**
    
<font size=4> **- 3-days horizon**
    
<font size=4> **- 7-days horizon**
    
<font size=4> Then, we extract the features using the tsfresh's ***extract_features()*** function.
    
<font size=4> Finally, we create forecsting models based on different regressors:
    
<font size=4> ***- Simple Linear Regression***
    
<font size=4> ***- Partial Least Square (PLS) Regression***
    
<font size=4> ***- Orthogonal Partial Least Square (O-PLS) Regression***
    
<font size=4> ***- XGBoost***

In [10]:
cols = list(df.columns[1:])
cols

['AC.PA', 'BNP.PA', 'CAP.PA', 'ENGI.PA', 'G.MI', 'RACE.MI', 'SAN.PA', 'TIT.MI']

## Roll the Time Series Datasets

<font size=4>Rolling is a very useful operation for time series data. Rolling means creating a rolling window with a specified size and perform calculations on the data in this window which, of course, rolls through the data.
    
<font size=4>[This link](https://tsfresh.readthedocs.io/en/latest/text/forecasting.html#the-rolling-mechanism) will provide you a good explanation and illustration of the [rolling mechanism](https://tsfresh.readthedocs.io/en/latest/text/forecasting.html#the-rolling-mechanism).

#### Set the number of cpus to use for parallel computing

In [11]:
num_cpus = (
    multiprocessing.cpu_count() - 2
)  # optional, you should pay attention to your number of available cpus, and set this value accordingly

print(f"Number of available cpus: {multiprocessing.cpu_count()}\n")
print(f"Number of cpus to use: {num_cpus}")

Number of available cpus: 16

Number of cpus to use: 14


<font size=4>As seen in the previous step of the analysis, the partial autocorrelations demonstrate that some time series present partial autocorrelations until lag 15. This means that in some case the current price is influenced by the prices observed the preceding 15 days.

<font size=4>Of course, this is not the case for all the time series as others present shorter seasonality and only  a few lags are correlated.


<font size=4>So, for simplicity, we decide to retain 2 weeks (14 days) as the relevant period to take consider to extract the relevant features upon which to build the forecast models for all the time series.


<font color='red' size=4>Note that, you could decide to consider specific values for each time series, to be more consistent with their specific seasonality and number of lags correlated.

In [12]:
max_timeshift = 14
rolling_direction = 1

In [13]:
%%time
result_train = Parallel(n_jobs=num_cpus, prefer="threads")(
    delayed(helper.create_forecasting_frame)(
        train, col, max_timeshift, rolling_direction
    )
    for col in cols
)

Rolling:   5%|████████                                                                                                                                                        | 2/40 [00:00<00:01, 19.98it/s]
Rolling:  10%|████████████████                                                                                                                                                | 4/40 [00:00<00:01, 18.88it/s][A
Rolling:  15%|████████████████████████                                                                                                                                        | 6/40 [00:01<00:06,  5.27it/s][A

Rolling:   0%|                                                                                                                                                                        | 0/40 [00:00<?, ?it/s][A[A


Rolling:   0%|                                                                                                                                                   

CPU times: user 12.7 s, sys: 2.6 s, total: 15.3 s
Wall time: 13.8 s


In [14]:
%%time
result_test = Parallel(n_jobs=num_cpus, prefer="threads")(
    delayed(helper.create_forecasting_frame)(
        test, col, max_timeshift, rolling_direction
    )
    for col in cols
)

Rolling:   0%|                                                                                                                                                                        | 0/38 [00:00<?, ?it/s]
Rolling:   0%|                                                                                                                                                                        | 0/38 [00:00<?, ?it/s][A

Rolling:  11%|████████████████▊                                                                                                                                               | 4/38 [00:00<00:05,  5.79it/s][A[A

Rolling:  63%|███████████████████████████████████████████████████████████████████████████████████████████████████▊                                                          | 24/38 [00:00<00:00, 153.74it/s][A[A
Rolling:  24%|█████████████████████████████████████▉                                                                                                           

CPU times: user 3.94 s, sys: 2.26 s, total: 6.21 s
Wall time: 5.37 s


## Create and Store the Rolled Datasets for the Different Forecasting Horizons

In [15]:
%%time
for horizon in [1, 3, 7]:
    # train
    Parallel(n_jobs=num_cpus, prefer="threads")(
        delayed(helper.create_and_save_dataset)(
            forecasting_frame,
            os.path.join(RESULTS_PATH, "rolled-dataset"),
            horizon,
            "train",
        )
        for forecasting_frame in result_train
    )
    # test
    Parallel(n_jobs=num_cpus, prefer="threads")(
        delayed(helper.create_and_save_dataset)(
            forecasting_frame,
            os.path.join(RESULTS_PATH, "rolled-dataset"),
            horizon,
            "test",
        )
        for forecasting_frame in result_test
    )

CPU times: user 537 ms, sys: 77.6 ms, total: 615 ms
Wall time: 540 ms


## Extract the Features from the Time Series

In [16]:
%%time
helper.generate_features_dataframes(
    os.path.join(RESULTS_PATH, "rolled-dataset"), os.path.join(RESULTS_PATH, "features")
)

Starting features extraction...




Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:06<00:00, 10.30it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:01<00:00, 33.59it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:07<00:00,  9.84it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:01<00:00, 31.89it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7



...Features extraction completed, all the files are OK!!!


CPU times: user 2min 40s, sys: 15.4 s, total: 2min 56s
Wall time: 6min 13s


## Build the Forecasting Models

#### We create and train the models, then we use them to forecast the times series, and output the predictions to files.

In [17]:
%%time
eval_metric = "RMSE"
helper.build_models_and_results_summary(RESULTS_PATH, eval_metric)

processed: predictions-1-day-horizon--AC.PA-test.csv                                                                                                                                                         
processed: predictions-1-day-horizon--BNP.PA-test.csv                                                                                                                                                        
processed: predictions-1-day-horizon--CAP.PA-test.csv                                                                                                                                                        
processed: predictions-1-day-horizon--ENGI.PA-test.csv                                                                                                                                                       
processed: predictions-1-day-horizon--G.MI-test.csv                                                                                                                             

### Update the test dataset as explained in the function *update_test_data()* in the utils file

In [18]:
test = helper.update_test_data(os.path.join(DATA_PATH, "assets-prices-test.csv"))
test

Rolling: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:00<00:00, 228.55it/s]
Rolling: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:00<00:00, 213.80it/s]


Unnamed: 0,Date,AC.PA,BNP.PA,CAP.PA,ENGI.PA,G.MI,RACE.MI,SAN.PA,TIT.MI
1,2021-04-09,33.040001,51.459999,151.000000,12.326000,16.965000,174.000000,85.059998,0.433900
2,2021-04-10,32.890001,51.446665,151.216665,12.368000,17.000000,174.700002,84.936666,0.435233
3,2021-04-11,32.740000,51.433332,151.433329,12.410000,17.035000,175.400004,84.813334,0.436567
4,2021-04-12,32.590000,51.419998,151.649994,12.452000,17.070000,176.100006,84.690002,0.437900
5,2021-04-13,32.840000,51.639999,153.500000,12.260000,17.084999,176.350006,84.070000,0.435800
...,...,...,...,...,...,...,...,...,...
293,2022-01-26,32.220001,63.950001,193.399994,13.374000,18.049999,203.500000,91.940002,0.409600
294,2022-01-27,32.160000,64.500000,193.800003,13.672000,18.280001,200.399994,94.889999,0.408100
295,2022-01-28,32.029999,62.779999,193.000000,13.558000,18.264999,200.500000,94.260002,0.406900
296,2022-01-29,32.139999,62.853333,194.483332,13.556666,18.370000,201.299998,93.756668,0.410000


And save it

In [19]:
test.to_csv(os.path.join(DATA_PATH, "assets-prices-test.csv"), index=False)