# An initial training pipeline

+ A `Pipeline` object allows us to sequentially apply transformation steps and, if required, a predictor.
+ `Pipeline` objects compose transforms, i.e., classes that implement `transform` and `fit` methods.
+ The purpose of `Pipeline` objects is to ensemble transforms and predictors to be used in cross-validation.

In [1]:
%load_ext dotenv
%dotenv ../src/.env
import sys
sys.path.append("../src")
import dask
dask.config.set({'dataframe.query-planning': True})
import dask.dataframe as dd
import pandas as pd
import numpy as np
import os
from glob import glob
ft_dir = os.getenv("FEATURES_DATA")
ft_glob = glob(ft_dir+'/*.parquet')
df = dd.read_parquet(ft_glob).compute().reset_index()

## Data Prepration

+ As a first example, we will build a pipeline with a single step: [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).
+ This step will center our data and normalize by its standard deviation (z-score).
+ For simplicity, we will work with a single variable, log_returns.

In [2]:
log_returns = df[['log_returns']].dropna()
log_returns.describe()

Unnamed: 0,log_returns
count,2703278.0
mean,6.466212e-05
std,0.080019
min,-7.145808
25%,-0.009246503
50%,0.0004882451
75%,0.01018463
max,6.010436


+ A `Pipeline` is defined by a list of tuples.
+ Each tuple is composed of `("name", <ColumnTransformer>)`, the name of the step and the `<ColumnTransformer>` function of our chosing.

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


In [4]:
pipe = Pipeline(
    [
        ('scaler', StandardScaler())
    ]
)

+ A `Pipeline` object is also a transformer, therefor it implements `.fit()`, `.transform()`, and `.fit_transform()`.
+ The pipeline can be applied to `log_returns` with `pipe.fit_transform(log_returns)`

In [5]:
scaled_returns_np = pipe.fit_transform(log_returns)
scaled_returns = pd.DataFrame(scaled_returns_np, columns=log_returns.columns)
scaled_returns.describe()

Unnamed: 0,log_returns
count,2703278.0
mean,-3.832278e-18
std,1.0
min,-89.30222
25%,-0.1163619
50%,0.005293531
75%,0.1264695
max,75.11182


## Multiple features

+ Our data contains more than one feature. 


In [10]:
files = glob(os.getenv("FEATURES_DATA")+"*/part*.parquet")
dd.read_parquet(files)


500