# Homework
The purpose of this homework is to go through all steps in the machine learning pipeline with scikit-learn
* Make the necessary import
* Divide in train and test sets
* Preprocess the input data
* Test different types of cross-validation
* Train and predict



### Useful resources
Train-test split
* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Hyper-parameters search
* https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

In [1]:
import pandas as pd
import numpy as np
import os, sys

scr_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(scr_dir)


from load import TimeSeriesLoader
import warnings
warnings.filterwarnings('ignore')
# import preprocessing
from preprocess import (
    create_rolling_ts, 
    split_data, 
    flatten
)

from error_metrics import regression_metrics



ROOT_PATH = 'C://Users/gilbe/Documents/aifi-bootcamp'
df = pd.read_csv(f'{ROOT_PATH}/data/aapl.csv')
df['Unnamed: 0'] = pd.to_datetime(df['Unnamed: 0'])
df.set_index('Unnamed: 0', inplace=True)
df.index.rename('Date', inplace=True)
# df.rename(columns={'Unnamed: 0', 'Date'}, inplace=True)

In [2]:

def split_sequence(sequence, n_steps):
    """
    This function produces input and output for a univariate time series
    --Args:
        sequence: sequence to split
        n_steps: number of steps to use for predicting the next time step
    --Return:
        sequence with n_steps as input and the next time step to predict
    """
    X, y = [], []
    for i in range(len(sequence)):
        end_ix = i + n_steps
        if end_ix > len(sequence) - 1:
            break
        seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
        X.append(seq_x)
        y.append(seq_y)
    return pd.DataFrame(np.array(X)), pd.DataFrame(np.array(y))

In [3]:
price = pd.read_csv(f'{ROOT_PATH}/data/closing_prices.csv')
price.set_index('date', inplace=True)
price.head()

Unnamed: 0_level_0,AAPL,ABBV,ABT,AMGN,AMZN,AVGO,BA,BAC,BMY,C,...,PFE,PG,PM,T,UNH,V,VZ,WFC,WMT,XOM
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2010-01-04,27.294495,,21.129273,49.792737,133.9,16.673423,45.640616,14.63946,19.648366,32.843185,...,13.934719,47.114847,34.70953,18.32377,27.878733,20.694802,20.828887,22.38583,43.935552,53.998112
2010-01-05,27.341684,,20.958563,49.365427,134.69,16.797192,47.135432,15.115313,19.34172,34.098954,...,13.735967,47.130264,34.407096,18.23401,27.834524,20.45766,20.566022,23.000375,43.49806,54.20895
2010-01-06,26.906778,,21.074956,48.994453,132.25,16.929801,48.565255,15.292591,19.334054,35.161528,...,13.6918,46.906716,34.407096,17.971143,28.108625,20.18295,20.274994,23.033151,43.40084,54.677481
2010-01-07,26.857038,,21.249546,48.545833,130.0,16.823714,50.531263,15.796435,19.349386,35.258126,...,13.640272,46.652332,34.83613,17.769366,29.187345,20.370786,20.154309,23.868932,43.425145,54.505686
2010-01-08,27.035592,,21.35818,48.977198,133.52,16.947483,50.043823,15.656478,19.027407,34.67854,...,13.750689,46.590664,34.64623,17.639188,28.913244,20.427137,20.167013,23.647696,43.206399,54.287039


#### Here is an example on how to use the split_squence function, it works only for 1-dim data

In [4]:
ex1, ex2 = split_sequence(price['AAPL'], n_steps=3)
ex1.head()

Unnamed: 0,0,1,2
0,27.294495,27.341684,26.906778
1,27.341684,26.906778,26.857038
2,26.906778,26.857038,27.035592
3,26.857038,27.035592,26.797095
4,27.035592,26.797095,26.492278


In [5]:
ex2.head()

Unnamed: 0,0
0,26.857038
1,27.035592
2,26.797095
3,26.492278
4,26.865966


### Compute returns and plot some of then. You can use the pct_change() method in pandas

### Divide data in features and targets
* Here you need to use the split_sequence function to split in features and targets

### Preprocess features
* You might need to do imputation
* You might need some kind of normalization of the inputs

### You can either do single regression or multiple regression

* if you choose single regression then you have to select a single stock from the dataframe price

* you do ***multiple regression*** then you have to wrape your algorithm with the MultiOutputRegressor class. Multiple regression requires to modify the split_sequence function above to get the result for multiple outputs. 

Multiple regression in scikit-learn:

https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regression

### Divide in train and test

### Do hyper parameter search, you are free to choose between GridSearchCV or RandomSearchCV
* use cross-validation for i.i.d
* use time series crossvalidation
* which method gives better results on the test set?

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

### Do time series split without Hyperparameter search (advanced)

For this you might need to do a for-loop and iterate trough the cv-folds in the dataset. Then train the model
in each round of the for-loop. The following function called split can be used.

Using for-loops together with the zip() function might be useful.

\https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html


In [6]:
def split(X, y, n_splits=5):
    """
    Function for time series splitting. It returns
    a list of training and test sets.
    --Args:
        X: dataframe with Xtrain
        y: dataframe with ytrain
    """
    tscv = TimeSeriesSplit(n_splits=n_splits)

    test_y_list = []
    test_x_list = []
    train_y_list = []
    train_x_list = []

    if isinstance(X, (pd.DataFrame, pd.Series)):
        X.reset_index(inplace=True, drop=True)
    if isinstance(y, (pd.DataFrame, pd.Series)):
        y.reset_index(inplace=True, drop=True)

    for train_index, test_index in tscv.split(X):
        if isinstance(X, (pd.DataFrame, pd.Series)):
            X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        else:
            X_train, X_test = X[train_index], X[test_index]
        if isinstance(y, (pd.DataFrame, pd.Series)):
            y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        else:
            y_train, y_test = y[train_index], y[test_index]

        train_x_list.append(X_train)
        train_y_list.append(y_train)
        test_x_list.append(X_test)
        test_y_list.append(y_test)

    return train_x_list, test_x_list, train_y_list, test_y_list