# Implementing Trading with Machine Learning Regression - Part - 2

In the previous notebook, we have covered how to import data to create indicators. We defined and independent variables for linear regression. 

In this notebook, you will learn the machine learning regression technique. We will implement a linear regression model on Gold ETF that will predict the Day's High and Day's Low given its Day's Open, High, Low and Other defined indicators. The key steps are:
1. Import the Data
2. Preprocess the Data
3. Grid Search Cross-Validation
4. Split Train and Test Data
5. Predict the High and-Low Prices

In [2]:
# Import Machine Learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Import the libraries
import numpy as np
import pandas as pd

# For Plotting 
import matplotlib.pyplot as plt 
%matplotlib inline
plt.style.use('seaborn-darkgrid')

# To ignore unwanted warnings
import warnings 
warnings.filterwarnings("ignore")

### Import the Data
The input data is stored in `input_parameters.csv`, which we will import here as `gold_prices` to make prediction using Pipeline.

In [3]:
# Read the data
gold_prices = pd.read_csv('data/input_parameters.csv', index_col='Date')

# Printing the data
gold_prices.head()

Unnamed: 0_level_0,Open,High,Low,Close,S_3,S_15,S_60,Corr,Std_U,Std_D,OD,OL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-04-15,136.0,136.75,130.509995,131.309998,,,,,0.75,5.490005,,
2013-04-16,134.899994,135.110001,131.759995,132.800003,,,,,0.210007,3.139999,-1.100006,3.589996
2013-04-17,133.809998,134.949997,132.320007,132.869995,,,,,1.139999,1.489991,-1.089996,1.009995
2013-04-18,134.119995,135.309998,133.619995,134.300003,132.326665,,,,1.190003,0.5,0.309997,1.25
2013-04-19,136.0,136.020004,134.600006,135.470001,133.323334,,,,0.020004,1.399994,1.880005,1.699997


#### Checking for NaN values
Here we will for NaN values, then we will drop all the rows having NaN values using `dropna` method

In [4]:
gold_prices.isna().sum(axis=0)

Open      0
High      0
Low       0
Close     0
S_3       3
S_15     15
S_60     60
Corr     13
Std_U     0
Std_D     0
OD        1
OL        1
dtype: int64

We have 60 NaN values is `S_60`, 15 NaN in `S_15`, 13 NaN values in `S_13` and 3 NaN values in `S_3` etc. Now we will simply drop all the NaN values using `dropna`

In [5]:
# Dropping all the NaN values
gold_prices.dropna(inplace=True)

# Checking for NaN values
gold_prices.isna().sum()

Open     0
High     0
Low      0
Close    0
S_3      0
S_15     0
S_60     0
Corr     0
Std_U    0
Std_D    0
OD       0
OL       0
dtype: int64

Now our dataframe `gold_prices` is free from NaN values.

In [6]:
# Independent variables
X = gold_prices[['Open', 'S_3','S_15','S_60','OD','OL', 'Corr']]

# Dependent variables for upward deviation
yU = gold_prices['Std_U']

# Dependent variable for downward deviation
yD = gold_prices['Std_D']

### Data Preprocessing

Feeding the model with preprocessed sata in a machine learning model is essential. Raw data contains many errors, and using such data will result in inconsistent and erroneous results.

#### Scaling
Suppose a feature has a variance of an order of magnitude larger than the other features. In that case, it might dominate the objective function and make the estimator unable to learn from other features correctly. To achieve tis, we call the Standard Scaler function.

#### Linear Regression
Linear regression uses independent variables to predict a dependent variable using Linear equation. Here we use X as independent and `yU` and `yD` as the dependent variable

#### Pipeline
We define a list containing tuples that specify various machine learning tasks given in the order of execution.

Specify in the steps a list (name, transform) tuples. The 'name' is the variable name given to the task, and the 'transform' is the function used to perform the task. Then, sequentially apply a list of transforms specified in steps using the pipeline.

Syntax:

    steps = [(name_1, transform_1), (name_2, tranform_2), ...., (name_n, transform_n)]
    Pipeline(steps)

We are using the following two steps in our pipeline,

1. Scaling the data
2. Fitting the data using the linear regression model

In [7]:
# First we put scaling and then linear regression in the pipeline
steps = [('scaler', StandardScaler()), ('linear', LinearRegression())]

# Defining pipeline
pipeline = Pipeline(steps)

#### Hyperparameters
There are some parameters that the model itself cannot estimate. But we still need to account for them as they play a crucial role in increasing the performance of the system. Such parameters are called hyperparameters. We used intercept but you can add more hyperparameters to tune this algorithm

In [8]:
# Here we are using intercept as hyperparameter
parameters = {'linear__fit_intercept': [0,1]}

### Grid Search Cross-Validation
Cross-validation indicates the model's performance in a practical situation. It is used to tackle the overfitting of the model. We will use the `GridSearchCV` function, an inbuilt function for cross-validation

We have set `cv=5` which implies that the grid search will consider five rounds of cross-validation for averaging the performance results. We are using `GridSearchCV` instead of `RandomSearchCV` due to fewer features. `TimeSeriesSplit` splits training data into multiple segments

In [9]:
# Using TimeSeriesSplit for cross validation
my_cv = TimeSeriesSplit(n_splits=5)

# Defining reg as variable for GridSearch function containing pipeline, hyperparameters, and time series split
reg = GridSearchCV(pipeline, parameters, cv=my_cv)