<a href="https://colab.research.google.com/github/gingerchien/QuantHub/blob/main/LinearRegressionPricePrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal: Predicting Days High and Low Given Its Open

Objectives:

* Learn the types of Regression
* Understand Variance and Bias Trade-off
* Making Predictions

### ScikitLearn

* Pre-installed in colab
* Important for Regression Problems
* Has data pre-processing packages to help standardize the data to normally distributed data such as the standarScaler function, MinMaxScaler, and MaxAbsScaler
* To tackle NaN values, especially as it will lead to losing out on the information provided by the non-NaN values in other parameterics. Use the Imputer function.

In [9]:
import sklearn

In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

### Hyperparameter Optimization

* In the context of machine learning, hyper parameter optimization or model selection is the problem of
choosing a set of hyper parameters for a learning algorithm, usually with the goal of optimizing a measure of
the algorithm's performance on an independent data set.

In [11]:
from sklearn.model_selection import GridSearchCV

### Linear Regression Model

In [12]:
from sklearn.linear_model import LinearRegression

### Pipeline

 Pipeline is a feature which allows us to send in the functions and the steps that we would
want the algorithm to follow during the process. The purpose of the pipeline is to assemble several steps that
can be cross validated together while setting different parameters.

In [13]:
from sklearn.pipeline import Pipeline

# Importing Required Libraries

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Drop NaN Values

In [15]:
df = pd.read_csv('gold_prices.csv', index_col='Date')
df.isna().sum()

Open     0
High     0
Low      0
Close    0
dtype: int64

In [16]:
df.head()

Unnamed: 0_level_0,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013-04-15,136.0,136.75,130.509995,131.309998
2013-04-16,134.899994,135.110001,131.759995,132.800003
2013-04-17,133.809998,134.949997,132.320007,132.869995
2013-04-18,134.119995,135.309998,133.619995,134.300003
2013-04-19,136.0,136.020004,134.600006,135.470001


# Check NaN Values are Dropped
0s in all columns confirms that all NaN values are dropped.

In [17]:
df = df.dropna()
df.head()

Unnamed: 0_level_0,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013-04-15,136.0,136.75,130.509995,131.309998
2013-04-16,134.899994,135.110001,131.759995,132.800003
2013-04-17,133.809998,134.949997,132.320007,132.869995
2013-04-18,134.119995,135.309998,133.619995,134.300003
2013-04-19,136.0,136.020004,134.600006,135.470001


# Create Feature Columns

* Std_U = High - Open
* Std_D = Open - Low
* 3 periods moving average S_3 = Close.shift(1).rolling(window=3).mean()
* 15 periods moving average S_15 = Close.shift(1).rolling(window=15).mean()
* 60 periods moving average S_60 = close.shift(1).rolling(windows=60).mean()
* Todays open minus Yesterday's Open OD = Open - Open.shift(1)
* Correlation Indicator Corr = Close.shift(1).rolling(window=10).corr(S_3).shift(1) #find the correlation between the moving average and the previous close values
* Calculate Overnight Changes = Today's open - Yesterday's Close

In [18]:
#Calculate Upward and Downward Deviations from the Open
df['Std_U'] = df['High'] - df['Open']
df['Std_D'] = df['Open'] - df['Low']

In [19]:
#calculate the moving averages as inputs for prediction
df['S_3'] = df['Close'].shift(1).rolling(window=3).mean()
df['S_15'] = df['Close'].shift(1).rolling(window=15).mean()
df['S_60'] = df['Close'].shift(1).rolling(window=60).mean()

In [20]:
#calculate correlation between the previous close and the corresponding 3 day moving average values by using a 10 day window to get the recent correlation
df['Corr'] = df['Close'].shift(1).rolling(window=10).corr(df['S_3'].shift(1))

In [21]:
#Calculate how much the market has changed compared to the previous day's open
df['OD'] = df['Open'] - df['Open'].shift(1)

#Calculate how much the market has changed compared to previous day's close by subtracting today's open from previous days close
df['OL'] = df['Close'].shift(1) - df['Open']
df.tail()

Unnamed: 0_level_0,Open,High,Low,Close,Std_U,Std_D,S_3,S_15,S_60,Corr,OD,OL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2019-05-08,121.540001,121.540001,120.769997,120.910004,0.0,0.770004,120.89,120.606668,122.611834,-0.221595,0.520004,-0.330002
2019-05-09,120.959999,121.620003,120.860001,121.199997,0.660004,0.099998,120.976667,120.633335,122.567001,-0.290695,-0.580002,-0.049995
2019-05-10,121.410004,121.730003,121.300003,121.43,0.319999,0.110001,121.106667,120.694668,122.522667,-0.280418,0.450005,-0.210007
2019-05-13,122.629997,122.849998,122.330002,122.669998,0.220001,0.299995,121.18,120.765334,122.490334,0.078028,1.219993,-1.199997
2019-05-14,122.599998,122.660004,122.120003,122.459999,0.060006,0.479995,121.766665,120.918667,122.467167,0.365089,-0.029999,0.07


# Import the Data

In [22]:
# Read the data
gold_prices = pd.read_csv(
    'input_parameters.csv', index_col='Date')
gold_prices.head()

Unnamed: 0_level_0,Open,High,Low,Close,S_3,S_15,S_60,Corr,Std_U,Std_D,OD,OL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-04-15,136.0,136.75,130.509995,131.309998,,,,,0.75,5.490005,,
2013-04-16,134.899994,135.110001,131.759995,132.800003,,,,,0.210007,3.139999,-1.100006,3.589996
2013-04-17,133.809998,134.949997,132.320007,132.869995,,,,,1.139999,1.489991,-1.089996,1.009995
2013-04-18,134.119995,135.309998,133.619995,134.300003,132.326665,,,,1.190003,0.5,0.309997,1.25
2013-04-19,136.0,136.020004,134.600006,135.470001,133.323334,,,,0.020004,1.399994,1.880005,1.699997


# Check and Drop NaN Values

In [23]:
# Check for NaN values
gold_prices.isna().sum()

Open      0
High      0
Low       0
Close     0
S_3       3
S_15     15
S_60     60
Corr     13
Std_U     0
Std_D     0
OD        1
OL        1
dtype: int64

In [24]:
# Drop all the NaN values
gold_prices.dropna(inplace=True)

# Check for NaN values
gold_prices.isna().sum()

Open     0
High     0
Low      0
Close    0
S_3      0
S_15     0
S_60     0
Corr     0
Std_U    0
Std_D    0
OD       0
OL       0
dtype: int64

# Scaling the Data

Standardize the dataset by centering it around the mean and then scaling it. Centering reduces the mean value of the features to 0.0 and scaling divides each entry by the standard deviation of the data which transforms the features to 1.

In [25]:
#Initialize the Standard Scaler
scaler = StandardScaler()

#Scale the data in gold prices and store it as an array in variable scaled
scaled  = scaler.fit_transform(gold_prices)

#Convert data stored in scaled from array to dataframe
scaled_prices = pd.DataFrame(scaled, index = gold_prices.index, columns = gold_prices.columns)
scaled_prices.head()

Unnamed: 0_level_0,Open,High,Low,Close,S_3,S_15,S_60,Corr,Std_U,Std_D,OD,OL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-07-10,0.2784,0.388435,0.264553,0.247863,0.013875,0.444063,2.22376,-1.247226,1.480757,0.216553,0.30828,0.675468
2013-07-11,0.751969,0.693429,0.704618,0.747942,0.159572,0.327314,2.195139,-2.894026,-0.745596,0.730431,3.103464,4.243061
2013-07-12,0.639286,0.684325,0.681697,0.731221,0.400532,0.261003,2.171491,-2.2966,0.630713,-0.587816,-0.739927,-0.928662
2013-07-15,0.72456,0.697981,0.761153,0.738822,0.579342,0.266537,2.147346,-0.100679,-0.320568,-0.498424,0.557849,-0.06884
2013-07-16,0.828106,0.822406,0.836026,0.846741,0.743888,0.257452,2.119389,0.310616,-0.037213,-0.07392,0.677638,0.739637


It should be noted that the complete data shouldn't be scaled before splitting into train and test datasets. The correct approach is to fit the scaler on the train data and use the fitted scaler model to transform the train and test sets. This avoids data leakage from the test set to the train set.

# Using the Imputer
imp=SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Using the Pipeline

* Pipeline is used to execute a certain number of steps sequentially
* may contain several transformation steps followed by the final estimation step (sequence of transformation steps for processing the data, such as dealing with missing values in the data and standardizing the data, followed by estimating using linear regression, in the model we are building.
* The steps which are defined to be used while calling the Pipeline function are stores as a list of tuples containtin key and value. The key is a string containing the name you would want to give to the particular step and the value is the function to be executed in the step.
* The first tuple in the steps list stores 'imputation' as the key and 'imp' as the function which we have instantiated earlier, this function will replace non-numerical values present in the data witht he most frequently appearing value for the particular column)
* The next two steps are StandardScaler() which centers and scale the data, and the LinearRegression() function, which will apply the linear regression algo to estimate the movement of the market. Once the steps are defined, we create the pipeline variable by passing steps as the argument of the Pipeline() function.

* The pipeline allows the program to chain transformers and estimators together so that they can be used as a single unit. This is extremely useful when needing to jump through different steps of data processing and finally train the machine learning model or use the model to make predictions.

In [26]:
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

steps = [('imputation', imp),
         ('scaling', StandardScaler()),
         ('linear_regression', LinearRegression())]

pipeline = Pipeline(steps)

# Pipeline Example

In [27]:
# To ignore unwanted warnings
import warnings
warnings.filterwarnings("ignore")

In [28]:
# Sample Dataset
# List containing close prices of Tesla (independent variable)
x= [663.90, 674.90, 628.16, 658.80, 707.73, 759.63, 758.26, 740.37, 775.00, 703.55]

# List containing close prices of Amazon (dependent variable)
y= [2151.82, 2151.14, 2082.00, 2135.50, 2221.55, 2302.93, 2404.19, 2433.68, 2510.22, 2447.00]

In [29]:
# Split into Test and Train datasets

split = int(len(x)*0.8)

x_train, y_train = x[:split], y[:split]
x_test, y_test = x[split:], y[split:]

In [None]:
# Reshape 1D array to 2D array. Each item in the array is a point where the model is to predict. The input is a 2D array of shape (-1,1)

In [31]:
# Reshape training data
x_train = np.reshape(x_train, (-1, 1))
y_train = np.reshape(y_train, (-1, 1))

# Reshape testing data
x_test = np.reshape(x_test, (-1, 1))
y_test = np.reshape(y_test, (-1, 1))

In [32]:
# Scaling first then Regression into the pipeline
steps = [('scaler', StandardScaler()),('linear_regression', LinearRegression())]

# Defining the pipeline
pipeline = Pipeline(steps, verbose=True)

In [33]:
# Fit the training set the pipeline
pipeline.fit(x_train, y_train)

[Pipeline] ............ (step 1 of 2) Processing scaler, total=   0.0s
[Pipeline] . (step 2 of 2) Processing linear_regression, total=   0.0s


In [34]:
# Predict the value for y using the predict() function with x_test parameter
y_pred = pipeline.predict(x_test)

In [35]:
print(y_pred)

[[2418.75009388]
 [2246.40191606]]


# Hyperparameter Tuning