Data Preprocessing

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


In [None]:
dataset_train = pd.read_csv('Google_Stock_Price_Train.csv')
dataset_train.head()

In the syntax iloc[:, 1:2], the colon : before the comma , denotes all rows, and the 1:2 after the comma , denotes the second column.

You may wonder why we use 1:2 instead of just 1. The reason for this is that the iloc method is inclusive on the left side of the range and exclusive on the right side. In other words, if you use iloc[:, 1], it will select the second column of the DataFrame but it will return it as a pandas Series object rather than a pandas DataFrame with one column.

By using iloc[:, 1:2], we are selecting a slice of the DataFrame that contains only one column, but we are doing so in a way that preserves the DataFrame structure. This means that the resulting object will be a pandas DataFrame with one column, rather than a pandas Series.

In [None]:
training_set = dataset_train.iloc[:,1:2].values
training_set

Feature scaling


Here, we are normalising or scaling the values in training_set to a range of 0 to 1

In [None]:
from sklearn.preprocessing import MinMaxScaler
minMax = MinMaxScaler(feature_range = (0,1))
training_set_scaled = minMax.fit_transform(training_set)




Next step is to decide how many time steps do we need our RNN to remember? 
60 time steps (so before predicting the next one it will take a look to 60 previous stock prices )
Output 1 = at time t +1

The reason why the loop starts at 60 is because the first 60 days of data are being used to create the first input sequence for the model.

The input sequence is created by taking 60 consecutive stock prices as inputs and predicting the next stock price based on those 60 inputs. Therefore, if we were to start at 0, we would not have enough data to create the first input sequence of 60 days.

Starting the loop at 60 ensures that we have enough data to create the first input sequence and still have enough data points left for testing the model. The loop ends at 1258, which is the total number of data points in the training set.

This code is creating a sliding window of 60 time steps (60 days in this case), and using that window to predict the value for the next time step. The X_train array will contain 1198 (1258 - 60) rows, with each row containing 60 values (60 days of stock prices). The y_train array will contain 1198 values, each representing the stock price for the 61st day after the corresponding row in X_train.

In [None]:
#1st train
X_train = []
y_train = []

for i in range(60, 1258):
    X_train.append(training_set_scaled[i-60:i, 0])
    y_train.append(training_set_scaled[i , 0])

X_train, y_train = np.array(X_train), np.array(y_train)

In [None]:
X_train

Reshaping by adding more dimensionality to the above X_train and y_train.
At this moment we have open indicator. 
We can add more dimensionality 

In [None]:
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1 ))

The line of code X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1)) reshapes the 2D array X_train into a 3D array, which is required as input to a specific type of neural network used for sequence prediction tasks like stock price prediction, called a Recurrent Neural Network (RNN).

The first argument (X_train.shape[0]) is the number of training samples, the second argument (X_train.shape[1]) is the number of time steps in each training sample, which is 60 in this case (as we are using 60 previous stock prices to predict the next one), and the third argument 1 is the number of features in the input data, which in this case is the stock price at each time step.

So, the resulting shape of X_train will be (1198, 60, 1), which means we have 1198 training samples, each with 60 time steps, and 1 feature (stock price) at each time step.

The third dimension is added to indicate that the data is univariate (i.e., it only has one feature, the stock price), as RNNs can also be used for multivariate time series analysis where the data has multiple features at each time step.

<h1>Building the RNN using LSTM</h1>
Stacked LSTM with some drop out regularisation to avoid overfitting

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

Initialising the RNN

In [None]:
rnn = Sequential()



add layers (First LSTM layer and some dropout Regularisation)

In [None]:
rnn.add(LSTM(units = 50, return_sequences= True, input_shape= (X_train.shape[1], 1)))

In [None]:
rnn.add(Dropout(rate = 0.2)) #20% neurons will be ignored everytime iteration of training

2nd LSTM Layer

In [None]:
rnn.add(LSTM(units = 50, return_sequences= True))
rnn.add(Dropout(rate = 0.2))

3rd LSTM Layer

In [None]:
rnn.add(LSTM(units = 50, return_sequences= True ))
rnn.add(Dropout(rate = 0.2))

4th LSTM layer

In [None]:
rnn.add(LSTM(units = 50))
rnn.add(Dropout(rate = 0.2))

Adding the output layer

In [None]:
rnn.add(Dense(units = 1)) #Stock price at time t + 1

In [None]:
print(rnn.get_weights())

Compiling the RNN

In [None]:
rnn.compile(optimizer='adam', loss ='mean_squared_error')

Fit to training set

In [None]:
rnn.fit(X_train, y_train, batch_size= 32 , epochs= 100 )

Predict January 2017 
Make the predictions and visualise the results
Get the real stock price

In [None]:
dataset_test   = pd.read_csv('Google_Stock_Price_Test.csv')
real_stock_price = dataset_test.iloc[:,1:2].values

Get the predicted stock price of 2017
We trained the model based on previous 60.
So to predict for every day of January, we need the predict the prices based on the previous 60 days of the data.
This means we will need some concatenation from X_train and y_train
But remember with X_train and y_train we have the scaled values and if we are using it to predict the values, 
meaning we need the actual values to predict. So we cannot use the scaled data.

Can we use the original dataset_train and dataset_test then?


In [None]:
dataset_total = pd.concat((dataset_train['Open'], dataset_test['Open']),axis=0) 
# Concatenates the "Open" stock prices from dataset_train and dataset_test into one new dataframe xdataset_total.

inputs = dataset_total[len(dataset_total) - len(dataset_test) - 60 : ].values
# Selects the input data for the RNN model by taking the last 60 days of stock prices from xdataset_total, 
# along with additional data before the test set.

inputs = inputs.reshape(-1,1) #to avoid any format issues
# Reshapes the input data to a 2D array with one column and an appropriate number of rows to fit the RNN model.

inputs = minMax.transform(inputs)
# Scales the input data to fit the range of values that the RNN model was trained on using the minMax scaler.

#1st train
X_test = []

for i in range(60, 80):
    X_test.append(inputs[i-60:i, 0])

# Populates the test input data list with sequences of 60 stock prices from the scaled input data.
  
X_test = np.array(X_test)
# Converts the test input data list to a numpy array.

X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1 ))
# Reshapes the test input data to a 3D array to fit the input format expected by the RNN model.

predicted_stock_price = rnn.predict(X_test)
#reverse the scaling

predicted_stock_price = minMax.inverse_transform(predicted_stock_price)
#Inverse scales the predicted stock prices back to their original values using the minMax scaler.


In [None]:
predicted_stock_price

We are not sure if these results are following the same trend. Lets visualise to see how far are we in predicting.

In [None]:
plt.figure(figsize=(12,10))
plt.plot(real_stock_price, color='blue', label="Real Stock price - Jan 2017")
plt.title("Stock prices")

plt.plot(predicted_stock_price, color='red', label="Predicted Stock price - Jan 2017")
plt.xlabel('Time')
plt.ylabel('Stock prices')
plt.legend()