## MLAIF: Predicting Stock Return


Submit report includes:
1. Present the detailed steps of the exercise (1 points)
2. Find your final best model (highest accuracy you can get) (2 points)
3. Evaluate the performance of the model (1 points)
4. Code (2 points)

### Step 1. Data Loading and Preprosessing

In [1]:
# Import libraries
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler

%matplotlib inline

In [2]:
# Set seeds
np.random.seed(1)

# Load the dataset
os.chdir('~')

dataset = pd.read_excel('sample_dataset.xlsx')
dataset = dataset.dropna()
dataset = dataset[['open', 'high', 'low', 'close','volume', 'amount']]


print(dataset.head())

        open       high        low      close    volume       amount
0  3394.5740  3394.5740  3394.5740  3394.5740  177127.0  247062735.6
1  3388.2855  3388.2855  3381.3847  3381.3847  342509.0  506723446.2
2  3379.9758  3379.9758  3377.1050  3377.1050  235704.0  347825009.2
3  3376.9107  3376.9107  3375.5988  3376.6061  298665.0  421905301.0
4  3375.8573  3375.8573  3375.1275  3375.5162  299740.0  426854063.8


### Step 2. Models Construction (features engineering, deep learning and model evaluation)
In this case, use deep neural network and long short term memory model to predict the trend of stock prices.

#### Model 1: NN
The model is based on:
$$ y_{t+1} = f(open, high, low, volumn, amount) $$

In [3]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.normalization import BatchNormalization
from keras.optimizers import SGD, Adam
from keras.utils.np_utils import to_categorical

Using TensorFlow backend.


In [5]:
# Split into input (X1) and output (Y1) variables
rawdata = pd.read_excel('MFIN7034_A3_dataset.xlsx')
X1 = rawdata
X1 = X1.drop('DATETIME',axis=1)
X1 = X1.drop('close',axis=1)
Y1 = rawdata['close']

# Data processing Y1 to 1 if tick return positive else 0
Y1 = (Y1.shift(1)-Y1)/Y1
Y1=Y1.shift(-1)
Y1.drop(Y1.shape[0] -1,inplace =True)
Y1=pd.Series(np.where(Y1.values > 0 , 1,0),Y1.index)

# Data processing X1 to 1 if tick increase positive else 0
X1=X1.pct_change()
X1=X1.shift(-1)
X1=X1.drop(X1.index[397982])
X1['open']=pd.Series(np.where(X1.open.values > 0 , 1,0),X1.index)
X1['high']=pd.Series(np.where(X1.high.values > 0 , 1,0),X1.index)
X1['low']=pd.Series(np.where(X1.low.values > 0 , 1,0),X1.index)
X1['volume']=pd.Series(np.where(X1.volume.values > 0 , 1,0),X1.index)
X1['amount']=pd.Series(np.where(X1.amount.values > 0 , 1,0),X1.index)


# Training set (60%)
X = X1.loc[0:238789,]
Y = Y1.loc[0:238789,]

  # Remove the CWD from sys.path while we load stuff.
  from ipykernel import kernelapp as app
  app.launch_new_instance()


In [None]:
# Create model 1 parallel NN 20 neurons, epochs=20, batch_size=40
n=20
model = Sequential()
model.add(Dense(n, input_dim=5, activation='relu'))
model.add(Dense(n, activation='relu'))
model.add(Dense(n, activation='relu'))
model.add(Dense(n, activation='relu'))
model.add(Dense(n, activation='relu'))
model.add(Dense(1, activation='softmax'))


model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X, Y, epochs=20, batch_size=40 )

In [None]:
# Evaluate the model 1
scores = model.evaluate(X, Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

In [None]:
# Tuning parameters and set BatchNormalization
model = Sequential()
model.add(Dense(64, input_dim=5))

model.add(BatchNormalization())
model.add(Dense(16, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(2, activation='relu'))
model.add(Dense(1, activation='softmax'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y, epochs=20, batch_size=40 )

In [None]:
# Evaluate the model again by using CV and testing sets.

scores = model.evaluate(X, Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

Xcv = X1.loc[238791:318887,]
Ycv = Y1.loc[238791:318887,]
#Ycv = to_categorical(Ycv, num_classes=2)
score = model.evaluate(Xcv,Ycv)
print('Accuracy of cross validation set',score[1])

Xtest = X1.loc[318888:398982,]
Ytest = Y1.loc[318888:398982,]
#Ytest = to_categorical(Ytest, num_classes=2)
score = model.evaluate(Xtest,Ytest)
print('Accuracy of testing set',score[1])

#### Model 2: LSTM
The model is based on:
$$ y_{t+1} = f(x_t, x_t^2, volumn, amount) $$

In [7]:
# Create new features
dataset['close2']= dataset['close'].shift(1)**2

# Define dummy variable: trend of stock price
dataset['Price_Rise'] = np.where(dataset['close'].shift(-1) > dataset['close'], 1, 0)
dataset = dataset.dropna()

dataset.head()

Unnamed: 0,open,high,low,close,volume,amount,close2,Price_Rise
1,3388.2855,3388.2855,3381.3847,3381.3847,342509.0,506723446.2,11523130.0,0
2,3379.9758,3379.9758,3377.105,3377.105,235704.0,347825009.2,11433760.0,0
3,3376.9107,3376.9107,3375.5988,3376.6061,298665.0,421905301.0,11404840.0,0
4,3375.8573,3375.8573,3375.1275,3375.5162,299740.0,426854063.8,11401470.0,0
5,3375.6936,3375.6936,3374.9391,3375.5097,273694.0,394890093.5,11394110.0,1


In [8]:
# Extract X and y
X = dataset.iloc[:, 3:-1]
y = dataset.iloc[:, -1]

# Create training and testing sets
split = int(len(dataset)*0.8)
X_train, X_test, y_train, y_test = X[:split], X[split:], y[:split], y[split:]

# Normalize the feature values
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# reshaping - Adding time interval as a dimension for input.
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1])) # time_steps = 1
X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))

print('The shape of X_train is: \n', X_train.shape)

The shape of X_train is: 
 (318365, 1, 4)


In [9]:
# Building the RNN(LSTM)
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout

# Initialising the RNN
# Creating an object of Sequential class to create the RNN.
classifier = Sequential()

# Adding the input layer and the LSTM layer
# input_shape = (len_of_seq, nb_of_features)
classifier.add(LSTM(units = 32, activation = 'relu', input_shape = (X_train.shape[1],X_train.shape[2]), return_sequences=False))

# Adding the output layer
# 1 nueron in the output layer for 1 dimensional output
classifier.add(Dense(units = 1, activation = 'sigmoid'))

# Compiling the RNN
# Compiling all the layers together.
# Loss helps in manipulation of weights in NN. 
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the RNN to the Training set
# Number of epochs increased for better convergence.
classifier.fit(X_train, y_train, batch_size = 200, epochs = 30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x1a236714e0>

In [10]:
# Evaluate the model
scores = classifier.evaluate(X_test, y_test)

print("\n%s: %.2f%%" % (classifier.metrics_names[1], scores[1]*100))


acc: 50.78%


#### Model 3: LSTM
The model is based on:
$$ y_{t+1} = f(HL, OC, 3m MA, 10m MA, 30m MA, volatility) $$

In [11]:
# Create new features
dataset['H-L'] = dataset['high'] - dataset['low']
dataset['O-C'] = dataset['close'] - dataset['open']
dataset['3m MA'] = dataset['close'].shift(1).rolling(window = 3).mean() # 3 mins moving average
dataset['10m MA'] = dataset['close'].shift(1).rolling(window = 10).mean()
dataset['30m MA'] = dataset['close'].shift(1).rolling(window = 30).mean()
dataset['Std_dev']= dataset['close'].rolling(5).std()

# Define dummy variable: trend of stock price
dataset['Price_Rise'] = np.where(dataset['close'].shift(-1) > dataset['close'], 1, 0)
dataset = dataset.dropna()

dataset.head()

Unnamed: 0,open,high,low,close,volume,amount,H-L,O-C,3m MA,10m MA,30m MA,Std_dev,Price_Rise
30,3358.1065,3358.1065,3358.1065,3360.5149,244622.0,369415381.4,0.0,2.4084,3356.039733,3356.71862,3367.570947,2.508297,1
31,3360.4355,3360.4355,3360.434,3362.4217,256421.0,364852618.5,0.0015,1.9862,3358.064533,3356.62247,3366.435643,3.299309,1
32,3362.3333,3362.3333,3362.3333,3364.5874,227649.0,348879482.7,0.0,2.2541,3360.2998,3356.95353,3365.803543,3.513844,1
33,3364.9953,3364.9953,3364.978,3365.5744,219018.0,327010766.0,0.0173,0.5791,3362.508,3357.73911,3365.38629,3.079289,1
34,3365.9502,3365.9502,3365.9502,3366.4954,225914.0,341952828.2,0.0,0.5452,3364.1945,3358.75224,3365.018567,2.431807,0


In [12]:
# Extract X and y
X = dataset.iloc[:, 4:-1]
y = dataset.iloc[:, -1]

# Create training and testing sets
split = int(len(dataset)*0.8)
X_train, X_test, y_train, y_test = X[:split], X[split:], y[:split], y[split:]

# Normalize the feature values
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# reshaping - Adding time interval as a dimension for input.
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1])) # time_steps = 1
X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))

print('The shape of X_train is: \n', X_train.shape)

The shape of X_train is: 
 (318342, 1, 8)


In [13]:
# Building the RNN(LSTM)
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout

# Initialising the RNN
# Creating an object of Sequential class to create the RNN.
classifier = Sequential()

# Adding the input layer and the LSTM layer
# input_shape = (len_of_seq, nb_of_features)
classifier.add(LSTM(units = 32, activation = 'relu', input_shape = (X_train.shape[1],X_train.shape[2]), return_sequences=False))

# Adding the output layer
# 1 nueron in the output layer for 1 dimensional output
classifier.add(Dense(units = 1, activation = 'sigmoid'))

# Compiling the RNN
# Compiling all the layers together.
# Loss helps in manipulation of weights in NN. 
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the RNN to the Training set
# Number of epochs increased for better convergence.
classifier.fit(X_train, y_train, batch_size = 200, epochs = 30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x1a20385550>

In [14]:
# Evaluate the model
scores = classifier.evaluate(X_test, y_test)

print("\n%s: %.2f%%" % (classifier.metrics_names[1], scores[1]*100))

acc: 68.92%


### Step 3: Conclusion

1. Present the detailed steps of the exercise (1 points)
> The general idea is:
>- Load and clean data; 
>- Inside each of the 3 model, split dataset into training and testing sets;
>- Apply deep learning algorithms while deal with overfitting carefully; 
>- Evaluate models' performance and choose the final model.
2. Find your final best model (highest accuracy you can get) (2 points)
> The best model is model 3, with 68.92% accuracy rate.<br>
>- $$ y_{t+1} = f(HL, OC, 3m MA, 10m MA, 30m MA, volatility) $$
>- **Reasons:**
>- The problem in this case belongs to time series prediction, and it adds the complexity of a sequence dependence among the input variables. The LSTM network is a type of recurrent neural network, which is designed to deal with sequential data. With LSTM, large architectures can be successfully trained, while deep NN cannot find the pattern effectively.
>- Feature engineering. Instead of the original variables, we created new features including **spreads, moving average and volatility**. This process can help the model extract useful data while filter the redundant information.
3. Evaluate the performance of the model (1 points)
>- According to the evaluation matrix on testing set, the model loss is 0.5739 and the accuracy rate is 0.6892.
>- A great improvement compared with other models.
4. Code (2 points)
>- Please refer the Jupyter notebook.