# CQF 2022 - January Cohort
## Final Project - Deep Learning

#### Enzo Yamamura

***
### ETH direction LSTM classifier

* Dataset used: [ETH-USD Daily OHLC (Kaggle)](https://www.kaggle.com/datasets/abhimaneukj/ethereum-historical-dataset?resource=download)

***
## Importing necessary libs and data

In [334]:
!pip install ta
!pip install minisom

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [438]:
import numpy as np
import pandas as pd
import plotly.express as px
import io
import ta
from ta import add_all_ta_features
from ta.utils import dropna
from minisom import MiniSom
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Dropout, LeakyReLU
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow as tf
from tensorflow.random import set_seed
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

In [289]:
from google.colab import files
uploaded = files.upload()

Saving ETHUSD.csv to ETHUSD (1).csv


***
## Functions:

In [292]:
def fill_ohlc(df):
  """
  This function takes in an OHLC dataset and fills in the NaN gaps according to the following logic:

  - Open = previous day close

  - Close & Adj Close = next obs Open price

  - High/Low = max / min (open x close)

  It assumes that the columns are named in the usual manner (capitalized first letter).
  """
  OHLC = ['Open',	'High',	'Low',	'Close',	'Volume']
  nan_rows = df[df[OHLC].isna().any(axis=1)].index

  # Workaround considering consecutive NaNs:
  df['Open'], df['Close'] = (df['Open'].fillna(df['Close'].ffill(limit=1)),
                           df['Close'].fillna(df['Open'].bfill(limit=1)))

  
  df[['Open','Close']] = df[['Open','Close']].ffill()
  # Filling Highs and Lows:
  df['High'], df['Low'] = df['High'].fillna(df[['Open','Close']].max(axis =1)), df['Low'].fillna(df[['Open','Close']].min(axis =1))

  # Interpolation makes sense for volumes:
  df['Volume'] = df['Volume'].interpolate()

  print('The following rows were altered:')
  print(df.loc[nan_rows])
  return df

***
## Initial checks on the database

In [293]:
df = pd.read_csv(io.BytesIO(uploaded['ETHUSD.csv']))

In [294]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2267 entries, 0 to 2266
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       2267 non-null   object 
 1   Open       2262 non-null   float64
 2   High       2262 non-null   float64
 3   Low        2262 non-null   float64
 4   Close      2262 non-null   float64
 5   Adj Close  2262 non-null   float64
 6   Volume     2262 non-null   float64
dtypes: float64(6), object(1)
memory usage: 124.1+ KB


In [295]:
# We won't bw using Adj Close, hence:
del df['Adj Close']

In [296]:
# Checking null OHLC
# df[df['Open'].isna()]
OHLC = ['Open',	'High',	'Low',	'Close',	'Volume']
df[df[OHLC].isna().any(axis=1)]

Unnamed: 0,Date,Open,High,Low,Close,Volume
1715,2020-04-17,,,,,
1890,2020-10-09,,,,,
1893,2020-10-12,,,,,
1894,2020-10-13,,,,,
2265,2021-10-19,,,,,


The dataset has 5 days with missing OHLC data.
We'll keep the series stable during the gaps. In order to do that:

* Open = previous day close

* Close & Adj Close = next obs Open price

* High/Low = max / min (open x close)


In [297]:
df = fill_ohlc(df)

The following rows were altered:
            Date         Open         High          Low        Close  \
1715  2020-04-17   172.157379   172.157379   171.618073   171.618073   
1890  2020-10-09   351.455658   365.402466   351.455658   365.402466   
1893  2020-10-12   375.142059   375.142059   375.142059   375.142059   
1894  2020-10-13   375.142059   381.971466   375.142059   381.971466   
2265  2021-10-19  3748.760254  3888.102539  3748.760254  3888.102539   

            Volume  
1715  2.153540e+10  
1890  1.256475e+10  
1893  1.302929e+10  
1894  1.347407e+10  
2265  1.666277e+10  


***
### Adding main technical analysis indicators at once with [TA-LIB](https://technical-analysis-library-in-python.readthedocs.io/en/latest/)

In [299]:
df_ = add_all_ta_features(
    df, open="Open", high="High", low="Low", close="Close", volume="Volume", fillna=True)

  dip[idx] = 100 * (self._dip[idx] / value)
  din[idx] = 100 * (self._din[idx] / value)


In [301]:
df_.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2267 entries, 0 to 2266
Data columns (total 92 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Date                       2267 non-null   object 
 1   Open                       2267 non-null   float64
 2   High                       2267 non-null   float64
 3   Low                        2267 non-null   float64
 4   Close                      2267 non-null   float64
 5   Volume                     2267 non-null   float64
 6   volume_adi                 2267 non-null   float64
 7   volume_obv                 2267 non-null   float64
 8   volume_cmf                 2267 non-null   float64
 9   volume_fi                  2267 non-null   float64
 10  volume_em                  2267 non-null   float64
 11  volume_sma_em              2267 non-null   float64
 12  volume_vpt                 2267 non-null   float64
 13  volume_vwap                2267 non-null   float

In [303]:
# Checking out the data:
df_ = df_.set_index('Date')
df_.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,volume_adi,volume_obv,volume_cmf,volume_fi,volume_em,...,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-08-07,2.83162,3.53661,2.52112,2.77212,164329.0,-83094.169524,164329.0,-0.505657,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2.77212,-99.453675,0.0,0.0
2015-08-08,2.79376,2.79881,0.714725,0.753325,674188.0,-732308.469507,-509859.0,-0.873338,-1361047.0,-393.237393,...,-6.140652,-1.22813,-4.912521,20.12531,4.025062,16.100248,2.019402,-72.82495,-130.287091,-72.82495
2015-08-09,0.706136,0.87981,0.629191,0.701897,532170.0,-955707.370588,-1042029.0,-0.697247,-1170522.0,-47.200547,...,-11.671314,-3.316767,-8.354547,26.813893,8.582828,18.231065,1.499154,-6.826801,-7.071007,-74.680137
2015-08-10,0.713989,0.729854,0.636546,0.708448,405283.0,-736378.155858,-636746.0,-0.414634,-1002925.0,-1.641546,...,-16.4446,-5.942334,-10.502267,27.382146,12.342692,15.039454,1.158356,0.933328,0.928999,-74.443819
2015-08-11,0.708087,1.13141,0.663235,1.06786,1463100.0,329519.921784,826354.0,0.101733,-784527.9,6.851671,...,-19.034512,-8.560769,-10.473743,46.350691,19.144292,27.206399,1.117929,50.732305,41.033526,-61.47858


***
### Correlogram of features



In [377]:
df_.corr().iloc[-1]

Open                 0.997687
High                 0.998906
Low                  0.998896
Close                1.000000
Volume               0.708027
                       ...   
momentum_pvo_hist   -0.011635
momentum_kama        0.992101
others_dr            0.015849
others_dlr           0.016631
others_cr            1.000000
Name: others_cr, Length: 91, dtype: float64

In [378]:
mask = np.triu(np.ones_like(df_.corr(), dtype=bool))

fig = px.imshow(df_.corr().where(np.tril(np.ones(df_.corr().shape)).astype(bool)))
fig.update_layout(
    height = 1500,
    width = 1500,
    margin=dict(l=10, r=10, t=10, b=10), 
    coloraxis_showscale=False)
fig.show()

Top 20 features more associated with the Closing price besides itself.

In [379]:
df_.corr()['Close'].sort_values(ascending=False)[1:20]

others_cr              1.000000
High                   0.998906
Low                    0.998896
Open                   0.997687
trend_ichimoku_conv    0.995153
trend_ema_fast         0.994649
volatility_kcl         0.993759
volatility_kcc         0.993010
trend_ichimoku_a       0.992626
momentum_kama          0.992101
trend_sma_fast         0.992087
volume_vwap            0.990861
volatility_kch         0.990724
volatility_dcm         0.989329
trend_ema_slow         0.988381
trend_ichimoku_base    0.986657
volatility_bbm         0.986503
volatility_dch         0.984285
volatility_bbh         0.983897
Name: Close, dtype: float64

***
### Self Organizing Maps (SOM) for feature selection / dimensionality reduction

In [327]:
# Initializing neurons and Training
data = df_.values
n_neurons = 9
m_neurons = 9
som = MiniSom(n_neurons, m_neurons, data.shape[1], sigma=1.5, learning_rate=.5, neighborhood_function='gaussian', random_seed=0)
 
som.pca_weights_init(data)
som.train(data, 1000, verbose=True)  # random training


Casting complex values to real discards the imaginary part



 [ 1000 / 1000 ] 100% - 0:00:00 left 
 quantization error: 464412290373.0151


In [370]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(data)
# plt.pcolor(frequencies.T, cmap='Blues') 
# plt.colorbar()
# plt.show()
fig=px.imshow(frequencies, color_continuous_scale='Blues', text_auto = True, title ="Self Organizing Map")
fig.update_layout(
    height = 600,
    width = 600,
    margin=dict(l=10, r=10, t=30, b=10), 
    coloraxis_showscale=True)
fig.show()

<Figure size 504x504 with 0 Axes>

***
### Decision Tree Regressor for Feature Selection

Considering we will be trying to classify the direction of the ETH.

In [409]:
# X features
X = df_.drop(['Close','Open','High','Low'], axis=1)
# y target
y = df_['Close']


# Decision tree classifier using 200 trees
model = RandomForestRegressor(n_estimators=200, random_state=0)
model.fit(X,y)

RandomForestRegressor(n_estimators=200, random_state=0)

In [414]:
#  Plotting DTR top 20 features:

importances = model.feature_importances_
ft_imp = pd.DataFrame(importances, index=X.columns, columns =['DTR Feature Importance'])

# without others
top20 = ft_imp.sort_values('DTR Feature Importance', ascending = False)[:20]

fig = px.bar_polar(top20, r = top20['DTR Feature Importance'], theta=top20.index, color = top20['DTR Feature Importance'], title ='Top 20 TA indicators - per DTR') 
fig.update_layout(
    height = 700,
    width = 700,
    margin=dict(l=150, r=150, t=30, b=30), 
    coloraxis_showscale=False)
fig.show()


Without others_cr, volume_adi and volatility_dcl:

In [415]:
#  Plotting DTR top 20 features:

importances = model.feature_importances_
ft_imp = pd.DataFrame(importances, index=X.columns, columns =['DTR Feature Importance'])

# without others
top20_ = ft_imp.sort_values('DTR Feature Importance', ascending = False)[3:23]

fig = px.bar_polar(top20_, r = top20_['DTR Feature Importance'], theta=top20_.index, color = top20_['DTR Feature Importance'], title ='Top 20 TA indicators - per DTR (w/ exclusions)') 
fig.update_layout(
    height = 700,
    width = 700,
    margin=dict(l=150, r=150, t=30, b=30), 
    coloraxis_showscale=False)
fig.show()


In [421]:
fig = px.bar(top20[::-1], orientation='h', title = 'DTR Feature Importances')
fig.update_layout(
    showlegend=False,
    xaxis_title='',
    yaxis_title=''
)

In [422]:
fig = px.bar(top20_[::-1], orientation='h', title = 'DTR Feature Importances (w/ exclusions)')
fig.update_layout(
    showlegend=False,
    xaxis_title='',
    yaxis_title=''
)

***
### LSTM Modeling

We'll be using the top 20 features (per DTR feature importance w/o others_cr) obtained above.

As it is a time series, *the order of observations matters*. Therefore we set **shufle=False** in the train-test-split below.

In [452]:
dffinal = pd.concat([df_['Close'], df_[features]], axis =1)

# We scale the data so hasten the LSTM's conversion
scaler = MinMaxScaler() 
scaler.fit(dffinal)
dffinal_scaled = scaler.fit_transform(dffinal)

# Features x Target
features = dffinal_scaled
target = dffinal_scaled [:,0]

x_train, x_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0, shuffle = False)

In [453]:
# Defining the best Hyperparameters for the multivariate LSTM:

registro = pd.DataFrame(columns=['MSE','MAE','MAPE'])
win_length = 0
batch_size = 0
for win_length in [7,14,28,70]: # days
  for batch_size in [4,8,32,64,256]: # supposedly, these are the best options (empirically)
    num_features = x_train.shape[1]

    # Generator
    train_generator = TimeseriesGenerator(x_train, y_train, length= win_length, sampling_rate = 1, batch_size= batch_size)
    test_generator = TimeseriesGenerator(x_test, y_test, length= win_length, sampling_rate = 1, batch_size= batch_size)

    # Architecture LSTM
    model = Sequential()
    model.add(LSTM(128, input_shape = (win_length, num_features), return_sequences = True))
    model.add(LeakyReLU(alpha=0.5))
    model.add(LSTM(128, return_sequences=True))
    model.add(LeakyReLU(alpha=0.5))
    model.add(Dropout(0.3))
    model.add(LSTM(64, return_sequences=False))
    model.add(Dropout(0.3))
    model.add(Dense(1)) #output layer

    # Defining Early Stopping to avoid overfitting (after 3 attempts)
    early_stopping = EarlyStopping(monitor='val_loss',patience = 3, mode='min')

    # Using exponential decrease:
    def scheduler(epoch, lr):
        return np.clip(lr * tf.math.exp(-0.1), 0.000001, 0.001)


    callbacklr = tf.keras.callbacks.LearningRateScheduler(scheduler)

    # MAE as the Loss Function, althout MSE and MAPE are calculated as well:
    model.compile(loss = 'mae',
                  optimizer = tf.optimizers.Adam(learning_rate=0.001),
                  metrics=['mse','mae','mape'])

    
    # Training the model:
    set_seed(0)
    history = model.fit(train_generator, epochs = 50, validation_data = test_generator,
                                  shuffle=False, callbacks = [early_stopping, callbacklr])
    
    # Metrics for each window and batch configuration:
    registro.loc[f'Batches:{batch_size}, Window:{win_length}']= pd.Series(dict(zip(['MSE','MAE','MAPE'],model.evaluate(test_generator, verbose=0))))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/5

In [454]:
# Checking best configuration:
registro.index.names = ['Configurations']
registro

Unnamed: 0_level_0,MSE,MAE,MAPE
Configurations,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Batches:4, Window:7",0.160393,0.054571,0.160393
"Batches:8, Window:7",0.078519,0.014338,0.078519
"Batches:32, Window:7",0.078706,0.013577,0.078706
"Batches:64, Window:7",0.099599,0.018735,0.099599
"Batches:256, Window:7",0.096653,0.019846,0.096653
"Batches:4, Window:14",0.116476,0.030836,0.116476
"Batches:8, Window:14",0.354025,0.195593,0.354025
"Batches:32, Window:14",0.075314,0.012487,0.075314
"Batches:64, Window:14",0.075946,0.013825,0.075946
"Batches:256, Window:14",0.152454,0.040889,0.152454


In [455]:
# Top 3
registro.sort_values(['MAE','MSE','MAPE']).head(3)

Unnamed: 0_level_0,MSE,MAE,MAPE
Configurations,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Batches:8, Window:28",0.044783,0.00431,0.044783
"Batches:32, Window:14",0.075314,0.012487,0.075314
"Batches:32, Window:7",0.078706,0.013577,0.078706


As per the brief, the longer the LSTM time window, the higher the model complexity. Therefore I made a judgement call and opted for the third configuration above.

In [456]:
# Using the best hyperparameters:
registro = pd.DataFrame(columns=['MSE','MAE','MAPE'])
win_length = 7
batch_size = 32

num_features = x_train.shape[1]

# Generator
train_generator = TimeseriesGenerator(x_train, y_train, length= win_length, sampling_rate = 1, batch_size= batch_size)
test_generator = TimeseriesGenerator(x_test, y_test, length= win_length, sampling_rate = 1, batch_size= batch_size)

# Architecture LSTM
model = Sequential()
model.add(LSTM(128, input_shape = (win_length, num_features), return_sequences = True))
model.add(LeakyReLU(alpha=0.5))
model.add(LSTM(128, return_sequences=True))
model.add(LeakyReLU(alpha=0.5))
model.add(Dropout(0.3))
model.add(LSTM(64, return_sequences=False))
model.add(Dropout(0.3))
model.add(Dense(1)) #output layer

# Defining Early Stopping to avoid overfitting (after 3 attempts)
early_stopping = EarlyStopping(monitor='val_loss',patience = 3, mode='min')

# Using exponential decrease:
def scheduler(epoch, lr):
  return np.clip(lr * tf.math.exp(-0.1), 0.000001, 0.001)


callbacklr = tf.keras.callbacks.LearningRateScheduler(scheduler)

# MAE as the Loss Function, althout MSE and MAPE are calculated as well:
model.compile(loss = 'mae',
      optimizer = tf.optimizers.Adam(learning_rate=0.001),
      metrics=['mse','mae','mape'])


# Training the model:
set_seed(0)
history = model.fit(train_generator, epochs = 50, validation_data = test_generator,
                      shuffle=False, callbacks = [early_stopping, callbacklr])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50


#### Metrics evolution per Epoch

In [457]:
metricas = history.history
del metricas ['mape']
del metricas['val_mape']
fig = px.line(metricas, title = 'Loss Function Evolution Per Epoch')
fig.update_layout(xaxis_title = 'Epochs')

#### Predicted x Observed

In [458]:
predictions = model.predict(test_generator)

df_pred = pd.concat([pd.DataFrame(predictions), pd.DataFrame(x_test [:, 1:][win_length:])],axis=1)
rev_trans = scaler.inverse_transform(df_pred)

df_final = df_[predictions.shape[0]*-1:]
df_final ['Close_Pred'] = rev_trans[:,0]

fig = px.line(df_final[['Close','Close_Pred']], title='Predicted x Observed: ETH TS')
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

