<h1 align="center">Predicting EUR/USD with LSTM Network</h1> 
<h3 align="center">Bradley Droegkamp</h3> 

# Introduction
***
Forex price prediction, much like stock price prediction, is a near impossible task given all the noise involved in price time series data.  However, profitable trading strategies can be made from models that provide only a sliver of edge.  In this project, I will use a Long Short-Term Memory (LSTM - http://colah.github.io/posts/2015-08-Understanding-LSTMs/) network to predict the 5 minute future price of the front month EUR/USD futures contract (EU) listed on the Chicago Mercantile Exchange (CME - https://www.cmegroup.com/trading/fx/g10/euro-fx.html).  The model will focus on a small subset of trading hours (9:15 - 11:15AM CST).
<br>

# Data
***
The data set consists of 1-minute increment front-month EU price data from September 27, 2009 to April 18, 2018, though we will only use a subset. The data was purchased from kibot (www.kibot.com), a vendor of CME intraday data.  Note the data contains all open hours of trading, which is a 23 hour trading day of 17:00 t-1 - 16:00 CST Monday(Sunday PM) to Friday.

### Contract Details
<table class="cmeSpecTable" summary="Contract Specifications Product Table" cellspacing="0" cellpadding="2">
<tbody>
<tr>
<td class="prodSpecAtribute">Contract Unit</td>
<td colspan="5" style="text-align:left">125,000 euro</td>
</tr>
<tr>
<td class="prodSpecAtribute" rowspan="1">Trading Hours</td>
<td colspan="3" style="text-align:left">Sunday - Friday 6:00 p.m. - 5:00 p.m. (5:00 p.m. - 4:00 p.m. Chicago Time/CT) with a 60-minute break each day beginning at 5:00 p.m. (4:00 p.m. CT)</td>
</tr>
<tr>
<td class="prodSpecAtribute" rowspan="1">Minimum Price Fluctuation*</td>
<td colspan="3" style="text-align:left">Outrights: .00005 USD per EUR increments ($6.25 USD).<br />Consecutive Month Spreads: (Globex only)&nbsp;&nbsp;0.00001 USD per EUR (1.25 USD)<br />All other Spread Combinations: &nbsp;0.00005 USD per EUR (6.25 USD)</td>
</tr>
<tr>
<td class="prodSpecAtribute">Product Code</td>
<td colspan="5" style="text-align:left">CME Globex: 6E<br />CME ClearPort: EC<br />Clearing: EC</td>
</tr>
<tr>
<td class="prodSpecAtribute" rowspan="1">Listed Contracts</td>
<td colspan="5" style="text-align:left">Contracts listed for the first 3 consecutive months and 20 months in the March quarterly cycle (Mar, Jun, Sep, Dec)</td>
</tr>
<tr>
<td class="prodSpecAtribute">Settlement Method</td>
<td colspan="5" style="text-align:left">Deliverable</td>
</tr>
<tr>
<td class="prodSpecAtribute" rowspan="1">Termination Of Trading</td>
<td colspan="5" style="text-align:left">9:16 a.m. Central Time (CT) on the second business day immediately preceding the third Wednesday of the contract month (usually Monday).</td>
</tr>
<tr>
<td class="prodSpecAtribute">Settlement Procedures</td>
<td colspan="5" style="text-align:left">Physical Delivery<br /><a href="http://www.cmegroup.com/confluence/display/EPICSANDBOX/Euro" target="_blank">EUR/USD Futures Settlement Procedures&nbsp;</a></td>
</tr>
</tbody>
</table>

*Source:  https://www.cmegroup.com/trading/fx/g10/euro-fx_contract_specifications.html*

**Min Price Fluctuation changed from 0.0001 to 0.00005 on January 11, 2016 (https://www.cmegroup.com/trading/fx/half-tick.html)*

<br>

#### I.  Bring in the raw data.

In [2]:
# The code was removed by Watson Studio for sharing.

+----------+-----+------+------+------+------+------+
|      Date| Time|  Open|  High|   Low| Close|Volume|
+----------+-----+------+------+------+------+------+
|09/27/2009|18:00|  1.47|1.4701| 1.469|1.4691|   441|
|09/27/2009|18:01|1.4691|1.4691|1.4689| 1.469|    29|
|09/27/2009|18:02| 1.469| 1.469|1.4688|1.4688|    22|
|09/27/2009|18:03|1.4687|1.4691|1.4687|1.4691|    38|
|09/27/2009|18:04|1.4692|1.4693|1.4692|1.4692|    20|
|09/27/2009|18:05|1.4692|1.4693| 1.469|1.4691|    11|
|09/27/2009|18:06|1.4691|1.4692|1.4689|1.4692|    14|
|09/27/2009|18:07|1.4691|1.4691| 1.469| 1.469|     6|
|09/27/2009|18:08| 1.469|1.4691| 1.469|1.4691|     5|
|09/27/2009|18:09| 1.469|1.4692| 1.469|1.4692|     7|
|09/27/2009|18:10|1.4692|1.4692|1.4684|1.4685|    81|
|09/27/2009|18:11|1.4686|1.4687|1.4683|1.4686|    63|
|09/27/2009|18:12|1.4687|1.4688|1.4686|1.4687|     7|
|09/27/2009|18:13|1.4687|1.4692|1.4687|1.4691|    25|
|09/27/2009|18:14| 1.469|1.4691|1.4684|1.4688|    37|
|09/27/2009|18:15|1.4686|1.4

#### II.  Combine Date and Time columns.  Also, these times are in EST, but I prefer CST.

In [3]:
from pyspark.sql.functions import unix_timestamp, from_unixtime, concat, col, lit, hour, minute, year, lag
from pyspark.sql.window import Window

# Convert Date and Time columns to Timestamps and combine
df_raw_2 = df_raw.select(unix_timestamp(concat(col('Date'), lit(' '), col('Time')), 'MM/dd/yyyy HH:mm')\
                   .cast(TimestampType()).alias('Timestamp'),
                   'Open', 'High', 'Low', 'Close', 'Volume')

# now substract hour from EST timestamps for CST
df = df_raw_2.select(from_unixtime(unix_timestamp(col('Timestamp')) - 60 * 60).alias('Timestamp'),
                    'Open', 'High', 'Low', 'Close', 'Volume')

df.createOrReplaceTempView('df')
df_2016 = spark.sql("SELECT * FROM df WHERE Timestamp BETWEEN '2016-01-01' AND '2016-12-31' ORDER BY Timestamp")

df_2016.show()

# pandas df for exploring at next step
pdf_plt = df_2016.toPandas()

+-------------------+------+------+------+------+------+
|          Timestamp|  Open|  High|   Low| Close|Volume|
+-------------------+------+------+------+------+------+
|2016-01-03 17:00:00|1.0884|1.0886|1.0882|1.0883|   215|
|2016-01-03 17:01:00|1.0884|1.0884|1.0881|1.0882|    48|
|2016-01-03 17:02:00|1.0883|1.0884|1.0882|1.0883|    37|
|2016-01-03 17:03:00|1.0884|1.0884|1.0879|1.0879|    51|
|2016-01-03 17:04:00|1.0878|1.0878|1.0873|1.0874|   133|
|2016-01-03 17:05:00|1.0874|1.0875|1.0874|1.0875|    47|
|2016-01-03 17:06:00|1.0875|1.0877|1.0874|1.0876|    24|
|2016-01-03 17:07:00|1.0876|1.0876|1.0875|1.0876|    10|
|2016-01-03 17:08:00|1.0876|1.0876|1.0876|1.0876|     3|
|2016-01-03 17:09:00|1.0875|1.0875|1.0874|1.0874|    40|
|2016-01-03 17:10:00|1.0875|1.0875|1.0873|1.0875|   146|
|2016-01-03 17:11:00|1.0875|1.0877|1.0874|1.0876|   195|
|2016-01-03 17:12:00|1.0877|1.0877|1.0877|1.0877|     6|
|2016-01-03 17:13:00|1.0876|1.0876|1.0876|1.0876|    10|
|2016-01-03 17:14:00|1.0877|1.0

#### III.  Explore Data

As expected, prices are not stationary.  We will use returns rather than price to get better results from our model.

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.dates as mdates
import matplotlib.pyplot as plt

pdf_day = pdf_plt[(pdf_plt['Timestamp'] >= '2016-01-05') & (pdf_plt['Timestamp'] < '2016-01-06')]

fig, ax = plt.subplots(1, 2, figsize=(18, 6))
pdf_day.plot(x="Timestamp", y="Close", ax=ax[0], legend=None)
ax[0].set_xlabel("Time")
ax[0].get_figure().autofmt_xdate()
#xfmt = mdates.DateFormatter('%d-%m-%y %H:%M:%S')
#ax[0].xaxis.set_major_formatter(xfmt)
ax[0].set_ylabel("Price")
ax[0].set_title("EUR/USD Price Movement on Jan 5, 2016", fontsize=16)
#ax2_sub1 = ax[0].twinx()
#ax2_sub1.bar(pdf_day.index, pdf_day['Volume'], color='g', alpha=0.5)

pdf_plt.plot(x="Timestamp", y="Close", ax=ax[1], legend=None)
ax[1].set_xlabel("Time")
ax[1].set_ylabel("Price")
ax[1].set_title("EUR/USD Price Movement in 2016", fontsize=16)
ax[1].get_figure().autofmt_xdate()

Volumes are more concentrated during US daytime hours.

In [5]:
# TODO:  

# 1. INSERT BAR CHART HERE WITH AVG VOLUMES BY HOUR
#problem below is df has timestamp as object.  need this to be datetime64 (prefer without slicing warning)
#times = pd.to_datetime(df.timestamp_col)
#df.groupby([times.hour, times.minute]).value_col.sum()

# 2. FORMAT X-axis ABOVE (and add volume overlay?).  Prettify with candlesticks???



#### IV.  Set up data for next steps

##### Data will be filtered to 9:15 - 11:15AM CST
 - Common economic releases between 7:15 - 9:00AM CST may bias data if included.
 - Train data set will be 2015, test data 2016, and 2017 for more validation data explained later.
 - These are generally periods of higher volume and volatility.
 - This can be revisited for building models with specific time slots.

<br>

At this point, we will define:
 - **Time window** used in the LSTM = **10** minutes
 - **Batch size** = **64**
 - **Forecast window** = **5** minutes

In [106]:
# define important data parameters
forecast_window = 5
time_window = 10
batch_size = 64

# add price returns column
df_train = df.withColumn('tmp_lag_price', lag(df.Close).over(Window.orderBy('Timestamp')))
df_train = df_train.withColumn('price_1min_return', df_train.Close - df_train.tmp_lag_price).na.drop()
df_train = df_train.drop('tmp_lag_price')

# add 5 minute return column
df_train = df_train.withColumn('tmp_long_lag_price', lag(df.Close, count=forecast_window).over(Window.orderBy('Timestamp')))
df_train = df_train.withColumn('price_5min_return', df_train.Close - df_train.tmp_long_lag_price).na.drop()
df_train = df_train.drop('tmp_long_lag_price')

# set train data to year 2015 (test data to 2016, and set aside 2017 data for some later validation)
# include only data between 9:05 and 11:20AM CST.
df_train = df_train.filter(year('Timestamp') == lit(2015))\
                   .filter((hour('Timestamp') >= lit(9)) & (hour('Timestamp') <= lit(11)))\
                   .filter((hour('Timestamp') != lit(9)) | (minute('Timestamp') >= lit(5)))\
                   .filter((hour('Timestamp') != lit(11)) | (minute('Timestamp') <= lit(20)))

# add a column to later filter out fields not needed in final dataset.
df_train = df_train.withColumn('is_in_dataset', ((hour('Timestamp') == lit(9)) & (minute('Timestamp') >= lit(15))) | \
                                                ((hour('Timestamp') == lit(11)) & (minute('Timestamp') <= lit(15))) | \
                                                ((hour('Timestamp') == lit(10))))

df_train.show()

+-------------------+------+------+------+------+------+--------------------+--------------------+-------------+
|          Timestamp|  Open|  High|   Low| Close|Volume|   price_1min_return|   price_5min_return|is_in_dataset|
+-------------------+------+------+------+------+------+--------------------+--------------------+-------------+
|2015-01-02 09:05:00|1.2024|1.2024|1.2022|1.2024|   515|                 0.0|-0.00100000000000...|        false|
|2015-01-02 09:06:00|1.2023|1.2025|1.2022|1.2024|   495|                 0.0|-0.00110000000000...|        false|
|2015-01-02 09:07:00|1.2024|1.2029|1.2022|1.2029|   355| 5.00000000000167E-4|-5.99999999999933...|        false|
|2015-01-02 09:08:00|1.2028| 1.203|1.2028| 1.203|   425|9.999999999998899E-5|                 0.0|        false|
|2015-01-02 09:09:00| 1.203|1.2037| 1.203|1.2035|   587|4.999999999999449E-4|0.001100000000000101|        false|
|2015-01-02 09:10:00|1.2036|1.2038|1.2036|1.2036|   246|9.999999999998899E-5|0.001200000000000..

##### Set up data using Close price and Volume as features

In [111]:
from sklearn.preprocessing import MinMaxScaler

pdf_train = df_train.toPandas()

# only need price returns and volumes, but keep dataset check vals for below
train_set = pdf_train.iloc[:, 5:8].values
is_in_dataset_check = pdf_train.iloc[:, -1]

# feature scaling
sc = MinMaxScaler(feature_range = (0, 1))
train_set_scaled = sc.fit_transform(np.float64(train_set))
train_set_scaled = train_set

# filter data into needed arrays
x_price_train = []
x_volume_train = []
y_train = []

length = len(train_set_scaled)
for i in range(0, length):
    x_volume_train.append(train_set_scaled[max(0, i - time_window):i, 0])
    x_price_train.append(train_set_scaled[max(0, i - time_window):i, 1])


for i in range(0, len(train_set_scaled)):
    y_train.append(train_set_scaled[min(length, i+forecast_window):min(length, i+forecast_window)+1, 2])

# now that we have the time_window data, remove unwanted entries based on prior is_in_dataset_check
x_volume_train, x_price_train, y_train, is_in_dataset_check = \
    np.array(x_volume_train), np.array(x_price_train), np.array(y_train), np.array(is_in_dataset_check)
x_volume_train = x_volume_train[is_in_dataset_check]
x_price_train = x_price_train[is_in_dataset_check]
y_train = y_train[is_in_dataset_check]

# reduce size of dataset to be divisible by batch size
x_volume_train = x_volume_train[0:len(x_volume_train) - len(x_volume_train)%batch_size]
x_price_train = x_price_train[0:len(x_price_train) - len(x_price_train)%batch_size]
y_train = y_train[0:len(y_train) - len(y_train)%batch_size]

# combine and reshape for modeling
x_volume_train = np.reshape(np.array(x_volume_train.tolist()), (x_volume_train.shape[0], 10))
x_price_train = np.reshape(np.array(x_price_train.tolist()), (x_price_train.shape[0], 10))
X_train = np.dstack((x_price_train, x_volume_train))
y_train = np.reshape(np.array(y_train.tolist()), (y_train.shape[0], 1))
print("Feature set shape (standardized price & volume w/10min window): ")
print(X_train.shape)
print(X_train[0])
print('\n')
print("y var shape (standardized 5min future price return): ")
print(y_train.shape)
print(y_train[0])

# TODO define test, sim sets
# vars for MLR

Feature set shape (standardized price & volume w/10min window): 
(31232, 10, 2)
[[ 0.00e+00  5.15e+02]
 [ 0.00e+00  4.95e+02]
 [ 5.00e-04  3.55e+02]
 [ 1.00e-04  4.25e+02]
 [ 5.00e-04  5.87e+02]
 [ 1.00e-04  2.46e+02]
 [ 1.00e-04  8.03e+02]
 [ 2.00e-04  2.35e+02]
 [-7.00e-04  5.47e+02]
 [-3.00e-04  3.19e+02]]


y var shape (standardized 5min future price return): 
(31232, 1)
[-0.0003]


# Methodology
***
### I.  Fit the LSTM network model

In [112]:
from keras.layers import Dense, Dropout, Input, LSTM
from keras.models import Sequential, load_model
import h5py

lstm = Sequential()
lstm.add(LSTM(40, batch_input_shape=(batch_size,time_window,2), return_sequences=True, recurrent_dropout = 0.1))
lstm.add(LSTM(30, recurrent_dropout = 0.2))
lstm.add(Dropout(0.2))
lstm.add(Dense(20, activation='relu'))
lstm.add(Dropout(0.2))
lstm.add(Dense(5, activation='relu'))
lstm.add(Dense(1))
lstm.compile(loss= 'mae', optimizer= 'adam')
lstm.fit(X_train, y_train, epochs=10, batch_size=batch_size, shuffle=True)
#set test data in fit?  Why?  Make test data!!!

lstm.save(filepath="lstm.h5")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2addcfed7940>

In [123]:
# can skip above and load model last model from here to save time
lstm = load_model('lstm.h5')