## Prepare data to RandomForest ##

### Due limitations we are going to use 5 quotes of currency-pairs ###

Time and data are rather interesting from this method's perspective. There have always been attempts to use time and data in trading systems. In our models, taking into account some hidden data on the quotes dependency on the time of day and day of the week can be revealed by the classification models automatically. The only thing to do here is to convert these two variables into the categorical form. The time is to become a category with 24 levels and the date is to become a categorical variable with five levels to match the number of week days.

Besides the source predictors, we are going to create additional predictors, which, in my opinion reveal existence of trends in the source quotes. We are going to use well known indicators to create additional predictors.

The following indicators are going to be employed: 5,10 and 15 EMA; MACD(12,26,9), RSI with periods 14,21,28. On top of them we are going to use increments of quotes and moving averages. All of these conversions are to be applied to all six quotes of the currency pairs.

I will use:
    1. (3) 5, 10 and 15 EMA
    2. (3) quotes-EMAs
    3. (1) MACD(12, 26)
    4. (3) RSI with periods 5, 20, 30
    5. (4) quotes sum A+B, B+C, C+D, D+E    

In [7]:
import numpy as np
from scipy.signal import argrelextrema
from matplotlib import pyplot
import pandas as pd
%matplotlib inline

In [8]:
target_quote = 'EUR/GBP'
target_variable = target_quote

In [9]:
# sample interval is of 30 seconds
df = pd.read_pickle('real_time_quotes.pandas') 
df = df.set_index(pd.DatetimeIndex(df['time']))
del df['time'] #not need anymore
print(df.columns)

Index(['EUR/USD', 'AUD/USD', 'GBP/USD', 'EUR/GBP', 'USD/CHF', 'USD/CAD',
       'EUR/CHF'],
      dtype='object')


In [10]:
def MACD(y, a=26, b=12):
    return pd.ewma(y, span=12) - pd.ewma(y, span=26) #12-26

In [11]:
def RSI(y, windown=14):
    dy = y.diff()
    dy.iat[0] = dy.iat[1]
    u = dy.apply(lambda x: x if (x > 0) else 0) # uptrend 0 with where it goes down
    d = dy.apply(lambda x: -x if (x < 0) else 0) # downtred 0 with where it goes up
    # simple exponential moving average
    rs = pd.ewma(u, span=windown)/pd.ewma(d, span=windown)
    return 100 - (100/(1+rs))

In [13]:
def create_indicators(df, target_variable):
    del df[target_variable] # TARGET VARIABLE!!
    quotes = df.columns
    for quote in quotes:
        df['ema_5 '+quote] = pd.ewma(df[quote],span=5)
        df['ema_10 '+quote] = pd.ewma(df[quote],span=10)
        df['ema_15 '+quote] = pd.ewma(df[quote],span=15)
        df['dema_5 '+quote] = df[quote] - df['ema_5 '+quote]
        df['dema_10 '+quote] = df[quote] - df['ema_10 '+quote]
        df['dema_15 '+quote] = df[quote] - df['ema_15 '+quote]
        df['macd '+quote] = MACD(df[quote])
        df['rsi_5 '+quote] = RSI(df[quote], 5)
        df['rsi_20 '+quote] = RSI(df[quote], 20)
        df['rsi_30 '+quote] = RSI(df[quote], 30)
    for i in range(len(quotes)-1):
        df[quotes[i]+'+'+quotes[i+1]] = df[quotes[i]]+df[quotes[i+1]]

In [14]:
create_indicators(df, target_variable)

	Series.ewm(min_periods=0,ignore_na=False,span=5,adjust=True).mean()
	Series.ewm(min_periods=0,ignore_na=False,span=10,adjust=True).mean()
	Series.ewm(min_periods=0,ignore_na=False,span=15,adjust=True).mean()
	Series.ewm(min_periods=0,ignore_na=False,span=12,adjust=True).mean()
  from ipykernel import kernelapp as app
	Series.ewm(min_periods=0,ignore_na=False,span=26,adjust=True).mean()
  from ipykernel import kernelapp as app
	Series.ewm(min_periods=0,ignore_na=False,span=5,adjust=True).mean()
	Series.ewm(min_periods=0,ignore_na=False,span=20,adjust=True).mean()
	Series.ewm(min_periods=0,ignore_na=False,span=30,adjust=True).mean()


In [15]:
# normalize 0 to 1
df = df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

In [16]:
df.head(5)

Unnamed: 0,EUR/USD,AUD/USD,GBP/USD,USD/CHF,USD/CAD,EUR/CHF,ema_5 EUR/USD,ema_10 EUR/USD,ema_15 EUR/USD,dema_5 EUR/USD,...,dema_15 EUR/CHF,macd EUR/CHF,rsi_5 EUR/CHF,rsi_20 EUR/CHF,rsi_30 EUR/CHF,EUR/USD+AUD/USD,AUD/USD+GBP/USD,GBP/USD+USD/CHF,USD/CHF+USD/CAD,USD/CAD+EUR/CHF
2016-05-31 09:21:20.981075,0.05772,0.511997,-0.237415,-0.471842,-0.677477,-0.241043,0.06319,0.06791,0.072244,-0.005452,...,-0.054255,-0.113219,,,,0.442329,0.161887,-0.47047,-0.709676,-0.659365
2016-05-31 09:21:36.619171,0.05772,0.517745,-0.266629,-0.458508,-0.675576,-0.241043,0.06319,0.06791,0.072244,-0.005452,...,-0.054255,-0.113219,,,,0.446905,0.137333,-0.499803,-0.706053,-0.657654
2016-05-31 09:21:42.940439,0.088254,0.511997,-0.268876,-0.485175,-0.673675,-0.216147,0.078685,0.081661,0.085643,0.063864,...,0.023898,-0.105219,0.483027,0.612632,0.654319,0.460635,0.130637,-0.513136,-0.707865,-0.650811
2016-05-31 09:21:48.560504,0.084437,0.511997,-0.266629,-0.471842,-0.673675,-0.216147,0.084138,0.086991,0.090947,0.025447,...,0.000291,-0.101777,0.483027,0.612632,0.654319,0.458347,0.132869,-0.505136,-0.706053,-0.650811
2016-05-31 09:21:54.385702,0.072987,0.517745,-0.219438,-0.445175,-0.678428,-0.207848,0.082375,0.086423,0.090687,-0.016843,...,0.017463,-0.096725,0.483027,0.612632,0.654319,0.456059,0.184208,-0.43847,-0.706959,-0.653377


### READ TARGET VARIABLE AND EVALUATE  CORRELATION ###

In [17]:
print(target_quote)

EUR/GBP


In [19]:
#tgpd = pd.read_pickle('target_variable_dup.pandas')

In [50]:
#df['target_variable_dup'] = pd.Series(tgpd, index=df.index)

### CORRELATION BETTWEEN INDICATORS AND TARGET VARIABLE ###

In [52]:
df.head(1)

Unnamed: 0,EUR/USD,GBP/USD,USD/CHF,USD/CAD,ema_5 EUR/USD,ema_10 EUR/USD,ema_15 EUR/USD,dema_5 EUR/USD,dema_10 EUR/USD,dema_15 EUR/USD,...,dema_10 USD/CAD,dema_15 USD/CAD,macd USD/CAD,rsi_5 USD/CAD,rsi_20 USD/CAD,rsi_30 USD/CAD,EUR/USD+GBP/USD,GBP/USD+USD/CHF,USD/CHF+USD/CAD,target_variable_dup
0,0.124169,-0.058249,0.05158,-0.617411,0.136002,0.142731,0.150453,-0.002463,-0.003767,-0.00508,...,-0.011715,-0.015433,-0.03777,-0.530555,-0.58853,-0.626984,-0.010826,-0.041888,-0.527556,1.0


In [22]:
def calculate_corr_and_remove(df, serie_binary_class, plot=True):
    """
    calculate correlation with binary classification serie
    collumns less then 5% are removed and wont be used on
    random forest
    """
    df['target_variable_dup'] = serie_binary_class
    corr = df.corr().ix['target_variable_dup', :-1]
    corr = p.apply(lambda x: np.abs(x))
    p.sort_values(ascending=False, inplace=True)
    if plot:
        print(p)
    df.drop(p.index[p < 0.05], axis='columns', inplace=True)
    df.drop('target_varible_dup', axis='columns', inplace=True)

### We will use everything above 5% correlation ###

Correlation of indicator variables should be rule of thumb:

At least 5%.

Remove those bellow 5%

In [55]:
toremove = p.index[p < 0.05]

In [56]:
df.drop(p.index[p < 0.05], axis='columns', inplace=True)
df.tail(2)

Unnamed: 0,GBP/USD,rsi_5 EUR/USD,rsi_20 EUR/USD,ema_5 GBP/USD,ema_10 GBP/USD,ema_15 GBP/USD,macd GBP/USD,rsi_20 GBP/USD,rsi_30 GBP/USD,rsi_5 USD/CHF,...,dema_5 USD/CAD,dema_10 USD/CAD,dema_15 USD/CAD,macd USD/CAD,rsi_5 USD/CAD,rsi_20 USD/CAD,rsi_30 USD/CAD,EUR/USD+GBP/USD,GBP/USD+USD/CHF,target_variable_dup
0,-0.058249,0.493197,0.585257,-0.058964,-0.061132,-0.063162,-0.007538,0.529163,0.569762,-0.526991,...,-0.007795,-0.011715,-0.015433,-0.03777,-0.530555,-0.58853,-0.626984,-0.010826,-0.041888,1.0
1,-0.03452,0.493197,0.585257,-0.044384,-0.047091,-0.048944,-0.004611,0.529163,0.569762,-0.526991,...,-0.019317,-0.020187,-0.022652,-0.038531,-0.530555,-0.58853,-0.626984,0.016952,-0.052758,1.0


## SAVE ##

In [57]:
df.to_pickle('data_for_classification.pandas')