# PRE DATA PROCESSING

This algorithm has the function of preprocessing the time series, calculating the exponential averages, technical indicators and classes for training the models.

**Importing libraries**


In [1]:
import numpy as np
from talib.abstract import *
import pickle
import pandas as pd

In [2]:
minutes = 5 # forecast horizon in minutes
test_size= 0.2 # 20% of the dataframe for testing

# load dataframe
with open("EURUSD", "rb") as f:
    inputs = pickle.load(f) # inputs is a dict containing open, high, low, closed, volume
dataframe = pd.DataFrame.from_dict(inputs)

# Technical indicators
## Bollinger Bands
This indicator has a strong relationship with volatility, thus making it possible to compare it with price levels over a given period of time. The biggest goal of Bollinger bands is to provide a relative idea of high and low. By definition, prices are high in the upper line (band) and low in the lower line (band).
## EMA
The exponential moving average (EMA) is a technical chart indicator that tracks the price of an investment (like a stock or commodity) over time. The EMA is a type of weighted moving average (WMA) that gives more weighting or importance to recent price data. Like the simple moving average, the exponential moving average is used to see price trends over time, and watching several EMAs at the same time is easy to do with moving average ribbons.
## RSI
The Relative Strength Index (RSI) is a well-versed impulse-based oscillator that is used to measure the speed (velocity) as well as the change (magnitude) of directional price movements. Essentially, the RSI, when presented in a graph, provides a visual means to monitor the news, as well as the history, strength and weakness of a given market. The strength or weakness is based on closing prices over a specified trading period, creating a reliable metric for price and momentum changes.
## CCI
The indicator measures the current price level in relation to an average level, in a given time window. Thus, the further away from the average, the greater its value. The technique will show positive readings if above average and negative values if below average. An indicator that tells the analyst how far it is from an equilibrium value, immediately becomes a candidate to be used as an overbought detector and oversells as will be seen later.

In [3]:
dataframe['BB_UP'], dataframe['BB_MID'], dataframe['BB_LOW'] = BBANDS(inputs, timeperiod=20, nbdevup=2.5, nbdevdn=2.5, matype=0)
dataframe['EMA'] = EMA(dataframe, timeperiod=100)
dataframe['RSI'] = RSI(dataframe['close'], timeperiod=14)
dataframe['CCI'] = CCI(dataframe['high'], dataframe['low'], dataframe['close'], timeperiod=14)

The dataframe contains NaN values, we need to fill them in!

In [4]:
dataframe.fillna(method="ffill", inplace= True) # ffill: propagate last valid observation forward to next valid
dataframe.fillna(method="bfill",inplace= True) # bfill: use next valid observation to fill gap.

Generating the classes

In [5]:
temp_df =  dataframe
temp_df = temp_df['close'].shift(-minutes) # Shift index by desired number of periods
temp_df.fillna(method="ffill", inplace= True)
temp_df.fillna(method="bfill", inplace= True)
# 1 = price increase, 0 = price decrease
classes = [1 if temp_df[i] > inputs['close'][i] else 0 for i in range(len(inputs['close']))]

# Division, Training and Testing 

**Importing libraries**

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import QuantileTransformer

## Random Division
Preference was given to shuffling the data because this technique generated models with greater precision than temporal data

## Data Distribution
The data distribution default is "uniform", but can be changed to "normal" which represents the Gaussian distribution.
`QuantileTransformer(output_distribution='normal')`
This model achieved better accuracy with the default distribution


In [7]:
# Random division
X_train,X_test,y_train,y_test=train_test_split(dataframe, classes, test_size= test_size, random_state=101)
# Exponential smoothing
scaler=QuantileTransformer()
scaler.fit(X_train)
X_train = pd.DataFrame(data=scaler.transform(X_train),columns = X_train.columns, index=X_train.index)
X_test = pd.DataFrame(data=scaler.transform(X_test),columns = X_test.columns,index=X_test.index)

# Random Forest

In [8]:
from sklearn.ensemble import RandomForestClassifier

**Attention**, this example works with multiprocessing and is configured to use all the cores of your processor. 
This behavior can be fixed by the parameter `jobs`

In [9]:
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(X_train, y_train)

RandomForestClassifier(n_jobs=-1)

Prediction and score

In [16]:
predictions = rf.predict(X_test)
score = rf.score(X_test, y_test)
print('Score: ', round(score,2))
# saves Random Forest with pickles
with open("RF_EURUSD", "wb") as f:
    pickle.dump(rf, f)

Score:  0.79


# Cross-validation

In [14]:
from sklearn.model_selection import cross_val_score

In [15]:
scores = cross_val_score(rf, X_train, y_train, cv=10)
print("Accuracy rf: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy rf: 0.78 (+/- 0.01)
