<img src="https://github.com/WeiTaKuan/zenbot_tading_algorithm/blob/main/ZenBotLogo.png" width="100px" align="left"/>

# Zenbot Machine Learning Training Process

- Author: Max Kuan
- Last Update: 2022.05.02
- Language: Python 3.9.12
- Data Soucre: Backtested Trading Result from [FXTM](http://www.forextime.com/zh-tw/register/open-account?raf=fa060479) Metatrader 5 Account

### Description

Zenbot is an automatic reversal trading algorithm. It was first released on **2021.10.31** and is now upgraded into a new version. I feel a great sense of achievement for this upgrade. The new Zenbot uses machine learning techniques to identify the position type while entering a trade. This notebook provides my personal opinion on how to build a model **step-by-step** and **setup customises metrics** to evaluate the model performance.

In [36]:
import pandas as pd 
pd.options.mode.chained_assignment = None
import numpy as np
from collections import Counter

In [37]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [38]:
data = pd.read_csv('trading_result.csv')

In [39]:
data.head(10)

Unnamed: 0,time,tick_volume,spread,moving_average,bias_ma,volume_ma,vol_change,K,D,type,profit,label
0,2009/1/2,678,20,1.402423,-0.007718,783.0,0.994286,59.0,29.52694,short,149,1
1,2009/1/2,579,20,1.400623,-0.008584,775.8,0.974256,22.48062,50.904393,long,169,1
2,2009/1/2,521,20,1.397657,-0.004047,681.7,0.980299,74.766355,31.142187,short,314,1
3,2009/1/5,808,20,1.395407,-0.006956,567.4,0.947404,12.5,56.391491,long,365,1
4,2009/1/5,633,20,1.379803,-0.014207,967.2,0.931433,55.445545,36.960294,short,140,1
5,2009/1/5,612,20,1.37788,-0.014863,887.1,0.956236,30.769231,48.182703,long,309,1
6,2009/1/5,562,20,1.376947,-0.009693,867.6,0.978018,98.901099,62.667888,short,224,1
7,2009/1/6,625,20,1.373907,-0.010777,679.6,0.930831,29.473684,63.928002,long,-4,0
8,2009/1/6,295,20,1.371273,-0.008075,528.9,0.922715,37.777778,24.814815,short,455,1
9,2009/1/6,517,20,1.350817,-0.000753,939.8,0.952854,51.851852,80.997031,long,179,1


### Objectives

Due to the randomness of the market, which means if you enter a trade randomly, there will be only a 50% win rate for having a successful trade, which makes the market hard to predict whether a trade is successful or not. 

Hence, our goals include 
1. Use historical data to predict the position type of future date
2. Maximize our profit
3. Reduce wrong position type while entering a trade
4. Exceed randomness probability (50%)

### First, Getting the answer right
According to our data, we have a binary label that indicates it is a successful trade or not. Hence, if the trade is unsuccessful, we can change the type to another type, **i.e. short -> long**. This can make sure we get the correct answer for each trade. 

In [40]:
data['type'] = np.where(data.label == 0, np.where(data.type == 'short', 'long', 'short'), data.type)

We can see the data at index seven. The previous data shows it was a long position. However, after modification, it had changed into a short position, which is the correct answer, while others remain the same.

In [41]:
data.head(10)

Unnamed: 0,time,tick_volume,spread,moving_average,bias_ma,volume_ma,vol_change,K,D,type,profit,label
0,2009/1/2,678,20,1.402423,-0.007718,783.0,0.994286,59.0,29.52694,short,149,1
1,2009/1/2,579,20,1.400623,-0.008584,775.8,0.974256,22.48062,50.904393,long,169,1
2,2009/1/2,521,20,1.397657,-0.004047,681.7,0.980299,74.766355,31.142187,short,314,1
3,2009/1/5,808,20,1.395407,-0.006956,567.4,0.947404,12.5,56.391491,long,365,1
4,2009/1/5,633,20,1.379803,-0.014207,967.2,0.931433,55.445545,36.960294,short,140,1
5,2009/1/5,612,20,1.37788,-0.014863,887.1,0.956236,30.769231,48.182703,long,309,1
6,2009/1/5,562,20,1.376947,-0.009693,867.6,0.978018,98.901099,62.667888,short,224,1
7,2009/1/6,625,20,1.373907,-0.010777,679.6,0.930831,29.473684,63.928002,short,-4,0
8,2009/1/6,295,20,1.371273,-0.008075,528.9,0.922715,37.777778,24.814815,short,455,1
9,2009/1/6,517,20,1.350817,-0.000753,939.8,0.952854,51.851852,80.997031,long,179,1


### Next, seperate validation and training dataset

In this part, our goal is to predict unseen data from historical data. Hence, we need to have some data for validation and avoid leaking to the training model. So here, we take 16% out of the data to be our validation set. 

You might ask why 16%, not 30% or 21%? 


After testing different values, 16% is the best for training a model. So, make sure you test many parameters as you can before finalising your model.

In [42]:
val_data, training_data = data.iloc[-1 * int(len(data)/6):, :], data.iloc[:-1 * int(len(data)/6), :]

In [43]:
training_data.shape

(7097, 12)

In [44]:
val_data.shape

(1419, 12)

### Column Standardization 
Next, we need to rescale specific columns to make sure the value is between 0 and 1 to distort the differences in the range of the values. 
- tick_volume: Have huge range difference in the market. Need to scale.
- K: K is always between 0 to 100. No need to scale.
- D: D is similar to K and always sits between 0 to 100. No need to scale.

In [45]:
sc = StandardScaler()

training_data[['tick_volume']] = sc.fit_transform(training_data[['tick_volume']])
val_data[['tick_volume']] = sc.transform(val_data[['tick_volume']])

In [46]:
training_data.head()

Unnamed: 0,time,tick_volume,spread,moving_average,bias_ma,volume_ma,vol_change,K,D,type,profit,label
0,2009/1/2,-0.812221,20,1.402423,-0.007718,783.0,0.994286,59.0,29.52694,short,149,1
1,2009/1/2,-0.831648,20,1.400623,-0.008584,775.8,0.974256,22.48062,50.904393,long,169,1
2,2009/1/2,-0.84303,20,1.397657,-0.004047,681.7,0.980299,74.766355,31.142187,short,314,1
3,2009/1/5,-0.78671,20,1.395407,-0.006956,567.4,0.947404,12.5,56.391491,long,365,1
4,2009/1/5,-0.821051,20,1.379803,-0.014207,967.2,0.931433,55.445545,36.960294,short,140,1


### Setup baseline
Our goal is to maximise our profit and success rate from unseen data. Hence, we need to understand our validation dataset's profit and success rate. We can see our baseline profit is 10,538, and our success rate is 68.9% 

In [47]:
print(f'Validation Set Initial Profit {sum(val_data.profit)}')
print(f'Validation data Initial win rate {len(val_data[val_data.label == 1]) / len(val_data)}')

Validation Set Initial Profit 10538
Validation data Initial win rate 0.6892177589852009


## Build our model
So, our baseline profit is 10,538, and the win rate is 68.92%. Now we can start building our model. There are four different processes include
1. Feature Selection
2. Model Evaluation
3. Optimize Hyperparameter
4. Finalize Model

### Feature Selection


In [53]:
features_column = ['tick_volume', 'moving_average', 'bias_ma', 'volume_ma', 'vol_change','K', 'D']

In [54]:
train_feature = training_data[features_column]
train_label = training_data['type']

In [57]:
train_feature.head()

Unnamed: 0,tick_volume,moving_average,bias_ma,volume_ma,vol_change,K,D
0,-0.812221,1.402423,-0.007718,783.0,0.994286,59.0,29.52694
1,-0.831648,1.400623,-0.008584,775.8,0.974256,22.48062,50.904393
2,-0.84303,1.397657,-0.004047,681.7,0.980299,74.766355,31.142187
3,-0.78671,1.395407,-0.006956,567.4,0.947404,12.5,56.391491
4,-0.821051,1.379803,-0.014207,967.2,0.931433,55.445545,36.960294


In [58]:
train_label.head()

0    short
1     long
2    short
3     long
4    short
Name: type, dtype: object

In [55]:
val_feature = val_data[features_column]
val_label = val_data['type']

### Model Evaluation
Here we use logistic regression and random forest classifier as our model to check the model capacity. Don't forget our main objective is to maximise our profit. So, we care about the total profit that the model can give. 

The result shows logistic regression has a lower performance than our baseline. So, we will not consider logistic regression for our model for now. However, the random forest classifier has exceeded our baseline, which can be passed forward to the next step.

#### Logistic Regression

In [70]:
lr = LogisticRegression()
lr.fit(train_feature, train_label)
print(f"Accuracy - {round(accuracy_score(lr.predict(val_feature), val_label), 4)}")

val_data['predict'] = lr.predict(val_feature)
val_data['correct'] = np.where(val_data.type == val_data.predict, 1, 0)
val_data['abs_profit'] = abs(val_data.profit)
total = sum(val_data[val_data.correct == 1].abs_profit) - sum(val_data[val_data.correct == 0].abs_profit)
print(f"Total Profit - {total}")

Accuracy - 0.6765
Total Profit - 10120


#### Random Forest Classifier

In [73]:
rfc = RandomForestClassifier()
rfc.fit(train_feature, train_label)
print(f"Accuracy - {round(accuracy_score(rfc.predict(val_feature), val_label), 4)}")

val_data['predict'] = rfc.predict(val_feature)
val_data['correct'] = np.where(val_data.type == val_data.predict, 1, 0)
val_data['abs_profit'] = abs(val_data.profit)
total = sum(val_data[val_data.correct == 1].abs_profit) - sum(val_data[val_data.correct == 0].abs_profit)
print(f"Total Profit - {total}")

Accuracy - 0.6631
Total Profit - 12866


### Hyperparameter Tuning
Because we are using unseen data (validation set) to evaluate our model, so, to avoid overfitting, we need to prune the tree. Here, we test several parameters on the number of trees(n_estimator) and the depth of a tree. After 25 times running, we can get a good hyperparameter for our random forest classifier model.

In [86]:
n_estimator = [5, 10, 30, 50, 100]
max_depth = [2, 10, 21, 30, 50]
count = 0
parameter_result = {}
for est in n_estimator:
    for dep in max_depth:
        count = 0
        batch_total = []
        while count < 25:
            rfc = RandomForestClassifier(n_estimators=est, max_depth=dep, n_jobs=-1)
            rfc.fit(train_feature, train_label)
            val_data['predict'] = rfc.predict(val_feature)
            val_data['correct'] = np.where(val_data.type == val_data.predict, 1, 0)
            val_data['abs_profit'] = abs(val_data.profit)
            total = sum(val_data[val_data.correct == 1].abs_profit) - sum(val_data[val_data.correct == 0].abs_profit)
            batch_total.append(total)
            count += 1
        parameter_result[(est, dep)] = np.mean(batch_total)

In [87]:
parameter_result

{(5, 2): 10050.88,
 (5, 10): 8573.2,
 (5, 21): 7582.16,
 (5, 30): 7833.12,
 (5, 50): 6810.24,
 (10, 2): 10280.64,
 (10, 10): 10327.52,
 (10, 21): 7136.32,
 (10, 30): 8236.56,
 (10, 50): 8420.16,
 (30, 2): 10457.44,
 (30, 10): 10502.24,
 (30, 21): 11047.28,
 (30, 30): 10927.84,
 (30, 50): 11349.84,
 (50, 2): 10546.88,
 (50, 10): 10819.2,
 (50, 21): 11406.24,
 (50, 30): 10628.32,
 (50, 50): 11313.04,
 (100, 2): 11169.04,
 (100, 10): 10584.96,
 (100, 21): 11796.56,
 (100, 30): 11026.88,
 (100, 50): 11287.2}

However, due to the nature of the ensemble technique, the random forest classifier uses the bagging principle to build trees by selecting different features for each tree and doing a majority vote at the end. So, we need to run our model several times to find the best model closer to our objectives with a small drawdown.  

In [101]:
def find_best_model(feature, label, val_data, val_feature, baseline=10538):
    while True:
        model = RandomForestClassifier(n_estimators=30, max_depth=21, n_jobs=-1)
        model.fit(feature, label)
        val_data['predict'] = model.predict(val_feature)
        val_data['correct'] = np.where(val_data.type == val_data.predict, 1, 0)
        val_data['abs_profit'] = abs(val_data.profit)
        total = sum(val_data[val_data.correct == 1].abs_profit) - sum(val_data[val_data.correct == 0].abs_profit)
        if total > baseline:
            return model, total

We need to check the performance of our model with customised metrics. Our objective is to maximise our profit and get the wrong answer correct as much as possible. However, our primary goal is to get as much profit as we can, so we can tolerate some drawbacks of prediction correctness. 

We set up four different metrics, which include
1. wsw - win still win
2. ltw - lose turn win
3. lsl - lose still lose
4. wtl - win turn lose


In [99]:
def metrics(val_data):
    wsw = len(val_data[(val_data.type == val_data.predict) & (val_data.label == 1)])
    ltw = len(val_data[(val_data.type == val_data.predict) & (val_data.label == 0)])
    lsl = len(val_data[(val_data.type != val_data.predict) & (val_data.label == 0)])
    wtl = len(val_data[(val_data.type != val_data.predict) & (val_data.label == 1)])
    print(wsw, ltw, lsl, wtl)

After three trials, our model becomes better and better, but here we can see the third model. It has a higher profit but a low wsw, ltw rate and high wtl rate, which indicates it is not a good model to use. So, we need to train repeatedly to find the best model to use.

In [114]:
first_attempt, first_profit = find_best_model(train_feature, train_label, val_data, val_feature, baseline=10538)
print(f"Profit for this model - {first_profit}")
val_data['predict'] = first_attempt.predict(val_feature)
print("\n------ metrics -------")
metrics(val_data)

Profit for this model - 15406

------ metrics -------
871 72 369 107


In [115]:
second_attempt, second_profit = find_best_model(train_feature, train_label, val_data, val_feature, baseline=first_profit)

print(f"Profit for this model - {second_profit}")
val_data['predict'] = second_attempt.predict(val_feature)
print("\n------ metrics -------")
metrics(val_data)

Profit for this model - 15494

------ metrics -------
876 64 377 102


In [116]:
third_attempt, third_profit = find_best_model(train_feature, train_label, val_data, val_feature, baseline=second_profit)

print(f"Profit for this model - {third_profit}")
val_data['predict'] = third_attempt.predict(val_feature)
print("\n------ metrics -------")
metrics(val_data)

Profit for this model - 15674

------ metrics -------
867 63 378 111
