Strategy:
new range definition with fibonacci extensions - the basic idea is that when a range has been broken out of, it makes sense to target the span of that range * 1.618. still working on it but maybe i could chart out what i’m trying so i can see visually whether it works or not. what i’m thinking is that a range could be identified when the highest high and lowest low haven’t changed over 50% or 75% of the lookback period or whatever, because in theory the range holds more weight if it has lasted a longer time. This is like a filter to stop the feature from constantly calling a range breakout during trending periods, and since the fibonacci part of the signal relies quite heavily on the levels of the range being significant, it’s important to try to filter out all the meaningless range breakouts. so i need a vectorised way to keep track of when the hh and ll changes, and keep a constant count of how long it has been since the last change. i could refine it even further by saying if the range only changed by a small amount (maybe 1%-5% of the total range?) and has since continued to be a range, then it’s just a deviation and therefore doesn’t count. so that logic could look like: if all highest highs over the most recent 50% of the lookback period are within 5% of each other, the range high is the mean of those values and a break of the highest of those would be considered a resistance break. if highs and lows conform to these rules then there is a range in effect and a break of either of these levels is considered a range break.

Technical method used to quantify whether stop or target was hit first:
I want to find the index of the first high / low to exceed the profit value and the index of the first low / high to exceed the stop value, then I can see which is first. I can use idxmax and idxmin for this, but I first need to use clip to make sure that the first values to exceed my limits will be considered the first min/max value

i have a few strategies i could try out here:
- if support or resistance is broken whilst 'since_broke_sup' or 'since_broke_res' is above the 'split' threshold, expect continuation in the direction of the break. The approach for this strategy would be to place an oco order on the orderbook at the support/resistance line as soon as the relevant count goes above the split threshold
- if support or resistance is broken whilst the relevant count is below the 'split' threshold, expect rejection/reversal back into/across the channel. the way to take these trades would be to wait for support/resistance to be broken, then open a trade with an oco order on the close of the candle that broke it.
- if price bounces near support or resistance without actually breaking it, that might suggest a reversal too. i'll have to work out if it's a high probability signal, but it certainly gives clear invalidation. these trades would also be taken at market on any close that was close enough to the channel boundary

In [1]:
import time
import pandas as pd
from pathlib import Path
import joblib
import mt.resources.ml_funcs as mlf
import mt.resources.indicators as ind
import mt.resources.binance_funcs as funcs
import mt.resources.features as features
import numpy as np
import json
from itertools import product
from collections import Counter
from datetime import datetime
from pyarrow import ArrowInvalid
import statistics as stats
from plotly.subplots import make_subplots
import plotly.graph_objects as go
if not Path('/pi_2.txt').exists():
    from sklearnex import patch_sklearn
    patch_sklearn()
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier, StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import QuantileTransformer, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score
from sklearn.metrics import confusion_matrix, roc_auc_score, make_scorer
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif, chi2
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids
from xgboost import XGBClassifier, DMatrix
import optuna

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [2]:
def backtest_oco(df_0, side, lookback, trim_ohlc=2000):
    """i can either target the opposite side of the channel or the mid-point, or both"""

    df_0 = df_0.reset_index(drop=True)
    atr_lb = 10
    df_0 = ind.atr(df_0, atr_lb)
    
    # identify potential entries
    rows = list(df_0.loc[df_0[f"entry_{side[0]}"]].index)
    
    results = []
    for row in rows:
        if row == len(df_0)-1:
            break
        df = df_0[row:row + trim_ohlc].copy().reset_index(drop=True)
        entry = df.close.iloc[0]
        atr = df[f"atr-{atr_lb}"].iloc[0]
        
        if side == 'long':
            highest = df.high.max()
            target = df[f"hh_{lookback}"].iloc[0]
            stop = df[f"ll_{lookback}"].iloc[0] - atr
            rr = abs((target / entry) - 1) / abs((stop / entry) - 1)
            target_hit_idx = df.high.clip(upper=target).idxmax()
            stop_hit_idx = df.low.clip(lower=stop).idxmin()
            if (target > highest) or (stop_hit_idx < target_hit_idx):
                exit_row = stop_hit_idx
                pnl_cat = 0
                pnl = (stop-entry)/entry
            elif target_hit_idx < stop_hit_idx:
                exit_row = target_hit_idx
                pnl_cat = 1
                pnl = (target-entry)/entry
            else:
                exit_row = stop_hit_idx
                pnl_cat = 0
                pnl = 0
        else:
            lowest = df.low.min()
            target = df[f"ll_{lookback}"].iloc[0]
            stop = df[f"hh_{lookback}"].iloc[0] + atr
            rr = abs((target / entry) - 1) / abs((stop / entry) - 1)
            target_hit_idx = df.low.clip(lower=target).idxmin()
            stop_hit_idx = df.high.clip(upper=stop).idxmax()
            if (target < lowest) or (stop_hit_idx < target_hit_idx):
                exit_row = stop_hit_idx
                pnl_cat = 0
                pnl = (entry-stop)/entry
            elif target_hit_idx < stop_hit_idx:
                exit_row = target_hit_idx
                pnl_cat = 1
                pnl = (entry-target)/entry
            else:
                exit_row = stop_hit_idx
                pnl_cat = 0
                pnl = 0

        row_data = df_0.iloc[row-1].to_dict()
        pnl_pct = pnl - (2 * 0.0015)  # subtract trading fees and slippage estimate

        row_res = dict(
            # idx=row,
            r_pct=atr,
            rr=rr,
            lifespan=exit_row,
            pnl_pct=pnl_pct,
            pnl_r=pnl_pct / atr,
            pnl_cat=pnl_cat
        )

        results.append(row_data | row_res)

        msg = f"trade lifespans getting close to trimmed ohlc length ({exit_row / trim_ohlc:.1%}), increase trim ohlc"
        if exit_row / trim_ohlc > 0.9:
            print(msg)
        
    return results

In [3]:
def prepare_strat_data(df, lookback):    
    df[f"ll_{lookback}"] = df.low.rolling(lookback).min()
    df[f"hh_{lookback}"] = df.high.rolling(lookback).max()
    
    df['channel_mid'] = (df[f"hh_{lookback}"] + df[f"ll_{lookback}"]) / 2
    df['channel_width'] = (df[f"hh_{lookback}"] - df[f"ll_{lookback}"]) / df.channel_mid
    
    df['broke_support'] = df.low == df[f"ll_{lookback}"]
    df['broke_resistance'] = df.high == df[f"hh_{lookback}"]
    
    df['close_above_sup'] = df.close > df[f"ll_{lookback}"].shift()
    df['close_below_res'] = df.close < df[f"hh_{lookback}"].shift()
    
    df['channel_position'] = (df.close - df[f"ll_{lookback}"]) / (df[f"hh_{lookback}"] - df[f"ll_{lookback}"])
    
    df['entry_l'] = df.channel_position < 0.05
    df['entry_s'] = df.channel_position > 0.95
    
    df['entry_l_price'] = df.close.loc[df.entry_l]
    df['entry_s_price'] = df.close.loc[df.entry_s]
    
    df['support_diff_z'] = abs(ind.z_score(df[f"ll_{lookback}"].ffill().pct_change(), lookback) * df.broke_support)
    df['resistance_diff_z'] = abs(ind.z_score(df[f"hh_{lookback}"].ffill().pct_change(), lookback) * df.broke_resistance)
    
    return df

In [4]:
def plot_stuff(df):
    df = df.tail(2000)
    
    fig = make_subplots(rows=2, cols=1, shared_xaxes=True,
                        vertical_spacing=0.01,
                        row_heights=[0.8, 0.2])
    
    fig.update_yaxes(title_text="Price", row=1, col=1)
    fig.update_yaxes(title_text="S/R Diff Z-score", row=2, col=1)
    
    fig.add_trace(go.Candlestick(x=df.timestamp, open=df.open, high=df.high, low=df.low, close=df.close, showlegend=False), row=1, col=1)
    fig.add_trace(go.Scatter(x=df['timestamp'], y=df[f"hh_{lookback}"], mode='lines', name='Highest High'), row=1, col=1)
    fig.add_trace(go.Scatter(x=df['timestamp'], y=df[f"ll_{lookback}"], mode='lines', name='Lowest Low'), row=1, col=1)
    fig.add_trace(go.Scatter(x=df['timestamp'], y=df.channel_mid, mode='lines', name='Channel Mid-point'), row=1, col=1)
    fig.add_trace(go.Scatter(x=df['timestamp'], y=df.entry_l_price, mode='markers', 
                             marker_size=10, marker_symbol='triangle-up', 
                             name='Strat 2 Long Entry'), row=1, col=1)
    fig.add_trace(go.Scatter(x=df['timestamp'], y=df.entry_s_price, mode='markers', 
                             marker_size=10, marker_symbol='triangle-down', 
                             name='Strat 2 Short Entry'), row=1, col=1)
    
    fig.add_trace(go.Scatter(x=df['timestamp'], y=df[f"channel_position"], mode='lines', name='Channel Position'), row=2, col=1)
    # fig.add_trace(go.Scatter(x=df['timestamp'], y=df[f"support_diff_z"], mode='lines', name='Support Diff Z Score'), row=2, col=1)
    # fig.add_trace(go.Scatter(x=df['timestamp'], y=df[f"resistance_diff_z"], mode='lines', name='Resistance Diff Z Score'), row=2, col=1)
    fig.update_layout(height=600, width=1600, xaxis_rangeslider_visible=False,
                      margin=go.layout.Margin(l=20, r=20, b=20, t=20))
    fig.show()

In [5]:
# set constants
timeframe = '1h'
side = 'short'
# r_mult = 2
data_len = 50000
lookback = 200
selection_method = '1w_volumes'
num_pairs = 150

Create the Dataset

In [27]:
# generate dataset
pairs = mlf.get_margin_pairs(selection_method, num_pairs)
all_res = []
for pair in pairs:
    df = mlf.get_data(pair, timeframe)
    # df = df.tail(20000).reset_index(drop=True)
    df = mlf.add_features(df, timeframe)
    df = prepare_strat_data(df, lookback)
    print(f"{pair} len: {len(df)}")
    res = backtest_oco(df, side, lookback)
    all_res.extend(res)
res_df = pd.DataFrame(all_res).sort_values('timestamp').reset_index(drop=True)
res_df = res_df.dropna(axis=1)
res_df.tail()

FDUSDUSDT len: 2001
BTCUSDT len: 8760
XRPUSDT len: 8760
LOOMUSDT len: 4980
BNBUSDT len: 8760
SOLUSDT len: 8760
ETHUSDT len: 8760
BONDUSDT len: 8760
STRAXUSDT len: 8760
LINKUSDT len: 8760
MATICUSDT len: 8760
LQTYUSDT len: 5553
DOGEUSDT len: 8760
TRBUSDT len: 8760
ARBUSDT len: 4997
BNTUSDT len: 8760
LTCUSDT len: 8760
BCHUSDT len: 8760
RUNEUSDT len: 8760
REQUSDT len: 8760
TRXUSDT len: 8760
AVAXUSDT len: 8760
OPUSDT len: 6081
STORJUSDT len: 8760
BLZUSDT len: 8760
ATOMUSDT len: 8760
ADAUSDT len: 8760
RNDRUSDT len: 8760
WLDUSDT len: 2053
LEVERUSDT len: 6081
SHIBUSDT len: 8760
TUSDT len: 8760
BANDUSDT len: 8760
BNXUSDT len: 8760
STPTUSDT len: 8760
SUIUSDT len: 4018
OGNUSDT len: 8760
UNFIUSDT len: 8760
YGGUSDT len: 8760
DOTUSDT len: 8760
APEUSDT len: 8760
MKRUSDT len: 8760
TWTUSDT len: 8760
BETAUSDT len: 8760
HOTUSDT len: 8760
XMRUSDT len: 8760
CYBERUSDT len: 1522
BAKEUSDT len: 8760
APTUSDT len: 6081
GALAUSDT len: 8760
ARKUSDT len: 614
DYDXUSDT len: 8760
ZRXUSDT len: 8760
FILUSDT len: 8760
FRO

Unnamed: 0,timestamp,open,high,low,close,base_vol,quote_vol,num_trades,taker_buy_base_vol,taker_buy_quote_vol,...,close_below_res,entry_l,entry_s,atr-10,r_pct,rr,lifespan,pnl_pct,pnl_r,pnl_cat
37082,2023-10-17 19:00:00+00:00,0.2512,0.252,0.2508,0.251,186929.6,46984.9783,375.0,126515.8,31806.0326,...,True,False,False,0.0036,0.0035,12.644,0,-0.0267,-7.627,0
37083,2023-10-17 19:00:00+00:00,6.387,6.425,6.383,6.423,40721.97,260855.477,928.0,19479.37,124774.8301,...,True,False,False,0.0416,0.0396,11.1154,0,-0.0163,-0.4128,0
37084,2023-10-17 19:00:00+00:00,0.3206,0.3229,0.3206,0.3225,52181.44,16777.4345,145.0,36077.34,11596.0558,...,True,True,False,0.0037,0.0035,10.9107,0,-0.0179,-5.1293,0
37085,2023-10-17 19:00:00+00:00,3.844,3.869,3.839,3.867,112758.84,434822.3261,1573.0,50267.69,193739.687,...,True,False,False,0.0384,0.0372,8.6492,1,-0.0161,-0.4317,0
37086,2023-10-17 19:00:00+00:00,0.2424,0.2434,0.2421,0.2433,98507.45,23906.0321,319.0,50081.0,12160.6211,...,True,True,False,0.0023,0.0023,24.5828,0,-0.0187,-8.1184,0


In [7]:
# split features from labels
X, y, _ = mlf.features_labels_split(res_df)

In [8]:
# undersampling
# us = RandomUnderSampler(random_state=0)
us = ClusterCentroids(random_state=0)
X, y = us.fit_resample(X, y)

  super()._check_params_vs_input(X, default_n_init=10)


In [9]:
# split data for fitting and calibration
X_final = X.copy()
y_final = y.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=11, stratify=y)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=11, stratify=y_test)

In [10]:
# feature scaling
cols = list(X_train.columns) # list of strings, names of all features
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_val = scaler.transform(X_val)

# turn back into dataframes
X_train = pd.DataFrame(X_train, columns=cols)
X_test = pd.DataFrame(X_test, columns=cols)
X_val = pd.DataFrame(X_val, columns=cols)

In [11]:
# remove features with too many missing values
missing_condition = X_train.columns[X_train.isnull().mean(axis=0) < 0.1]
X_train = X_train[missing_condition]
X_test = X_test[missing_condition]
X_val = X_val[missing_condition]

# remove low variance features
variance_condition = X_train.columns[X_train.var() > 0.0]
X_train = X_train[variance_condition]
X_test = X_test[variance_condition]
X_val = X_val[variance_condition]


In [12]:
# remove features that are highly correlated with other features
corr_thresh = 0.5
corr_matrix = X_train.corr()
# Extract the upper triangle of the correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(bool))
        
# Select the features with correlations above the threshold
# Need to use the absolute value
to_drop = [column for column in upper.columns if any(upper[column].abs() > corr_thresh)]

# Dataframe to hold correlated pairs
record_collinear = []

# Iterate through the columns to drop to record pairs of correlated features
for column in to_drop:

    # Find the correlated features
    corr_features = list(upper.index[upper[column].abs() > corr_thresh])

    # Find the correlated values
    corr_values = list(upper[column][upper[column].abs() > corr_thresh])
    drop_features = [column for _ in range(len(corr_features))]    

    # Record the information (need a temp df for now)
    temp_df = pd.DataFrame.from_dict({'drop_feature': drop_features,
                                     'corr_feature': corr_features,
                                     'corr_value': corr_values})

    # Add to dataframe
    record_collinear.append(temp_df)

record_collinear = pd.concat(record_collinear, axis=0, ignore_index=True)
X_train = X_train.drop(list(record_collinear.corr_feature), axis=1)
X_test = X_test.drop(list(record_collinear.corr_feature), axis=1)
X_val = X_val.drop(list(record_collinear.corr_feature), axis=1)
X_train.shape

(2011, 56)

In [13]:
# mutual info feature selection
print(f"feature selection began: {datetime.now().strftime('%Y/%m/%d %H:%M')}")
cols = list(X_train.columns) # list of strings, names of all features
selector = SelectKBest(mutual_info_classif, k=15)
selector.fit(X_train, y_train)
X_train = selector.transform(X_train)
X_test = selector.transform((X_test))
X_val = selector.transform((X_val))

# Turn data back into dataframe with column names
cols_idx = list(selector.get_support(indices=True))
selected_columns = [col for i, col in enumerate(cols) if i in cols_idx]
X_train = pd.DataFrame(X_train, columns=selected_columns)
X_test = pd.DataFrame(X_test, columns=selected_columns)
X_val = pd.DataFrame(X_val, columns=selected_columns)
print(selected_columns)

feature selection began: 2023/10/17 19:04
['ats_z_200', 'day_of_week', 'fractal_trend_age_long', 'fractal_trend_age_short', 'hour_180', 'inside_bar', 'kurtosis_100', 'spooky_num_prox', 'ema_48_above_192', 'high_volume_churn_50', 'vol_delta', 'vol_delta_pct', 'weekly_open_ratio', 'atr-10', 'rr']


In [14]:
# sequential feature selection
pre_selector_model = RandomForestClassifier()
selector = SFS(estimator=pre_selector_model, k_features='best', forward=False,
                     floating=True, verbose=0, scoring='accuracy', n_jobs=-1)
selector = selector.fit(X_train, y_train)
cols_idx = list(selector.k_feature_idx_)
X_train = selector.transform(X_train)
X_test = selector.transform(X_test)
X_val = selector.transform(X_val)

Hyperparameter Optimisation

In [15]:
def objective(trial):
    # Suggest values for hyperparameters
    # criterion = trial.suggest_categorical('criterion', 'gini', 'log_loss')
    # min_samples_split = trial.suggest_int("min_samples_split", 2, 10)
    n_estimators = trial.suggest_int("n_estimators", 50, 150)
    max_depth = trial.suggest_int("max_depth", 12, 32)
    max_features = trial.suggest_float("max_features", 0.1, 1.0)
    max_samples = trial.suggest_float('max_samples', 0.1, 1.0)
    ccp_alpha = trial.suggest_float('ccp_alpha', 1e-5, 1e-2, log=True)

    # Create and fit random forest model
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        criterion='log_loss',
        max_depth=max_depth,
        min_samples_split=2,
        max_features=max_features,
        max_samples=max_samples,
        ccp_alpha=ccp_alpha,
        random_state=42,
        n_jobs=-1
    )
    model.fit(X_train, y_train)

    # Score model
    scores = cross_val_score(model, X_test, y_test, n_jobs=-1)
    avg_score = stats.mean(scores)

    # Return score
    return avg_score

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=1000, n_jobs=-1)

# automatically get best params from study
best_trials = [trial.params for trial in study.trials if trial.values[0] >= (study.best_value - 0.001)]
best_df = pd.DataFrame(best_trials)
best_df.describe()
best_params = best_df.median(axis=0).to_dict()

[I 2023-10-17 19:05:54,754] A new study created in memory with name: no-name-f0c95e01-2280-4973-8bc2-3493fea76a4e
[I 2023-10-17 19:05:57,125] Trial 5 finished with value: 0.9656716417910448 and parameters: {'n_estimators': 69, 'max_depth': 20, 'max_features': 0.8635968296186916, 'max_samples': 0.3932852078095972, 'ccp_alpha': 0.00018326764133416067}. Best is trial 5 with value: 0.9656716417910448.
[I 2023-10-17 19:05:57,727] Trial 3 finished with value: 0.9582089552238806 and parameters: {'n_estimators': 79, 'max_depth': 30, 'max_features': 0.986742402460056, 'max_samples': 0.6730816068839555, 'ccp_alpha': 0.005574705607177606}. Best is trial 5 with value: 0.9656716417910448.
[I 2023-10-17 19:05:58,173] Trial 2 finished with value: 0.9656716417910448 and parameters: {'n_estimators': 89, 'max_depth': 25, 'max_features': 0.307828236321006, 'max_samples': 0.5762140276555052, 'ccp_alpha': 0.004591957439770887}. Best is trial 5 with value: 0.9656716417910448.
[I 2023-10-17 19:05:58,954] Tri

In [16]:
# remove features with low permutation importance
imp_model = RandomForestClassifier()
imp_model.fit(X_train, y_train)
# print(f"Score: {imp_model.score(X_test, y_test)}")
importances = permutation_importance(imp_model, X_test, y_test, n_repeats=1000, random_state=42, n_jobs=-1)
imp_mean = pd.Series(importances.importances_mean, index=selector.k_feature_names_)
imp_std = pd.Series(importances.importances_std, index=selector.k_feature_names_)
imp_df = pd.DataFrame({'importances': imp_mean, 'imp_std': imp_std}).sort_values('importances', ascending=False)
imp_df['cum_imp'] = imp_df.importances.cumsum()
final_features = list(imp_mean.index[imp_mean > 0.01])
print(final_features)
# print(imp_df)

['ats_z_200', 'kurtosis_100', 'spooky_num_prox', 'ema_48_above_192', 'vol_delta']


In [20]:
# final validation score before training production model
X_train = pd.DataFrame(X_train, columns=selector.k_feature_names_)
X_val = pd.DataFrame(X_val, columns=selector.k_feature_names_)
X_train = X_train[final_features]
X_val = X_val[final_features]
val_model = RandomForestClassifier(
    n_estimators=int(best_params['n_estimators']),
    criterion='log_loss',
    max_depth=int(best_params['max_depth']),
    min_samples_split=2,
    max_features=best_params['max_features'],
    max_samples=best_params['max_samples'],
    ccp_alpha=best_params['ccp_alpha'],
    random_state=42,
    n_jobs=-1
)
val_model.fit(X_train, y_train)
score = val_model.score(X_val, y_val)
print(f"Final model validation score: {score:.1%}")

Final model validation score: 96.9%


In [18]:
X_final = X_final[final_features]
scaler = MinMaxScaler()
X_final = scaler.fit_transform(X_final)
X_final = pd.DataFrame(X_final, columns=final_features)

final_model = RandomForestClassifier(
    n_estimators=int(best_params['n_estimators']),
    criterion='log_loss',
    max_depth=int(best_params['max_depth']),
    min_samples_split=2,
    max_features=best_params['max_features'],
    max_samples=best_params['max_samples'],
    ccp_alpha=best_params['ccp_alpha'],
    random_state=42,
    n_jobs=-1
)

final_model.fit(X_final, y_final)

In [19]:
# save models and info
mlf.save_models(
    "channel_run",
    f"{lookback}",
    selection_method,
    num_pairs,
    side,
    timeframe,
    data_len,
    final_features,
    pairs,
    final_model,
    scaler,
    len(X_final)
)