## Optimization

### Introduction

While our initial trading strategy has shown promising results, the fact that the strategy is winning and outperforming the benchmark by trading all signals within a given quantile provides us with a solid foundation to build upon. This sets up a classic classification problem where our goal is to refine our strategy further. Our benchmark for success is to deliver better results than simply trading all signals within the selected quantiles.

### Key Focus Areas

### Key Focus Areas

1. **Feature Engineering**:
   - Enhance the dataset with additional features such as lagged values, moving averages, and volatility measures.
   - These features can provide more context to the signals and help improve the accuracy of our predictions.

2. **Classification Algorithms**:
   - Use machine learning algorithms to classify signals more effectively.
   - Evaluate different models (e.g., logistic regression, decision trees, random forests) to identify the best-performing classifier.

3. **Hyperparameter Tuning**:
   - Fine-tune the hyperparameters of our models to achieve optimal performance.
   - Use techniques such as grid search or random search to explore different parameter combinations.

4. **Model Evaluation**:
   - Assess the performance of the optimized strategy using metrics such as accuracy, precision, recall, F1-score, and Sharpe ratio.
   - Compare the optimized strategy to the initial strategy and the baseline to measure improvement.

### Approach

We will begin by enhancing our dataset through feature engineering. Next, we will train and evaluate various classification models to predict the trading signals. Finally, we will perform hyperparameter tuning to further refine our models and achieve the best possible performance. Throughout this process, we will compare the results to our initial strategy to ensure that each optimization step adds value.

### Implementation



### Feature Engineering

In [1575]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.svm import SVC




In [1576]:

data = pd.read_csv("../data/processed/processed_data.csv")

data['ema_3'] = data['equity_curve'].ewm(span=3, adjust=False).mean()
data['ema_5'] = data['equity_curve'].ewm(span=5, adjust=False).mean()
data['ema_7'] = data['equity_curve'].ewm(span=7, adjust=False).mean()
data['ema_10'] = data['equity_curve'].ewm(span=10, adjust=False).mean()
data['ema_20'] = data['equity_curve'].ewm(span=20, adjust=False).mean()

data['ema_3_7_crossover'] = (data['ema_3'] > data['ema_7']).astype(int)
data['ema_5_10_crossover'] = (data['ema_5'] > data['ema_10']).astype(int)
data['ema_5_20_crossover'] = (data['ema_5'] > data['ema_20']).astype(int)
data['ema_10_20_crossover'] = (data['ema_10'] > data['ema_20']).astype(int)

data['sma_3'] = data['equity_curve'].rolling(window=3).mean()
data['sma_5'] = data['equity_curve'].rolling(window=5).mean()
data['sma_7'] = data['equity_curve'].rolling(window=7).mean()
data['sma_10'] = data['equity_curve'].rolling(window=10).mean()
data['sma_20'] = data['equity_curve'].rolling(window=20).mean()

data['sma_3_7_crossover'] = (data['sma_3'] > data['sma_7']).astype(int)
data['sma_5_10_crossover'] = (data['sma_5'] > data['sma_10']).astype(int)
data['sma_10_20_crossover'] = (data['sma_10'] > data['sma_20']).astype(int)
data['sma_5_20_crossover'] = (data['sma_5'] > data['sma_20']).astype(int)

data = data.dropna()

data.to_csv("../data/processed/feature_engineered_data.csv", index=False)

print(data.head(5))

    Unnamed: 0    signal  equity_curve  equity_returns     ema_3     ema_5  \
19          20  0.005743      0.362728        0.000544  0.363872  0.363505   
20          21 -0.005063      0.362925        0.001600  0.363399  0.363312   
21          22 -0.008720      0.363506       -0.012017  0.363452  0.363376   
22          23 -0.003324      0.359138        0.022483  0.361295  0.361964   
23          24 -0.006281      0.367212        0.003951  0.364254  0.363713   

       ema_7    ema_10    ema_20  ema_3_7_crossover  ...  ema_10_20_crossover  \
19  0.363145  0.362914  0.363529                  1  ...                    0   
20  0.363090  0.362916  0.363472                  1  ...                    0   
21  0.363194  0.363023  0.363475                  1  ...                    0   
22  0.362180  0.362317  0.363062                  0  ...                    0   
23  0.363438  0.363207  0.363457                  1  ...                    0   

       sma_3     sma_5     sma_7    sma_10  

In [1577]:
def calculate_macd(data, short_window=12, long_window=26, signal_window=9):
    """
    Calculate MACD and Signal Line
    """
    # Calculate the short-term EMA (typically 12 periods)
    data['ema_short'] = data['equity_curve'].ewm(span=short_window, adjust=False).mean()
    
    # Calculate the long-term EMA (typically 26 periods)
    data['ema_long'] = data['equity_curve'].ewm(span=long_window, adjust=False).mean()
    
    # Calculate the MACD Line
    data['macd'] = data['ema_short'] - data['ema_long']
    
    # Calculate the Signal Line
    data['macd_signal'] = data['macd'].ewm(span=signal_window, adjust=False).mean()
    
    # Calculate the MACD Histogram
    data['macd_hist'] = data['macd'] - data['macd_signal']
    
    return data

data = calculate_macd(data)
data = data.drop(columns=['ema_short', 'ema_long'])
print(data.head())

    Unnamed: 0    signal  equity_curve  equity_returns     ema_3     ema_5  \
19          20  0.005743      0.362728        0.000544  0.363872  0.363505   
20          21 -0.005063      0.362925        0.001600  0.363399  0.363312   
21          22 -0.008720      0.363506       -0.012017  0.363452  0.363376   
22          23 -0.003324      0.359138        0.022483  0.361295  0.361964   
23          24 -0.006281      0.367212        0.003951  0.364254  0.363713   

       ema_7    ema_10    ema_20  ema_3_7_crossover  ...     sma_7    sma_10  \
19  0.363145  0.362914  0.363529                  1  ...  0.362221  0.362065   
20  0.363090  0.362916  0.363472                  1  ...  0.362805  0.362330   
21  0.363194  0.363023  0.363475                  1  ...  0.363291  0.362500   
22  0.362180  0.362317  0.363062                  0  ...  0.363465  0.362112   
23  0.363438  0.363207  0.363457                  1  ...  0.363995  0.362949   

      sma_20  sma_3_7_crossover  sma_5_10_crossove

In [1578]:
def compute_rsi(data, window):
    delta = data['equity_curve'].diff(1)
    gain = (delta.where(delta > 0, 0)).rolling(window=window, min_periods=1).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window, min_periods=1).mean()
    rs = gain / loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

data['rsi_14'] = compute_rsi(data, 14)
data = data.dropna()
print(data.head())

    Unnamed: 0    signal  equity_curve  equity_returns     ema_3     ema_5  \
20          21 -0.005063      0.362925        0.001600  0.363399  0.363312   
21          22 -0.008720      0.363506       -0.012017  0.363452  0.363376   
22          23 -0.003324      0.359138        0.022483  0.361295  0.361964   
23          24 -0.006281      0.367212        0.003951  0.364254  0.363713   
24          25 -0.004420      0.368663        0.017068  0.366458  0.365363   

       ema_7    ema_10    ema_20  ema_3_7_crossover  ...    sma_10    sma_20  \
20  0.363090  0.362916  0.363472                  1  ...  0.362330  0.362463   
21  0.363194  0.363023  0.363475                  1  ...  0.362500  0.362327   
22  0.362180  0.362317  0.363062                  0  ...  0.362112  0.361844   
23  0.363438  0.363207  0.363457                  1  ...  0.362949  0.361842   
24  0.364744  0.364199  0.363953                  1  ...  0.363805  0.362201   

    sma_3_7_crossover  sma_5_10_crossover  sma_10_

In [1579]:
train_data, test_data = train_test_split(data, test_size=0.3, shuffle=False)
base_line_test_data_raw = test_data.copy()

In [1580]:
def assign_baseline_test_data_trade_signal(data):
    data['trade_signal'] = np.where(data['signal'] < 0, -1, 1)
    return data

In [1581]:
base_line_test_data = assign_baseline_test_data_trade_signal(base_line_test_data_raw)
print("Base_line_test_data:")
print(base_line_test_data.head(5))
print(base_line_test_data.tail(5))

Base_line_test_data:
       Unnamed: 0    signal  equity_curve  equity_returns     ema_3     ema_5  \
35003       35004  0.005905      0.778394       -0.012145  0.777618  0.775352   
35004       35005 -0.007350      0.768940       -0.004804  0.773279  0.773215   
35005       35006  0.006328      0.765246        0.000261  0.769262  0.770558   
35006       35007  0.008281      0.765445       -0.004526  0.767354  0.768854   
35007       35008 -0.024470      0.761981        0.002272  0.764668  0.766563   

          ema_7    ema_10    ema_20  ema_3_7_crossover  ...    sma_20  \
35003  0.772961  0.770344  0.767668                  1  ...  0.765034   
35004  0.771956  0.770089  0.767790                  1  ...  0.764933   
35005  0.770278  0.769208  0.767547                  0  ...  0.764601   
35006  0.769070  0.768524  0.767347                  0  ...  0.764124   
35007  0.767298  0.767335  0.766836                  0  ...  0.763827   

       sma_3_7_crossover  sma_5_10_crossover  sma_10_

In [1582]:
train_data['signal_quantile'] = pd.qcut(train_data['signal'], 10, labels=False)
quantile_summary = train_data.groupby('signal_quantile')['signal'].agg(['min', 'max']).reset_index()

bins = [-np.inf] + quantile_summary['max'].tolist() + [np.inf]
print("Bins:")
print(bins)

print("Quantile Summary:")
print(quantile_summary)

positive_signals_std = train_data[train_data['signal'] > 0]['signal'].std()
negative_signals_std = train_data[train_data['signal'] < 0]['signal'].std()

buy_quantiles = quantile_summary[quantile_summary['max'] >= positive_signals_std]['signal_quantile'].tolist()
sell_quantiles = quantile_summary[quantile_summary['min'] <= -negative_signals_std]['signal_quantile'].tolist()

print(f"Buy quantiles: {buy_quantiles}")
print(f"Sell quantiles: {sell_quantiles}")

Bins:
[-inf, -0.0126427120969868, -0.0083365484598235, -0.0053055001645362, -0.0026362883712754, -0.0001847297271523, 0.0022688327178055, 0.0049028381548227, 0.0079980306604964, 0.0123438110295806, 0.0372032712589485, inf]
Quantile Summary:
   signal_quantile       min       max
0                0 -0.041569 -0.012643
1                1 -0.012641 -0.008337
2                2 -0.008336 -0.005306
3                3 -0.005305 -0.002636
4                4 -0.002636 -0.000185
5                5 -0.000184  0.002269
6                6  0.002269  0.004903
7                7  0.004907  0.007998
8                8  0.007999  0.012344
9                9  0.012344  0.037203
Buy quantiles: [7, 8, 9]
Sell quantiles: [0, 1, 2]


In [1583]:
def assign_trading_signals(data, bins, buy_quantiles, sell_quantiles):
    data['signal_quantile'] = pd.cut(data['signal'], bins=bins, labels=False, include_lowest=True)
    
    data['trade_signal'] = 0

    data.loc[data['signal_quantile'].isin(buy_quantiles), 'trade_signal'] = 1
    data.loc[data['signal_quantile'].isin(sell_quantiles), 'trade_signal'] = -1

    return data

train_data_with_signals = assign_trading_signals(train_data, bins, buy_quantiles, sell_quantiles)
test_data_with_signals = assign_trading_signals(test_data, bins, buy_quantiles, sell_quantiles)

print(train_data.head(5))
print(test_data.head(5))

    Unnamed: 0    signal  equity_curve  equity_returns     ema_3     ema_5  \
20          21 -0.005063      0.362925        0.001600  0.363399  0.363312   
21          22 -0.008720      0.363506       -0.012017  0.363452  0.363376   
22          23 -0.003324      0.359138        0.022483  0.361295  0.361964   
23          24 -0.006281      0.367212        0.003951  0.364254  0.363713   
24          25 -0.004420      0.368663        0.017068  0.366458  0.365363   

       ema_7    ema_10    ema_20  ema_3_7_crossover  ...  sma_3_7_crossover  \
20  0.363090  0.362916  0.363472                  1  ...                  1   
21  0.363194  0.363023  0.363475                  1  ...                  0   
22  0.362180  0.362317  0.363062                  0  ...                  0   
23  0.363438  0.363207  0.363457                  1  ...                  0   
24  0.364744  0.364199  0.363953                  1  ...                  1   

    sma_5_10_crossover  sma_10_20_crossover  sma_5_20_cr

In [1584]:

positive_quartile_filtered_train_data = train_data[train_data['signal_quantile'].isin([9])]
positive_quartile_filtered_train_data['profitable_trade'] = (positive_quartile_filtered_train_data['equity_returns'] > 0).astype(int)

print(positive_quartile_filtered_train_data.head(5))

    Unnamed: 0    signal  equity_curve  equity_returns     ema_3     ema_5  \
30          31  0.016719      0.371422        0.000480  0.371272  0.371161   
37          38  0.027988      0.371975        0.007997  0.370754  0.369551   
55          56  0.013490      0.364537        0.005577  0.367462  0.368871   
61          62  0.013657      0.380413       -0.007296  0.378678  0.376638   
73          74  0.024171      0.399530        0.000191  0.397002  0.394435   

       ema_7    ema_10    ema_20  ema_3_7_crossover  ...  sma_5_10_crossover  \
30  0.370654  0.369738  0.367736                  1  ...                   1   
37  0.369020  0.368610  0.367697                  1  ...                   0   
55  0.369651  0.370206  0.370295                  0  ...                   0   
61  0.375312  0.374139  0.372388                  1  ...                   1   
73  0.392150  0.389354  0.383510                  1  ...                   1   

    sma_10_20_crossover  sma_5_20_crossover      m

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  positive_quartile_filtered_train_data['profitable_trade'] = (positive_quartile_filtered_train_data['equity_returns'] > 0).astype(int)


### After much trial and error

I finally was able to get some machine learning models to return statistically significant results.
The largest quartile provided the only results that were statistically significat. 
This should give us an opportunity to at least demonstrate optimization. 

In [1585]:
features = [ 
    'macd',  
    'rsi_14',
    'ema_3_7_crossover',
    'sma_5_10_crossover',
]

# features = [
#     'equity_curve_lag_1',
#     'equity_curve_lag_2',
#     'equity_curve_lag_3',
#     'equity_curve_ma_3',
#     'equity_curve_ma_5',
#     'equity_curve_ma_10',
#     'sma_5',
#     'sma_10',
#     'sma_20',
#     'sma_5_10_crossover',
#     'sma_10_20_crossover',
#     'sma_5_20_crossover'
# ]

target = 'profitable_trade'
X = positive_quartile_filtered_train_data[features]
y = positive_quartile_filtered_train_data[target]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")



X_train shape: (2799, 4)
X_val shape: (700, 4)
y_train shape: (2799,)
y_val shape: (700,)


In [1586]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
y_pred_log_reg = log_reg.predict(X_val_scaled)

feature_names = X_train.columns  # Replace with your feature names if they are in a DataFrame
coefficients = log_reg.coef_[0] 

coeff_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
})

coeff_df['abs_coefficient'] = coeff_df['Coefficient'].abs()
coeff_df = coeff_df.sort_values(by='abs_coefficient', ascending=False)
print(coeff_df.drop(columns='abs_coefficient'))

dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
y_pred_dt = dt_clf.predict(X_val)

rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_val)

              Feature  Coefficient
0                macd    -0.613080
1              rsi_14    -0.354487
2   ema_3_7_crossover    -0.137035
3  sma_5_10_crossover    -0.035336


In [1587]:
print("Logistic Regression Classification Report:")
print(classification_report(y_val, y_pred_log_reg))
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred_log_reg))
print("Accuracy:", accuracy_score(y_val, y_pred_log_reg))

print("\nDecision Tree Classification Report:")
print(classification_report(y_val, y_pred_dt))
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred_dt))
print("Accuracy:", accuracy_score(y_val, y_pred_dt))

print("\nRandom Forest Classification Report:")
print(classification_report(y_val, y_pred_rf))
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred_rf))
print("Accuracy:", accuracy_score(y_val, y_pred_rf))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.55      0.49      0.52       345
           1       0.55      0.62      0.58       355

    accuracy                           0.55       700
   macro avg       0.55      0.55      0.55       700
weighted avg       0.55      0.55      0.55       700

Confusion Matrix:
[[169 176]
 [136 219]]
Accuracy: 0.5542857142857143

Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.50      0.50       345
           1       0.51      0.51      0.51       355

    accuracy                           0.50       700
   macro avg       0.50      0.50      0.50       700
weighted avg       0.50      0.50      0.50       700

Confusion Matrix:
[[172 173]
 [174 181]]
Accuracy: 0.5042857142857143

Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.4

In [1588]:
svm_clf = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_clf.fit(X_train_scaled, y_train)
y_pred_svm = svm_clf.predict(X_val_scaled)

print("SVM Classification Report:")
print(classification_report(y_val, y_pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_svm))
print("Accuracy:", accuracy_score(y_val, y_pred_svm))

SVM Classification Report:
              precision    recall  f1-score   support

           0       0.55      0.61      0.58       345
           1       0.58      0.51      0.54       355

    accuracy                           0.56       700
   macro avg       0.56      0.56      0.56       700
weighted avg       0.56      0.56      0.56       700

Confusion Matrix:
 [[212 133]
 [173 182]]
Accuracy: 0.5628571428571428


### Statistically significant results from SVM classification !!

The results show that we would have 394 winning trades using this SVM model over just trading all signals and only have 355 winning trades.
This is calculated buy taking the winnign trades chosen, and then also 

SVM Classification Report:
              precision    recall  f1-score   support

           0       0.55      0.61      0.58       345
           1       0.58      0.51      0.54       355

    accuracy                           0.56       700
   macro avg       0.56      0.56      0.56       700
weighted avg       0.56      0.56      0.56       700

Confusion Matrix:
 [[212 133]
 [173 182]]
Accuracy: 0.5628571428571428


The following features:
     - macd
     - rsi_14
     - ema_3_7_crossover
     - sma_5_10_crossover

These should prove useful in at least being able to test out a model to optimize any signal that is in the quartile group 9

### We use a z statistic and a p-value to test to determine if the values we are returing are statistically significant.

In [1590]:
import numpy as np
from statsmodels.stats.proportion import proportions_ztest

wins_group1 = 394
wins_group2 = 355

trials_group1 = 700
trials_group2 = 700

count = np.array([wins_group1, wins_group2])
nobs = np.array([trials_group1, trials_group2])

stat, pval = proportions_ztest(count, nobs)

print(f"Z-statistic: {stat}")
print(f"P-value: {pval}")

if pval < 0.05:
    print("The difference is statistically significant.")
else:
    print("The difference is not statistically significant.")

Z-statistic: 2.0897638957788462
P-value: 0.036639014071567236
The difference is statistically significant.
