# Imports
* Pandas, Sckit-Learn ("SKLEARN"), Catboost, Matplotlib, Keras, etc... 

In [1]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score,balanced_accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import os
import argparse
import numpy as np
import pandas as pd
import sys
import pandas as pd
from catboost import CatBoostRegressor
from catboost import Pool
import matplotlib.pyplot as plt
import keras
from IPython.core.display import HTML
from sklearn.metrics import roc_auc_score, log_loss
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn import tree
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import keras
from keras import layers
from keras import ops
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

HTML("""
<style>
.container { width:100% !important; }
</style>
""")


# Regression Analysis in Finance
## Goal: Predict "Simple Returns" or "Log Returns"

In this video we will focus on Simple Returns:

**Simple Returns:** $$\text{Simple Returns} = \frac{S_{t+1} - S_t}{S_t}$$

**Returns and Log Returns**
- **Returns**: The percentage change in the value of a financial asset over a specified period. It is calculated as `(Current Price - Previous Price) / Previous Price`.
- **Log Returns**: The natural logarithm of the ratio of the current price to the previous price. Calculated as `log(Current Price / Previous Price)`. Log returns are useful for mathematical simplicity and are more symmetric compared to simple returns, making them preferable in certain financial analyses.

**Mean Squared Error (MSE)**
* *Definition*: Measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.
* *Formula*: `MSE = 1/N * sum((y_i - y_hat_i)^2)`
    - Where:
        - `N` is the number of samples
        - `y_i` is the actual value
        - `y_hat_i` is the predicted value.
* *Interpretation*: A lower MSE indicates a model that is closer to the true data points. The scale of MSE is dependent on the data and thus it can be hard to interpret the raw numbers without context.

**Root Mean Squared Error (RMSE)**
* *Definition*: The square root of the mean squared error.
* *Formula*: `RMSE = sqrt(MSE)`
    - Provides a measure of the average magnitude of the error.
* *Interpretation*: RMSE improves interpretability of MSE by scaling it back to the original data units, making it more understandable.

**Mean Absolute Error (MAE)**
* *Definition*: Measures the average magnitude of the errors in a set of predictions, without considering their direction.
* *Formula*: `MAE = 1/N * sum(|y_i - y_hat_i|)`
* *Interpretation*: Provides an average of the absolute errors between predicted and actual values. Lower values indicate better fit.

**R-squared (R²)**
* *Definition*: Represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.
* *Formula*: `R² = 1 - (Residual Sum of Squares / Total Sum of Squares)`
* *Interpretation*: Values range from 0 to 1. A higher R² indicates a better fit between the model and the observed data.

**Adjusted R-squared**
* *Definition*: Adjusted for the number of predictors in the model, providing a more accurate measure in the context of multiple regression.
* *Formula*: Adjusts R² to reflect the number of predictors.
* *Interpretation*: Compensates for the addition of variables that do not improve the model, offering a more precise measure of the goodness of fit.

**Information Coefficient (IC)**
* *Definition*: A measure used in finance to evaluate the predictive skill of a forecast model, particularly in the context of predicting returns.
* *Formula*: `IC = correlation(predicted returns, actual returns)`
* *Interpretation*: Ranges from -1 to 1. A positive IC indicates that the model predictions are directionally correct with respect to actual outcomes. A higher absolute value of IC suggests a stronger predictive capability.

**Benchmark for Random Guessing in Finance**
* *Purpose*: Provides a reference point to evaluate the performance of financial models.
* *Example*: A model predicting returns could be benchmarked against a baseline assumption of zero return, which might represent a naive guess or the average historical return of the market.

**Advantages of Using Various Metrics**
* *Comprehensive Evaluation*: Different metrics provide insights into various aspects of the model's performance, such as accuracy, error magnitude, and directionality.
* *Suitability for Financial Data*: Financial data often exhibits characteristics like volatility and non-normal distribution; certain metrics might be more robust or informative under these conditions.

**Significance in Finance**
* *Risk and Return Analysis*: Regression metrics are critical in quantifying the accuracy and reliability of predictions related to financial returns, essential for risk management and investment strategy.
* *Model Validation*: Rigorous evaluation using these metrics ensures the robustness and reliability of predictive models in finance, guiding investment decisions and strategy formulation.


### Regression Report

In [2]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from scipy.stats import pearsonr
import numpy as np

def get_regression_report(y_true, y_pred, comparison_baseline=0):
    """
    Generate a report of regression metrics including MSE, RMSE, MAE, R-squared, Information Coefficient (IC),
    and optionally compare against a given baseline prediction.
    
    Parameters:
    - y_true: array-like of shape (n_samples,) Ground truth (correct) target values.
    - y_pred: array-like of shape (n_samples,) Estimated target values.
    - comparison_baseline: Optional; a constant value to predict for all instances for baseline comparison.
    """
    
    # Calculate MSE
    mse = mean_squared_error(y_true, y_pred)
    print(f"Mean Squared Error (MSE): {mse}")
    
    # Calculate RMSE
    rmse = np.sqrt(mse)
    print(f"Root Mean Squared Error (RMSE): {rmse}")
    
    # Calculate MAE
    mae = mean_absolute_error(y_true, y_pred)
    print(f"Mean Absolute Error (MAE): {mae}")
    
    # Calculate R-squared
    r_squared = r2_score(y_true, y_pred)
    print(f"R-squared: {r_squared}")


    if comparison_baseline is not None:
        # Create a baseline prediction array
        baseline_pred = np.full_like(y_true, fill_value=comparison_baseline)
        
        # Calculate baseline metrics
        baseline_mse = mean_squared_error(y_true, baseline_pred)
        baseline_rmse = np.sqrt(baseline_mse)
        baseline_mae = mean_absolute_error(y_true, baseline_pred)
        baseline_r_squared = r2_score(y_true, baseline_pred)
        
        # Calculate baseline IC
        baseline_ic, _ = pearsonr(y_true, baseline_pred)
        
        # Print comparison
        print("\nBaseline Comparison (using constant prediction):")
        print(f"Baseline Mean Squared Error (MSE): {baseline_mse}")
        print(f"Baseline Root Mean Squared Error (RMSE): {baseline_rmse}")
        print(f"Baseline Mean Absolute Error (MAE): {baseline_mae}")
        print(f"Baseline R-squared: {baseline_r_squared}")


### Load 5 minute bars
* >500 just removes the one day bug in data on day of stock split they forgot to divide
* Only take regular trading hours

In [3]:
os.chdir('C:\\Users\\adidr\\OneDrive\\Desktop\\ib-tutorials')
df = pd.read_csv('tsla_5mins_fixed.csv',index_col='date',parse_dates=['date']).sort_index()
 

df=df.between_time('9:30','15:59:55')

df

Unnamed: 0_level_0,Open,High,Low,Close,Volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-01-04 09:30:00,239.8333,242.7500,239.0600,242.6833,5496375.0
2021-01-04 09:35:00,242.7100,243.6267,241.8600,242.2833,3742554.0
2021-01-04 09:40:00,242.4200,247.6167,242.3100,247.5333,4520967.0
2021-01-04 09:45:00,247.5100,247.9200,246.4700,246.9067,3496944.0
2021-01-04 09:50:00,246.8467,247.2667,245.0733,245.7933,2272224.0
...,...,...,...,...,...
2023-11-22 15:35:00,233.3200,233.4900,233.2100,233.4500,600678.0
2023-11-22 15:40:00,233.4600,233.7000,233.2700,233.6900,973646.0
2023-11-22 15:45:00,233.6900,234.1500,233.5800,234.0300,994516.0
2023-11-22 15:50:00,234.0200,234.1800,233.6900,233.9500,1005743.0


### Create Regression Label for Next 5 min bars
* We will use ordinary returns

In [4]:
df['LABEL']=(df['Close'].shift(-1)-df.Close)/df.Close

### Extract Features with Pandas TA

In [5]:
import pandas_ta as ta
def extract_features(df):
    df = df.copy()
    ta_list = [{"kind": "rsi"}, {"kind": "rsi", "length": 840},
               {"kind": "rsi", "length": 70}, {"kind": "ppo"}, {"kind": "ppo", "fast": 40, "slow": 200, "signal": 12},
               {"kind": "log_return", "length": 840}, {"kind": "log_return", "length": 1},
               {"kind": "log_return", "length": 2}, {"kind": "log_return", "length": 3},
               {"kind": "aroon"},
               {"kind": "bop"},
               {"kind": "mfi"}, {"kind": "mfi", "length": 70}, {"kind": "vwap", "length": 14},
               {"kind": "vwap", "length": 200},
               {"kind": "mfi"}, {"kind": "mfi", "length": 840},{'kind':'kvo'},
               {"kind": "mfi"},
               {"kind": "adx"}, {"kind": "adx", "length": 70},
               {"kind": "adx"}, {"kind": "adx", "length": 840},
               {"kind": "adx"}, 
               {"kind": "natr"}, {"kind": "natr", "length": 70}, {"kind": "natr", "length": 840},
               {"kind": "squeeze_pro"},
               {"kind": "skew"}, {"kind": "skew", "length": 90}, {"kind": "skew", "length": 200},
               {"kind": "slope", "length": 1}, {"kind": "slope", "length": 60}, {"kind": "slope", "length": 300},
               {"kind": "slope", "length": 1200},
               {"kind": "slope", "length": 5}, {"kind": "slope", "length": 10}, {"kind": "slope", "length": 20},
               {"kind": "slope", "length": 70},
               {"kind": "log_return", "length": 1}, {"kind": "log_return", "length": 80},
               {"kind": "log_return", "length": 3}, {"kind": "log_return", "length": 4},
               {"kind": "log_return", "length": 5},
               {"kind": "log_return", "length": 10}, {"kind": "log_return", "length": 60},
               {"kind": "log_return", "length": 300}, {"kind": "log_return", "length": 1200},{'kind':'adosc'},
               {'kind':'pvr'},{'kind':'pvt'},{'kind':'chop'},{'kind':'rsx'},{'kind':'rvgi'},{'kind':'uo'},{'kind':'tsi'}
               ]
    df['PREV_RETURN']=np.log(df['Close']/df['Close'].shift(1))
    ta_strategy = ta.Strategy(name="C", ta=ta_list)
    df.ta.strategy(ta_strategy)
    
    df[f"""serial_correlation_50_1"""]=df['PREV_RETURN'].rolling(window=50).apply(lambda x: x.autocorr(lag=1), raw=False)
    
    # Drop missing values
    df.dropna(axis=0,inplace=True)
    
    return df

df_with_features=extract_features(df)

df_with_features

Unnamed: 0_level_0,Open,High,Low,Close,Volume,LABEL,PREV_RETURN,RSI_14,RSI_840,RSI_70,...,PVR,PVT,CHOP_14_1_100,RSX_14,RVGI_14_4,RVGIs_14_4,UO_7_14_28,TSI_13_25_13,TSIs_13_25_13,serial_correlation_50_1
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-02-03 12:55:00,286.4500,286.8633,285.8833,286.1267,459699.0,0.000769,-0.001443,38.589439,51.500601,49.275659,...,4.0,8.803861e+07,46.669668,43.677265,-0.034304,0.104102,29.863009,-3.577866,-2.195433,-0.020455
2021-02-03 13:00:00,286.1133,286.9533,286.1033,286.3467,255252.0,0.000012,0.000769,40.865114,51.518911,49.560990,...,2.0,8.805823e+07,45.906915,36.725154,-0.128016,0.019149,28.282276,-5.574056,-2.678093,-0.017953
2021-02-03 13:05:00,286.3000,286.6233,286.0333,286.3500,132573.0,0.000408,0.000012,40.900491,51.519186,49.565307,...,2.0,8.805838e+07,44.408598,31.507992,-0.184191,-0.071943,30.175838,-7.219708,-3.326895,-0.015336
2021-02-03 13:10:00,286.3633,286.7033,286.3433,286.4667,115692.0,-0.000477,0.000407,42.217004,51.528917,49.719704,...,2.0,8.806310e+07,42.641712,28.070207,-0.213231,-0.145325,29.110498,-8.288995,-4.035767,-0.018006
2021-02-03 13:15:00,286.5000,286.7267,286.0033,286.3300,135681.0,-0.000570,-0.000477,41.063086,51.516790,49.539482,...,3.0,8.805662e+07,43.298426,25.518553,-0.226796,-0.191609,32.049469,-9.509211,-4.817687,-0.031670
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-11-22 15:30:00,233.0800,233.4300,233.0100,233.3100,691981.0,0.000600,0.000986,51.925258,50.569267,46.142918,...,1.0,9.422860e+08,54.969588,49.854528,-0.073070,-0.115703,60.141164,-4.119316,-5.710006,-0.179047
2023-11-22 15:35:00,233.3200,233.4900,233.2100,233.4500,600678.0,0.001028,0.000600,53.352797,50.586787,46.412396,...,2.0,9.423221e+08,53.069949,52.551986,-0.025758,-0.088811,59.429339,-2.570393,-5.261490,-0.181043
2023-11-22 15:40:00,233.4600,233.7000,233.2700,233.6900,973646.0,0.001455,0.001028,55.777095,50.616828,46.874686,...,1.0,9.424222e+08,48.674596,55.741677,0.025860,-0.047715,65.324600,-0.558022,-4.589566,-0.177164
2023-11-22 15:45:00,233.6900,234.1500,233.5800,234.0300,994516.0,-0.000342,0.001454,59.025889,50.659374,47.525290,...,1.0,9.425669e+08,39.471512,59.718936,0.105390,0.005421,71.625497,2.135027,-3.628910,-0.171414


### Split over time to train/test

In [6]:
split_fraction = 0.6
train_split = int(split_fraction * int(df_with_features.shape[0]))

train = df_with_features.iloc[:train_split,:]
test = df_with_features.iloc[train_split:,:]

print("TRAIN DATE RANGE",train.index.min(),train.index.max())
print("TEST DATE RANGE",test.index.min(),test.index.max())

TRAIN DATE RANGE 2021-02-03 12:55:00 2022-10-10 12:00:00
TEST DATE RANGE 2022-10-10 12:05:00 2023-11-22 15:50:00


### Define Feature Columns to Use

In [7]:
feature_cols=['RSI_14', 'RSI_840',''
       'RSI_70', 'PPO_12_26_9', 'PPOh_12_26_9', 'PPOs_12_26_9',
       'PPO_40_200_12', 'PPOh_40_200_12', 'PPOs_40_200_12', 'LOGRET_840',
       'LOGRET_1', 
              'LOGRET_2', 'LOGRET_3', 'AROOND_14', 'AROONU_14',
       'AROONOSC_14', 
              'BOP', 'MFI_14','serial_correlation_50_1']

 
    

### Decision Tree

In [8]:
reg = tree.DecisionTreeRegressor(max_depth=6)
reg = reg.fit(train[feature_cols], train['LABEL'])

In [9]:
preds = reg.predict(test[feature_cols])
get_regression_report(test.LABEL,preds)

Mean Squared Error (MSE): 1.6765344847466015e-05
Root Mean Squared Error (RMSE): 0.004094550628269971
Mean Absolute Error (MAE): 0.0023935647084233585
R-squared: -0.06640980591678058

Baseline Comparison (using constant prediction):
Baseline Mean Squared Error (MSE): 1.57213945807666e-05
Baseline Root Mean Squared Error (RMSE): 0.003965021384654388
Baseline Mean Absolute Error (MAE): 0.0023566963917073256
Baseline R-squared: -6.232866151645311e-06




### Random Forest

In [10]:
reg = RandomForestRegressor(n_estimators=100,max_depth=5)
reg = reg.fit(train[feature_cols], train['LABEL'])


In [11]:
preds = reg.predict(test[feature_cols])
get_regression_report(test.LABEL,preds)

Mean Squared Error (MSE): 1.626612917011525e-05
Root Mean Squared Error (RMSE): 0.004033128955304461
Mean Absolute Error (MAE): 0.002403405477466389
R-squared: -0.03465570253043082

Baseline Comparison (using constant prediction):
Baseline Mean Squared Error (MSE): 1.57213945807666e-05
Baseline Root Mean Squared Error (RMSE): 0.003965021384654388
Baseline Mean Absolute Error (MAE): 0.0023566963917073256
Baseline R-squared: -6.232866151645311e-06




### Catboost

In [12]:
reg=   CatBoostRegressor(learning_rate=0.0001,random_seed=1213)

test_pool = Pool(
        data=test[feature_cols],
        label=test.LABEL
)
reg.fit(train[feature_cols], train['LABEL'], use_best_model=False, eval_set=test_pool, plot=False)#,sample_weight=train['SAMPLE_WEIGHT'])

0:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (0)	total: 148ms	remaining: 2m 28s
1:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (0)	total: 160ms	remaining: 1m 20s
2:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (0)	total: 172ms	remaining: 57.2s
3:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (0)	total: 185ms	remaining: 46s
4:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (0)	total: 198ms	remaining: 39.4s
5:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (0)	total: 210ms	remaining: 34.8s
6:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (0)	total: 220ms	remaining: 31.2s
7:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (0)	total: 230ms	remaining: 28.6s
8:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (0)	total: 240ms	remaining: 26.4s
9:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (0)	total: 249ms	remaining: 24.7s
10:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (10)	total: 256ms	remaining: 23s
11:	learn: 0.0041211	test: 0.0039650	best: 0.0039650 (

111:	learn: 0.0041208	test: 0.0039650	best: 0.0039650 (95)	total: 1.18s	remaining: 9.38s
112:	learn: 0.0041208	test: 0.0039650	best: 0.0039650 (95)	total: 1.19s	remaining: 9.36s
113:	learn: 0.0041208	test: 0.0039650	best: 0.0039650 (113)	total: 1.2s	remaining: 9.34s
114:	learn: 0.0041208	test: 0.0039650	best: 0.0039650 (113)	total: 1.21s	remaining: 9.33s
115:	learn: 0.0041208	test: 0.0039650	best: 0.0039650 (113)	total: 1.22s	remaining: 9.31s
116:	learn: 0.0041207	test: 0.0039650	best: 0.0039650 (116)	total: 1.23s	remaining: 9.29s
117:	learn: 0.0041207	test: 0.0039650	best: 0.0039650 (116)	total: 1.24s	remaining: 9.28s
118:	learn: 0.0041207	test: 0.0039650	best: 0.0039650 (116)	total: 1.25s	remaining: 9.26s
119:	learn: 0.0041207	test: 0.0039650	best: 0.0039650 (116)	total: 1.26s	remaining: 9.24s
120:	learn: 0.0041207	test: 0.0039650	best: 0.0039650 (120)	total: 1.27s	remaining: 9.22s
121:	learn: 0.0041207	test: 0.0039650	best: 0.0039650 (120)	total: 1.28s	remaining: 9.2s
122:	learn: 0.

221:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.22s	remaining: 7.78s
222:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.23s	remaining: 7.77s
223:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.24s	remaining: 7.75s
224:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.25s	remaining: 7.74s
225:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.25s	remaining: 7.72s
226:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.26s	remaining: 7.71s
227:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.27s	remaining: 7.7s
228:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.28s	remaining: 7.69s
229:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.29s	remaining: 7.68s
230:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.3s	remaining: 7.66s
231:	learn: 0.0041204	test: 0.0039650	best: 0.0039650 (200)	total: 2.31s	remaining: 7.65s
232:	learn: 

334:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (334)	total: 3.25s	remaining: 6.45s
335:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (335)	total: 3.26s	remaining: 6.44s
336:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (336)	total: 3.27s	remaining: 6.43s
337:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (337)	total: 3.28s	remaining: 6.42s
338:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (337)	total: 3.29s	remaining: 6.42s
339:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (339)	total: 3.3s	remaining: 6.41s
340:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (340)	total: 3.31s	remaining: 6.4s
341:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (340)	total: 3.32s	remaining: 6.39s
342:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (340)	total: 3.33s	remaining: 6.38s
343:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (340)	total: 3.34s	remaining: 6.38s
344:	learn: 0.0041201	test: 0.0039650	best: 0.0039650 (340)	total: 3.35s	remaining: 6.37s
345:	learn: 

439:	learn: 0.0041198	test: 0.0039650	best: 0.0039650 (437)	total: 4.29s	remaining: 5.46s
440:	learn: 0.0041198	test: 0.0039650	best: 0.0039650 (437)	total: 4.3s	remaining: 5.45s
441:	learn: 0.0041198	test: 0.0039650	best: 0.0039650 (441)	total: 4.31s	remaining: 5.44s
442:	learn: 0.0041198	test: 0.0039650	best: 0.0039650 (441)	total: 4.32s	remaining: 5.43s
443:	learn: 0.0041198	test: 0.0039650	best: 0.0039650 (443)	total: 4.33s	remaining: 5.42s
444:	learn: 0.0041198	test: 0.0039650	best: 0.0039650 (443)	total: 4.34s	remaining: 5.41s
445:	learn: 0.0041198	test: 0.0039650	best: 0.0039650 (443)	total: 4.35s	remaining: 5.4s
446:	learn: 0.0041198	test: 0.0039650	best: 0.0039650 (443)	total: 4.36s	remaining: 5.39s
447:	learn: 0.0041197	test: 0.0039650	best: 0.0039650 (443)	total: 4.37s	remaining: 5.38s
448:	learn: 0.0041197	test: 0.0039650	best: 0.0039650 (448)	total: 4.38s	remaining: 5.37s
449:	learn: 0.0041197	test: 0.0039650	best: 0.0039650 (448)	total: 4.39s	remaining: 5.36s
450:	learn: 

545:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (532)	total: 5.32s	remaining: 4.42s
546:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (532)	total: 5.33s	remaining: 4.41s
547:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (532)	total: 5.34s	remaining: 4.4s
548:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (532)	total: 5.35s	remaining: 4.39s
549:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (532)	total: 5.36s	remaining: 4.38s
550:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (532)	total: 5.37s	remaining: 4.37s
551:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (551)	total: 5.38s	remaining: 4.36s
552:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (551)	total: 5.39s	remaining: 4.35s
553:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (553)	total: 5.39s	remaining: 4.34s
554:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (554)	total: 5.41s	remaining: 4.33s
555:	learn: 0.0041194	test: 0.0039650	best: 0.0039650 (554)	total: 5.42s	remaining: 4.33s
556:	learn:

653:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.35s	remaining: 3.36s
654:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.36s	remaining: 3.35s
655:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.37s	remaining: 3.34s
656:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.38s	remaining: 3.33s
657:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.39s	remaining: 3.32s
658:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.4s	remaining: 3.31s
659:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.41s	remaining: 3.3s
660:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.42s	remaining: 3.29s
661:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.43s	remaining: 3.28s
662:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.44s	remaining: 3.27s
663:	learn: 0.0041191	test: 0.0039650	best: 0.0039650 (582)	total: 6.45s	remaining: 3.26s
664:	learn: 

751:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (747)	total: 7.38s	remaining: 2.43s
752:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (747)	total: 7.39s	remaining: 2.42s
753:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (747)	total: 7.4s	remaining: 2.42s
754:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (747)	total: 7.42s	remaining: 2.41s
755:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (755)	total: 7.43s	remaining: 2.4s
756:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (755)	total: 7.44s	remaining: 2.39s
757:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (755)	total: 7.45s	remaining: 2.38s
758:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (755)	total: 7.46s	remaining: 2.37s
759:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (755)	total: 7.47s	remaining: 2.36s
760:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (755)	total: 7.47s	remaining: 2.35s
761:	learn: 0.0041188	test: 0.0039650	best: 0.0039650 (755)	total: 7.49s	remaining: 2.34s
762:	learn: 

862:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (861)	total: 8.41s	remaining: 1.33s
863:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (861)	total: 8.42s	remaining: 1.32s
864:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (864)	total: 8.43s	remaining: 1.31s
865:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (865)	total: 8.43s	remaining: 1.3s
866:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (866)	total: 8.44s	remaining: 1.29s
867:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (867)	total: 8.45s	remaining: 1.28s
868:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (868)	total: 8.46s	remaining: 1.27s
869:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (868)	total: 8.47s	remaining: 1.26s
870:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (868)	total: 8.48s	remaining: 1.25s
871:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (868)	total: 8.49s	remaining: 1.25s
872:	learn: 0.0041185	test: 0.0039650	best: 0.0039650 (868)	total: 8.49s	remaining: 1.24s
873:	learn:

954:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (953)	total: 9.23s	remaining: 435ms
955:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (955)	total: 9.23s	remaining: 425ms
956:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (956)	total: 9.24s	remaining: 415ms
957:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (957)	total: 9.25s	remaining: 406ms
958:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (957)	total: 9.26s	remaining: 396ms
959:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (957)	total: 9.27s	remaining: 386ms
960:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (960)	total: 9.28s	remaining: 377ms
961:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (961)	total: 9.29s	remaining: 367ms
962:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (962)	total: 9.29s	remaining: 357ms
963:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (963)	total: 9.3s	remaining: 347ms
964:	learn: 0.0041182	test: 0.0039650	best: 0.0039650 (963)	total: 9.31s	remaining: 338ms
965:	learn:

<catboost.core.CatBoostRegressor at 0x223e00de530>

In [13]:
preds = reg.predict(test[feature_cols])
get_regression_report(test.LABEL,preds)

Mean Squared Error (MSE): 1.5720857433077337e-05
Root Mean Squared Error (RMSE): 0.003964953648288632
Mean Absolute Error (MAE): 0.002356667514547424
R-squared: 2.7934016084696367e-05

Baseline Comparison (using constant prediction):
Baseline Mean Squared Error (MSE): 1.57213945807666e-05
Baseline Root Mean Squared Error (RMSE): 0.003965021384654388
Baseline Mean Absolute Error (MAE): 0.0023566963917073256
Baseline R-squared: -6.232866151645311e-06




### Neural Networks

In [14]:
model = keras.Sequential(
    [
        layers.Dense(64, activation="relu", name="layer1"),
        layers.Dense(64,activation='relu',name='layer2'),
        layers.Dense(1, activation='linear',name="output"),
    ]
)

#### For NN we need to standardize our data

In [15]:
scaler = StandardScaler()
train_X=scaler.fit_transform(train[feature_cols])
test_X=scaler.transform(test[feature_cols])

In [16]:
optimizer = keras.optimizers.Adam(learning_rate=0.001)
model.compile(
    optimizer=optimizer, loss="mse", metrics=["mse"]
)

model.fit(
    x=train_X,
    y=train['LABEL'],
    batch_size=32,
    epochs=20,
    verbose=2,
    validation_data=(test_X,test['LABEL'])
)


Epoch 1/20
1034/1034 - 4s - 4ms/step - loss: 0.0033 - mse: 0.0033 - val_loss: 7.4457e-04 - val_mse: 7.4470e-04
Epoch 2/20
1034/1034 - 2s - 2ms/step - loss: 4.2980e-04 - mse: 4.2982e-04 - val_loss: 3.9564e-04 - val_mse: 3.9571e-04
Epoch 3/20
1034/1034 - 3s - 3ms/step - loss: 1.7454e-04 - mse: 1.7464e-04 - val_loss: 1.2957e-04 - val_mse: 1.2959e-04
Epoch 4/20
1034/1034 - 3s - 3ms/step - loss: 8.8693e-05 - mse: 8.8719e-05 - val_loss: 6.8930e-05 - val_mse: 6.8942e-05
Epoch 5/20
1034/1034 - 3s - 3ms/step - loss: 5.3156e-05 - mse: 5.3137e-05 - val_loss: 4.8630e-05 - val_mse: 4.8639e-05
Epoch 6/20
1034/1034 - 3s - 3ms/step - loss: 4.1720e-05 - mse: 4.1730e-05 - val_loss: 3.3561e-05 - val_mse: 3.3568e-05
Epoch 7/20
1034/1034 - 2s - 2ms/step - loss: 3.7094e-05 - mse: 3.7113e-05 - val_loss: 2.0944e-05 - val_mse: 2.0949e-05
Epoch 8/20
1034/1034 - 3s - 3ms/step - loss: 2.5774e-05 - mse: 2.5780e-05 - val_loss: 2.9960e-05 - val_mse: 2.9965e-05
Epoch 9/20
1034/1034 - 3s - 2ms/step - loss: 2.3018e-05 

<keras.src.callbacks.history.History at 0x223e01aaaa0>

In [17]:
#pearsonr(y_true, baseline_pred)
np.isnan(preds).mean()

0.0

In [18]:
preds = model.predict(test_X)
get_regression_report(test['LABEL'],preds)

[1m689/689[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
Mean Squared Error (MSE): 1.6153888984135274e-05
Root Mean Squared Error (RMSE): 0.004019190090569899
Mean Absolute Error (MAE): 0.002411486468059978
R-squared: -0.027516330448557724

Baseline Comparison (using constant prediction):
Baseline Mean Squared Error (MSE): 1.57213945807666e-05
Baseline Root Mean Squared Error (RMSE): 0.003965021384654388
Baseline Mean Absolute Error (MAE): 0.0023566963917073256
Baseline R-squared: -6.232866151645311e-06




### RNNs like LSTMs
* We can also try "TRANSFORMERS"

In [19]:
def create_sequences(X, y, sequence_length):
    X_seqs, y_seqs = [], []
    for i in range(len(X) - sequence_length):
        X_seqs.append(X[i:i+sequence_length])
        y_seqs.append(y[i+sequence_length])
    return np.array(X_seqs), np.array(y_seqs)
sequence_length = 12  # For example

# Assuming train_X and test_X are already scaled and feature_cols are selected
train_X_seq, train_y_seq = create_sequences(train_X, train['LABEL'].values, sequence_length)
test_X_seq, test_y_seq = create_sequences(test_X, test['LABEL'].values, sequence_length)


In [20]:
from tensorflow.keras import layers, Sequential
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler

model = Sequential([
    # LSTM layer
    layers.LSTM(64, input_shape=(sequence_length, len(feature_cols)), name="lstm_layer"),
    # You might need to adjust the number of units in LSTM and Dense layers
    layers.Dense(32,activation='relu'),
    layers.Dense(1, activation='linear', name="output")
])

# Compilation remains the same
optimizer = Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss="mse")

  super().__init__(**kwargs)


In [21]:
# Model fitting
model.fit(
    x=train_X_seq,
    y=train_y_seq,
    batch_size=32,
    epochs=20,
    verbose=2,
    validation_data=(test_X_seq, test_y_seq)
)

Epoch 1/20
1033/1033 - 12s - 12ms/step - loss: 0.0010 - val_loss: 1.7071e-04
Epoch 2/20
1033/1033 - 9s - 9ms/step - loss: 8.7906e-05 - val_loss: 6.9226e-05
Epoch 3/20
1033/1033 - 9s - 9ms/step - loss: 4.6291e-05 - val_loss: 4.5930e-05
Epoch 4/20
1033/1033 - 10s - 9ms/step - loss: 3.4388e-05 - val_loss: 2.9402e-05
Epoch 5/20
1033/1033 - 9s - 9ms/step - loss: 2.4026e-05 - val_loss: 1.9902e-05
Epoch 6/20
1033/1033 - 9s - 9ms/step - loss: 1.8989e-05 - val_loss: 1.6464e-05
Epoch 7/20
1033/1033 - 9s - 9ms/step - loss: 1.7456e-05 - val_loss: 1.5937e-05
Epoch 8/20
1033/1033 - 9s - 9ms/step - loss: 1.7282e-05 - val_loss: 1.5878e-05
Epoch 9/20
1033/1033 - 10s - 10ms/step - loss: 1.7324e-05 - val_loss: 1.5999e-05
Epoch 10/20
1033/1033 - 9s - 9ms/step - loss: 1.7269e-05 - val_loss: 1.7279e-05
Epoch 11/20
1033/1033 - 9s - 9ms/step - loss: 1.7432e-05 - val_loss: 1.6610e-05
Epoch 12/20
1033/1033 - 9s - 9ms/step - loss: 1.7385e-05 - val_loss: 1.5721e-05
Epoch 13/20
1033/1033 - 9s - 9ms/step - loss: 1.

<keras.src.callbacks.history.History at 0x223e63e7010>

In [22]:
preds = model.predict(test_X_seq)
get_regression_report(test_y_seq,preds)

[1m689/689[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step
Mean Squared Error (MSE): 1.59010646021985e-05
Root Mean Squared Error (RMSE): 0.003987613898335508
Mean Absolute Error (MAE): 0.0023950085120863045
R-squared: -0.011029216429802924

Baseline Comparison (using constant prediction):
Baseline Mean Squared Error (MSE): 1.5727708124668302e-05
Baseline Root Mean Squared Error (RMSE): 0.0039658174598269525
Baseline Mean Absolute Error (MAE): 0.002357138391258099
Baseline R-squared: -6.780635399827872e-06




### RNN with OHLCV

In [23]:
feature_cols=['RSI_14', 'RSI_840',''
       'RSI_70', 'PPO_12_26_9', 'PPOh_12_26_9', 'PPOs_12_26_9',
       'PPO_40_200_12', 'PPOh_40_200_12', 'PPOs_40_200_12', 'LOGRET_840',
       'LOGRET_1', 
              'LOGRET_2', 'LOGRET_3', 'AROOND_14', 'AROONU_14',
       'AROONOSC_14', 
              'BOP', 'MFI_14','serial_correlation_50_1']

ohlcv_cols=['Open','High','Low','Close','Volume']


feature_cols=feature_cols+ohlcv_cols


In [24]:
scaler = StandardScaler()
train_X=scaler.fit_transform(train[feature_cols])
test_X=scaler.transform(test[feature_cols])

In [25]:
# Model fitting
model.fit(
    x=train_X_seq,
    y=train_y_seq,
    batch_size=32,
    epochs=20,
    verbose=2,
    validation_data=(test_X_seq, test_y_seq)
)

Epoch 1/20
1033/1033 - 11s - 10ms/step - loss: 1.7312e-05 - val_loss: 1.5743e-05
Epoch 2/20
1033/1033 - 10s - 10ms/step - loss: 1.7295e-05 - val_loss: 1.5792e-05
Epoch 3/20
1033/1033 - 10s - 10ms/step - loss: 1.7341e-05 - val_loss: 1.5957e-05
Epoch 4/20
1033/1033 - 10s - 10ms/step - loss: 1.7253e-05 - val_loss: 1.5721e-05
Epoch 5/20
1033/1033 - 10s - 10ms/step - loss: 1.7367e-05 - val_loss: 1.5975e-05
Epoch 6/20
1033/1033 - 10s - 9ms/step - loss: 1.7247e-05 - val_loss: 1.5777e-05
Epoch 7/20
1033/1033 - 10s - 10ms/step - loss: 1.7259e-05 - val_loss: 1.6062e-05
Epoch 8/20
1033/1033 - 9s - 9ms/step - loss: 1.7293e-05 - val_loss: 1.5758e-05
Epoch 9/20
1033/1033 - 10s - 9ms/step - loss: 1.7273e-05 - val_loss: 1.5816e-05
Epoch 10/20
1033/1033 - 10s - 10ms/step - loss: 1.7301e-05 - val_loss: 1.5731e-05
Epoch 11/20
1033/1033 - 9s - 9ms/step - loss: 1.7308e-05 - val_loss: 1.6221e-05
Epoch 12/20
1033/1033 - 9s - 9ms/step - loss: 1.7268e-05 - val_loss: 1.5921e-05
Epoch 13/20
1033/1033 - 10s - 10m

<keras.src.callbacks.history.History at 0x223e616a830>

In [26]:
preds = model.predict(test_X_seq)
get_regression_report(test_y_seq,preds)

[1m689/689[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step
Mean Squared Error (MSE): 1.5730106920932906e-05
Root Mean Squared Error (RMSE): 0.003966119882319861
Mean Absolute Error (MAE): 0.0023565877973210246
R-squared: -0.00015930206515624157

Baseline Comparison (using constant prediction):
Baseline Mean Squared Error (MSE): 1.5727708124668302e-05
Baseline Root Mean Squared Error (RMSE): 0.0039658174598269525
Baseline Mean Absolute Error (MAE): 0.002357138391258099
Baseline R-squared: -6.780635399827872e-06


