Given the OHLCV data from binance, create all the needed features 

In [69]:
import ccxt
import pandas as pd
import polars as pl
import os
import numpy as np
import plotly.express as px

# GLOBAL VARIABLES
# change to rel. paths
TICKER_DATA_PATH = r"C:\Users\Damja\CODING_LOCAL\trading\data\ticker_specific_data_BINANCE"

ModuleNotFoundError: No module named 'plotly'

### TO-DO's
- create checks for correct dates
- test data quality
- create features

In [5]:
pairs = pd.read_csv("pairs.csv")
NUM_PAIRS_TO_LOAD = 100
pairs = pairs.iloc[:NUM_PAIRS_TO_LOAD, 0].values
pairs = [pair.replace("USD", "USDT") for pair in pairs]

'scan' uses lazy evaluation --> optimized

In [6]:
pair = pairs[2]

In [7]:
df = pl.read_parquet(f'{TICKER_DATA_PATH}/{pair.replace("/", "")}.parquet').to_pandas()

In [1]:
#df.collect_schema().names()

#### Feature Creation
- 30 mins momentum
- change in volume
- current vol / 24h avg volume (check for weekends?) --> maybe a flag; is weekend or is change to weekend

In [8]:
#df.with_columns['usd_volume_30m_delta'] = df['usd_volume'].diff()

df['usd_volume_30m_delta'] = df['usd_volume'].diff()
df['usd_volume_60m_delta'] = df['usd_volume'].diff(2)
df['usd_volume_3hm_delta'] = df['usd_volume'].diff(6)

df['return_30m'] = df['close'].diff()/df['close']*100
df['return_60m'] = df['close'].diff(2)/df['close']*100
df['return_3h'] = df['close'].diff(6)/df['close']*100

Calculate the rolling 6h Low and High prices; It's difference is a good proxy of coin specific volatility within this timeframe

In [9]:
df['6h_low'] = df['low'].rolling(12).min()
df['6h_high'] = df['high'].rolling(12).max()
df['6h_high_minus_low'] = df['6h_high'] - df['6h_low']

df['6h_close_volatility'] = df['close'].rolling(12).std()

df.columns

Index(['Date', 'open', 'high', 'low', 'close', 'volume', 'usd_volume',
       'usd_volume_30m_delta', 'usd_volume_60m_delta', 'usd_volume_3hm_delta',
       'return_30m', 'return_60m', 'return_3h', '6h_low', '6h_high',
       '6h_high_minus_low', '6h_close_volatility'],
      dtype='object')

In [10]:
df['Date'] > pd.to_datetime("2025-01-01")

0        False
1        False
2        False
3        False
4        False
         ...  
54316     True
54317     True
54318     True
54319     True
54320     True
Name: Date, Length: 54321, dtype: bool

### Training Step

What is the naive benchmark that we need to be better than when looking at just 1 pair?
- Better than long term buy and hold?
- better in terms of risk measures (max DD, sharpe,..)


One idea is not to always be invested but only in lucrative phases. This would lend to using leverage in these phases. Such a strategy would need to be evealuated based on drawdown events and hit ratio of the strategy in general. Either lower hit ratio with high payout potential or high hit ratio with higher leverage --> high conviction trades only.
The question is what models lend itself to such a strategy?

What do we want to predict?
- buy/sell signals (i.e. buy/hold/sell in the next 30 mins)
- or direct returns

In [None]:
# filter data for older than 2023

df = df[df['Date'] >= pd.to_datetime("2024-01-01")]


In [None]:
y_lagged_30m = df['close'].shift(-1)

AttributeError: 'Series' object has no attribute 'lag'

In [42]:
y_lagged_30m = [
    df['Date'].iloc[:-1].values.reshape(-1), 
    df['close'].shift(-1).dropna().values.reshape(-1)]

y_lagged_30m = pd.DataFrame(y_lagged_30m, index=['Date', 'close']).T
y_lagged_30m

Unnamed: 0,Date,close
0,2024-01-01 00:00:00,0.6162
1,2024-01-01 00:30:00,0.6173
2,2024-01-01 01:00:00,0.6185
3,2024-01-01 01:30:00,0.6169
4,2024-01-01 02:00:00,0.6154
...,...,...
19207,2025-02-03 08:30:00,2.3832
19208,2025-02-03 09:00:00,2.3733
19209,2025-02-03 09:30:00,2.3668
19210,2025-02-03 10:00:00,2.432


In [43]:
from sklearn.linear_model import Ridge

In [46]:
FEAT_COLS = ['usd_volume',
       'usd_volume_30m_delta', 'usd_volume_60m_delta', 'usd_volume_3hm_delta',
       'return_30m', 'return_60m', 'return_3h', '6h_low', '6h_high',
       '6h_high_minus_low', '6h_close_volatility']

In [53]:
X = df[["Date"] + FEAT_COLS]
X = X[ X["Date"].isin(y_lagged_30m["Date"]) ]

In [59]:
model = Ridge().fit(X[FEAT_COLS], y_lagged_30m["close"])

  return linalg.solve(A, Xy, sym_pos=True, overwrite_a=True).T


In [67]:
preds = model.predict(X[FEAT_COLS])

In [65]:
def mse(ytrue, ypred):
    return np.mean((ytrue - ypred)**2)

In [68]:
mse(y_lagged_30m["close"], preds)

0.00027467536222299384