# Meta Labeling

Meta-labeling is particularly helpful when you want to achieve higher F1-scores. First, we build a model that achieves high recall, even if the precision is not particularly high. Second, we correct for the low precision by applying meta-labeling to the positives predicted by the primary model.

The central idea is to create a secondary ML model that learns how to use the primary model. This leads to improved performance metrics, including: Accuracy, Precision, Recall, and F1-Score etc.

Binary classification problems present a trade-off between type-I errors (false positives) and type-II errors (false negatives). In general, increasing the true positive rate of a binary classifier will tend to increase its false positive rate. The receiver operating characteristic (ROC) curve of a binary classifier measures the cost of increasing the true positive rate, in terms of accepting higher false positive rates.

In [1]:
import sys
import os
import pandas as pd

# Añadir el directorio raíz del proyecto al PYTHONPATH
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import quantstats as qs

Cálculo de los labels sobre los datos originales. Si esto estuviese bien hecho, se habria hecho en el apartado 2. Pero como en ese momento no tenía ni puta idea, pues lo hago ahora con los datos completos. ya los uniré después.

En este notebook cojo los datos originales (sin separar en in-sample y out-of-sample) y aplico 'three barrier method' sobre ellos.

In [45]:
# Importar datos
#SPY
spy_data = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\00_api_data\SPY_all.parquet')
spy_dollar_imb = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\01_imbalance_bars\SPY_dollar_imbalance.parquet')
spy_volume_imb = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\01_imbalance_bars\SPY_volume_imbalance.parquet')


#BTC
btc_data = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\00_api_data\BTC_all.parquet')
btc_dollar_imb = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\01_imbalance_bars\BTC_dollar_imbalance.parquet')
btc_volume_imb = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\01_imbalance_bars\BTC_volume_imbalance.parquet')

# paso todas las columnas date a formato datetime, y las convierto en índice.
# Convertir la columna 'date' a datetime para todos los DataFrames
spy_data['date'] = pd.to_datetime(spy_data['date'])
spy_dollar_imb['date'] = pd.to_datetime(spy_dollar_imb['date'])
spy_volume_imb['date'] = pd.to_datetime(spy_volume_imb['date'])

btc_data['date'] = pd.to_datetime(btc_data['date'])
btc_dollar_imb['date'] = pd.to_datetime(btc_dollar_imb['date'])
btc_volume_imb['date'] = pd.to_datetime(btc_volume_imb['date'])

# Establecer la columna 'date' como índice para todos los DataFrames
spy_data.set_index('date', inplace=True)
spy_dollar_imb.set_index('date', inplace=True)
spy_volume_imb.set_index('date', inplace=True)

btc_data.set_index('date', inplace=True)
btc_dollar_imb.set_index('date', inplace=True)
btc_volume_imb.set_index('date', inplace=True)

# Three barrier method

### Funciones de ayuda

funciones basadas en: https://towardsdatascience.com/financial-machine-learning-part-1-labels-7eeed050f32e

In [50]:
# función para calcular la volatilidad. El libro usa la diaría. aquí usaremos la horaria.
def calculate_rolling_volatility(prices: pd.Series, span: int = 100, time_delta: pd.Timedelta = pd.Timedelta(hours=1)) -> pd.Series:
    """
    Calculate the rolling volatility of a time series of prices.

    Parameters:
    ----------
    prices : pd.Series
        A pandas Series representing the price data, indexed by time.
    span : int, optional
        The span for the exponential weighted moving average, by default 100.
    time_delta : pd.Timedelta, optional
        The time difference used to compute returns, by default 1 hour.

    Returns:
    -------
    pd.Series
        A pandas Series containing the rolling volatility of the price series.
    """

    # 1. Calcular los retornos de la forma p[t]/p[t-1] - 1
    # 1.1 Encontrar los timestamps de los valores p[t-1]
    previous_indices = prices.index.searchsorted(prices.index - time_delta)
    previous_indices = previous_indices[previous_indices > 0]

    # 1.2 Alinear los timestamps de p[t-1] con los timestamps de p[t]
    aligned_indices = pd.Series(prices.index[previous_indices-1], index=prices.index[prices.shape[0] - previous_indices.shape[0]:])

    # 1.3 Obtener valores por timestamps, y luego calcular los retornos
    returns = prices.loc[aligned_indices.index] / prices.loc[aligned_indices.values].values - 1

    # 2. Estimar la desviación estándar móvil (volatilidad) usando media ponderada exponencialmente
    rolling_volatility = returns.ewm(span=span).std()

    return rolling_volatility

In [51]:
def get_horizons(prices: pd.Series, time_delta: pd.Timedelta = pd.Timedelta(minutes=60)) -> pd.Series:
    """
    Calculate the future time horizons for a given time series of prices.

    Parameters:
    ----------
    prices : pd.Series
        A pandas Series representing the price data, indexed by time.
    time_delta : pd.Timedelta, optional
        The time difference used to calculate future time horizons, by default 60 minutes.

    Returns:
    -------
    pd.Series
        A pandas Series containing the future time horizons, indexed by the original timestamps.
    """

    # 1. Buscar los índices de los timestamps desplazados hacia adelante por time_delta
    future_indices = prices.index.searchsorted(prices.index + time_delta)
    
    # 2. Filtrar los índices que están dentro del rango del DataFrame
    future_indices = future_indices[future_indices < prices.shape[0]]
    
    # 3. Obtener los timestamps correspondientes a los índices futuros
    future_times = prices.index[future_indices]
    
    # 4. Crear una Serie con los timestamps futuros, indexada por los timestamps originales
    time_horizons = pd.Series(future_times, index=prices.index[:future_times.shape[0]])

    return time_horizons


In [52]:
import pandas as pd

def get_touches(prices: pd.Series, events: pd.DataFrame, factors: list = [1, 1]) -> pd.DataFrame:
    """
    Calculate the earliest stop loss and take profit times for given events.

    Parameters:
    ----------
    prices : pd.Series
        Series with price data indexed by time.
    events : pd.DataFrame
        DataFrame with the following columns:
            - t1: Timestamp of the next horizon.
            - threshold: Unit height of the top and bottom barriers.
            - side: The direction (side) of each bet.
    factors : list, optional
        Multipliers for the threshold to set the height of the top and bottom barriers.
        Default is [1, 1], meaning the barriers are at 1x the threshold.

    Returns:
    -------
    pd.DataFrame
        DataFrame with columns 'stop_loss' and 'take_profit' indicating the earliest times
        at which the stop loss and take profit levels are touched for each event.
    """

    # Crear una copia del DataFrame 'events' con la columna 't1'
    touch_times = events[['t1']].copy(deep=True)

    # Calcular el umbral superior (barrera superior)
    if factors[0] > 0:
        upper_threshold = factors[0] * events['threshold']
    else:
        upper_threshold = pd.Series(index=events.index)  # sin umbral superior

    # Calcular el umbral inferior (barrera inferior)
    if factors[1] > 0:
        lower_threshold = -factors[1] * events['threshold']
    else:
        lower_threshold = pd.Series(index=events.index)  # sin umbral inferior

    # Iterar sobre cada evento para calcular el stop loss y take profit
    for event_index, horizon_time in events['t1'].items():
        price_path = prices[event_index:horizon_time]  # Precios en el camino
        returns_path = (price_path / prices[event_index] - 1) * events.loc[event_index, 'side']  # Retornos en el camino
        touch_times.loc[event_index, 'stop_loss'] = returns_path[returns_path < lower_threshold[event_index]].index.min()  # Primer stop loss
        touch_times.loc[event_index, 'take_profit'] = returns_path[returns_path > upper_threshold[event_index]].index.min()  # Primer take profit

    return touch_times


In [53]:
# def get_horizons(prices, delta=pd.Timedelta(minutes=60)):
#     t1 = prices.index.searchsorted(prices.index + delta)
#     t1 = t1[t1 < prices.shape[0]]
#     t1 = prices.index[t1]
#     t1 = pd.Series(t1, index=prices.index[:t1.shape[0]])
#     return t1

In [54]:
# def get_touches(prices: pd.Series, events: pd.DataFrame, factors=[1, 1]) -> pd.DataFrame:
#     """
#     Calculate the earliest stop loss and take profit for given events.
    
#     Parameters:
#     prices : pd.Series
#         Series with price data.
#     events : pd.DataFrame
#         DataFrame with columns:
#             - t1: Timestamp of the next horizon
#             - threshold: Unit height of top and bottom barriers
#             - side: The side of each bet
#     factors : list
#         Multipliers of the threshold to set the height of top/bottom barriers.
        
#     Returns:
#     pd.DataFrame
#         DataFrame with columns 'stop_loss' and 'take_profit' for each event.
#     """
#     out = events[['t1']].copy(deep=True)
    
#     if factors[0] > 0:
#         thresh_uppr = factors[0] * events['threshold']
#     else:
#         thresh_uppr = pd.Series(index=events.index)  # no upper threshold
    
#     if factors[1] > 0:
#         thresh_lwr = -factors[1] * events['threshold']
#     else:
#         thresh_lwr = pd.Series(index=events.index)  # no lower threshold
    
#     for loc, t1 in events['t1'].items():  # Cambiado a 'items' en lugar de 'iteritems'
#         df0 = prices[loc:t1]  # path prices
#         df0 = (df0 / prices[loc] - 1) * events.loc[loc, 'side']  # path returns
#         out.loc[loc, 'stop_loss'] = df0[df0 < thresh_lwr[loc]].index.min()  # earliest stop loss
#         out.loc[loc, 'take_profit'] = df0[df0 > thresh_uppr[loc]].index.min()  # earliest take profit
    
#     return out


In [55]:
def get_labels(touches: pd.DataFrame) -> pd.DataFrame:
    """
    Assign labels to events based on the first touch of stop loss or take profit levels.

    Parameters:
    ----------
    touches : pd.DataFrame
        DataFrame containing the earliest stop loss and take profit times for each event.

    Returns:
    -------
    pd.DataFrame
        A DataFrame with an additional 'label' column indicating the outcome:
        - 1 for take profit.
        - -1 for stop loss.
        - 0 if neither was touched.
    """

    labels = touches.copy(deep=True)

    # Calcular el primer nivel tocado (stop loss o take profit) ignorando valores NaN
    first_touch = touches[['stop_loss', 'take_profit']].min(axis=1)

    # Asignar etiquetas según el primer nivel tocado
    for event_index, touch_time in first_touch.items():
        if pd.isnull(touch_time):
            labels.loc[event_index, 'label'] = 0  # No se tocó ningún nivel
        elif touch_time == touches.loc[event_index, 'stop_loss']:
            labels.loc[event_index, 'label'] = -1  # Se tocó el stop loss
        else:
            labels.loc[event_index, 'label'] = 1  # Se tocó el take profit

    return labels


In [56]:
# import pandas as pd

# def get_labels(touches: pd.DataFrame) -> pd.DataFrame:
#     out = touches.copy(deep=True)
#     # pandas df.min() ignores NaN values
#     first_touch = touches[['stop_loss', 'take_profit']].min(axis=1)
    
#     for loc, t in first_touch.items():
#         if pd.isnull(t):
#             out.loc[loc, 'label'] = 0
#         elif t == touches.loc[loc, 'stop_loss']:
#             out.loc[loc, 'label'] = -1
#         else:
#             out.loc[loc, 'label'] = 1
            
#     return out


In [57]:
import pandas as pd

def process_ohlc_data(data_ohlc: pd.DataFrame) -> pd.DataFrame:
    """
    Process OHLC data to generate event labels for trading strategies.

    This function calculates the volatility threshold, the time horizons, and assigns
    labels for stop loss and take profit events.

    Parameters:
    ----------
    spy_data : pd.DataFrame
        A DataFrame containing OHLC data with at least a 'close' price column.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the original data with additional columns:
        - 'threshold': Volatility threshold calculated from the close prices.
        - 't1': Time horizons for each event.
        - 'label': Event labels (-1 for stop loss, 1 for take profit, 0 for neither).
    """

    # Asignar umbral de volatilidad calculado a partir de los precios de cierre
    data_ohlc = data_ohlc.assign(threshold=calculate_rolling_volatility(spy_data.close)).dropna()

    # Asignar horizontes temporales calculados
    data_ohlc = data_ohlc.assign(t1=get_horizons(data_ohlc)).dropna()

    # Crear DataFrame de eventos con las columnas 't1' y 'threshold'
    events = data_ohlc[['t1', 'threshold']]

    # Asignar la columna 'side' con valor 1 para indicar posiciones largas únicamente
    events = events.assign(side=pd.Series(1., index=events.index))

    # Calcular los niveles de stop loss y take profit
    touches = get_touches(data_ohlc.close, events, factors=[1, 1])

    # Asignar etiquetas basadas en los niveles de stop loss y take profit
    touches = get_labels(touches)

    # Asignar las etiquetas finales al DataFrame original
    data_ohlc = data_ohlc.assign(label=touches.label)

    return data_ohlc


### Obtención de labeles

In [58]:
new_spy = process_ohlc_data(spy_data)

In [59]:
new_spy.to_parquet('spy_with_labels.parquet')

In [61]:
spy_volume_imb = process_ohlc_data(spy_volume_imb)
spy_volume_imb.to_parquet('spy_volume_with_labels.parquet')

In [63]:
spy_dollar_imb = process_ohlc_data(spy_dollar_imb)
spy_dollar_imb.to_parquet('spy_dollar_with_labels.parquet')

In [64]:
btc_data = process_ohlc_data(btc_data)
btc_data.to_parquet('btc_with_labels.parquet')

In [65]:
btc_volume_imb = process_ohlc_data(btc_volume_imb)
btc_volume_imb.to_parquet('btc_volume_with_labels.parquet')

In [66]:
btc_dollar_imb = process_ohlc_data(btc_dollar_imb)
btc_dollar_imb.to_parquet('btc_dollar_with_labels.parquet')

# Matrices de confusión

Una vez tengo los labeles, puedo obtener las matrices de confusion de los datos de los datos predichos por el algoritmo.

El meta labeling es una clasificación binaria. Osea que los labeles {-1,1} son aciertos, y los {0} errores. 

In [2]:
from sklearn.metrics import confusion_matrix,classification_report

In [4]:
BTC_ORIGINAL_RET = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\04_validation\BTC_original_returns.parquet')
BTC_VOLUME_RET = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\04_validation\BTC_volume_returns.parquet')
BTC_DOLLAR_RET = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\04_validation\BTC_dollar_returns.parquet')

SPY_ORIGINAL_RET = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\04_validation\SPY_original_returns.parquet')
SPY_VOLUME_RET = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\04_validation\SPY_volume_returns.parquet')
SPY_DOLLAR_RET = pd.read_parquet(r'C:\Users\adelapuente\Desktop\math_tfm\04_validation\SPY_dollar_returns.parquet')


SPY_ORIGINAL_RET.set_index('date', inplace=True)
SPY_VOLUME_RET.set_index('date', inplace=True)
SPY_DOLLAR_RET.set_index('date', inplace=True)

BTC_ORIGINAL_RET.set_index('date', inplace=True)
BTC_VOLUME_RET.set_index('date', inplace=True)
BTC_DOLLAR_RET.set_index('date', inplace=True)

### SPY Original

In [6]:
new_spy = pd.read_parquet('spy_with_labels.parquet')
spy_original_union = SPY_ORIGINAL_RET.merge(new_spy['label'], left_index=True, right_index=True)

spy_original_labels = spy_original_union['label'].apply(lambda x: 1 if x != 0 else 0)
spy_original_ppo_predictions = spy_original_union['predicted_actions'].apply(lambda x: 1 if x != 0 else 0)

spy_original_union_cm = classification_report(spy_original_labels, spy_original_ppo_predictions)
print(spy_original_union_cm)

              precision    recall  f1-score   support

           0       0.35      0.35      0.35     17465
           1       0.65      0.65      0.65     31804

    accuracy                           0.54     49269
   macro avg       0.50      0.50      0.50     49269
weighted avg       0.54      0.54      0.54     49269



### SPY Volume

In [7]:
spy_volume_imb = pd.read_parquet('spy_volume_with_labels.parquet')
spy_volume_union = SPY_VOLUME_RET.merge(spy_volume_imb['label'], left_index=True, right_index=True)

spy_volume_labels = spy_volume_union['label'].apply(lambda x: 1 if x != 0 else 0)
spy_volume_ppo_predictions = spy_volume_union['predicted_actions'].apply(lambda x: 1 if x != 0 else 0)

spy_volume_union_cm = classification_report(spy_volume_labels, spy_volume_ppo_predictions)
print(spy_volume_union_cm)

              precision    recall  f1-score   support

           0       0.38      0.37      0.38      1304
           1       0.63      0.64      0.63      2149

    accuracy                           0.54      3453
   macro avg       0.50      0.50      0.50      3453
weighted avg       0.53      0.54      0.54      3453



### SPY Dollar

In [8]:
spy_dollar_imb = pd.read_parquet('spy_dollar_with_labels.parquet')
spy_dollar_union = SPY_DOLLAR_RET.merge(spy_dollar_imb['label'], left_index=True, right_index=True)

spy_dollar_labels = spy_dollar_union['label'].apply(lambda x: 1 if x != 0 else 0)
spy_dollar_ppo_predictions = spy_dollar_union['predicted_actions'].apply(lambda x: 1 if x != 0 else 0)

spy_dollar_union_cm = classification_report(spy_dollar_labels, spy_dollar_ppo_predictions)
print(spy_dollar_union_cm)

              precision    recall  f1-score   support

           0       0.39      0.37      0.38      3161
           1       0.63      0.65      0.64      5195

    accuracy                           0.55      8356
   macro avg       0.51      0.51      0.51      8356
weighted avg       0.54      0.55      0.54      8356



### BTC Original

In [9]:
btc_data = pd.read_parquet('btc_with_labels.parquet')
btc_original_union = BTC_ORIGINAL_RET.merge(btc_data['label'], left_index=True, right_index=True)

btc_original_labels = btc_original_union['label'].apply(lambda x: 1 if x != 0 else 0)
btc_original_ppo_predictions = btc_original_union['predicted_actions'].apply(lambda x: 1 if x != 0 else 0)

btc_original_union_cm = classification_report(btc_original_labels, btc_original_ppo_predictions)
print(btc_original_union_cm)

              precision    recall  f1-score   support

           0       0.02      0.18      0.04       903
           1       0.98      0.86      0.92     49867

    accuracy                           0.85     50770
   macro avg       0.50      0.52      0.48     50770
weighted avg       0.97      0.85      0.90     50770



### BTC Volume

In [10]:
btc_volume_imb = pd.read_parquet('btc_volume_with_labels.parquet')
btc_volume_union = BTC_VOLUME_RET.merge(btc_volume_imb['label'], left_index=True, right_index=True)

btc_volume_labels = btc_volume_union['label'].apply(lambda x: 1 if x != 0 else 0)
btc_volume_ppo_predictions = btc_volume_union['predicted_actions'].apply(lambda x: 1 if x != 0 else 0)

btc_volume_union_cm = classification_report(btc_volume_labels, btc_volume_ppo_predictions)
print(btc_volume_union_cm)

              precision    recall  f1-score   support

           0       0.02      0.60      0.04       128
           1       0.98      0.39      0.56      5961

    accuracy                           0.40      6089
   macro avg       0.50      0.50      0.30      6089
weighted avg       0.96      0.40      0.55      6089



### BTC Dollar

In [11]:
btc_dollar_imb = pd.read_parquet('btc_dollar_with_labels.parquet')
btc_dollar_union = BTC_DOLLAR_RET.merge(btc_dollar_imb['label'], left_index=True, right_index=True)

btc_dollar_labels = btc_dollar_union['label'].apply(lambda x: 1 if x != 0 else 0)
btc_dollar_ppo_predictions = btc_dollar_union['predicted_actions'].apply(lambda x: 1 if x != 0 else 0)

btc_dollar_union_cm = classification_report(btc_dollar_labels, btc_dollar_ppo_predictions)
print(btc_dollar_union_cm)

              precision    recall  f1-score   support

           0       0.02      0.31      0.03       850
           1       0.98      0.64      0.78     45678

    accuracy                           0.64     46528
   macro avg       0.50      0.48      0.40     46528
weighted avg       0.96      0.64      0.76     46528

