# **Predicting Funding Rate Direction: Model Development**

## **Introduction**

In this notebook, we aim to develop predictive models to forecast the **direction of the funding rate movement** for Bitcoin perpetual futures contracts on Binance. The funding rate is a crucial metric in futures trading, reflecting the cost of holding positions and influencing traders' strategies. Accurately predicting its direction can provide significant advantages in trading decisions.

Our objectives are:

- **Data Preprocessing and Feature Engineering**: Clean and prepare the data, extract meaningful features, and handle any data-related challenges.
- **Model Training and Evaluation**: Train various machine learning models to predict the funding rate direction and evaluate their performance.
- **Model Improvement and Selection**: Enhance model performance through techniques like hyperparameter tuning and select the best-performing model.


In [None]:
import sys
import os

# absolute path of the project's root directory
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))

# project root directory to sys.path
if project_root not in sys.path:
    sys.path.append(project_root)

# import modules
from utilities import (
    load_data,
    preprocess_data,
    create_features,
    train_classification_model,
    save_model
)

from utilities.functions import (
    add_lag_features,
    add_technical_indicators,
    apply_smote,
    perform_hyperparameter_tuning,
    evaluate_classification_model,
    plot_feature_importance
)

from config import (
    BINANCE_BTC_PERP_CSV,
    MODEL1_PATH,
    SCALER1_PATH,
    RANDOM_STATE
)

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

## **Step 1: Data Preprocessing and Feature Engineering**

In this step, we prepare the data for modeling by performing preprocessing tasks and creating new features that may improve model performance.

### **1.1 Determining Funding Rate Direction**

We create the target variable `direction` to indicate whether the funding rate is expected to **increase (`1`)** or **decrease/remain the same (`0`)** in the next time period.

- **Methodology**:
  - Shift the `funding_rate` column by one period to get the future funding rate.
  - Compare the future funding rate with the current funding rate to determine the direction.
  - The `direction` is set to `1` if the future funding rate is higher; otherwise, it's `0`.

### **1.2 Feature Engineering**

To enhance the model's predictive power, we generate additional features:

- **Lag Features**:
  - `funding_rate_lag1`: Funding rate from the previous period.
  - `open_interest_lag1`: Open interest from the previous period.
  - `mark_price_lag1`: Mark price from the previous period.

- **Technical Indicators**:
  - `funding_rate_ma3`: 3-period moving average of the funding rate.

- **Cyclical Time Features**:
  - Convert time-based features (hour, day, month) into cyclical features using sine and cosine transformations to capture periodic patterns.

- **Data Handling**:
  - **Missing Values**: Filled `NaN` values resulting from lagging and moving averages using backward fill (`bfill`).
  - **Scaling**: Standardized numerical features to ensure they're on the same scale.

In [None]:
# Load and preprocess data
df = load_data(BINANCE_BTC_PERP_CSV)
df = preprocess_data(df)
df = create_features(df)

# Create the 'direction' target variable
df['future_funding_rate'] = df['funding_rate'].shift(-1)
df['direction'] = (df['future_funding_rate'] > df['funding_rate']).astype(int)
df.drop(columns=['future_funding_rate'], inplace=True)

# Add lag features
df = add_lag_features(df)

# Add technical indicators
df = add_technical_indicators(df)

# Handle NaNs appropriately
df.bfill(inplace=True)

## **Step 2: Model Training and Evaluation**

### **2.1 Random Forest Classifier**

#### **Description**

Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mode of the classes as the prediction.

#### **Implementation**

- **Handling Class Imbalance**: Used **SMOTE (Synthetic Minority Over-sampling Technique)** to balance the classes in the training data.
- **Hyperparameter Tuning**: Employed `GridSearchCV` to find the optimal hyperparameters, such as the number of estimators and maximum depth.
- **Training**: Trained the Random Forest model with the best-found hyperparameters.

#### **Evaluation**

- **Metrics Used**: Same as logistic regression for consistency.
- **Results**:
  - Observed improvements in predictive performance over logistic regression.
  - Evaluated the model's ability to predict the minority class accurately.
  - Plotted feature importance to understand which features contributed most to the predictions.

In [None]:
# Proceed only if df is not empty
if not df.empty:
    # Define features and target
    feature_columns = [
        'funding_rate_lag1', 'funding_rate_lag2',
        'funding_rate_ma3', 'funding_rate_ma5',
        'open_interest', 'open_interest_lag1',
        'mark_price', 'mark_price_lag1',
        'hour_sin', 'hour_cos', 'day_sin', 'day_cos', 'month_sin', 'month_cos'
    ]
    X = df[feature_columns]
    y = df['direction']

    # Split the data
    split_index = int(0.8 * len(X))
    X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
    y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

    if not X_train.empty and not X_test.empty:
        # Scale numerical features
        numeric_features = [
            'funding_rate_lag1', 'funding_rate_lag2',
            'funding_rate_ma3', 'funding_rate_ma5',
            'open_interest', 'open_interest_lag1',
            'mark_price', 'mark_price_lag1'
        ]
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train[numeric_features])
        X_test_scaled = scaler.transform(X_test[numeric_features])

        # Convert scaled features back to DataFrame
        X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=numeric_features, index=X_train.index)
        X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=numeric_features, index=X_test.index)

        # Combine scaled numerical features with cyclical features
        cyclical_features = ['hour_sin', 'hour_cos', 'day_sin', 'day_cos', 'month_sin', 'month_cos']
        X_train_prepared = pd.concat(
            [X_train_scaled_df.reset_index(drop=True), X_train[cyclical_features].reset_index(drop=True)],
            axis=1
        )
        X_test_prepared = pd.concat(
            [X_test_scaled_df.reset_index(drop=True), X_test[cyclical_features].reset_index(drop=True)],
            axis=1
        )

        # Apply SMOTE to balance class distribution
        X_train_resampled, y_train_resampled = apply_smote(X_train_prepared, y_train.reset_index(drop=True))

        # Initialize and train the Random Forest model
        rf_model = RandomForestClassifier(random_state=RANDOM_STATE, class_weight='balanced')
        rf_model = train_classification_model(rf_model, X_train_resampled, y_train_resampled)

        # Evaluate the initial model
        print("\nInitial Model Evaluation:")
        evaluate_classification_model(rf_model, X_test_prepared, y_test)

        # Perform hyperparameter tuning
        best_rf_model = perform_hyperparameter_tuning(X_train_resampled, y_train_resampled)

        # Evaluate the best model
        print("\nBest Model Evaluation After Hyperparameter Tuning:")
        evaluate_classification_model(best_rf_model, X_test_prepared, y_test)

        # Adjust classification threshold if necessary (e.g., threshold=0.5)
        y_proba = best_rf_model.predict_proba(X_test_prepared)[:, 1]
        custom_threshold = 0.5
        print(f"\nEvaluation with Custom Threshold ({custom_threshold}):")
        evaluate_classification_model(best_rf_model, X_test_prepared, y_test, y_proba, threshold=custom_threshold)

        # Plot feature importance
        plot_feature_importance(best_rf_model, X_train_prepared.columns)

        # Save the trained model and scaler
        save_model(best_rf_model, MODEL1_PATH)
        save_model(scaler, SCALER1_PATH)


### **2.2 Logistic Regression**

#### **Description**

Logistic Regression is a linear model commonly used for binary classification problems. It models the probability that a given input belongs to a particular category.

#### **Implementation**

- **Handling Class Imbalance**: Addressed through techniques like class weighting or resampling (if applicable).
- **Feature Scaling**: Applied `StandardScaler` to numerical features to normalize the data.
- **Training**: Trained the logistic regression model using the processed training data.

#### **Evaluation**

- **Metrics Used**:
  - **Accuracy**: Overall correctness of the model.
  - **Precision**: Correct positive predictions out of all positive predictions.
  - **Recall**: Correct positive predictions out of all actual positives.
  - **ROC AUC Score**: Measure of the model's ability to distinguish between classes.

- **Results**:
  - Presented the classification report and confusion matrix.
  - Analyzed the model's performance, particularly on the minority class.

### **2.3 Model Comparison**

- Compared the performance of logistic regression and Random Forest models.
- Discussed which model performed better and why.
- Considered factors like overfitting, computational efficiency, and interpretability.