# Feature Engineering of Microstructure Data

## Goals:
- Understand the importance of feature engineering in analyzing microstructure data.
- Explore techniques to extract meaningful features from raw microstructure datasets.
- Demonstrate how these features can improve the performance of machine learning models.

# 1. What is Market Microstructure? 


In [1]:
#import statements
import os
from argparse import  Namespace
os.chdir(os.path.abspath(".."))

from src.train import run_pipeline

print("Current Working Directory:", os.getcwd())

Current Working Directory: /Users/carloteufel/Documents/Projects/FeatureEngineering


  from .autonotebook import tqdm as notebook_tqdm


### 1. Just OHLCV-Data

In [2]:
args = Namespace(
    dataset_version="Dataset2",
    data_type = "ohlcv",
    model="BasicXGBOOST1",
    featurizer="Featurizer2",
    seq_len=100,
    predict_horizon=10,
    make_new_features= False,
    use_kfold=True, 
    use_ta = False
)
run_pipeline(args)

Dataset Columns: ['open_ohlcv', 'high_ohlcv', 'low_ohlcv', 'close_ohlcv', 'volume_ohlcv']
Cross-validation scores: [0.4367928  0.43898284 0.43886118 0.46623677 0.4421868 ]
Mean accuracy: 0.4446120777061281
Evaluating model
Accuracy: 0.46700734071460437
Evaluating accuracy over time


100%|██████████| 5/5 [00:00<00:00,  7.46it/s]


Saving model
Model saved to saved_models/BasicXGBOOST1_Dataset2_Featurizer2_2025-04-01_19:10:33


### 2. OHLCV + Technical Indicators

In [3]:
args = Namespace(
    dataset_version="Dataset2",
    data_type = "ohlcv",
    model="BasicXGBOOST1",
    featurizer="Featurizer2",
    seq_len=100,
    predict_horizon=10,
    make_new_features= False,
    use_kfold=True, 
    use_ta = True
)
run_pipeline(args)

Dataset Columns: ['open_ohlcv', 'high_ohlcv', 'low_ohlcv', 'close_ohlcv', 'volume_ohlcv', 'sma_5_ohlcv', 'sma_10_ohlcv', 'sma_20_ohlcv', 'sma_30_ohlcv', 'sma_60_ohlcv', 'sma_200_ohlcv', 'ema_12_ohlcv', 'ema_26_ohlcv', 'ema_50_ohlcv', 'ema_200_ohlcv', 'rsi_14_ohlcv', 'rsi_30_ohlcv', 'rsi_50_ohlcv', 'vwap_typical_ohlcv_ohlcv', 'lowest_low_ohlcv', 'highest_high_ohlcv', 'stochastic_k_ohlcv', 'stochastic_d_ohlcv']
Cross-validation scores: [0.80987144 0.80995255 0.79624447 0.80046234 0.80042179]
Mean accuracy: 0.8033905179056658
Evaluating model
Accuracy: 0.8048018818185505
Evaluating accuracy over time


100%|██████████| 23/23 [00:12<00:00,  1.87it/s]


Saving model
Model saved to saved_models/BasicXGBOOST1_Dataset2_Featurizer2_2025-04-01_19:12:06


### 3. OHLCV+ Best Bid Offer (BBO)

In [None]:
args = Namespace(
    dataset_version="Dataset2",
    data_type = ["bbo", "ohlcv"],
    model="BasicXGBOOST1",
    featurizer="Featurizer2",
    seq_len=100,
    predict_horizon=10,
    make_new_features= False,
    use_kfold=True, 
    use_ta = False
)
run_pipeline(args)

Dataset Columns: ['open_ohlcv', 'high_ohlcv', 'low_ohlcv', 'close_ohlcv', 'volume_ohlcv', 'mean_bid_bbo', 'mean_ask_bbo', 'std_bid_bbo', 'std_ask_bbo', 'median_ask_bbo', 'median_bid_bbo', 'mean_spread_bbo', 'std_spread_bbo', 'median_spread_bbo', 'total_bid_size_bbo', 'total_ask_size_bbo', 'max_spread_bbo', 'min_spread_bbo', 'mean_spread_change_bbo', 'std_spread_change_bbo', 'median_spread_change_bbo', 'volume_avg_price_bbo', 'time_avg_price_bbo', 'order_book_imbalance_bbo', 'order_book_imbalance_change_bbo', 'order_book_imbalance_std_bbo', 'order_book_imbalance_median_bbo', 'order_book_imbalance_avg_bbo', 'order_book_imbalance_max_bbo', 'order_book_imbalance_min_bbo', 'order_book_imbalance_change_std_bbo', 'order_book_imbalance_change_median_bbo', 'order_book_imbalance_change_avg_bbo', 'bid_ask_ratio_bbo', 'bid_ask_ratio_change_bbo', 'bid_ask_ratio_std_bbo', 'bid_ask_ratio_median_bbo', 'bid_ask_ratio_avg_bbo', 'bid_ask_ratio_max_bbo', 'bid_ask_ratio_min_bbo', 'bid_ask_ratio_change_std_

  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


Cross-validation scores: [0.4367928  0.43898284 0.43886118 0.46623677 0.4421868 ]
Mean accuracy: 0.4446120777061281
Training model


  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


Evaluating model
Accuracy: 0.46700734071460437
Evaluating accuracy over time


100%|██████████| 46/46 [00:46<00:00,  1.02s/it]
