# Comprehensive Feature Engineering Pipeline

This notebook consolidates the feature engineering scripts under `ml/features` and `ml/scripts/prepare_features.py` into a single, step-by-step workflow. You can run the cells sequentially to produce daily or intraday feature sets and optionally save the results to Parquet.


## 1. Environment Setup

Configure the Python path so the existing feature engineering modules can be imported, and load standard dependencies used throughout the pipeline.


In [None]:
from __future__ import annotations

import argparse
import os
import sys
import warnings
from dataclasses import dataclass, field
from pathlib import Path
from typing import Iterable, List, Optional, Tuple

import numpy as np
import pandas as pd
import psycopg2
from IPython.display import display

from dotenv import load_dotenv

load_dotenv()


def _locate_project_root(start: Path, marker: str = "trading-pipeline") -> Path:
    for candidate in [start, *start.parents]:
        if candidate.name == marker:
            return candidate
    return start


try:
    ROOT = _locate_project_root(Path(__file__).resolve())
except NameError:
    ROOT = _locate_project_root(Path.cwd().resolve())

if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))

print(f"Project root: {ROOT}")

Project root: /Users/mac/learning/trading-pipeline


In [21]:
from ml.features import (
    PriceFeatureEngineer,
    VolumeFeatureEngineer,
    TechnicalIndicatorsFeatureEngineer,
    NewsFeatureEngineer,
    TimeFeatureEngineer,
    CandlestickFeatureEngineer,
    ConfluenceFeatureEngineer,
    PriceFeatureConfig,
    VolumeFeatureConfig,
    TechnicalIndicatorConfig,
    NewsFeatureConfig,
    TimeFeatureConfig,
    ConfluenceConfig,
    RuleBasedSentimentModel,
    LLMSentimentModel,
    combine_sentiment_scores,
)
from ml.scripts.prepare_features import (
    prepare_daily_features,
    prepare_intraday_features,
    _connect_db,
    _load_daily_bars,
    _load_intraday_bars,
    _load_news,
    _enrich_news_with_sentiment,
)

## 2. Feature Engineering Components

The next cells inline the core classes and helpers previously defined under `ml/features` and `ml/scripts/prepare_features.py`. Executing them once will register the same functionality inside this notebook so you can run the full pipeline cell by cell without relying on external modules.


## 3. Configure Feature Parameters

Adjust the configuration dataclasses below to tune the feature generation process. The defaults match the production script but you can modify them interactively.


In [22]:
price_config = PriceFeatureConfig()
volume_config = VolumeFeatureConfig()
technical_config = TechnicalIndicatorConfig()
news_config = NewsFeatureConfig()
time_config = TimeFeatureConfig()
confluence_config = ConfluenceConfig()

configs = {
    "price_config": price_config,
    "volume_config": volume_config,
    "technical_config": technical_config,
    "news_config": news_config,
    "time_config": time_config,
    "confluence_config": confluence_config,
}

configs

{'price_config': PriceFeatureConfig(return_windows=(5, 10, 20), sma_windows=(5, 10, 20, 50, 200), ema_windows=(12, 26), volatility_windows=(5, 20), price_position_windows=(5, 20), true_range_window=14, keep_na=False),
 'volume_config': VolumeFeatureConfig(volume_windows=(5, 20), trend_windows=(5, 20), ratio_baseline_window=20, spike_threshold=2.0, dry_threshold=0.5),
 'technical_config': TechnicalIndicatorConfig(rsi_length=14, macd_fast=12, macd_slow=26, macd_signal=9, stochastic_k=14, stochastic_d=3, stochastic_smooth=3, bollinger_length=20, bollinger_std=2.0, atr_length=14, adx_length=14, ema_length=14, sma_length=14, wma_length=14),
 'news_config': NewsFeatureConfig(lookback_days=7, trend_min_periods=2, fill_numeric=0.0, fill_count=0),
 'time_config': TimeFeatureConfig(market_open_hour=9, market_open_minute=30, market_close_hour=16, market_close_minute=0, opening_window_minutes=30, closing_window_minutes=30, lunch_hours=(11, 14), morning_hours=(9, 12), afternoon_hours=(12, 16), sess

## 4. Database Connection

The loader utilities expect `DATABASE_URL` (or `DATABASE_URL_HOST`) to point at the Postgres instance that stores bar and news data. Set it directly in the notebook if needed.


In [24]:
# Optional: provide credentials here if they are not already exported in the shell.
# os.environ["DATABASE_URL"] = "postgresql://user:password@host:5432/database"

try:
    conn = _connect_db()
    conn.close()
    print("✅ Database connection verified")
except Exception as exc:  # pragma: no cover - purely for interactive use
    print(f"⚠️  Unable to verify database connection: {exc}")


✅ Database connection verified


## 5. Load Raw Data

Select a ticker (or leave `None` for all tickers) and load the daily, intraday, and news records directly from the database. These helpers mirror the functions in the standalone script.


In [25]:
ticker = "AAPL"  # Replace with a specific ticker symbol or iterable of tickers

conn = _connect_db()
try:
    daily_bars, daily_gaps = _load_daily_bars(conn, ticker)
    intraday_bars, intraday_gaps = _load_intraday_bars(conn, ticker, time_config)
    news_articles = _load_news(conn, ticker)
finally:
    conn.close()

print(
    f"Loaded {len(daily_bars)} daily rows, {len(intraday_bars)} intraday rows, "
    f"and {len(news_articles)} news articles"
)

warning_tables = {
    "daily_gap_warnings": daily_gaps,
    "intraday_gap_warnings": intraday_gaps,
}
warning_tables = {name: frame for name, frame in warning_tables.items() if not frame.empty}

if warning_tables:
    print("\nWarning details:")
    for name, frame in warning_tables.items():
        print(f"- {name}: {len(frame)} rows flagged")
        display(frame.head())

    daily_bars.attrs.setdefault("warnings", {})["daily_gap_warnings"] = warning_tables.get("daily_gap_warnings")
    intraday_bars.attrs.setdefault("warnings", {})["intraday_gap_warnings"] = warning_tables.get("intraday_gap_warnings")


daily_bars.head()

  "ticker",
  tickers = list(ticker)


⚠️  AAPL: detected 13 daily gaps (>3 days) at row(s) [31, 35, 44, 68, 96, 136, 203, 298, 317, 360, 385, 412, 452]
⚠️  AAPL: detected 678 intraday gaps (>15 minutes) at row(s) [187, 373, 378, 380, 549, 572, 726, 743, 905, 915, 920, 1085, 1269, 1453, 1460, 1461, 1462, 1627, 1787, 1789, 1792, 1796, 1797, 1802, 1803, 1892, 1896, 1898, 1909, 1915, 2074, 2253, 2433, 2441, 2445, 2450, 2603, 2618, 2623, 2773, 2794, 2948, 3136, 3150, 3315, 3337, 3340, 3485, 3490, 3493, 3656, 3843, 3871, 4026, 4043, 4206, 4395, 4580, 4769, 4770, 4944, 5132, 5315, 5319, 5324, 5492, 5504, 5506, 5509, 5665, 5675, 5836, 5858, 6008, 6013, 6016, 6018, 6172, 6363, 6551, 6740, 6927, 7117, 7297, 7485, 7670, 7854, 8042, 8227, 8417, 8606, 8791, 8808, 8976, 9161, 9184, 9348, 9531, 9714, 9903, 10093, 10277, 10469, 10661, 10680, 10844, 11027, 11035, 11198, 11204, 11208, 11211, 11370, 11390, 11551, 11561, 11732, 11918, 12106, 12298, 12314, 12485, 12515, 12670, 12857, 13041, 13051, 13221, 13248, 13404, 13596, 13783, 13973, 1416

  missing_mask = (features["missing_next_bar"] == 1) | (features["missing_previous_bar"] == 1)


Unnamed: 0,ticker,row_index,previous_date,current_date,gap_days
0,AAPL,31,2023-12-22,2023-12-26,4.0
1,AAPL,35,2023-12-29,2024-01-02,4.0
2,AAPL,44,2024-01-12,2024-01-16,4.0
3,AAPL,68,2024-02-16,2024-02-20,4.0
4,AAPL,96,2024-03-28,2024-04-01,4.0




Unnamed: 0,ticker,row_index,previous_timestamp,current_timestamp,gap_seconds
0,AAPL,187,2023-11-10 00:55:00,2023-11-10 09:00:00,29100.0
1,AAPL,373,2023-11-11 00:55:00,2023-11-13 09:00:00,201900.0
2,AAPL,378,2023-11-13 09:25:00,2023-11-13 09:50:00,1500.0
3,AAPL,380,2023-11-13 10:00:00,2023-11-13 10:20:00,1200.0
4,AAPL,549,2023-11-14 00:55:00,2023-11-14 09:00:00,29100.0


Unnamed: 0,ticker,date,open,high,low,close,volume,transactions,volume_weighted_avg_price,is_monday,is_friday,days_since_prev_close
0,AAPL,2023-11-09,182.96,184.12,181.81,182.41,53763540,545660,182.9116,0,0,0
1,AAPL,2023-11-10,183.97,186.565,183.53,186.4,66177922,610938,185.4104,0,1,1
2,AAPL,2023-11-13,185.82,186.03,184.21,184.8,43627519,530407,184.8317,1,0,3
3,AAPL,2023-11-14,187.7,188.11,186.3,187.44,60108378,609218,187.2038,0,0,1
4,AAPL,2023-11-15,187.845,189.5,187.78,188.01,53790499,564160,188.4206,0,0,1


In [26]:
daily_gaps

Unnamed: 0,ticker,row_index,previous_date,current_date,gap_days
0,AAPL,31,2023-12-22,2023-12-26,4.0
1,AAPL,35,2023-12-29,2024-01-02,4.0
2,AAPL,44,2024-01-12,2024-01-16,4.0
3,AAPL,68,2024-02-16,2024-02-20,4.0
4,AAPL,96,2024-03-28,2024-04-01,4.0
5,AAPL,136,2024-05-24,2024-05-28,4.0
6,AAPL,203,2024-08-30,2024-09-03,4.0
7,AAPL,298,2025-01-17,2025-01-21,4.0
8,AAPL,317,2025-02-14,2025-02-18,4.0
9,AAPL,360,2025-04-17,2025-04-21,4.0


In [33]:
intraday_gaps[(intraday_gaps.gap_seconds != 201900.0) & (intraday_gaps.gap_seconds != 29100.0)]

Unnamed: 0,ticker,row_index,previous_timestamp,current_timestamp,gap_seconds
2,AAPL,378,2023-11-13 09:25:00,2023-11-13 09:50:00,1500.0
3,AAPL,380,2023-11-13 10:00:00,2023-11-13 10:20:00,1200.0
5,AAPL,572,2023-11-14 11:15:00,2023-11-14 12:00:00,2700.0
7,AAPL,743,2023-11-15 10:55:00,2023-11-15 11:15:00,1200.0
9,AAPL,915,2023-11-16 09:50:00,2023-11-16 10:10:00,1200.0
...,...,...,...,...,...
662,AAPL,90143,2025-10-23 10:35:00,2025-10-23 10:55:00,1200.0
669,AAPL,91434,2025-10-31 23:55:00,2025-11-03 09:00:00,205500.0
672,AAPL,91821,2025-11-05 09:45:00,2025-11-05 10:05:00,1200.0
674,AAPL,92008,2025-11-06 10:05:00,2025-11-06 10:30:00,1500.0


In [34]:
intraday_bars.iloc[370:390,]

Unnamed: 0,ticker,timestamp,open,high,low,close,volume,transactions,volume_weighted_avg_price,date,is_monday,is_friday,seconds_since_prev_bar
370,AAPL,2023-11-11 00:45:00+00:00,186.0,186.02,186.0,186.02,582,27,186.017,2023-11-11 00:00:00+00:00,0,0,300
371,AAPL,2023-11-11 00:50:00+00:00,186.05,186.05,186.05,186.05,1157,33,186.0457,2023-11-11 00:00:00+00:00,0,0,300
372,AAPL,2023-11-11 00:55:00+00:00,186.02,186.26,186.02,186.26,6316,125,186.1469,2023-11-11 00:00:00+00:00,0,0,300
373,AAPL,2023-11-13 09:00:00+00:00,185.35,185.63,185.35,185.63,4829,196,185.5468,2023-11-13 00:00:00+00:00,1,0,201900
374,AAPL,2023-11-13 09:05:00+00:00,185.56,185.58,185.48,185.58,6884,156,185.5374,2023-11-13 00:00:00+00:00,1,0,300
375,AAPL,2023-11-13 09:15:00+00:00,185.51,185.63,185.51,185.63,2811,100,185.5805,2023-11-13 00:00:00+00:00,1,0,600
376,AAPL,2023-11-13 09:20:00+00:00,185.66,185.7,185.66,185.7,1304,41,185.6726,2023-11-13 00:00:00+00:00,1,0,300
377,AAPL,2023-11-13 09:25:00+00:00,185.66,185.66,185.5,185.51,4056,80,185.5878,2023-11-13 00:00:00+00:00,1,0,300
378,AAPL,2023-11-13 09:50:00+00:00,185.59,185.61,185.59,185.61,1141,28,185.6051,2023-11-13 00:00:00+00:00,1,0,1500
379,AAPL,2023-11-13 10:00:00+00:00,185.6,185.6,185.55,185.55,762,38,185.5793,2023-11-13 00:00:00+00:00,1,0,600


## 6. Enrich News Sentiment (Optional)

Combine vendor, rule-based, and LLM sentiment scores. Disable either model by setting the corresponding flags.


In [None]:
use_rule_sentiment = True
llm_model_name = 'finbert'  # e.g. "ProsusAI/finbert"

if news_articles.empty:
    enriched_news = news_articles.copy()
else:
    enriched_news = _enrich_news_with_sentiment(
        news_articles,
        llm_model_name=llm_model_name,
        use_rule_sentiment=use_rule_sentiment,
    )

enriched_news.head()


In [None]:
enriched_news[enriched_news.sentiment_score != 0.0]

## 7. Build Daily Feature Set

Run the individual feature engineers (price, volume, technical, news, confluence) explicitly to inspect intermediate outputs before combining them.


In [None]:
price_engineer = PriceFeatureEngineer(price_config)
volume_engineer = VolumeFeatureEngineer(volume_config)
technical_engineer = TechnicalIndicatorsFeatureEngineer(technical_config)
news_engineer = NewsFeatureEngineer(news_config)
confluence_engineer = ConfluenceFeatureEngineer(
    price_config=price_config,
    volume_config=volume_config,
    technical_config=technical_config,
    news_config=news_config,
    confluence_config=confluence_config,
)

price_features = price_engineer.create_features(daily_bars)
volume_features = volume_engineer.create_features(price_features)
technical_features = technical_engineer.create_features(price_features)

daily_with_news = news_engineer.create_features(enriched_news, price_features[["ticker", "date"]])

confluence_features = confluence_engineer.create_features(
    price_features,
    intraday_ohlcv=None,
    news_df=enriched_news if not enriched_news.empty else None,
)

confluence_features.head()


### Quick Daily Pipeline Helper

If you prefer the scripted behaviour, call `prepare_daily_features` directly. This wraps the steps above and optionally writes the result to disk.


In [None]:
daily_feature_frame = prepare_daily_features(
    ticker=ticker,
    price_config=price_config,
    volume_config=volume_config,
    technical_config=technical_config,
    news_config=news_config,
    confluence_config=confluence_config,
    llm_model_name=llm_model_name,
    use_rule_sentiment=use_rule_sentiment,
)

daily_feature_frame.head()


## 8. Build Intraday Feature Set

Create candlestick, volume, and time-of-day features for intraday bars. Optionally limit the session or flag missing neighbours.


In [None]:
intraday_feature_frame = prepare_intraday_features(
    ticker=ticker,
    volume_config=volume_config,
    time_config=time_config,
)

intraday_feature_frame.head()
