Feature engineering

The goal of feature engineering is to create derived variables that help your strategy identify patterns in price movements — e.g., momentum, volatility, overbought/oversold signals, etc.

Since the asset I am using is FX pair, I had used the help of Chatgpt to help me identify the necessary feature engineers for each of the strategy and the ones that are being used here are the following :

Features required by each strategy
| **Strategy**                 | **Required Features**                         | **How it’s used**                                                                                     |
| ---------------------------- | --------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| **Moving Average Crossover** | SMA(50), SMA(200)                             | Buy when SMA(50) > SMA(200), sell when SMA(50) < SMA(200)                                             |
| **Momentum Strategy**        | Daily return, cumulative return (n days)      | Rank assets or decide buy/sell based on positive momentum                                             |
| **Mean Reversion**           | RSI(14), Bollinger Bands (20-day SMA ± 2 std) | Buy when oversold (RSI<30 or price < lower band), sell when overbought (RSI>70 or price > upper band) |

notice these overlap — so you can compute all features in a single pipeline efficiently.


**IMPORTANT**

Extra features increase the model’s complexity and training time without adding new information.
In finance — where signals are already weak — this leads to:

    - Overfitting on random correlations.
    - Degraded out-of-sample performance (your backtest looks good but live trading fails).



In [8]:
import pandas as pd
import numpy as np
import ta #technical analysis library


In [12]:
#load the cleaned csv file

df = pd.read_csv('/Users/akilfiros/Desktop/Projects/Side Projects /Quant-Backtesting/Data/market_data_cleaned.csv', parse_dates=['Date.'],index_col='Date.')

print(df.columns.tolist())


['EURUSD=X.Open', 'EURUSD=X.High', 'EURUSD=X.Low', 'EURUSD=X.Close', 'EURUSD=X.Volume', 'EURUSD=X.Dividends', 'EURUSD=X.Stock Splits', 'GBPUSD=X.Open', 'GBPUSD=X.High', 'GBPUSD=X.Low', 'GBPUSD=X.Close', 'GBPUSD=X.Volume', 'GBPUSD=X.Dividends', 'GBPUSD=X.Stock Splits']


In [13]:
#defining the tickers

tickers = ['EURUSD=X','GBPUSD=X'] #since i only have two assets that assessing so mentioning them in the tickers list wont make much doffence but if you have multiple then use the the multiple-pair code where you identify the fx_pairs automatically.

In [14]:
#loop through each ticker and then compute the features
for ticker in tickers:
    close_col = f"{ticker}.Close"
    close = df[close_col]

    #features to be added below

    # Returns
    df[f'{ticker}_Return'] = close.pct_change() #gives the daily returns, the building blocks for both momentum and volatility
    df[f'{ticker}_CumReturn'] = (1 + df[f'{ticker}_Return']).cumprod() - 1

    # Moving Averages (for crossover)
    df[f'{ticker}_SMA_50'] = close.rolling(window=50).mean() #smooths price action for your moving average crossover
    df[f'{ticker}_SMA_200'] = close.rolling(window=200).mean()

    # Momentum (20-day lookback)
    lookback = 20 #helps detect sustained trends (e.g., 20-day winners vs losers)
    df[f'{ticker}_Momentum'] = close / close.shift(lookback) - 1

    # RSI (14-day)
    df[f'{ticker}_RSI_14'] = ta.momentum.RSIIndicator(close, window=14).rsi() #provide mean reversion signals directly

    # Bollinger Bands
    # Bollinger Bands are a volatility-based indicator built around a moving average
    #
    # They consist of three lines:
    # 1. Middle Band → usually a 20-day Simple Moving Average (SMA) of price.
    # 2. Upper Band (High) → SMA + 2 × standard deviation of the last 20 prices.
    # 3. Lower Band (Low) → SMA − 2 × standard deviation of the last 20 prices.
    #
    # The idea:
    # - When price moves above the upper band, it’s considered overbought (possibly due for a fall).
    # - When price moves below the lower band, it’s oversold (possibly due for a bounce).
    # - The distance between bands (the “band width”) expands when volatility rises and contracts when volatility falls.

    boll = ta.volatility.BollingerBands(close, window=20, window_dev=2) #provide mean reversion signals directly
    df[f'{ticker}_BB_High'] = boll.bollinger_hband()
    #BB_High=SMA20+2×rolling std deviation
    #tells me the upper price boundary within which ~95% of recent price action should fall if prices are normally distributed.
    #Used in trading logic as an overbought threshold.
    df[f'{ticker}_BB_Low'] = boll.bollinger_lband()
    # BB_Low=SMA20−2×rolling std deviation
    # gives the lower boundary of the expected price range.
    # Used as an oversold threshold — e.g., if price < BB_Low, your mean-reversion strategy might trigger a buy signal.
    df[f'{ticker}_BB_Width'] = df[f'{ticker}_BB_High'] - df[f'{ticker}_BB_Low']
    # measures the width (or spread) between the upper and lower bands
    # It’s a proxy for volatility — when markets are calm, BB_Width is small; when volatile, it expands.
    # can use this to filter signals — e.g., “only trade when volatility is low, because reversions are more stable.”

#if you had only one ticker then you can just do this because what is essential for feature engineering in our case is the close column and in doing so you dont need to do the above for loop.
# close = df['EURUSD=X.Close']

# this is what the code would be :
#
# close = df['EURUSD=X.Close']
#
# # Returns
# df['Return'] = close.pct_change()
# df['Cum_Return'] = (1 + df['Return']).cumprod() - 1
#
# # Moving Averages (for crossover)
# df['SMA_50'] = close.rolling(window=50).mean()
# df['SMA_200'] = close.rolling(window=200).mean()
#
# # Momentum (n-day lookback)
# lookback = 20
# df['Momentum'] = close / close.shift(lookback) - 1  # 20-day price change %
#
# # RSI (for mean reversion)
# df['RSI_14'] = ta.momentum.RSIIndicator(close, window=14).rsi()
#
# # Bollinger Bands (for mean reversion)
# boll = ta.volatility.BollingerBands(close, window=20, window_dev=2)
# df['BB_High'] = boll.bollinger_hband()
# df['BB_Low'] = boll.bollinger_lband()
# df['BB_Width'] = df['BB_High'] - df['BB_Low']

In [15]:
# Drop NaN values created by rolling calculations
df.dropna(inplace=True)

In [16]:
#save to new csv file

df.to_csv('/Users/akilfiros/Desktop/Projects/Side Projects /Quant-Backtesting/Data/market_data_features.csv')
print("Feature engineering complete. Saved to 'market_data_features.csv'.")

Feature engineering complete. Saved to 'market_data_features.csv'.


--------------------------


THE FOLLOWING CELLS OF CODE ARE USED WHEN THE VARIABLE FX (BASICALLY THE ASSET) TAKES MULTIPLE PAIRS OF CURRENCY TO COMPARE THEN WE CAN USE THE FOLLOWING CELLS.

IF YOU ARE USING THE FOLLOWING FOR JUST A SINGLE PAIR THEN USE THE SPECIFIC FEATURES YOU WANT BECAUSE SAY YOU JUST WANT TO TRAIN A MODEL FOR SINGLE FX_PAIR LIKE EURUSD BUT WHEN YOU HAVE PREFIXED FEATURES LIKE GBPUSD_SMA50, USDJPY_RSI14, ETC THEN THOSE COLUMNS WILL CAUSE IT TO EITHER BE COMPLETELY NaN SINCE THE PAIRS ARE NOT LOADED OR BE UNRELATED TO EURUSD's BEHAVIOUR. IN BOTH CASES WE WONT LEARN ANYTHING USEFUL FROM IT.

EVEN WORSE THEY WOULD ADD NOISE AND REDUCE GENERALIZATION, ESPECIALLY FOR ML MODELS LIKE RandForest,XGBoost, ETC.

When multi-pair features do make sense?

        You only add features from multiple pairs if:
        - You believe there’s cross-correlation or contagion (e.g., EURUSD moves with GBPUSD),
        - And your strategy explicitly exploits that — e.g. statistical arbitrage or pairs trading.
        - Then you can safely include correlated-pair features like GBPUSD_Return or EURGBP_RSI.


In [7]:
# #load the cleaned csv file
#
# df = pd.read_csv('/Users/akilfiros/Desktop/Projects/Side Projects /Quant-Backtesting/Data/market_data_cleaned.csv', parse_dates=['Date'],index_col='Date')

# #identify the pair of FX you want to compare automatically
#
# pairs = sorted({col.split('.')[0] for col in df.columns if '.Close' in col})
# print(f"Detected pairs: {pairs}")

In [10]:
# #create an empty list to collect the processed dataframe
#
# processed_df = []

In [19]:
# from scipy.stats import pairs
#
# #def() for feature engineering
#
# def generate_features(df,pair):
#     close = df[f"{pair}.Close"]
#
#     #create a copy to hold the pair specific features
#     feat = pd.DataFrame(index=df.index)
#     feat[f"{pair}Close"] = close
#
#     #Returns
#     feat[f"{pair}_Return"] = close.pct_change()
#     feat[f"{pair}_CumReturn"] = (1+feat[f"{pair}_Return"]).cumprod() - 1
#
#     #Moving Averages (for cross over)
#     feat[f"{pair}_SMA_50"] = close.rolling(window=50).mean() #for 50 days
#     feat[f"{pair}_SMA_200"] = close.rolling(window=200).mean() # for 200 days
#
#     #Momentum (20-days)
#     lookback = 20
#     feat[f"{pair}_Momentum"] = close/close.shift(lookback) - 1
#
#     #RSI `(14-day)
#     feat[f"{pair}_RSI_14"] = ta.momentum.RSIIndicator(close,window=14).rsi()
#
#     # Bollinger Bands
#     boll = ta.volatility.BollingerBands(close, window=20, window_dev=2)
#     feat[f"{pair}_BB_High"] = boll.bollinger_hband()
#     feat[f"{pair}_BB_Low"] = boll.bollinger_lband()
#
#     return feat
#
# #Loop over all the pairs
# for pair in pairs:
#     pair_feature =generate_features(df,pair)
#     processed_df.append(pair_feature)
#
# #combine all features horizontally
# features_df = pd.concat(processed_df, axis=1)
#
# #drop NaN Values from rolling Calculations
# features_df.dropna(inplace=True)
#
# #save to csv file
# features_df.to_csv("/Users/akilfiros/Desktop/Projects/Side Projects /Quant-Backtesting/Data/market_data_features.csv")
# print("✅ Multi-pair feature engineering complete. Saved to 'market_data_features.csv'.")

#have not run the code yet

# What this does (according to chatgpt):
#
# - Auto-detects all FX pairs by scanning column names like EURUSD=X.Close.
#
# - For each pair, it computes:
#   Returns
#   Cumulative Return
#   50 & 200-day SMA
#   Momentum (20-day)
#   RSI (14-day)
#   Bollinger Bands (20-day ± 2 std)
#
# - Saves a unified DataFrame containing all pairs and all features.

#excpected out according to chatgpt is
# EURUSD=X_Close, EURUSD=X_Return, EURUSD=X_SMA_50, ...,
# GBPUSD=X_Close, GBPUSD=X_Return, GBPUSD=X_SMA_50, ...
# USDJPY=X_Close, ...
