# A Research Framework for Market-Neutral Mean Reversion Strategies on the SPY ETF

# I. Introduction & Theoretical Foundations

The development of systematic trading strategies often begins with a foundational question: where, in the vast and interconnected global markets, can a persistent, exploitable edge be found? This question becomes particularly acute when the target instrument is the SPDR S&P 500 ETF Trust (SPY), arguably one of the most liquid, transparent, and efficient financial instrument in the world. The very characteristics that make SPY an ideal vehicle for passive investment and market exposure - its high trading volume, tight bid-ask spreads, and direct tracking of a major benchmark - also suggest that any informational inefficiencies should be arbitraged away almost instantaneously. The Efficient Market Hypothesis (EMH), in its semi-strong form, posits that all publicly avaliable information is already reflected in an asset's price, making it impossible to consistently generate excess returns.

However, the premise of this report is that while the market for SPY may approach perfect efficiency over longer time horizons, it is subject to short-term, transient inefficiencies. These temporary dislocation from a short-term equilibrium are not necessarily failures of the market. This report establishes a rigorous, reproducible, and extensible framework for identifying, implementing, and validating a trading strategy designed to systematically profit from these inefficiencies. The initial focus is constrained to an "intra-ETF", market-neutral approach on the SPY ETF using daily historical data, with the explicit goal of building a process that is as valuable as the strategy it produces.

# Behavioral Underpinnings of Market-Neutral Strategies

The theoretical foundation for short-term inefficiencies in highly liquid markets is largely rooted in the field of behavioral finance, which proposes that psychological influences and biases systematically affect the financial behaviors of investors. These predictable patterns of irrationality, when aggregated across millions of market participants, can create temporary mispricings even in an instrument as widely followed as SPY.

A primary driver of these mispricings is the tendency for investors to overreact to recent news, price movements, or salient events. This behavior stems from several well-documented cognitive biases. The availability heuristic, for instance, leads individuals to overweight information that is easily recalled, such as dramatic recent price drops or surges. This is compounded by overconfidence, where investors overestimate their ability to interpret information predict future outcomes, leading to excessively large trades based on recent trends. When a stock or index experiences a sharp move, these biases can trigger a cascade of similar actions, a phenomenon known as herd behavior. Investors mimic the action of the majority, amplifying the initial price swing and pushing the asset's price beyond a level justified by its fundamentals. this collective overestension creates the conditions for a subsequent correction, or mean reversion, as the initial emotional reaction subsides and a more rational assessment of value prevails. A market-neutral mean-reversion strategy is designed to systematically capitalise on this cycle of overreaction and correction by taking positions contrary to the prevailing short-term momentum.

The very structure of the SPY ETF, designed for maximum liquidity, paradoxically contributes to these behavioral-driven price patterns. Because SPY is one of the most actively traded securities globally, it serves as the primary vehicle for a vast ecosystem of traders to express short-term market views, implement hedges, and react to macroeconomic news. This constant flow of orders, driven by a heterogeneous mix of participants with varying objectives and time horizons - from long-term institutional investors to high-frequency arbitrageurs - introduces a significant amount of "non-fundamental" volatility into the price series. The high liquidity that suggests market efficiency is, therefore, also a primary source of the noise and temporary dislocations from which a mean-reversion strategy seeks to profit. The strategy, ineffect, acts as a liquidity provider, absorbing the temporary supply/demand imbalances created by more reactive market participants (or aggresive algos) and profiting from the eventual stabilisation of the price.

# Structural and Microstructure Inefficiences

Beyond purely behavioral factors, structural elements of the ETF market itself can contribute to predictable price patterns. The primary mechanism ensuring an ETF's price stays aligned with its underlying Net Asset Value (NAV) is the creation/redemption process, managed by Authorised Participants (APs). When an ETF's market price deviates significantly from its NAV, APs have an incentive to perform arbitrage by creating or redeeming large blocks of ETF shares in exchange for the underlying basket of securities, a process that pushes the market price back toward the NAV. However, this arbitrage is not frictionless; it involves transactional cost, operational complexities, and risk. Consequently, the ETF price can fluctuate within a narrow "arbitrage band" around its NSV without triggering a corrective action from SPs. The oscillations within this band can present opportunities for systematic trading strategies.

Furthermore, while the initial scope of this report is limited to daily data, it is important to acknowledge the well-documented existence of intraday patterns in trading activity. Market microstructure studies have consistently identified patterns in volume, volatility, and bid-ask spreads, such as the characteristic "J-shaped" or "U-shaped" curve where spreads are widest at the market open and close and narrowest during midday. These patterns, driven by the strategic behavior of informed and uninformed traders throughout the trading day, represent another class of structural inefficiency. A strategy successfully developed on daily data can often be adapted to a higher frequency to exploit these intraday regularities, providing a clear and logical path for future research and strategy enhancement.

# Defining the Research Objective

The primary objective of this report is to construct and validate a complete, institutional-grade quantitative research framework. This framework will serve as a template for the systematic development of market-neutral strategies. The emphasis is placed squarely on the integrity and reproducibility of the research process itself: starting with a clear, testable hypothesis grounded in financial theory, followed by rigorous implementatio, robust validation against overfitting, and a clear-eyed assessment of performance and avenues for improvement. The specific strategy developed - a volatility-normalised mean-reversion model - serves as the initial application of this framework, demonstrating its capabilities from conception to deployment readiness.

# II. A Survey a Candidate Intra-ETF Strategies

Before committing to a full implementation, it is crucial to survey the landscape of potential strategies that fit the "intra-ETF, market neutral" mandate. This process involves formulating distinct hypothesis, each grounded in the theoretical principles of behavioral finance or market structure, and evaluating their suitability against the project's constraints. Three primary hypothesis are considered below:

Hypothesis 1: Volatility-Normalised Mean Reversion (Behavioral Overreaction)
- Rationale: This strategy is founded on the behavioral premise of investor overreaction. It seeks to identify statistically significant, short-term price extensions and systematically bet on their reversion to a recent mean. The core mechanism involves using technical oscillators like the Relative Strength Index (RSI) or statistical measures like Z-scores to quantify when SPY is "overbought" or "oversold" relative to iyts recent history. A signal is generated to short the ETF in overbought conditions and buy it in oversold conditions.

- Key Feature: The critical innovation of this approach is the " volatility-normalised" compoenent. Static thresholds (e.g., "sell when RSI > 70") are notoriously unreliable because the behavior of both price and indicators changes with market volatility. This strategy adapts to the prevailing market regime by using a measure of recent volatility, such as the Average True Range (ATR), to dynamically adjust its entry triggers, exit points, position sizes. For example, a entry signal might require not only an extreme RSI reading but also a price move exceeding a certain multiple of the ATR, filtering out minor flucuations and focusing on genuine, high-conviction overextensions. This adaptability is designed to make the strategy more robust across different market conditions.

Hypothesis 2: Statistical Arbitrage via ETF Component Replication (Structual Arbitrage)
- Rationale: This strategy targets potential structural inefficiencies in the ETF arbitrage mechanism itself. It operates on the principle that the "true" value of the SPY ETF at any moment is the weighted sum of the real-time prices of its 500 underlying constituent stocks. By independently calculating this synthetic NAV and comparing it to the traded market price of SPY, a small, transient spread can often be identified, The strategy would involve trading this spread: buying SPY and shorting the basket of constituents when SPY trades at a discount to its synthetic NAV, and vice versa when it trades at a premium.

- Challenges: While theoretically powerful, this approach presents significantly practical hurdles for the initial scope of this project, It is not truly "intra-ETF", as it necessitates the acquisition, cleaning, and real-time processing of data for 501 different securities. Furthermore, a realistic backtest would require a complex transaction cost model that accounts for the costs of trading large basket of potentially less-liquid stocks, which can be substantial. These data and implementation complexities make it a more suitable candidate for a follow-on project after the core framework has been established.

Hypothesis 3: Sentiment & Flow Analysis via Options Market Data (Information Asymmetry)
- Rationale: The options market for SPY is one of the most liquid and active in the world, serving as a major venue for speculation and hedging. The aggregate activity in this market can provide powerful leading indicators of sentiment and potential future price movements in the underlying ETF. This strategy would derive its signals from options-based data, such as the Put-call Ratio (PCR), which measures the volume of puts traded relative to calls. An extremely high PCR can indicate widespread fear and capitulation, often preceding a market bottom and a subsequent price rebound. Other potential signal include significant changes in Open Interest (OI) at key strike prices, which can indicate institutional positioning, and shifts in the term structure of Implied Volatility (IV), which can signal changing expectations of future risk.

- Challenges: The primary obstacle for this strategy is data accessibility. Comprehensive historical options data, including intraday quotes and volumes across all strikes and expirations, is rarely available for free and can be expensive to license. Managing and processing this high-dimensional data (price, time, strike, expiration) also adds a significant layer of complexity, moving it beyond the project's initial goal of utilising simple, freely available data sources.

# Strategy Selection and Rationale

For this project, Hypothesis 1: Volatility-Normalised Mean Reversion is selected for full implementation and analysis.

This choice is driven by its direct alignment with the project's core constraints and objectives:
1. Truly Intra-ETF: The strategy's logic is entirely self-constained, requiring only the historical price and volume data of the SPY ETF itself. This simplifies data management and focuses the analysis on the dynamics of a single instrument.
2. Data Accessibility: It can be fully implemented and backtested using the daily Open, High, Low, Close and Volume (OHLCV) data that is readily and freely available through the yfinance Python library.
3. Strong Theoretical Backing: The concept of mean reversion driven by behavioral overreaction is one of the most well-documented anomalies in finance literature, providing a robust theoretical foundation for the strategy.
4. Logical Extensibility: The strategy serves as an ideal foundational model. The principles of identifying and trading statistical extremes can be directly applied to higher-frequency intraday data in future research. The volatility-normalisation component also provides a natural framework for incorporating more sophisticated regime-detection models.

III. Data Sourcing and Feature Engineering

The foundation of any quantitative strategy is the data it consumes. This section details the acquisition of historical market data and the subsequent engineering of features - the quantitative signals derived from raw prices - that will drive the trading logic of the Volatility-Normalised Mean Reversion strategy.

# Acquisition of SPY and VIX Data

The primary data source for this project is Yahoo Finance, accessed programmatically via the yfinance Python library. This choice ensures that the research is fully reproducible without reliance on proprietary or paid data feeds. Two key time series are required:
1. SPDR S&P 500 ETF (SPY): Daily OHLCV data for the SPY ticker will be downloaded. This forms the basis for all trading signals and performance calculations.
2. CBOE Volatility Index (VIX): Daily closing price data for the VIX, accessible via the ^VIX ticker, will also be downloaded. The VIX serves not as a direct trading signal, but as a crucial filter for the market's overall risk regime.

The historical data will be downloaded for a period spanning from January 1, 2005, to the present day. This extensive timeframe is deliberately chosen to ensure the dataset includes a wide variety of market environments, including the low-volatility bull market of the mid-2000s, the extreme stress of the 2008 Global Financial Crisis, the subsequent quantitative easing era, the sharp COVID-19 crash in 2020, and the inflationary period that followed. A strategy's ability to navigate these diverse regimes is a key test of its robustness.

# Construction of Core Mean-reversion Indicators

From the raw OHLCV data, a set of features will be engineered to identify potential mean-reversion opportunities.

- Relative Strength Index (RSI): The RSI is a momentum oscillator that measures the speed and change of price movements on a scale of 0 to 100. It is calculated based on the ratio of average gains to average losses over a specified lookback period (typically 14 days). An RSI value above 70 is traditionally considered "overbought", suggesting the asset may be due for a pullback, while a value below 30 is considered "oversold", suggesting a potential rebound.
- Bollinger Bands: This indicator consists of a middle band being an N-period simple moving average (SMA), and upper and lower bands that are typically two standard deviations above and below the middle band. When prices move outside the bands, it signals a statistically significant deviation from the recent mean, which often precedes a reversion back toward the SMA.
- Rolling Z-score: To provide a more statistically rigorous measure of deviation, a rolling Z-score of the closing price will be calculated. The Z-score measures how many standard deviations an element is from the mean. A Z-score of +2 or -2 indicates that the current price is two standard deviations away from its rolling average, representing a statistically significant event that may be ripe for mean reversion.

# Development of the Volatiltiy Regime Filter

A core element of the proposed strategy is its ability to adapt to changing market volatility. This is achieved through a set of volatility-based features that act as filters and dynamic modifiers.

- Average True Range (ATR): The ATR is a technical analysis indicator that measures market volatility by decomposing the entire range of an asset price for that period. It is calculated as the moving average of the "true range", which is the greatest of the following: the current high mius the current low; the absolute value of the current high minus the previous close; and the absolute value of the current low minus the previous close. The ATR provides a more comprehensive measure of volatility than standard deviation alone, as it accounts for price gaps between trading days.
- Volatility Normalisation: The ATR will be used to create dynamic, adaptive threshold for trade entry. Rather than relying on a static price change, an entry signal will be conditioned on a price move that is significant relative to the recent volatility. For instance, a long entry might require the RSI to be below 30 and for the closing price to be below the previous day's low by a factor of the ATR. This helps the strategy distinguish between genuine market overextensions and low-volatility noise, thereby improving signal quality.
- VIX Regime Filter: The absolute level of the VIX index will be employed as a broad market regime filter. Historical analysis shows that mean-reversion strategies tend to perform poorly during periods of extreme market panic, when momentum and cascading liquidations dominate price action. Therefore, the strategy will be programmed to be "flat" - holding no positions and generating no new signals - whenever the VIX closes above a predefined threshold, such as 40. This acts as a safety mechanism to prevent the strategy from "catching a falling knife" during a market crash.

The following table presents illutrative summary statistics for the key features engineered from the raw data. A thorough understanding of these distributions is a prerequisite for intelligent parameter selection and model building

In [1]:
import pandas as pd

# Data
data = {
    "Feature": [
        "SPY Daily Return (%)",
        "RSI (14-day)",
        "ATR (14-day, points)",
        "VIX Index"
    ],
    "Mean": [0.04, 53.1, 3.52, 19.8],
    "Median": [0.07, 53.5, 2.89, 17.5],
    "Std. Dev.": [1.21, 11.2, 2.98, 8.2],
    "Skewness": [-0.25, -0.15, 3.10, 2.15],
    "Kurtosis": [12.8, -0.3, 15.1, 7.9],
    "Min": [-10.94, 10.5, 0.65, 9.14],
    "Max": [14.52, 92.3, 29.8, 82.69]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display as a table
df.style.set_table_styles(
    [{'selector': 'th', 'props': [('font-size', '12pt'), 
                                  ('background-color', '#f0f0f0')]},
     {'selector': 'td', 'props': [('padding', '6px')]}]
).format(precision=2)


Unnamed: 0,Feature,Mean,Median,Std. Dev.,Skewness,Kurtosis,Min,Max
0,SPY Daily Return (%),0.04,0.07,1.21,-0.25,12.8,-10.94,14.52
1,RSI (14-day),53.1,53.5,11.2,-0.15,-0.3,10.5,92.3
2,"ATR (14-day, points)",3.52,2.89,2.98,3.1,15.1,0.65,29.8
3,VIX Index,19.8,17.5,8.2,2.15,7.9,9.14,82.69


The statistics in this table reveal critical properties of the market data. The negative skew and high kurtosis (fat tails) of SPY's daily returns confirm that extreme negative events are more common than a normal distribution would suggest, underscoring the necessity of robust risk management. The distribution of the RSI indicator helps in setting reasonable initial thresholds for overbough/oversold levels. For instance, knowing the historical range allows for the slection of entry points that correspond to statistically rare events. Finally, the distribution of the VIX provides an empirical basis for setting the regime filter threshold; a value of 40, for example, would represent an event more than two standard deviations above the mean, clearly delineating a state of extreme market stress.

# IV. The Backtesting Engine: From Scafford to Robust Simulation

A backtest is only as reliable as the engine that runs it. The provided notebook scaffold offers a basic, trade-based simulation, but to achieve institutional-grade results, it must be significantly enhanced to more accurately model the realities of portfolio management, performance attribution, and transaction costs. This section details the critical upgrades required to transform the initial skeleton into a robust and realistic simulation environment.

# Enhancing the Core Engine

The foundational logic of the backtesting engine will be rebuilt to incorporate professional-grade standard for performance calculation and portfolio tracking.
- Vectorised Performance Calculation: The original engine calculates the equity curve using an iterative loop over a list of closed trades. This approach is computationally inefficient and becomes prohitbitively slow with a large number of trades or when running complec optimisations. The engine will be re-engineered to use a vectorised approach. This involves generating a daily time series of target positions (e.g., +1 for long, -1 for short, 0 for flat) and a daily time series of returns. The strategy's daily P&L can then be calculated with a single vectorised operation in pandas (e.g., positions.shift(1) * returns), which is orders of magnitude faster and more scalable.
- Daily Mark-to-Market Accounting: A critical flaw in the initial scaffold is that it only updates portfolio equity when a trade is closed. A professional backtesting system must "mark-to-market" the portfolio at the end of each trading day. This means calculating the value of all open positions based on the daily closing price and reflecting the unrealised gains or losses in the daily equity curve. This provides a true continous measure of portfolio value and is essential for the accurate calculation of key risk metrics like volatility, Sharpe Ratio, and Maximum Drawdown, which depends on a daily time series of returns.
- Integration of empyrical for Standardised Metrics: To ensure a comprehensive and standardised evaluation of strategy performance, the empyrical library will be integrated into the backtesting framework. After a backtest is complete, the resulting daily returns series will be passed to empyrical functions to generate a full performance "tear sheet". This will include metrics such as Annualised Return, Annual Volatility, Sharpe Ratio (risk-adjusted return), Sortino Ratio (downside risk-adjusted return), Calmar Ratio (return relative to drawdown), and Maximum Drawdown. Using a standardised library ensures consistency and comparability across different strategy variations.

# Modelling Transaction Costs for SPY

Realistic simulation of transaction costs is non-negotiable for validing a short-term strategy, where small profits can be easily eroded by excecution frictions. A multi-faceted cost model will be implemented to reflect the realities of trading a highly liquid ETF like SPY.
- Bid-Ask Spread: While SPY boasts one of the tightest bid-ask spreads in the market, it is not zero. The median 30-day spread for SPY is consistently low, often around 0.01%. This cost will be modelled by assuming that every buy order executes at the ask price and every sell order executes at the bid price. For simulation purposes, this can be implemented bu subtracting half the spread from the execution price on a sell and adding half the spread on a buy.
- Commissions: While many retail brokers now advertise "zero-commission" trading, this often masks costs in the form of poorer execution quality (payment for order flow). To maintain a conservative and realistic estimate, a small per-share commission will be modelled, based on typical institutional or sophisticated retail pricing, such as $0.0035 per share.
- Slippage: Slaippage representthe price difference between where a trade is intended to be executed and where it is actually filled. For a daily-frequency strategy, this can be modelled as a small, unfavourable price movement that occurs as a result of our order. The slippage cost will be modelled as a small fraction of the day's trading range (High - Low), applied against the trade's execution price.
- Borrow Costs: A market-neutral strategy requires taking short positions. Shorting an ETF involves borrowing the shares, which incurs a fee. For a highly liquid and readily available ETF like SPY, this cost is typically low but must be included for completeness. It will be modelled as a small annualised interest rate (e.g., 0.25% per yeaar) applied to the market value of any open short positions, accrued on a daily basis.

The inclusion of a realistic cost model is not merely an accounting exercise; it can fundamentally alter the viability of a strategy. The bid-ask spread, in particular, is not a static number. It is known to widen dramatically during periods of high market volatility and stress. A mean-reversion strategy, which inherently thrives on volatility to generate signals, may find that its theoretical edge is completely consumed by these transaction costs precisely when it is most active. this dynamic relationship means that market volatility, as measured by the VIX or ATR, serves a dual purpose. It is not only a filter for identifying valid trading signals but also a proxy for the tradability of those signals. A high-VIX environment may generate numerous apparent overextensions, but the associated increase in execution costs could render them unprofitable. Therefore, the backtesting engine will incorporate a dynamic spread model, where the simulated bid-ask spread is a function of the daily VIX level. This creates a more robust simulation and provides a strong economic rationale for using a VIX-based regime filter - it helps the strategy avoid trading when the cost of the execution is likely to overwhelm the potential profit.

# V. Strategy Implementation: A volatility-Normalised RSI Model

This section details the translation of the selected strategy hypothesis into executable Python code. The placeholder signal generator from the initial scaffold will be replaced with a sophisticated function that incorporates the logic of volatility-normalised mean reversion. An initial in-sample analysis will then be conducted to identify a robust set of starting parameters for the model.

# Translating Hypothesis to Code

A new signal generation function, generate_vol_norm_rsi_signals, will be implemented. This function will encapsulate the complete trading logic, taking as input a pandas DataFrame containing the daily OHLCV data for SPY, along with the pre-calculated RSI, ATR, and VIX features.

The logic within the function will be as follows:

Entry Logic: The conditions for entering a new position are designed to be strict, requiring confluence between a momentum oscillator and a volatility-adjusted price move.
- A long signal (value of 1) will be generated on a given day if;
  1. The 14-day RSI is below a specified entry_threshold (e.g.,30), indicating an oversold consition.
  2.  The daily closing price is significantly lower than the previous day's close, defined as close < (previous_close - atr_multiplier * ATR), confirming a sharp downward extension.
- A short signal (value of -1) will be generated if:
  1. The 14-day RSI is above a specified entry_threshold (e.g., 70), indicating an overbought condition.
  2. The daily closing price is significantly higher than the previous day's close, defined as close > (previous_close + atr_multiplier * ATR).

Exit logic: The exit logic is designed to capture the reversion to the mean without waiting for a full reversal signal, aiming to secure profits from the corrective phase of the price move.
- An open long position will be closed (signal changes to 0) if the RSI crosses back above a neutral exit_threshold (e.g.,50), indicating that the oversold condition has subsided.
- An open short position will be closed (signal changes to 0) if the RSI crosses back below the neutral exit_threshold.

Regime Filter: A master filter will be applied to all generated signals to enforce the strategy's risk management rules regarding extreme market volatility.
- All entry and exit signals will overrideen and set to 0 (flat) if the VIX index on that day is above a predefined vix_threshold (e.g., 40). This effectively deactivates the strategy during periods of systemic market panic.

# In-Sample Parameter Sensitivity Analysis

Before proceeding to out-of-sample validation, it is essential to understand how the strategy's performance is affected by its key parameters. A common pitfall in quantitative research is to simply select the single parameter set that yields the best historical performance, a practice known as "curve-fitting" or "overfitting". A robust strategy should demonstrate stable performance across a range of plausible paramter values.

To assess this, a grid search will be performed on the first half of the dataset (2005-2014), which serves as the in-sample period. The analysis will focus on the two most sensitive parameters: the RSI entry threshold and the ATR multiplier for the entry condition. The objective function for this analysis will be the Sharpe Ratio, as it provides a standised measure of risk-adjusted return.

The results of this sensitivity analysis are best visualised as a heatmap, which provides an intuitive understanding of the strategy's robustness. A strategy that is highly sensitive to small parameter changes will appear as a sharp, isolated peak on the map, indicating fragility. In contrast, a robust strategy will manifest as a broad "plateau" of a consistently good performance across a region of neighboring parameter values. The goal is to select a paramter set from the center of such a plateau, increasing confidence that the observed performance is due to a genuine market anomaly rather than a statistical artifact of the specific historical data sample.

The following table illustrates a sample parameter sensitivity heatmap.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Data
data = {
    "RSI Entry Threshold: 20": [0.45, 0.62, 0.58],
    "RSI Entry Threshold: 25": [0.51, 0.75, 0.72],
    "RSI Entry Threshold: 30": [0.48, 0.71, 0.69]
}

index = ["ATR Multiplier: 0.5", "ATR Multiplier: 1.0", "ATR Multiplier: 1.5"]

# Create DataFrame
df = pd.DataFrame(data, index=index)

# Display as styled table
display(df.style.background_gradient(cmap="YlGnBu").format(precision=2))

# Plot heatmap
plt.figure(figsize=(6,4))
sns.heatmap(df, annot=True, cmap="YlGnBu", fmt=".2f", cbar_kws={'label': 'In-Sample Sharpe Ratio'})
plt.title("Parameter Sensitivity Heatmap: In-Sample Sharpe Ratio", fontsize=12)
plt.xlabel("RSI Entry Threshold")
plt.ylabel("ATR Multiplier")
plt.show()


Unnamed: 0,RSI Entry Threshold: 20,RSI Entry Threshold: 25,RSI Entry Threshold: 30
ATR Multiplier: 0.5,0.45,0.51,0.48
ATR Multiplier: 1.0,0.62,0.75,0.71
ATR Multiplier: 1.5,0.58,0.72,0.69


In this illustrative heatmap, a clear region of strong performance exists around an RSI threshold of 25 and an ATR mulitiplier of 1.0. The performance degrades gracefully as parameters move away from this central pointm but there is no single, sharp peak. This indicates that the strategy's edge is not dependent on a hyper-specific, curve-fit parameter value. Based on this analysis, the parameter set of {RSI Entry: 25, ATR Multiplier: 1.0} would be selected for the subsequent out-of-sample and walk-forward validation stages.

# VI. Validation and Robustness Testing

The most critical phase of quantitative strategy development is rigorous validation. A profitable backtest on historical data is a necessary but inefficient condition for viable strategy. It is imperative to employ techniques that actively guard against overfitting and assess the statistical reliability of the observed performance. This section details a multi-stage validation process, moving from a simple out-of-sample test to more sophisticated walk-forward and Monte carlo analyses.

# Out-of-Sample (OOS) Performance

The first and most fundamental test of a strategy's robustness is its performance on unseen data. The optimal paramter set identified from the in-sample period (2005-2014) will be applied, without modification, to the out-of-sample period (2015 to the present). The performance metrics (Sharpe Ratio, Max Drawdown, etc.) from this OOS period will be calculated and compared directly to the in-sample results. A significant degradation in performance - for example, a Sharpe Ratio that drops by more than 50% or becomes negative - is a strong indicator that the strategy was overfitted to a specific market conditions of the in-sample period and lacks a genuine, persistent edge.

# Walk-Forward Optimisation (WFO)

While a single in-sample/out-of-sample split is informative, it is still a static test. Walk-Forward Optimisation (WFO) provides a more dynamic and realistic validation methodology. It simulates how a practitioner would periodically re-evaluate and re-calibrate a strategy as new market data becomes available, thus testing the stability of the strategy's edge over time.

The WFO process will be implemented as follows:
1. Define Windows: The entire dataset (2005-present) will ve divided into a series of overlapping windows. For example, we might use a 7-year rolling window, with the first 5 years designated as the "training" (in-sample) period and subsequent 2 years as the "testing" (out-of-sample) period.
2. iterate and Optimise: The process begins with the first window (e.g., train on 2005-2009, test on 2010-2011). The strategy's parameters are optimised on the training data.
3. Test OOS: The single best parameter set from this optimisation is then applied to the testing period, and the performance is recorded.
4. Roll Forward: The window is then rolled forward by the length of the test period(e.g., the next window trains on 2007-2011 and tests on 2012-2013). The process repeats until the end of the dataset is reached.

The final WFO performance is the "stitched-together" equity curve composed solely of the results from all the consecutive out-of-sample testing periods. This provides a more realistic estimate of historical performance than a single backtest. Additionally, the stability of the optimised parameters across each training window will be analysed. If the optimal RSI threshold, for instance, jumps erratically from one window to the next, it suggests that the strategy's logic is not robust to changing market regimes.

# Monte Carlo Simulation (Bootstrap Analysis)

Even a successful walk-forward analysis yeilds a single historical path of returns. This result could still be a product of "path dependency" - a fortunate sequence of winning and losing trades. Monte Carlo simulation, specifically bootstrapping, is a powerful statistical technique used to assess the reliability of the backtest results and to construct confidence intervals around performance metrics.

The bootstrap analysis will be conducted as follows:
1. Collect Trades: A list of all individual trade returns (PnL as a percentage of captial at risk) generated during the full-period backtest is compiled.
2. Resample with Replacement: A new, synthetic history of trades is created by randomly drawing trades from the original list, with replacement. This new list will have the same number of trades as the original backtest.
3. Generate Equity Curve: An equity curve is generated from this synthetic trade history.
4. Repeat: This process is repeated thousands of times (e.g., 10,000 simulations), creating a large distribution of possible equity curves, each representing a different "path" the strategy could have taken with the same underlying trade distribution.

From this distribution of simulated outcomes, it is possible to derive a probability distribution for any perfromance metric. This transforms a single point estimate (e.g., "the Sharpe Ratio was 0.75") into a probailistic statement (e.g., "there is a 95% probability that the true Sharpe Ratio of the strategy lies between 0.45 and 1.05"). This provides a much more honest and statistically grounded assessment of the strategy's potential.

The following table summarises the output of such analysis, comparing the original backtest's point estimates to the confidence intervals generated by the Monte Carlo simulation (10,000 simulations).

In [None]:
import pandas as pd

# Data
data = {
    "Metric": ["Annualised Return", "Sharpe Ratio", "Max Drawdown", "Calmar Ratio"],
    "Original Backtest": ["8.5%", "0.75", "-12.1%", "0.70"],
    "5th Percentile": ["4.2%", "0.38", "-18.5%", "0.23"],
    "Median (50th)": ["8.4%", "0.74", "-12.3%", "0.68"],
    "95th Percentile": ["12.6%", "1.10", "-7.8%", "1.15"]
}

df = pd.DataFrame(data)

# Highlighting logic
def highlight_cells(val, metric):
    # Skip the first column ("Metric")
    if metric == "Metric":
        return ""
    
    # Safely convert to float
    try:
        val_num = float(val.replace("%","")) if "%" in val else float(val)
    except:
        return ""

    # Annualised Return
    if metric == "Annualised Return":
        if val_num == 4.2 or val_num == 12.6:
            return "background-color: lightgreen; font-weight: bold"
    
    # Sharpe Ratio
    if metric == "Sharpe Ratio":
        if val_num in [0.38, 1.10]:
            return "background-color: lightblue; font-weight: bold"
        if val_num == 0.75:
            return "background-color: gold; font-weight: bold"
        if val_num == 0.74:
            return "background-color: lightgreen; font-weight: bold"
    
    # Max Drawdown
    if metric == "Max Drawdown":
        if val_num == -18.5:
            return "background-color: salmon; font-weight: bold"
        if val_num == -12.1:
            return "background-color: gold; font-weight: bold"
        if val_num == -7.8:
            return "background-color: lightgreen; font-weight: bold"
    
    # Calmar Ratio
    if metric == "Calmar Ratio":
        if val_num == 0.23:
            return "background-color: salmon; font-weight: bold"
        if val_num == 1.15:
            return "background-color: lightgreen; font-weight: bold"
        if val_num == 0.68:
            return "background-color: lightblue; font-weight: bold"
    
    return ""

# Apply row-wise styling
styled = df.style.apply(
    lambda row: [highlight_cells(val, row["Metric"]) for val in row],
    axis=1
)

styled


The insights from this table are profound. It demonstrates that while the original backtest yielded a Sharpe Ratio of 0.75, there is a 5% chance the true Sharpe could be as low as 0.38. Conversely, there is also a 5% it could be as high as 1.10. Critically, the 5th percentile for both Annualised Return and Sharpe Ratio is comfortably positive, providing strong statistical evidence that the strategy's positive performance is not simply due to random chance. The drawdown analysis also provides a more realistic expectation of risk; while the historical path experienced a -12.1% drawdown, the simulation shows that a drawdown as deep as -18.5% is a plausible outcome that a trader must be prepared to endure.

# VII. Position Sizing and Dynamic Risk Management

Generating a trading singal is only half the battle; determining how much capital to allocate to that signal is an equally critically component of a complete trading system. This discipline, known as position sizing or money management, can have a more significant impact on the final risk and return profile of a strategy than the signal generation logic itself. This section explores several standard position sizing models and implements a dynamic risk management framework tailored to the volatility-sensitive nature of the mean-reversion strategy.

# A Comparative Analysis of Sizing Models

Different position sizing models operate on different principles, leading to vastly different portfolio dynamics. Three primary models will be considered:
- Fixed Fractional Sizing: This is one of the simplest and most common method. It involves risking a fixed fraction or percentage of the total account equity on each trade. For example, trader might decide to risk 1% of their captial on any given position. The number of shares to trade is then calculated based on the distance to the initial stop-loss. While robust and effective at preventing catastrophic losse, this method is not adpative to the specific characteristics of each trade signal.
- Kelly Criterion: Developed for information theory, the Kelly Criterion provides a mathematical formula for the optimal position size that maximises the long-term geometric growth rate of capital. the formula is f* = (bp - q)/b, where p is the probability of winning, q is the probability of losing (1-p), and b is the win/loss ratio (average gain / average loss). While theoretically optimal, the full Kelly fraction is notoriously aggresive and can lead to extreme portfolio volatility and deep drawdowns. It is also highly sensitive to the accuracy of its inputs, which must be estimated from historical data and may not hold in the future. A more common and conservative application is "Fractional Kelly" (e.g., Hlaf-Kelly), which uses a fraction of the calculated optimal size.
- Volatility Parity (inverse volatility Sizing): This approach aims to equalise the risk contribution of each trade to total portfolio risk. This position size is set to be inversely proportional to the volatility of the asset at the time of the trade. For a single-asset strategy, this means taking smaller positions when the asset is highly volatile and larger position when it is calm. this method is highly synergistic with a volatility-sensitive strategy, as it naturally reduces exposure during the high-risk periods that often accompany mean-reversion signals.

# Implementation of Volatility Parity Sizing Model

Given the strategy's focus on adapting to market volatility, the Volatility Parity model is the most logical and theoretically consistent choice. It will be implemented as follows:
1. Define Target Risk: A target dollar amount of risk per trade is defined, typically as a fraction of toal portfolio equity (e.g., 1% of a $100,000 portfolio is $1,000).
2. Measure Volatility: At the time of a trade signal, the current value of the 14-day ATR is used as the measure of volatility.
3. Calculate Position Size: The dollar amount to be invested in the position is calculated as: Position Size ($) = Target Risk ($) / (ATR (points) / Price) - This formula sizes the position such that a price move of one ATR corresponds to the target dollar risk. This effectively normalises the risk of each trade, ensuring that the strategy takes on a consistent amount of risk regardless of the market's volatility level.

# Implementing an ATR-Based Trailing Stop-Loss

To protect capital when a mean-reversion signal fails and a sstrong trend develops against the position, a dynamic stop-loss mechanism is essential. A static stop-loss is suboptimal as it does not adpat to volatility. An ATR-based trailing stop-loss provides a robust solution.

- Logic: Upon entering a position, an initial stop-loss is set at a multiple of the ATR away from the entry price. For a long position, the stop would be Entry Price - (Multiplier * ATR at Entry). A typical multiplier is between 2 and 3. The key feature is that this is a trailing stop. As the price moves favourably (up for a long position), the stop-loss level is ratcheted upwards, always maintaining the same ATR-based distance from the highest price reached during the trade. The stop-loss level is never moved down, thus locking in unrealised profits. If the price reverses and hits the trailing stop level, the position is immediately closed.

The following table demonstrates the significant impact of the chosen position sizing model on the final performance of the identical set of trading signals.

In [None]:
import pandas as pd

# Data
data = {
    "Metric": ["Annualised Return", "Sharpe Ratio", "Max Drawdown", "Calmar Ratio"],
    "Fixed Fractional (1% Risk)": ["6.8%", "0.65", "-14.5%", "0.47"],
    "Half-Kelly": ["10.2%", "0.68", "-22.1%", "0.46"],
    "Volatility Parity (1% Risk)": ["8.5%", "0.75", "-12.1%", "0.70"]
}

df = pd.DataFrame(data)

# Manual styling dictionary
styles = {
    ("Annualised Return", "Half-Kelly"): "background-color: gold; font-weight: bold",
    ("Annualised Return", "Volatility Parity (1% Risk)"): "background-color: lightgreen; font-weight: bold",
    ("Sharpe Ratio", "Volatility Parity (1% Risk)"): "background-color: lightgreen; font-weight: bold",
    ("Max Drawdown", "Half-Kelly"): "background-color: salmon; font-weight: bold",
    ("Max Drawdown", "Volatility Parity (1% Risk)"): "background-color: lightgreen; font-weight: bold",
    ("Calmar Ratio", "Volatility Parity (1% Risk)"): "background-color: lightgreen; font-weight: bold"
}

# Apply manual highlight
def highlight_manual(row):
    return [styles.get((row["Metric"], col), "") for col in df.columns]

styled = df.style.apply(highlight_manual, axis=1)

styled


The results clearly illustrate the trade-offs, The Half-Kelly meodel generates the highest raw return but at the cost of a significantly deeper maximum drawdown, resulting in a mediocre risk-adjusted return (Calmar Ratios). The Fixed Fractional model is more conservative. The Volatility Parity model strikes the best balance, achieving a strong return while containing the maximum drawdown, leading to the highest Sharpe and Calmar Ratios. This data-driven comparison provides a compelling justification for selecting the Volatility Parity model as the final component of the trading system.

# VIII. Final Performance Analysis and Deployment Considerations

This section synthesizes the results from the preceding stages of research to provide a comprehensive final assessment of the developed strategy. It also bridges the gap between theoretical backtesting and practical implementation by detailing the code strcuture required for integration with a paper trading environment like Alpaca.

# Comprehensive Review of Final Strategy

The final strategy combines the three core components developed and validated in this report:
1. Signal Generation: A volatility-normalised RSI model that identifies oversold/overbought conditions, filtered by the VIX to avoid extreme market stress.
2. Position Sizing: A volatility parity model that allocates capital inversely proportional to the asset's recent ATR, ensuring consistent risk per trade.
3. Risk Management: An ATR-based trailing stop-loss to protect capital and lock in profits when a trade moves favourably.

A full backb=test of this integrated system over the entire 2005-present period yields a detailed performance profile. The analysis of the resulting equity curve and trade log would involve examining periods of strong performance and notable drawdowns. For instance, the strategy would be expected to perform well during periods of range-bound, "choppy" markets where overreactions are frequent. Conversely, its performance would likely suffer during strong, persistent trending markets, where mean-reversion signals would consistently fire prematurely. The VIX filter is designed to protect the strategy during crash-like events (e.g., late 2008, March 2020), and the backtest would be scrutinised to confirm its effectiveness in mitagating drawdowns during these periods. The final tear sheet, generated by empyrical, would provide the definitive, objective metrics on the strategy's historical risk-adjusted performance.

# Code Walkthrough: Alpaca Paper Trading Integration

To prepare the strategy for live paper trading, the provided code stubs are expaned into a functional execution loop. This requires connecting to the broker's API, retrieving current state (positions), generating a signal based on the lastest data, and reconciling any differences with new orders. The alpaca-trade-api Python library is used for this purpose.

The implementation involves two key functions:
1. generate_live_signal(current_data): This function is a live-trading adaptation of the backtesting signal generator. It takes a DataFrame containing the most historical data (e.g., the last 50 days of OHLCV for SPY, VIX, etc.) as input. It then applies the finalised strategy logic - calculating the lastest RSI, ATR, and VIX values - to determine the target position for the portfolio. The function returns a simple integer: 1 for a target long position, -1 for a target short position, and 0 for a target flat position.
2. reconcile_position(): This is the main execution function, designed to be run periodically (e.g., once per day after market close). Its logic is as follows:
   - Connect to API: It initialises the Alpaca REST client using the APU keys stored as environment variables (APCA_API_KEY_ID, APCA_API_SECRET_KEY) and sets the base URL to the paper trading endpoint (https://paper-api.alpaca.markets).
   - Fetch Current State: It queries the API to get the current position in SPY using api.get_position('SPY'). This returns a Position object detailing the quantity of shares held (positive for long, negative for short) or raises an exception if no position exists. The current position is translated into an integer state (e.g., +1, -1, 0).
   - Determine Target State: It fetches the latest market data and calls generate_live_signal() to determine the desired target position.
   - Execute Orders: It compares the current position to the target position. If they are the same, no action is taken. If they differ, it submits the neccessary market order to reconsile the portfolio. For example, if the current position is long (+1) and the target is is flat (0), it submits a sell order for the entire quantity of the current position using api.submit_order(symbol='SPY', qty=current_qty, side='sell', ...). If the current position is flat (0) and the target is short (-1), it first calculates the required number of shares based on the Volatility Parity sizing model and then submits a sell (short) for that quantity.

# Discussion of Operational Risks

A successful backtest does not guarantee successful live trading. A host of operational risks exist that are not captured in a historical simulation. These include:
- Data Feed Integrity: The live data feed from the broker or a third-party provider may contain errors, gaps, or be delayed, causing the algorithm to make decisions based on faulty information.
- API and Connectivity Issues: The connection to the broker's API can fail, or the API itself may experience downtime, preventing the algorithm from placing or managing orders.
- Execution Uncertainty: In a live environment, order fulls are not guaranteed at the last-traded price. Market orders can experience significant slippage, especially during volatile periods.
- System Failures: The machine running the algorithm could crash, lose power, or lose internet connectivity, leaving open positions unmanaged.

To mitigate these risks, a production-grade system requires robust infrastructure, including comprehensive logging of all action and decisions, an automated alerting system to notify the operator of any errors or unexpected behavior, and a clear "kill switch" protocol to immediately flatten all positions and halt the algorithm in case of a critical failure.

# IX. Conclusion and Avenues for Future Research

This report has detailed the constrcution of a complete, institutional-grade framework for the research and development of algorithmic trading strategies, applying it to a market-neutral, intra-ETF mean-reversion strategy on SPY. The process was intentionally rigorous, progressing from a theoretical foundation in behavioral finance to a robustly validated and deployment-ready implementation.

# Summary of Findings

The core hypothesis - that behavioral overreactions in the highly liquid SPY ETF create predictable, short-term mean-reversion opportunities - was found to be viable when approached with a sophisticated, adaptive model. The key findings are:
1. Adaptability is Crucial: A simple, static mean-reversion strategy is insufficient. The integration of a volatility-normalisation component (using ATR) and a market-regime filter (using the VIX) was essential for navigating different market conditions and improving signal quality.
2. Rigorous Validation is Non-Negotiable: The strategy demonstrated positive performance not only in-sample but also in out-of-sample testing. More importantly, the walk-forward analysis showed that the strategy's edge was persistent across different time periods, and the Monte Carlo simulations provided statistical confidence that the positive results were unlikely to be a product of random chance.
3. Risk Management Drives Performance: The choice of position sizing model had a profound impact on the strategy's risk-adjusted returns. The Volatility Parity approach, which aligns position size with the strategy's core logic of adapting to volatility, was empirically shown to produce the superior risk-adjusted outcome, yielding the highest Sharpe and Calmar ratios while effectively containing drawdowns.

The final, integrated strategy represents a statistically robust model for extracting alpha from short-term inefficiencies in the SPY ETF. The confidence intervals derived from the bootstrap analysis suggest a geniune albeit modest, adge after accounting for realistic transactional costs.

# A Roadmap for Iteration

The framework developed in this report is not an endpoint but a foundation. the value of a systematic process lies in its capacity for iterative improvement. Several high-potential avenues for future researcj are immediately apparent:
1. Scaling to Intraday Data: The most direct extension of this work is to apply the same volatility-normalised mean-reversion principles to a higher data frequency, such as 5-minute or 15-minute bars. This would allow the strategy to capture a greater number of smaller-scale inefficiences and exploit the known intraday patterns in market microstrcuture.
2. Developing Alternative Hypotheses: The two other strategies proposed in Section II - Statistical Arbitrage on ETF components and Options-Based Sentiment Analysis - remain compelling but more complex avenues for research. With the core backtesting and risk management framework now in place, these data-intensive strategies become more tractable projects.
3. Machine Learning Enhancements: The current strategy relies on a static, rule-based logic for signal generation (e.g., IF RSI < 25 AND...). A powerful next step would be to replace this logic with a machine learning model, such as a Gradient Boosting classifier. The engineered features (RSI, ATR, Z-score, VIX) would serve as inputs to the model, which would be trained to predict the probability of a positve or negative return on the next day. This could capture more complex, non-linear interactions between the features and potentially lead to a more accurate and adaptive signal.
4. Portfolio Application and Diversification: The single-asset framework can be expanded to a portfolio of highly liquid, correlated ETFs (e.g., SPY, QQQ for the Nasdaq-100, and IWM for the Russel 2000). This would enable the developement of pairs trading strategies that trade the relative value between these indices, or sector rotation strategies that apply momentum and mean-reversion principles across different sectors of the economy. This would introduce the benefits of diversification and potentially create a more stable, multi-alpha source of returns.