<a href="https://colab.research.google.com/github/etay-atar/machine_learning/blob/main/Predict_SandP500_Price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Next-Day S&P 500 Movements with Machine Learning

#Researcher: **Etay Atar**

The goal is to build and evaluate a machine learning model that predicts tomorrow’s S&P 500 price direction (up or down) based on more than 20 years of historical market data.

**Objective**

Develop an end-to-end pipeline for time-series prediction of the S&P 500 index.

Focus on daily next-day forecasting (classification: up vs. down).

Evaluate predictive performance both with machine learning metrics (accuracy, AUC) and trading strategy metrics (CAGR, Sharpe ratio, drawdowns).

Ensure robust testing using walk-forward backtesting, which mimics real-world trading conditions and prevents lookahead bias.


**Research Question**

Step 1 — Forecasting:
“Can machine learning models accurately predict the next-day movement of the S&P 500 index based on historical price data and technical indicators?”

Step 2 — Trading:
“If such predictions are used in a simple trading strategy (via the SPY ETF), would the strategy generate consistent profits after accounting for transaction costs and market frictions?”

Start with Data collection:


In [None]:
#Downloading the s&p 500 price data
import yfinance as yf

In [None]:
#Download the price history for the ticker GSPC (The S&P 500 index)
sp500 = yf.Ticker("^GSPC")

In [None]:
#Query all data from the creation of the index
sp500 = sp500.history(period="max")

In [None]:
#Display the data
sp500

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1927-12-30 00:00:00-05:00,17.660000,17.660000,17.660000,17.660000,0,0.0,0.0
1928-01-03 00:00:00-05:00,17.760000,17.760000,17.760000,17.760000,0,0.0,0.0
1928-01-04 00:00:00-05:00,17.719999,17.719999,17.719999,17.719999,0,0.0,0.0
1928-01-05 00:00:00-05:00,17.549999,17.549999,17.549999,17.549999,0,0.0,0.0
1928-01-06 00:00:00-05:00,17.660000,17.660000,17.660000,17.660000,0,0.0,0.0
...,...,...,...,...,...,...,...
2025-09-15 00:00:00-04:00,6603.490234,6619.620117,6602.069824,6615.279785,5045020000,0.0,0.0
2025-09-16 00:00:00-04:00,6624.129883,6626.990234,6600.109863,6606.759766,5359510000,0.0,0.0
2025-09-17 00:00:00-04:00,6604.870117,6624.390137,6551.149902,6600.350098,5805340000,0.0,0.0
2025-09-18 00:00:00-04:00,6626.850098,6656.799805,6611.890137,6631.959961,5292400000,0.0,0.0


In [None]:
#Check the dataframe
sp500.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 24546 entries, 1927-12-30 00:00:00-05:00 to 2025-09-19 00:00:00-04:00
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Open          24546 non-null  float64
 1   High          24546 non-null  float64
 2   Low           24546 non-null  float64
 3   Close         24546 non-null  float64
 4   Volume        24546 non-null  int64  
 5   Dividends     24546 non-null  float64
 6   Stock Splits  24546 non-null  float64
dtypes: float64(6), int64(1)
memory usage: 1.5 MB


In [None]:
#Try another ticker with interday prices
sp500_SPY = yf.Ticker("SPY")
sp500_SPY = sp500_SPY.history(period="max")
sp500_SPY

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,Capital Gains
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1993-01-29 00:00:00-05:00,24.330326,24.330326,24.209280,24.313034,1003200,0.000,0.0,0.0
1993-02-01 00:00:00-05:00,24.330323,24.485954,24.330323,24.485954,480500,0.000,0.0,0.0
1993-02-02 00:00:00-05:00,24.468674,24.555136,24.416797,24.537844,201300,0.000,0.0,0.0
1993-02-03 00:00:00-05:00,24.572437,24.814530,24.555144,24.797237,529400,0.000,0.0,0.0
1993-02-04 00:00:00-05:00,24.883672,24.952842,24.606995,24.900965,531500,0.000,0.0,0.0
...,...,...,...,...,...,...,...,...
2025-09-15 00:00:00-04:00,657.816255,659.212348,657.517097,659.082703,63772400,0.000,0.0,0.0
2025-09-16 00:00:00-04:00,659.641138,659.950340,657.387438,658.175232,61169000,0.000,0.0,0.0
2025-09-17 00:00:00-04:00,658.185266,659.890500,652.491031,657.357544,101952200,0.000,0.0,0.0
2025-09-18 00:00:00-04:00,660.060044,663.051750,658.444528,660.429016,90459200,0.000,0.0,0.0


In [None]:
#Check the dataframe
sp500_SPY.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8217 entries, 1993-01-29 00:00:00-05:00 to 2025-09-19 00:00:00-04:00
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Open           8217 non-null   float64
 1   High           8217 non-null   float64
 2   Low            8217 non-null   float64
 3   Close          8217 non-null   float64
 4   Volume         8217 non-null   int64  
 5   Dividends      8217 non-null   float64
 6   Stock Splits   8217 non-null   float64
 7   Capital Gains  8217 non-null   float64
dtypes: float64(7), int64(1)
memory usage: 577.8 KB


The data from both tickers fit two different goals
1. **Forecasting goal → ^GSPC**

Question: “Can I predict whether the S&P 500 will go up or down tomorrow?”

Focus: prediction accuracy (ML metrics like accuracy, AUC, etc.).

Data: index values only (no dividends, no real trading possible).

Outcome: you know how well your model forecasts the market’s movement.

it's like a weather forecast  — only care about whether the prediction matches reality.

2. **Trading goal → SPY**

Question: “If I trade my predictions with real money, will I make a profit?”

Focus: profitability (PnL, CAGR, Sharpe ratio, drawdowns, after costs).

Data: ETF prices (OHLCV), dividends, transaction costs.

Outcome: you see how much money you would (or wouldn’t) make in practice.

it's like running a business simulation — you care about actual money, not just being “right.”

________________________________________________________________________

My goal now is modeling the market itself (long history, more comprehensive), so i will stick with ^GSPC