# Stock Price Prediction Across Market Sectors

This project applies machine learning to the problem of stock price prediction, with an emphasis on sector-level diversity and company-level representation. The analysis covers all 11 sectors defined by the Global Industry Classification Standard (GICS). For each sector, a leading stock has been selected from a predefined list of 22 well-established and widely traded companies.

The goal is to develop a generalizable and reproducible prediction pipeline, while gaining insight into the behavior of stocks across different industries. 

### GICS Sectors Covered:
- Information Technology  
- Health Care  
- Financials  
- Consumer Discretionary  
- Communication Services  
- Industrials  
- Consumer Staples  
- Energy  
- Utilities  
- Real Estate  
- Materials


### 📈 Dataset

This project uses historical daily stock price data downloaded using the [Yahoo Finance API](https://pypi.org/project/yfinance/). The dataset includes Adjusted Close, Open, High, Low, Volume, and Close prices.

We selected 22 companies across 11 sectors of the US stock market:

| Sector                    | Tickers         |
|--------------------------|-----------------|
| Information Technology   | AAPL, MSFT      |
| Health Care              | JNJ, UNH        |
| Financials               | JPM, BAC        |
| Consumer Discretionary   | AMZN, TSLA      |
| Communication Services   | GOOGL, META     |
| Industrials              | UNP, RTX        |
| Consumer Staples         | PG, KO          |
| Energy                   | XOM, CVX        |
| Utilities                | NEE, DUK        |
| Real Estate              | AMT, PLD        |
| Materials                | LIN, SHW        |

These companies were selected due to their market leadership, high liquidity, and rich historical data. They serve as strong representatives of their sectors and offer a diverse foundation for building and evaluating time series forecasting models.

Raw data is saved in `data/raw/` as individual CSV files.


### Feature Engineering: 

Extensive Feature engineering is covered in the 03_feature_engineering.py script (included in the src folder). The script adds ~200 features including some of the most ultilised trading indicators. Below is just a snapshot of some that are covered:

- High-Low
- Price-Open
- lag features for closing price
- Volume and volatility indicators
- EMA, SMA, Bolinger Bands
- Swing direction
- Index (NASDAQ and SP500) derived features

### Exploratory Data Analysis to guide modeling strategy

Before we move on to modelling and price predictions, it would be a good idea to consider the following question: How many models to build?
In order to help address this problem we will consider the following scenarios: 

1. If stocks cluster clearly by sector or behavior

    🏢 <span style="color:lime">We will build per-sector models </span>

2. If each stock behaves uniquely

    📈 <span style="color:purple">We will build stock-specific modelss </span>

3. If patterns are consistent across stocks

    🌐 <span style="color:orange">We will build one unified model. </span>

### Let's start exploring the data with UMAP

First let's load and merge individual stock data.

In [14]:
import pandas as pd
from pathlib import Path

input_folder = Path("/Users/beatawyspianska/Desktop/AIML_Projects/predict_stock_price/stock-price-predictor/data/merged")
all_data = []

for file in input_folder.glob("*.csv"):
    ticker = file.stem  # assumes filename is the ticker or contains it
    df = pd.read_csv(file, parse_dates=['Date'])
    df['TICKER'] = ticker
    all_data.append(df)

merged_df = pd.concat(all_data, ignore_index=True)
print(f"The size of the merged DataFrame is: {merged_df.shape}")
print("-------------------------------------------------------------------------------------------------------------------------------------------------------")
print("The first few rows of the merged DataFrame:")    
merged_df.head()



The size of the merged DataFrame is: (134415, 182)
-------------------------------------------------------------------------------------------------------------------------------------------------------
The first few rows of the merged DataFrame:


Unnamed: 0,Date,TICKER,Sector,MarketCap,MarketCapBin,Close,High,Low,Open,Volume,...,Close_lag15,Close_lag16,Close_lag17,Close_lag18,Close_lag19,Close_lag20,Target_Raw_Close,Target_Log_Return,Target_%_Return,Target_Direction
0,1999-12-01,BAC,Financials,342700523520,Large Cap,15.007516,15.299239,14.813034,14.813034,9380200,...,,,,,,,14.618541,-0.02626,-0.025919,0
1,1999-12-02,BAC,Financials,342700523520,Large Cap,14.618541,15.088538,14.391645,14.87785,12395600,...,,,,,,,14.894057,0.018672,0.018847,1
2,1999-12-03,BAC,Financials,342700523520,Large Cap,14.894057,15.104746,14.813023,14.87785,14423800,...,,,,,,,14.351138,-0.037133,-0.036452,0
3,1999-12-06,BAC,Financials,342700523520,Large Cap,14.351138,14.748205,14.132346,14.553723,13820000,...,,,,,,,13.61372,-0.052751,-0.051384,0
4,1999-12-07,BAC,Financials,342700523520,Large Cap,13.61372,14.245786,13.548893,14.213372,20245200,...,,,,,,,13.143729,-0.035133,-0.034523,0


How many features do we have to explore?

In [13]:
# Print Number of features
print(f"Number of features: {len(merged_df.columns)}")
print("-------------------------------------------------------------------------------------------------------------------------------------------------------")

# Print list of features
columns = merged_df.columns.tolist()
print("List of features (10 per line):")
for i in range(0, len(columns), 10):
    print(columns[i:i+10])

Number of features: 182
-------------------------------------------------------------------------------------------------------------------------------------------------------
List of features (10 per line):
['Date', 'TICKER', 'Sector', 'MarketCap', 'MarketCapBin', 'Close', 'High', 'Low', 'Open', 'Volume']
['NASDAQ_return_1', 'NASDAQ_return_5', 'NASDAQ_return_10', 'NASDAQ_return_20', 'NASDAQ_lag_1', 'NASDAQ_lag_2', 'NASDAQ_lag_3', 'NASDAQ_lag_4', 'NASDAQ_lag_5', 'NASDAQ_lag_6']
['NASDAQ_lag_7', 'NASDAQ_lag_8', 'NASDAQ_lag_9', 'NASDAQ_lag_10', 'NASDAQ_lag_11', 'NASDAQ_lag_12', 'NASDAQ_lag_13', 'NASDAQ_lag_14', 'NASDAQ_lag_15', 'NASDAQ_lag_16']
['NASDAQ_lag_17', 'NASDAQ_lag_18', 'NASDAQ_lag_19', 'NASDAQ_lag_20', 'SP500_return_1', 'SP500_return_5', 'SP500_return_10', 'SP500_return_20', 'SP500_lag_1', 'SP500_lag_2']
['SP500_lag_3', 'SP500_lag_4', 'SP500_lag_5', 'SP500_lag_6', 'SP500_lag_7', 'SP500_lag_8', 'SP500_lag_9', 'SP500_lag_10', 'SP500_lag_11', 'SP500_lag_12']
['SP500_lag_13', 'SP50

Compare agreement between timeframes

In [15]:
# Ensure 'Date' column is datetime
merged_df['Date'] = pd.to_datetime(merged_df['Date'])

# Get date range per TICKER
stock_ranges = merged_df.groupby('TICKER')['Date'].agg(start='min', end='max').reset_index()

print("🗓️ Available timeframe per stock:")
print(stock_ranges)


🗓️ Available timeframe per stock:
   TICKER      start        end
0    AAPL 1999-12-01 2025-06-20
1     AMT 1999-12-01 2025-06-20
2    AMZN 1999-12-01 2025-06-20
3     BAC 1999-12-01 2025-06-20
4     CVX 1999-12-01 2025-06-20
5     DUK 1999-12-01 2025-06-20
6   GOOGL 2004-08-19 2025-06-20
7     JNJ 1999-12-01 2025-06-20
8     JPM 1999-12-01 2025-06-20
9      KO 1999-12-01 2025-06-20
10    LIN 1999-12-01 2025-06-20
11   META 2012-05-18 2025-06-20
12   MSFT 1999-12-01 2025-06-20
13    NEE 1999-12-01 2025-06-20
14     PG 1999-12-01 2025-06-20
15    PLD 1999-12-01 2025-06-20
16    RTX 1999-12-01 2025-06-20
17    SHW 1999-12-01 2025-06-20
18   TSLA 2010-06-29 2025-06-20
19    UNH 1999-12-01 2025-06-20
20    UNP 1999-12-01 2025-06-20
21    XOM 1999-12-01 2025-06-20


Great mojority of stock have data going back all the way back to 1999. Unfortunately, Google (GOOGL), Facebook (META), and Tesla (TSLA) are the exceptions. In order to get the exact range for all stocks we will need to focus on the time frame define by the youngest company (META) for which price history dates back to 18-05-2012. We will use this date a the starting point for all stock. 

In [None]:
from datetime import datetime

# Define your desired start date - here based on META's cutoff date
cutoff_date = pd.to_datetime("2012-05-18")

# Filter merged_df to only include data on or after the cutoff
merged_df = merged_df[merged_df['Date'] >= cutoff_date]

# View the timeframe after filtering
print("Filtered merged_df shape:", merged_df.shape)
print("🗓️ Final start dates by stock:")
print(merged_df.groupby('TICKER')['Date'].agg(start='min', end='max').reset_index())

Filtered merged_df shape: (72402, 182)
🗓️ Final start dates by stock:
   TICKER      start        end
0    AAPL 2012-05-18 2025-06-20
1     AMT 2012-05-18 2025-06-20
2    AMZN 2012-05-18 2025-06-20
3     BAC 2012-05-18 2025-06-20
4     CVX 2012-05-18 2025-06-20
5     DUK 2012-05-18 2025-06-20
6   GOOGL 2012-05-18 2025-06-20
7     JNJ 2012-05-18 2025-06-20
8     JPM 2012-05-18 2025-06-20
9      KO 2012-05-18 2025-06-20
10    LIN 2012-05-18 2025-06-20
11   META 2012-05-18 2025-06-20
12   MSFT 2012-05-18 2025-06-20
13    NEE 2012-05-18 2025-06-20
14     PG 2012-05-18 2025-06-20
15    PLD 2012-05-18 2025-06-20
16    RTX 2012-05-18 2025-06-20
17    SHW 2012-05-18 2025-06-20
18   TSLA 2012-05-18 2025-06-20
19    UNH 2012-05-18 2025-06-20
20    UNP 2012-05-18 2025-06-20
21    XOM 2012-05-18 2025-06-20


In [19]:
import plotly.express as px

stock_ranges = merged_df.groupby("TICKER")["Date"].agg(start="min", end="max").reset_index()

fig = px.timeline(stock_ranges, x_start="start", x_end="end", y="TICKER", title="Stock Data Availability")
fig.update_yaxes(autorange="reversed")
fig.show()


ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed