# 🌳📈 Stock Price Prediction Across Market Sectors

This project applies machine learning to the problem of stock price prediction, with an emphasis on sector-level diversity and company-level representation. The analysis covers all 11 sectors defined by the Global Industry Classification Standard (GICS). For each sector, a leading stock has been selected from a predefined list of 22 well-established and widely traded companies.

The goal is to develop a generalizable and reproducible prediction pipeline, while gaining insight into the behavior of stocks across different industries. 

### GICS Sectors Covered:
- Information Technology  
- Health Care  
- Financials  
- Consumer Discretionary  
- Communication Services  
- Industrials  
- Consumer Staples  
- Energy  
- Utilities  
- Real Estate  
- Materials

### Here we explore using Tree-based Models & Feature Importance

## 🏗️ 1. Project Overview

In this notebook, we explore the problem of predicting future stock price returns using machine learning.

We will:

- Load & merge historical stock data across 11 GICS sectors.
- Prepare and engineer features (~200+ indicators).
- Train tree-based regression models (Random Forest, LightGBM).
- Analyze feature importances to understand key drivers.
- Evaluate model performance.

## 🔄 2. Load and Merge Data

In [None]:
import pandas as pd
from pathlib import Path

# Define path to data
data_path = Path("/Users/beatawyspianska/Desktop/AIML_Projects/predict_stock_price/stock-price-predictor/data/merged")

# Load all CSV files and combine into a single DataFrame
dfs = []
for file in data_path.glob("*.csv"):
    ticker = file.stem
    df = pd.read_csv(file, parse_dates=['Date'])
    df['TICKER'] = ticker
    dfs.append(df)

# Merge into single DataFrame
df = pd.concat(dfs, ignore_index=True)

# Sanity check
print(f"✅ Loaded data shape: {df.shape}")
df.head()

✅ Loaded data shape: (141049, 155)


Unnamed: 0,Date,TICKER,Sector,MarketCap,MarketCapBin,Close,High,Low,Open,Volume,...,Target_Raw_Close_21d,Target_%_Return_7d,Target_Log_Return_7d,Target_Direction_7d,Target_%_Return_14d,Target_Log_Return_14d,Target_Direction_14d,Target_%_Return_21d,Target_Log_Return_21d,Target_Direction_21d
0,1999-12-01,BAC,Financials,368534880256,Large Cap,15.007504,15.299226,14.813022,14.813022,9380200,...,13.014065,-0.073434,-0.07627,0,-0.147948,-0.160108,0,-0.132829,-0.14252,0
1,1999-12-02,BAC,Financials,368534880256,Large Cap,14.618542,15.088539,14.391646,14.877851,12395600,...,12.56028,-0.066241,-0.068537,0,-0.139689,-0.150462,0,-0.140798,-0.151751,0
2,1999-12-03,BAC,Financials,368534880256,Large Cap,14.894052,15.10474,14.813018,14.877845,14423800,...,11.814767,-0.109902,-0.116423,0,-0.14037,-0.151253,0,-0.206746,-0.231612,0
3,1999-12-06,BAC,Financials,368534880256,Large Cap,14.351131,14.748198,14.132339,14.553717,13820000,...,11.944418,-0.084698,-0.088501,0,-0.114624,-0.121743,0,-0.167702,-0.183565,0
4,1999-12-07,BAC,Financials,368534880256,Large Cap,13.613721,14.245787,13.548894,14.213373,20245200,...,12.965451,-0.044048,-0.045047,0,-0.060714,-0.062636,0,-0.047619,-0.04879,0


## 🧹 3. Basic Cleaning & Filtering
Drop rows with missing values (for simplicity).

Filter to start from 2013-03-07, to align all stocks to the same timeframe (due to META IPO).

In [2]:
# Drop missing values
df.dropna(inplace=True)
print(f"✅ After dropping NA: {df.shape}")

# Align start date to META IPO
start_date = pd.to_datetime("2013-03-07")
df = df[df['Date'] >= start_date].reset_index(drop=True)
print(f"✅ After aligning start date: {df.shape}")

# Confirm unique stocks
print("Unique tickers:", df['TICKER'].nunique(), df['TICKER'].unique())

✅ After dropping NA: (135989, 155)
✅ After aligning start date: (70840, 155)
Unique tickers: 23 ['BAC' 'UNP' 'AMT' 'PG' 'AMZN' 'KO' 'NEE' 'MSFT' 'SHW' 'RTX' 'NVDA' 'XOM'
 'CVX' 'PLD' 'LIN' 'JPM' 'TSLA' 'GOOGL' 'META' 'DUK' 'UNH' 'AAPL' 'JNJ']


## ✂️ 4. Prepare Features and Targets
Here we will:

Drop identifiers & raw target columns to create feature matrix X.

Use 'Target_%_Return_21d' as our prediction target y.

In [None]:
# Columns to exclude from X
exclude_cols = [
    'Date', 'Close', 'Open', 'High', 'Low', 'Volume', 'TICKER', 'Sector',  'trend_id',
    'trend_duration', 'is_trend_start',
    'Target_Raw_Close_daily', 'Target_Raw_Close_7d', 'Target_Raw_Close_14d', 'Target_Raw_Close_21d',
    'Target_%_Return_daily', 'Target_%_Return_7d', 'Target_%_Return_14d', 'Target_%_Return_21d',
    'Target_Log_Return_daily', 'Target_Log_Return_7d', 'Target_Log_Return_14d', 'Target_Log_Return_21d',
    'Target_Direction_daily', 'Target_Direction_7d', 'Target_Direction_14d', 'Target_Direction_21d'
]

# Features & target
X = df.drop(columns=[col for col in exclude_cols if col in df.columns], errors='ignore')
y = df['Target_%_Return_21d']

# Sanity check
print(f"✅ Feature matrix shape: {X.shape}")
print(f"✅ Target shape: {y.shape}")
