# Stock Price Prediction Across Market Sectors

This project applies machine learning to the problem of stock price prediction, with an emphasis on sector-level diversity and company-level representation. The analysis covers all 11 sectors defined by the Global Industry Classification Standard (GICS). For each sector, a leading stock has been selected from a predefined list of 22 well-established and widely traded companies.

The goal is to develop a generalizable and reproducible prediction pipeline, while gaining insight into the behavior of stocks across different industries. 

### GICS Sectors Covered:
- Information Technology  
- Health Care  
- Financials  
- Consumer Discretionary  
- Communication Services  
- Industrials  
- Consumer Staples  
- Energy  
- Utilities  
- Real Estate  
- Materials


### 📈 Dataset

This project uses historical daily stock price data downloaded using the [Yahoo Finance API](https://pypi.org/project/yfinance/). The dataset includes Adjusted Close, Open, High, Low, Volume, and Close prices.

We selected 22 companies across 11 sectors of the US stock market:

| Sector                    | Tickers         |
|--------------------------|-----------------|
| Information Technology   | AAPL, MSFT      |
| Health Care              | JNJ, UNH        |
| Financials               | JPM, BAC        |
| Consumer Discretionary   | AMZN, TSLA      |
| Communication Services   | GOOGL, META     |
| Industrials              | UNP, RTX        |
| Consumer Staples         | PG, KO          |
| Energy                   | XOM, CVX        |
| Utilities                | NEE, DUK        |
| Real Estate              | AMT, PLD        |
| Materials                | LIN, SHW        |

These companies were selected due to their market leadership, high liquidity, and rich historical data. They serve as strong representatives of their sectors and offer a diverse foundation for building and evaluating time series forecasting models.

Raw data is saved in `data/raw/` as individual CSV files.


### Basic Feature Engineering: 
Adding:
- High-Low
- Price-Open
- lag features for closing price

In [55]:
# This script creates High-Low, Price-Open and lagged features for closed prices for the past 10 trading days
import os
import pandas as pd

# Set the directory containing your stock CSVs
directory = '/Users/beatawyspianska/Desktop/AIML_Projects/predict_stock_price/stock-price-predictor/data/raw/modified' # Replace with your actual path

# Loop through all CSV files in the folder
for file in os.listdir(directory):
    if file.endswith('.csv'):
        file_path = os.path.join(directory, file)
        
        # Load the CSV
        data = pd.read_csv(file_path)
        
        # Ensure required columns exist
        required_cols = ['High', 'Low', 'Close', 'Open']
        if not all(col in data.columns for col in required_cols):
            print(f"Skipping {file}: missing required columns.")
            continue

        # Add engineered features
        data['High-Low'] = data['High'] - data['Low']
        data['Price-Open'] = data['Close'] - data['Open']
        
        # Create lag features for 'Close'
        for i in range(1, 21):
            data[f'Close_lag{i}'] = data['Close'].shift(i)
        
        # Drop rows with any NaNs caused by lagging
        data = data.dropna(subset=[f'Close_lag{n}' for n in range(1, 21)]).reset_index(drop=True)

        # Save the modified file back
        data.to_csv(file_path, index=False)

        print(f"Processed and saved: {file}")


Processed and saved: modified_RTX.csv
Processed and saved: modified_SHW.csv
Processed and saved: modified_CVX.csv
Processed and saved: modified_XOM.csv
Processed and saved: modified_TSLA.csv
Processed and saved: modified_AMT.csv
Processed and saved: modified_META.csv
Processed and saved: modified_NEE.csv
Processed and saved: modified_UNP.csv
Processed and saved: modified_GOOGL.csv
Processed and saved: modified_AAPL.csv
Processed and saved: modified_BAC.csv
Processed and saved: modified_KO.csv
Processed and saved: modified_JNJ.csv
Processed and saved: modified_PG.csv
Processed and saved: modified_DUK.csv
Processed and saved: modified_UNH.csv
Processed and saved: modified_AMZN.csv
Processed and saved: modified_JPM.csv
Processed and saved: modified_LIN.csv
Processed and saved: modified_MSFT.csv
Processed and saved: modified_PLD.csv


### Let's create our Target column

Since we are predicting the next-day Close, our target will be as follows

In [None]:
import pandas as pd
import pandas_ta as ta
import os

from pathlib import Path

# Define input and output folders
input_folder = Path("/Users/beatawyspianska/Desktop/AIML_Projects/predict_stock_price/stock-price-predictor/data/raw/modified")
output_folder = Path("/Users/beatawyspianska/Desktop/AIML_Projects/predict_stock_price/stock-price-predictor/data/processed")
output_folder.mkdir(parents=True, exist_ok=True)

# Define a function to add features
def engineer_features(df):
    df = data.copy()

    # Ensure datetime format and sort
    df['Date'] = pd.to_datetime(df['Date'])
    df.sort_values('Date', inplace=True)
    df.set_index('Date', inplace=True)

# Basic price features
    df['RollingMean_5'] = df['Close'].rolling(window=5).mean()
    df['RollingStd_5'] = df['Close'].rolling(window=5).std()
    df['RollingMean_10'] = df['Close'].rolling(window=10).mean()
    df['RollingStd_10'] = df['Close'].rolling(window=10).std()
    df['RollingMean_20'] = df['Close'].rolling(window=20).mean()
    df['RollingStd_20'] = df['Close'].rolling(window=20).std()

# Price momentum
    df['Return_1'] = df['Close'].pct_change(1)
    df['Return_5'] = df['Close'].pct_change(5)
    df['Return_10'] = df['Close'].pct_change(10)
    df['Return_20'] = df['Close'].pct_change(20)

