# Sentiment-Enhanced Stock Predictor — End-to-End Workflow

This notebook walks through the full MVP pipeline:

1. Setup & imports
2. Build dataset (prices + sentiment)
3. Feature engineering & target creation
4. EDA
5. Train LSTM models (per ticker)
6. Predict & conservative backtest
7. Visualize results & compute metrics

**Pre-reqs**
- Create `.env` with `NEWSAPI_KEY=...`
- Install requirements: `pip install -r requirements.txt`
- From repo root.


In [4]:
# Cell: Setup & imports
import os
import sys
import json
import datetime as dt
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

repo_root = Path.cwd().parent  # go up one directory
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

# helper plotting defaults
plt.rcParams["figure.figsize"] = (10, 5)

# convenience utils
def show_df_head(df, n=5):
    display(df.head(n))

print("Working directory:", repo_root)


Working directory: /Users/abhirupghosh/Documents/Work/job_preparation/case_studies/sentiment-enhanced-stock-predictor


## Configuration: tickers, date range, model params
Set the assets and hyperparameters for the run below.

In [5]:
# Cell: configuration
TICKERS = ["AAPL", "MSFT", "NVDA"]          # start small for speed
START = "2023-01-01"
END = dt.date.today().isoformat()
FEATURES_PATH = "data/processed/features.parquet"
LOOKBACK = 10
EPOCHS = 8   # keep small for quick runs; increase later
THRESHOLD = 0.75
HOLD_DAYS = 3

print("Tickers:", TICKERS, "Start:", START, "End:", END)


Tickers: ['AAPL', 'MSFT', 'NVDA'] Start: 2023-01-01 End: 2025-09-04


# 1) Build dataset (prices + sentiment + features + targets)

This step:
- fetches prices (via `src.data_fetch.fetch_prices`)
- fetches headlines via NewsAPI and scores them with VADER (via `src.sentiment.get_daily_sentiment`)
- computes technical features (via `src.features.add_technical_features` + `merge_sentiment`)
- computes `future_return_3d` and `target`
- saves result to `data/processed/features.parquet`


In [6]:
# Cell: build dataset (calls src.build_dataset.main)
# We call the build_dataset main function to create features parquet.
from importlib import import_module
mod = import_module("src.build_dataset")
# mod.main(tickers, start, end, out_path)
mod.main(TICKERS, START, END, FEATURES_PATH)
print("Saved features to:", FEATURES_PATH)

  df = df.stack(level=1).rename_axis(["Date","Ticker"]).reset_index()


Saved features to data/processed/features.parquet with shape (2007, 18)
Saved features to: data/processed/features.parquet


# Inspect the created features

In [7]:
# Cell: load features and quick check
df = pd.read_parquet(FEATURES_PATH)
print("Features shape:", df.shape)
print("Columns:", df.columns.tolist())
show_df_head(df)

Features shape: (2007, 18)
Columns: ['Date', 'Ticker', 'Close', 'High', 'Low', 'Open', 'Volume', 'Return_1d', 'SMA_5', 'SMA_10', 'SMA_20', 'Vol_10', 'date', 'sentiment', 'ticker', 'sentiment_3d', 'future_return_3d', 'target']


Unnamed: 0,Date,Ticker,Close,High,Low,Open,Volume,Return_1d,SMA_5,SMA_10,SMA_20,Vol_10,date,sentiment,ticker,sentiment_3d,future_return_3d,target
0,2023-01-03,AAPL,123.330643,129.079559,122.443158,128.468186,112117500,,,,,,2023-01-03,0.0,,0.0,0.03638,1
1,2023-01-04,AAPL,124.602707,126.870724,123.340509,125.125335,89113600,0.010314,,,,,2023-01-04,0.0,,0.0,0.029994,1
2,2023-01-05,AAPL,123.281342,125.993097,123.024963,125.361998,80962700,-0.010605,,,,,2023-01-05,0.0,,0.0,0.045673,1
3,2023-01-06,AAPL,127.81736,128.47804,123.153145,124.257571,87754700,0.036794,,,,,2023-01-06,0.0,,0.0,0.029857,1
4,2023-01-09,AAPL,128.339996,131.554669,128.083618,128.655553,70790800,0.004089,125.474409,,,,2023-01-09,0.0,,0.0,0.025048,1
