# Advanced Stock Price Prediction

This notebook implements an advanced stock price prediction project suitable for a resume/GitHub. It includes data download, EDA, feature engineering (MA, EMA, RSI), classical ML (Linear Regression, Random Forest), LSTM deep learning, hyperparameter tuning, multi-step forecasting, evaluation, and instructions for saving and pushing the project to GitHub.

**Notes:**
- This notebook is intended to be run locally. It uses `yfinance` to download data. If you run this in an environment without internet access, first download the CSV into `data/` and adjust the data path.
- Cells marked `ðŸš€` are key steps for training and evaluation.


In [8]:
# Setup: install required packages (run in your environment once)
%%bash
pip install --quiet yfinance pandas numpy matplotlib scikit-learn tensorflow keras seaborn ta
echo 'Packages installed (or already present).'


SyntaxError: invalid syntax (3260636281.py, line 3)

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import MinMaxScaler

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

sns.set(style='darkgrid')

print('Imports ok')

In [None]:
# Parameters
TICKER = 'AAPL'  # change as needed
START_DATE = '2015-01-01'
END_DATE = datetime.today().strftime('%Y-%m-%d')
WINDOW = 60  # sequence length for LSTM
FUTURE_DAYS = 7  # multi-step forecast horizon
SEED = 42
np.random.seed(SEED)

os.makedirs('data', exist_ok=True)
os.makedirs('model', exist_ok=True)


In [None]:
# Download historical data using yfinance
import yfinance as yf

csv_path = f'data/{TICKER}.csv'
if not os.path.exists(csv_path):
    df = yf.download(TICKER, start=START_DATE, end=END_DATE)
    df.to_csv(csv_path)
    print('Downloaded to', csv_path)
else:
    df = pd.read_csv(csv_path, index_col=0, parse_dates=True)
    print('Loaded existing CSV:', csv_path)

# show head
print(df.head())


In [None]:
# Basic EDA
print('Shape:', df.shape)
print(df.info())

plt.figure(figsize=(12,5))
plt.plot(df['Close'])
plt.title(f'{TICKER} Close Price')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.show()

# Daily returns
if 'Adj Close' in df.columns:
    df['Return'] = df['Adj Close'].pct_change()
else:
    df['Return'] = df['Close'].pct_change()

plt.figure(figsize=(12,4))
plt.plot(df['Return'].fillna(0))
plt.title('Daily Returns')
plt.show()


## Feature engineering
We will add commonly-used technical indicators: moving averages (MA), exponential moving averages (EMA), and RSI. You can add more indicators from the `ta` package.

In [None]:
# Feature engineering functions

def add_technical_indicators(df):
    df = df.copy()
    df['MA10'] = df['Close'].rolling(window=10).mean()
    df['MA21'] = df['Close'].rolling(window=21).mean()
    df['EMA10'] = df['Close'].ewm(span=10, adjust=False).mean()
    df['EMA21'] = df['Close'].ewm(span=21, adjust=False).mean()
    # RSI
    delta = df['Close'].diff()
    up = delta.clip(lower=0)
    down = -1 * delta.clip(upper=0)
    roll_up = up.rolling(14).mean()
    roll_down = down.rolling(14).mean()
    rs = roll_up / roll_down
    df['RSI'] = 100.0 - (100.0 / (1.0 + rs))
    # Volatility (rolling std of returns)
    df['Volatility'] = df['Return'].rolling(window=21).std()
    df = df.fillna(method='bfill').fillna(method='ffill')
    return df

# apply
if 'Return' not in df.columns:
    df['Return'] = df['Close'].pct_change()

df = add_technical_indicators(df)
print(df[['Close','MA10','MA21','EMA10','EMA21','RSI','Volatility']].tail())


### Prepare supervised dataset for classical ML models
We will use lag features (previous n days' close) and the technical indicators to train Linear Regression and Random Forest.

In [None]:
# Create lag features
LAGS = 5

feat_df = df.copy()
for lag in range(1, LAGS+1):
    feat_df[f'lag_{lag}'] = feat_df['Close'].shift(lag)

# target: next-day Close
feat_df['target'] = feat_df['Close'].shift(-1)
feat_df = feat_df.dropna()

features = [f'lag_{i}' for i in range(1, LAGS+1)] + ['MA10','MA21','EMA10','EMA21','RSI','Volatility']
X = feat_df[features]
y = feat_df['target']

print('Feature columns:', X.columns.tolist())
print('Shapes', X.shape, y.shape)


In [None]:
# Time-series split (train/test) - keep chronological order
split_idx = int(len(X)*0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

print('Train size:', X_train.shape, 'Test size:', X_test.shape)


In [None]:
# Baseline: Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)

mse_lr = mean_squared_error(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print('Linear Regression - MSE:', mse_lr, 'MAE:', mae_lr, 'R2:', r2_lr)

plt.figure(figsize=(12,5))
plt.plot(y_test.values, label='Actual')
plt.plot(y_pred_lr, label='LR Predicted')
plt.legend()
plt.show()


In [None]:
# Random Forest: quick hyperparameter search
rf = RandomForestRegressor(random_state=SEED)
params = {'n_estimators':[50,100], 'max_depth':[5,10,None]}

# Use small grid for speed; expand for serious tuning
gcv = GridSearchCV(rf, params, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
gcv.fit(X_train, y_train)
print('Best params:', gcv.best_params_)

best_rf = gcv.best_estimator_
rf_pred = best_rf.predict(X_test)

mse_rf = mean_squared_error(y_test, rf_pred)
mae_rf = mean_absolute_error(y_test, rf_pred)
r2_rf = r2_score(y_test, rf_pred)

print('Random Forest - MSE:', mse_rf, 'MAE:', mae_rf, 'R2:', r2_rf)

plt.figure(figsize=(12,5))
plt.plot(y_test.values, label='Actual')
plt.plot(rf_pred, label='RF Predicted')
plt.legend()
plt.show()


## Prepare data for LSTM
We'll use the `Close` price scaled with MinMaxScaler and create sequences of length `WINDOW`.

In [None]:
# LSTM data preparation using Close price
close_vals = df[['Close']].values
scaler = MinMaxScaler()
close_scaled = scaler.fit_transform(close_vals)

# create sequences
def create_sequences(data, window=60):
    Xs, ys = [], []
    for i in range(window, len(data)):
        Xs.append(data[i-window:i, 0])
        ys.append(data[i, 0])
    return np.array(Xs), np.array(ys)

X_seq, y_seq = create_sequences(close_scaled, WINDOW)

# split chronologically
split_idx_seq = int(len(X_seq)*0.8)
X_train_seq, X_test_seq = X_seq[:split_idx_seq], X_seq[split_idx_seq:]
y_train_seq, y_test_seq = y_seq[:split_idx_seq], y_seq[split_idx_seq:]

# reshape for LSTM
X_train_seq = X_train_seq.reshape((X_train_seq.shape[0], X_train_seq.shape[1], 1))
X_test_seq = X_test_seq.reshape((X_test_seq.shape[0], X_test_seq.shape[1], 1))

print('LSTM shapes', X_train_seq.shape, X_test_seq.shape)


### Build and train LSTM model
We use a 2-layer LSTM with Dropout and EarlyStopping.

In [None]:
# Build LSTM
model = Sequential([
    LSTM(64, return_sequences=True, input_shape=(WINDOW,1)),
    Dropout(0.2),
    LSTM(32),
    Dropout(0.2),
    Dense(1)
])

model.compile(optimizer='adam', loss='mse')
model.summary()

# Callbacks
es = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
mc = ModelCheckpoint('model/lstm_best.h5', monitor='val_loss', save_best_only=True)

# Train (you can increase epochs)
history = model.fit(X_train_seq, y_train_seq, epochs=30, batch_size=32,
                    validation_data=(X_test_seq, y_test_seq), callbacks=[es, mc])

# Plot loss
plt.figure(figsize=(8,4))
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val')
plt.legend()
plt.title('LSTM training loss')
plt.show()


In [None]:
# LSTM predictions and inverse transform
lstm_pred_scaled = model.predict(X_test_seq)

# inverse transform
lstm_pred = scaler.inverse_transform(lstm_pred_scaled)
y_test_actual = scaler.inverse_transform(y_test_seq.reshape(-1,1))

mse_lstm = mean_squared_error(y_test_actual, lstm_pred)
mae_lstm = mean_absolute_error(y_test_actual, lstm_pred)
r2_lstm = r2_score(y_test_actual, lstm_pred)

print('LSTM - MSE:', mse_lstm, 'MAE:', mae_lstm, 'R2:', r2_lstm)

plt.figure(figsize=(12,5))
plt.plot(y_test_actual, label='Actual')
plt.plot(lstm_pred, label='LSTM Predicted')
plt.legend()
plt.show()


## Multi-step Forecast (next N days)
We will use the trained LSTM to forecast the next `FUTURE_DAYS` by iteratively feeding back predictions.

In [None]:
# Multi-step forecast using last WINDOW days

def multi_step_forecast(model, recent_window, scaler, future_days=7):
    # recent_window: array of shape (WINDOW,)
    preds = []
    current_seq = recent_window.copy()
    for _ in range(future_days):
        x = current_seq.reshape((1, current_seq.shape[0], 1))
        pred_scaled = model.predict(x)[0,0]
        preds.append(pred_scaled)
        # append and slide
        current_seq = np.append(current_seq[1:], pred_scaled)
    # inverse transform
    preds = np.array(preds).reshape(-1,1)
    preds_inv = scaler.inverse_transform(preds)
    return preds_inv.flatten()

# get recent window
recent_window = close_scaled[-WINDOW:].reshape(-1)
future_preds = multi_step_forecast(model, recent_window, scaler, FUTURE_DAYS)
print('Next', FUTURE_DAYS, 'day predictions (LSTM):')
print(future_preds)

# visualize
future_dates = [df.index[-1] + timedelta(days=i+1) for i in range(FUTURE_DAYS)]
plt.figure(figsize=(10,4))
plt.plot(df.index[-200:], df['Close'].tail(200), label='Recent Actual')
plt.plot(future_dates, future_preds, marker='o', linestyle='--', label='Forecast')
plt.legend()
plt.show()


## Compare models (summary)
We'll summarize metrics for Linear Regression, Random Forest, and LSTM.

In [None]:
# Aggregate metrics
results = pd.DataFrame({
    'Model': ['LinearRegression','RandomForest','LSTM'],
    'MSE': [mse_lr, mse_rf, mse_lstm],
    'MAE': [mae_lr, mae_rf, mae_lstm],
    'R2': [r2_lr, r2_rf, r2_lstm]
})
results


## Save models and artifacts
Save trained models, scalers, and example plots for the GitHub repo.

In [None]:
import joblib

# Save sklearn models
joblib.dump(lr, 'model/linear_regression.pkl')
joblib.dump(best_rf, 'model/random_forest.pkl')
# Save scaler for LSTM
joblib.dump(scaler, 'model/close_scaler.pkl')

print('Saved model artifacts to model/ (linear_regression.pkl, random_forest.pkl, close_scaler.pkl, lstm_best.h5)')


## Project structure & Git commands

Add these files to your GitHub repo and push:

```
stock-price-prediction-advanced/
â”œâ”€â”€ data/                 # CSVs (do NOT push large raw data if you have many files)
â”œâ”€â”€ model/                # saved models
â”œâ”€â”€ notebooks/            # this notebook
â”œâ”€â”€ src/                  # helper scripts (download, preprocess, train, predict)
â”œâ”€â”€ streamlit_app/        # optional: streamlit app for demo
â”œâ”€â”€ requirements.txt
â””â”€â”€ README.md
```

Example git commands:

```bash
git init
git add .
git commit -m "Initial commit: advanced stock price prediction notebook"
git branch -M main
git remote add origin https://github.com/<your-username>/stock-price-prediction-advanced.git
git push -u origin main
```

Add screenshots & brief summary to README.md, and include instructions to run the notebook and reproduce results.

## Final tips to make it resume-ready
- Add a concise README with project goal, tech stack, how to run, and results.
- Include screenshots of plots and the Streamlit app (if you add one).
- Add a small `requirements.txt` and instructions to create a `venv`.
- Provide a short demo video/GIF in the repo if possible.

---

You're ready to run this notebook locally. Good luck!