# Technology Stocks Signal Forecasting with Machine Learning and Deep Learning

This notebook walks through a full mini-pipeline for technology equities:

* Download daily prices with **yfinance**.
* Engineer features to predict **next-day returns**.
* Train multiple models (logistic regression, random forest, and a small **LSTM**).
* Generate **buy/sell signals** using predicted probabilities.
* Run a simplified **options-style strategy** backtest with transaction costs and slippage.

The notebook is heavily commented for students with a background similar to **STAT 453 (deep learning / generative models)**.


## Setup and Imports

If you are running this outside the prepared environment, uncomment the installation cell below.


In [None]:
# !pip install yfinance tensorflow scikit-learn matplotlib seaborn


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import yfinance as yf

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
from sklearn.pipeline import Pipeline
import tensorflow as tf
from tensorflow.keras import layers, models

plt.style.use('seaborn-v0_8')
pd.options.display.float_format = '{:.4f}'.format


## Configuration
Feel free to tweak the ticker list, date range, and modeling hyperparameters.

In [None]:
# Universe of large-cap tech names
TICKERS = ["AAPL", "MSFT", "GOOGL", "AMZN", "META", "NVDA", "TSLA", "AMD", "AVGO", "CRM"]

START_DATE = "2016-01-01"
END_DATE = None  # None fetches up to the most recent available date

# Modeling parameters
LOOKBACK_DAYS = 10  # sequence length for the LSTM
VAL_SIZE = 0.2
TEST_SIZE = 0.15
RANDOM_STATE = 42

# Trading parameters
TRANSACTION_COST_BPS = 10  # 1bp ~ 0.01%
SLIPPAGE_BPS = 5  # 0.05% per trade
OPTION_PREMIUM = 0.01  # 1% premium cost for the simplified options-style bet
CALL_DELTA = 0.6  # sensitivity of payoff to underlying price moves
PUT_DELTA = -0.5
PROB_BUY_THRESHOLD = 0.55
PROB_SELL_THRESHOLD = 0.45


## Data Download
We use `yfinance.download` to pull daily OHLCV data for each ticker. The data are stored in a tidy DataFrame with a ticker column.

In [None]:
def fetch_data(tickers, start, end=None):
    '''Download adjusted close data for each ticker and stack into one DataFrame.'''
    price_dict = {}
    for t in tickers:
        df = yf.download(t, start=start, end=end, progress=False)
        df = df.rename(columns=lambda c: c.lower())
        df['ticker'] = t
        price_dict[t] = df
    data = pd.concat(price_dict.values()).reset_index().rename(columns={'index': 'date'})
    data = data.sort_values(['date', 'ticker']).reset_index(drop=True)
    return data

prices = fetch_data(TICKERS, START_DATE, END_DATE)
print(prices.head())


## Feature Engineering
We build simple yet intuitive features:

* Daily returns and next-day returns (target).
* Rolling mean/volatility of returns.
* Momentum proxies (rolling max/min, RSI-like feature).
* Volume z-score.

Each ticker is processed independently and then concatenated.

In [None]:
def engineer_features(df):
    feats = []
    for t, g in df.groupby('ticker'):
        g = g.copy()
        g['return'] = g['adj close'].pct_change()
        g['next_return'] = g['return'].shift(-1)
        g['target'] = (g['next_return'] > 0).astype(int)
        g['ret_ma_5'] = g['return'].rolling(5).mean()
        g['ret_ma_10'] = g['return'].rolling(10).mean()
        g['ret_std_10'] = g['return'].rolling(10).std()
        g['high_roll_10'] = g['high'].rolling(10).max() / g['close'] - 1
        g['low_roll_10'] = g['low'].rolling(10).min() / g['close'] - 1
        g['volume_z'] = (g['volume'] - g['volume'].rolling(20).mean()) / g['volume'].rolling(20).std()
        # Simple RSI-style oscillator
        up = g['close'].diff().clip(lower=0).rolling(14).mean()
        down = -g['close'].diff().clip(upper=0).rolling(14).mean()
        rs = up / (down + 1e-9)
        g['rsi'] = 100 - 100 / (1 + rs)
        g['rsi'] = (g['rsi'] - 50) / 50  # center around zero
        g = g.dropna()
        feats.append(g)
    return pd.concat(feats).reset_index(drop=True)

features = engineer_features(prices)
print(features[['date', 'ticker', 'return', 'target']].head())


## Train / Validation / Test Split
We keep the chronological order to avoid look-ahead bias. A portion of the most recent data is used for testing.

In [None]:
def time_based_split(df, test_size=0.15, val_size=0.2):
    df = df.sort_values('date')
    n = len(df)
    test_cut = int(n * (1 - test_size))
    val_cut = int(test_cut * (1 - val_size))
    train = df.iloc[:val_cut]
    val = df.iloc[val_cut:test_cut]
    test = df.iloc[test_cut:]
    return train, val, test

train_df, val_df, test_df = time_based_split(features, TEST_SIZE, VAL_SIZE)
print(len(train_df), len(val_df), len(test_df))


## Classical Machine Learning Models
We treat each day-ticker observation as a tabular feature vector. Ticker is one-hot encoded to allow models to learn per-name effects.

In [None]:
FEATURE_COLS = ['ret_ma_5', 'ret_ma_10', 'ret_std_10', 'high_roll_10', 'low_roll_10', 'volume_z', 'rsi', 'return']

# One-hot encode ticker
train_X = pd.get_dummies(train_df[['ticker'] + FEATURE_COLS], columns=['ticker'])
val_X = pd.get_dummies(val_df[['ticker'] + FEATURE_COLS], columns=['ticker'])
test_X = pd.get_dummies(test_df[['ticker'] + FEATURE_COLS], columns=['ticker'])

# Align dummy columns
train_X, val_X = train_X.align(val_X, join='left', axis=1, fill_value=0)
train_X, test_X = train_X.align(test_X, join='left', axis=1, fill_value=0)

target_train = train_df['target']
target_val = val_df['target']
target_test = test_df['target']

models = {}

log_reg = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=500, class_weight='balanced'))
])
log_reg.fit(train_X, target_train)
models['Logistic Regression'] = log_reg

rf_clf = RandomForestClassifier(
    n_estimators=200,
    max_depth=6,
    min_samples_leaf=5,
    random_state=RANDOM_STATE,
    n_jobs=-1
)
rf_clf.fit(train_X, target_train)
models['Random Forest'] = rf_clf


## Sequence Model (LSTM)
We build sequences of length `LOOKBACK_DAYS` for each ticker, so the model can learn short-term temporal patterns. Features are standardized per ticker to avoid dominance by high-price names.

In [None]:
def build_sequences(df, lookback=10):
    sequences = []
    labels = []
    for t, g in df.groupby('ticker'):
        g = g.sort_values('date')
        feat_mat = g[FEATURE_COLS].values
        # Standardize per ticker
        mean = feat_mat.mean(axis=0, keepdims=True)
        std = feat_mat.std(axis=0, keepdims=True) + 1e-9
        feat_mat = (feat_mat - mean) / std
        for i in range(len(g) - lookback):
            sequences.append(feat_mat[i:i+lookback])
            labels.append(g['target'].iloc[i+lookback])
    return np.array(sequences, dtype=np.float32), np.array(labels, dtype=np.int32)

train_seq_X, train_seq_y = build_sequences(train_df, LOOKBACK_DAYS)
val_seq_X, val_seq_y = build_sequences(val_df, LOOKBACK_DAYS)
test_seq_X, test_seq_y = build_sequences(test_df, LOOKBACK_DAYS)

print("LSTM train shape:", train_seq_X.shape)

lstm_model = models.Sequential([
    layers.Input(shape=(LOOKBACK_DAYS, len(FEATURE_COLS))),
    layers.LSTM(32, return_sequences=False),
    layers.Dropout(0.2),
    layers.Dense(16, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

lstm_model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
]

history = lstm_model.fit(
    train_seq_X, train_seq_y,
    validation_data=(val_seq_X, val_seq_y),
    epochs=25,
    batch_size=64,
    verbose=1,
    callbacks=callbacks
)
models['LSTM'] = lstm_model


## Evaluation Helpers

In [None]:
def evaluate_classifier(model, X, y_true, name):
    # Handle keras vs scikit-learn predict_proba interface
    if hasattr(model, 'predict_proba'):
        prob = model.predict_proba(X)[:, 1]
    else:
        prob = model.predict(X).ravel()
    pred = (prob >= 0.5).astype(int)
    acc = accuracy_score(y_true, pred)
    auc = roc_auc_score(y_true, prob)
    print(f"
{name} Accuracy: {acc:.3f}, ROC-AUC: {auc:.3f}")
    print(classification_report(y_true, pred, digits=3))
    return prob

probs_test = {}
for name, model in models.items():
    if name == 'LSTM':
        probs_test[name] = model.predict(test_seq_X).ravel()
    else:
        probs_test[name] = evaluate_classifier(model, test_X, target_test, name)


## Feature Importance
We visualize coefficient magnitudes for logistic regression and mean decrease impurity for the random forest.

In [None]:
def plot_feature_importance(model, feature_names, title):
    plt.figure(figsize=(8,4))
    if hasattr(model, 'coef_'):
        imp = np.abs(model.coef_[0])
    else:
        imp = model.feature_importances_
    idx = np.argsort(imp)[::-1][:15]
    plt.barh(np.array(feature_names)[idx][::-1], imp[idx][::-1])
    plt.title(title)
    plt.xlabel('Importance')
    plt.tight_layout()
    plt.show()

plot_feature_importance(log_reg['clf'], train_X.columns, 'Logistic Regression | |coef|')
plot_feature_importance(rf_clf, train_X.columns, 'Random Forest Feature Importance')


## Signal Generation
We convert predicted probabilities into discrete trading signals.

* **Buy / call-style bet:** probability > `PROB_BUY_THRESHOLD`.
* **Sell / put-style bet:** probability < `PROB_SELL_THRESHOLD`.
* Otherwise: stay flat.


In [None]:
def generate_signals(df, probs, model_name):
    signals = df[['date', 'ticker', 'next_return']].copy()
    signals['prob_pos'] = probs
    signals['signal'] = 0
    signals.loc[signals['prob_pos'] > PROB_BUY_THRESHOLD, 'signal'] = 1
    signals.loc[signals['prob_pos'] < PROB_SELL_THRESHOLD, 'signal'] = -1
    signals['model'] = model_name
    return signals

# Align test probabilities to dataframe rows for classical models
signals_all = []
for name in ['Logistic Regression', 'Random Forest']:
    sig = generate_signals(test_df.reset_index(drop=True), probs_test[name], name)
    signals_all.append(sig)

# LSTM sequences need to be realigned because sequence building drops the first LOOKBACK_DAYS rows per ticker
# We rebuild a mapping to the underlying rows that ended each sequence
seq_indices = []
for t, g in test_df.groupby('ticker'):
    g = g.sort_values('date').reset_index(drop=True)
    for i in range(len(g) - LOOKBACK_DAYS):
        seq_indices.append(g.index[i + LOOKBACK_DAYS])

lstm_probs = pd.Series(probs_test['LSTM'], index=seq_indices)
lstm_df = test_df.loc[seq_indices].reset_index(drop=True)
signals_all.append(generate_signals(lstm_df, lstm_probs.values, 'LSTM'))

signals_df = pd.concat(signals_all).reset_index(drop=True)
signals_df.head()


## Simplified Options-Style Strategy & Backtest
We approximate option payoffs using delta times the underlying return minus a fixed premium:

* **Call-like (signal = +1):** payoff = max(CALL_DELTA * return - premium, -premium)
* **Put-like (signal = -1):** payoff = max(PUT_DELTA * return - premium, -premium)

Transaction costs and slippage reduce returns each time a position is opened. We compound P&L equally weighted across tickers each day.

In [None]:
def option_payoff(ret, signal):
    if signal == 1:
        raw = CALL_DELTA * ret - OPTION_PREMIUM
    elif signal == -1:
        raw = PUT_DELTA * ret - OPTION_PREMIUM
    else:
        return 0.0
    return max(raw, -OPTION_PREMIUM)  # limited downside to premium


def backtest(signals, cost_bps=10, slippage_bps=5):
    df = signals.copy()
    df['day'] = pd.to_datetime(df['date'])
    daily_results = []
    for (day, model), g in df.groupby(['day', 'model']):
        pnl = 0
        n_trades = (g['signal'] != 0).sum()
        for _, row in g.iterrows():
            pnl += option_payoff(row['next_return'], row['signal'])
        # Apply linear costs
        total_bps = (cost_bps + slippage_bps) * n_trades
        pnl -= total_bps / 10000
        daily_results.append({'date': day, 'model': model, 'pnl': pnl})
    res = pd.DataFrame(daily_results).sort_values('date')
    res['equity'] = (1 + res['pnl']).groupby(res['model']).cumprod()
    return res

bt_results = backtest(signals_df, TRANSACTION_COST_BPS, SLIPPAGE_BPS)
bt_results.head()


## Performance Comparison
We visualize equity curves and summarize key metrics like cumulative return and hit rate (fraction of profitable days).

In [None]:
def summarize_backtest(bt_df):
    summaries = []
    for model, g in bt_df.groupby('model'):
        cum_return = g['equity'].iloc[-1] - 1
        hit_rate = (g['pnl'] > 0).mean()
        vol = g['pnl'].std() * np.sqrt(252)
        sharpe = (g['pnl'].mean() * 252) / (vol + 1e-9)
        summaries.append({
            'Model': model,
            'Cumulative Return': cum_return,
            'Hit Rate': hit_rate,
            'Sharpe (naive)': sharpe
        })
    return pd.DataFrame(summaries)

summary_df = summarize_backtest(bt_results)
summary_df

plt.figure(figsize=(10,5))
for model, g in bt_results.groupby('model'):
    plt.plot(g['date'], g['equity'], label=model)
plt.legend()
plt.title('Equity Curves')
plt.ylabel('Equity (start=1)')
plt.xlabel('Date')
plt.tight_layout()
plt.show()


## Next Steps
* Tune thresholds and premium assumptions.
* Experiment with richer features (macro signals, volatility indices).
* Try other sequence architectures (1D CNNs, Transformers) or calibration methods for probabilities.
