# 📘 Project Overview: Reinforcement Learning-Based Trading Agent

This project explores whether a reinforcement learning (RL) agent can outperform traditional technical analysis strategies in financial trading. We focus on minute-level data from **Apple**, **Reliance**, and the **S&P 500 Index** over the past 20 years.

Our approach involves:
1. **Hardcoded Rule-Based Strategies** using technical indicators.
2. **Training an RL Agent** (DQN + LSTM and PPO) on the same data.
3. **Comparative Evaluation** of both approaches on profitability and generalization.

---

## 🧭 Step-by-Step Procedure

### 1. **Data Loading and Preprocessing**
- Load minute-by-minute stock data for:
  - Apple,
  - Reliance,
  - S&P 500 Index.
- Compute and append technical indicators to the data:
  - **Trend Indicators**: 50-day EMA, 200-day EMA
  - **Momentum Indicators**: MACD, Stochastics
  - **Mean Reversion**: Bollinger Bands
  - **Volume Indicators**: On-Balance Volume (OBV)

### 2. **Implement and Backtest Rule-Based Strategies**
Apply hardcoded trading strategies across the Apple dataset:
- **Single Indicator Logic** (e.g., EMA crossover, MACD cross).
- **Two/Three/Four Indicator Combinations**:
  - Trend + Momentum
  - Trend + Mean Reversion
  - Trend + Momentum + Volume
  - Trend + Mean Reversion + Momentum + Volume  
- Compute the **returns**, **Sharpe Ratio**, **Win Rate**, and **Drawdowns** for each strategy over the entire dataset.

### 3. **Build the RL Agent (DQN + LSTM)**
- Design a **custom OpenAI Gym-style trading environment**.
- Use:
  - **State**: Feature vector of technical indicators.
  - **Actions**: Buy, Hold, Sell.
  - **Rewards**: Based on portfolio performance.
- Train a **DQN agent with LSTM** to learn temporal patterns.
- Investigate if this is a **Double DQN** (check Q-target update logic).
- Perform **hyperparameter tuning** (e.g., learning rate, epsilon decay).

### 4. **Evaluate RL Agent on Unseen Test Data**
- Compare RL agent's performance on Apple **test data** with:
  - Previously tested hardcoded strategies.
  - Using metrics: total profit, Sharpe Ratio, max drawdown, win/loss rate.

### 5. **Try a Second RL Algorithm (PPO)**
- Train a **PPO agent** on the same data and environment.
- Compare its performance with DQN + LSTM.

### 6. **Test Generalization to Other Stocks**
- Run **both trained RL agents** on unseen data from:
  - **S&P 500 Index**
  - **Reliance**
- Evaluate generalization:
  - If agents perform well → strong generalization.
  - If performance drops → train **stock-specific agents**.

### 7. **Compare with Rule-Based Methods**
- For both S&P and Reliance:
  - Compare RL agent performance with hardcoded strategies over the same period.
  - Determine if RL models outperform static indicator-based systems.

### 📥 Step 1: Importing Libraries and Loading Apple Minute-Level Data

We begin by importing the core libraries required for this project:

- **Data Handling**: `numpy`, `pandas`
- **Reinforcement Learning**: `gym`, `torch`, `torch.nn`, `torch.optim`
- **Preprocessing & Utilities**: `MinMaxScaler`, `random`, `collections`, `itertools`, `numba` (for JIT speedups)

Next, we load the **Apple stock price dataset**, which contains **minute-by-minute price data** over several years. Initial steps include:

- Reading the CSV file `dataset.csv` into a DataFrame.
- Displaying the first few rows and checking the shape and schema of the data.
- Counting how many rows contain **missing (NaN) values** — important for preprocessing and data cleaning.

This dataset forms the **foundation for our technical indicator calculations and RL training** in subsequent steps.

In [13]:
import numpy as np
import pandas as pd
import gym
from gym import spaces
import torch
import torch.nn as nn
import torch.optim as optim
import random
import collections
import itertools
from numba import njit
from sklearn.preprocessing import MinMaxScaler
from torch.distributions import Categorical

In [2]:
# Load the dataset
file_path = 'dataset.csv'  # The file is in the same folder as the notebook
data = pd.read_csv(file_path, low_memory=False)

In [7]:
data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2006-01-03 00:00:00,2.585,2.669643,2.580357,2.669643,2.257056,807234400.0
1,2006-01-03 00:01:00,2.585068,2.669673,2.580413,2.669648,2.257061,807104100.0
2,2006-01-03 00:02:00,2.585136,2.669704,2.580469,2.669654,2.257065,806973800.0
3,2006-01-03 00:03:00,2.585205,2.669734,2.580524,2.669659,2.25707,806843500.0
4,2006-01-03 00:04:00,2.585273,2.669765,2.58058,2.669665,2.257074,806713200.0


In [4]:
data.shape

(8689184, 7)

In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8689184 entries, 0 to 8689183
Data columns (total 7 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Date       object 
 1   Open       float64
 2   High       float64
 3   Low        float64
 4   Close      float64
 5   Adj Close  float64
 6   Volume     float64
dtypes: float64(6), object(1)
memory usage: 464.1+ MB


In [3]:
# Count rows with at least one NaN value
rows_with_nan = data.isnull().any(axis=1).sum()

print(f"Number of rows with at least one NaN value: {rows_with_nan}")


Number of rows with at least one NaN value: 0


### Want to check ROI if we just buy and hold apple stock for the 20 years.

In [None]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = data.iloc[0]['Close']
final_price = data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 6366.75%


### 📊 Step 2: Implementing a Trend Indicator-Based Strategy (TI Method)

In this step, we define and backtest a simple but widely used **trend-following trading strategy** based on **Exponential Moving Averages (EMAs)**.

#### ✅ Strategy Logic (TI Method)
- **Buy Signal (1)**: When the short-term trend (50-minute EMA) crosses above the long-term trend (200-minute EMA), indicating bullish momentum.
- **Sell Signal (-1)**: When the 50-minute EMA crosses below the 200-minute EMA, indicating a downtrend.
- **Hold Signal (0)**: When both EMAs are equal — no trade signal.

This is commonly referred to as a **Golden Cross / Death Cross strategy**.

---

### ℹ️ What is an EMA?

The **Exponential Moving Average (EMA)** is a type of moving average that gives more weight to recent prices, making it more responsive to new information than a simple moving average (SMA).  
- A **shorter EMA (e.g., 50-period)** reacts quickly to recent price changes and captures short-term trends.
- A **longer EMA (e.g., 200-period)** smooths out price movements and represents the broader market direction.

Crossing EMAs are widely used to signal potential entry/exit points in trend-following strategies.

---

### ⚙️ Technical Steps

1. **Compute Indicators**:
   - Calculate the 50-minute and 200-minute EMAs using the closing price (`Close`).
   - Store results in a new DataFrame `TI`.

2. **Preprocessing**:
   - Convert the `Date` column to datetime format.
   - Check for null values after computing EMAs.

3. **Backtesting with Stop-Loss Handling**:
   - Use a **Numba-accelerated function** `numba_loop()` to simulate trades efficiently.
   - For every minute:
     - If a **buy** signal is triggered, invest a fixed amount (e.g., $100).
     - If a **sell** signal or a **stop-loss** is triggered, liquidate holdings and calculate profit/loss.
     - At the end, if shares remain, they are liquidated at the final price.
   - Fees (0.25% per buy/sell) are deducted during trading.

---

### 🛡️ What is a Stop Loss?

A **stop loss** is a risk management technique that forces a sell when the price drops below a certain threshold relative to the purchase price.  
In this strategy:
- The stop loss is defined as a **percentage drop** from the average cost.
- For example, a 3% stop loss sells the asset automatically if the price drops 3% below the entry point.
- This helps prevent large losses during market reversals or false signals.

---

### 📈 Evaluation with Multiple Stop-Loss Thresholds

- The strategy is tested under different **stop-loss percentages** (from 0.03% up to 20%, including “no stop loss”).
- For each configuration:
  - Compute **Total Investment**, **Total Profit**, and **Return on Investment (ROI %)**.
  - Store all results in a DataFrame for easy comparison.

---

### 📊 Signal Distribution Analysis

- Count the total number of **Buy**, **Sell**, and **Hold** signals generated by this strategy.
- Helps understand how active or conservative the signal generation is.

---

### 📌 Summary

This block implements a **baseline trading strategy** using only **trend-based signals** (EMA crossovers). It evaluates how well this method performs under various **risk-management settings** using stop-losses.  
The results from this will later be used to compare with **reinforcement learning agents** to assess learning-based improvements.

In [6]:
# Copy the original DataFrame to a new DataFrame named TI
TI = data.copy()

# Convert 'Date' column to datetime
TI['Date'] = pd.to_datetime(TI['Date'])

In [7]:
# Calculate 50-minute EMA (shorter-term trend)
TI['EMA_50'] = TI['Close'].ewm(span=50, adjust=False).mean()

# Calculate 200-minute EMA (longer-term trend)
TI['EMA_200'] = TI['Close'].ewm(span=200, adjust=False).mean()

In [14]:
# Count rows with any null values
num_null_rows = TI.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 0


In [26]:
TI.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,EMA_50,EMA_200
0,2006-01-03 00:00:00,2.585,2.669643,2.580357,2.669643,2.257056,807234400.0,2.669643,2.669643
1,2006-01-03 00:01:00,2.585068,2.669673,2.580413,2.669648,2.257061,807104100.0,2.669643,2.669643
2,2006-01-03 00:02:00,2.585136,2.669704,2.580469,2.669654,2.257065,806973800.0,2.669644,2.669643
3,2006-01-03 00:03:00,2.585205,2.669734,2.580524,2.669659,2.25707,806843500.0,2.669644,2.669643
4,2006-01-03 00:04:00,2.585273,2.669765,2.58058,2.669665,2.257074,806713200.0,2.669645,2.669643


In [132]:
@njit
def numba_loop(close, signals, stop_loss_pct, buy_fee, sell_fee):
    """
    Generic Numba-accelerated loop for any trading strategy.
    """
    n = len(close)
    profit_loss = np.zeros(n)
    shares = 0.0
    cumulative_investment = 0.0

    for i in range(n):
        price = close[i]

        # Buy Signal
        if signals[i] == 1:  # Buy signal
            amount_to_invest = 100
            effective_investment = amount_to_invest * (1 - buy_fee)
            shares += effective_investment / price
            cumulative_investment += amount_to_invest

        # Stop Loss Triggered
        stop_loss_triggered = price <= (cumulative_investment / shares) * (1 - stop_loss_pct) if shares > 0 else False

        # Sell Signal or Stop Loss
        if (signals[i] == -1 or stop_loss_triggered) and shares > 0:  # Sell signal
            sell_value = shares * price
            sell_value_after_fee = sell_value * (1 - sell_fee)
            profit_loss[i] = sell_value_after_fee - cumulative_investment
            shares = 0
            cumulative_investment = 0

    # Final Liquidation
    if shares > 0:
        final_price = close[-1]
        final_sell_value = shares * final_price
        final_sell_value_after_fee = final_sell_value * (1 - sell_fee)
        profit_loss[-1] += final_sell_value_after_fee - cumulative_investment

    return profit_loss


In [133]:
def optimized_strategy_with_numba(data, stop_loss_pct):
    """
    Optimized trading strategy with Numba acceleration for the loop.
    """
    # Prepare data
    data['Signal'] = np.where(data['EMA_50'] > data['EMA_200'], 1,
                              np.where(data['EMA_50'] < data['EMA_200'], -1, 0))  # 1 = Buy, -1 = Sell, 0 = Hold
    close = data['Close'].values
    signals = data['Signal'].values

    # Define constants
    buy_fee = 0.0025
    sell_fee = 0.0025

    # Call Numba-optimized loop
    profit_loss = numba_loop(close, signals, stop_loss_pct, buy_fee, sell_fee)

    # Add results back to DataFrame
    data['Profit/Loss'] = profit_loss

    # Total profit and investment
    total_profit = profit_loss.sum()
    total_investment = 100 * (signals == 1).sum()  # $100 per buy signal
    return total_investment, total_profit

In [40]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Initialize an empty list to store results
results = []

# Iterate over each stop loss value
for stop_loss in stop_loss_values:
    # Call the strategy function for the current stop loss value
    investment, profit = optimized_strategy_with_numba(TI, stop_loss_pct=stop_loss)
    
    # Calculate ROI
    roi = (profit / investment) * 100 if investment != 0 else 0
    
    # Determine stop-loss label
    stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"
    
    # Append the results as a dictionary
    results.append({
        "Stop Loss": stop_loss_label,  # Use label for stop loss
        "Total Investment ($)": investment,
        "Total Profit ($)": profit,
        "ROI (%)": roi
    })

# Convert the results into a DataFrame
results_TI = pd.DataFrame(results)


In [41]:
results_TI

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,473166600,7149635.0,1.511019
1,0.03%,473166600,-2362876.0,-0.499375
2,1.00%,473166600,7149647.0,1.511021
3,3.00%,473166600,7149635.0,1.511019
4,5.00%,473166600,7149635.0,1.511019
5,10.00%,473166600,7149635.0,1.511019
6,15.00%,473166600,7149635.0,1.511019
7,20.00%,473166600,7149635.0,1.511019


In [84]:
# Count the number of each signal type in the 'Signal' column
signal_counts = TI['Signal'].value_counts()

# Print the counts for each signal
print("Signal Counts:")
print(f"Buy Signals (1): {signal_counts.get(1, 0)}")
print(f"Sell Signals (-1): {signal_counts.get(-1, 0)}")
print(f"Hold Signals (0): {signal_counts.get(0, 0)}")


Signal Counts:
Buy Signals (1): 4731666
Sell Signals (-1): 3957517
Hold Signals (0): 1


### 🔁 Step 3: Implementing a Mean Reversion Strategy (MR Method)

This section defines and evaluates a **mean reversion trading strategy** using **Bollinger Bands**.

#### 📐 Strategy Logic (MR Method)
- **Buy Signal (1)**: Triggered when the price drops **below the lower Bollinger Band**, suggesting the asset is **oversold** and likely to revert upward.
- **Sell Signal (-1)**: Triggered when the price exceeds the **upper Bollinger Band**, indicating the asset is **overbought** and may revert downward.
- **Hold Signal (0)**: When the price stays within the bands — no trade signal.

---

### 📊 What are Bollinger Bands?

**Bollinger Bands** are a technical analysis tool used to identify periods of high and low price volatility, as well as overbought or oversold conditions. Each band is defined as:

- **Middle Band**: 20-period Simple Moving Average (SMA)
- **Upper Band**: SMA + 2 × standard deviation
- **Lower Band**: SMA − 2 × standard deviation

When price moves **outside the bands**, it signals potential **mean reversion** — a tendency for the price to return to the average over time.

---

### ⚙️ Technical Steps

1. **Indicator Calculation**:
   - Compute the **20-period SMA** and **20-period standard deviation** of the close price.
   - Construct:
     - **Upper Band** = SMA + 2×STD
     - **Lower Band** = SMA - 2×STD

2. **Signal Generation & Execution**:
   - Assign buy/sell/hold signals based on price relation to Bollinger Bands.
   - Reuse the previously defined **`numba_loop`** for efficient profit/loss computation.

3. **Backtesting with Varying Stop Losses**:
   - Test the strategy with multiple stop-loss thresholds (1% to 20%).
   - Compute total investment, total profit, and ROI for each configuration.
   - Results are stored and displayed in a comparison table.

4. **Signal Statistics**:
   - Count the number of buy, sell, and hold signals to analyze how reactive or conservative this strategy is.

---

### 📝 Notes

- Mean reversion strategies like this one **bet on price bouncing back** toward its average after extreme deviations.
- They often complement trend-following strategies in a diversified system, and we’ll later compare both with RL agents.

In [21]:
# Copy data to a new DataFrame for Mean Reversion indicators
MR = data.copy()

# Calculate 20-day SMA
MR['SMA_20'] = MR['Close'].rolling(window=20).mean()

# Calculate 20-day Standard Deviation
MR['Std_Dev_20'] = MR['Close'].rolling(window=20).std()

# Calculate Upper and Lower Bollinger Bands
MR['Upper_Band'] = MR['SMA_20'] + (2 * MR['Std_Dev_20'])
MR['Lower_Band'] = MR['SMA_20'] - (2 * MR['Std_Dev_20'])

In [22]:
# Count rows with any null values
num_null_rows = MR.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")


Number of rows with any null values: 19


In [23]:
# Drop all rows with any null values
MR = MR.dropna()

# Print the updated DataFrame
MR.head()


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,SMA_20,Std_Dev_20,Upper_Band,Lower_Band
19,2006-01-03 00:19:00,2.586296,2.670223,2.581417,2.669747,2.257144,804758700.0,2.669695,3.2e-05,2.669759,2.66963
20,2006-01-03 00:20:00,2.586364,2.670253,2.581473,2.669752,2.257148,804628400.0,2.6697,3.2e-05,2.669765,2.669636
21,2006-01-03 00:21:00,2.586432,2.670284,2.581529,2.669758,2.257153,804498100.0,2.669706,3.2e-05,2.66977,2.669641
22,2006-01-03 00:22:00,2.586501,2.670314,2.581585,2.669763,2.257157,804367800.0,2.669711,3.2e-05,2.669776,2.669647
23,2006-01-03 00:23:00,2.586569,2.670345,2.581641,2.669768,2.257162,804237500.0,2.669717,3.2e-05,2.669781,2.669652


In [134]:
def optimized_mean_reversion_strategy_with_numba(MR, stop_loss_pct):
    """
    Optimized trading strategy using Bollinger Bands (Mean Reversion) with Numba acceleration.
    """
    # Generate Buy/Sell Signals
    # Buy when price < Lower Band, Sell when price > Upper Band
    MR['Signal'] = np.where(MR['Close'] < MR['Lower_Band'], 1,  # Buy Signal
                            np.where(MR['Close'] > MR['Upper_Band'], -1, 0))  # Sell Signal

    # Prepare data for Numba loop
    close = MR['Close'].values
    signals = MR['Signal'].values

    # Define constants
    buy_fee = 0.0025
    sell_fee = 0.0025

    # Call Numba-optimized loop
    profit_loss = numba_loop(close, signals, stop_loss_pct, buy_fee, sell_fee)

    # Add results back to DataFrame
    MR['Profit/Loss'] = profit_loss

    # Total profit and investment
    total_profit = profit_loss.sum()
    total_investment = 100 * (signals == 1).sum()  # $100 per buy signal
    return total_investment, total_profit


In [26]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Initialize an empty list to store results
results = []

# Iterate over each stop loss value
for stop_loss in stop_loss_values:
    # Call the mean reversion strategy function for the current stop loss value
    investment, profit = optimized_mean_reversion_strategy_with_numba(MR, stop_loss_pct=stop_loss)
    
    # Calculate ROI
    roi = (profit / investment) * 100 if investment != 0 else 0
    
    # Determine stop-loss label
    stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"
    
    # Append the results as a dictionary
    results.append({
        "Stop Loss": stop_loss_label,  # Use label for stop loss
        "Total Investment ($)": investment,
        "Total Profit ($)": profit,
        "ROI (%)": roi
    })

# Convert the results into a DataFrame
results_MR = pd.DataFrame(results)

In [27]:
results_MR

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,658100,-20820.975758,-3.163801
1,1.00%,658100,-6574.095487,-0.998951
2,3.00%,658100,-12323.68546,-1.872616
3,5.00%,658100,-15882.996883,-2.413463
4,10.00%,658100,-19814.530531,-3.010869
5,15.00%,658100,-20384.360391,-3.097456
6,20.00%,658100,-20528.936133,-3.119425


In [85]:
# Count the number of each signal type in the 'Signal' column
signal_counts = MR['Signal'].value_counts()

# Print the counts for each signal
print("Signal Counts:")
print(f"Buy Signals (1): {signal_counts.get(1, 0)}")
print(f"Sell Signals (-1): {signal_counts.get(-1, 0)}")
print(f"Hold Signals (0): {signal_counts.get(0, 0)}")


Signal Counts:
Buy Signals (1): 6581
Sell Signals (-1): 31662
Hold Signals (0): 8650922


### 🔄 Step 4: Implementing a Momentum-Based Strategy using Stochastics Oscillator

This section defines and tests a momentum-based strategy using the **Stochastics Oscillator**, a popular indicator for identifying **overbought and oversold conditions** in short-term price movements.

#### ⚡ Strategy Logic (Stochastics 14, 7, 3)
- **Buy Signal (1)**: Triggered when the **%D_Slow** line falls **below 20**, signaling that the stock is likely **oversold** and due for a bounce.
- **Sell Signal (-1)**: Triggered when **%D_Slow** rises **above 80**, indicating the stock is **overbought** and may soon decline.
- **Hold Signal (0)**: When %D_Slow is between 20 and 80 — no trade signal.

---

### 📈 What is the Stochastics Oscillator?

The **Stochastics Oscillator** is a **momentum indicator** that compares a stock’s **closing price** to its price range over a specified period. It helps capture whether a price is trending near the **top or bottom** of its recent range.

It consists of three main components:

- **%K**: The current close relative to the 14-period high-low range (fast oscillator).
- **%D**: A **7-period moving average** of %K (smoothed version).
- **%D_Slow**: A further **3-period moving average** of %D — used to reduce noise and generate trade signals.

These smoothed values help filter out short-term fluctuations and focus on stronger momentum signals.

---

### ⚙️ Technical Steps

1. **Indicator Calculation**:
   - Compute the 14-period highest high and lowest low.
   - Calculate %K, %D, and %D_Slow sequentially using moving averages.

2. **Signal Generation**:
   - Use thresholds of 20 (oversold) and 80 (overbought) on **%D_Slow** to assign buy and sell signals.

3. **Backtesting**:
   - Use the same **`numba_loop()`** for efficient trade simulation.
   - Run the strategy over various stop-loss levels and record:
     - Total investment
     - Total profit
     - ROI (%)

4. **Signal Statistics**:
   - Count and display the total buy/sell/hold signals to analyze the strategy's responsiveness.

---

### 📝 Notes

The stochastics-based strategy is **momentum-sensitive**, aiming to profit from short-term price reversals near the edges of recent ranges.  
We’ll later compare this with both rule-based and RL-based agents to see how effective pure momentum signals are in isolation.

In [218]:
# Copy data to a new DataFrame for Stochastics
RSI = data.copy()

# Calculate 14-period high and low
RSI['High_14'] = RSI['High'].rolling(window=14).max()
RSI['Low_14'] = RSI['Low'].rolling(window=14).min()

# Calculate %K (Fast Stochastic)
RSI['%K'] = ((RSI['Close'] - RSI['Low_14']) / (RSI['High_14'] - RSI['Low_14'])) * 100

# Calculate %D (7-period SMA of %K)
RSI['%D'] = RSI['%K'].rolling(window=7).mean()

# Calculate %D Slow (3-period SMA of %D)
RSI['%D_Slow'] = RSI['%D'].rolling(window=3).mean()

In [219]:
# Count rows with any null values
num_null_rows = RSI.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 21


In [220]:
# Drop all rows with any null values
RSI = RSI.dropna()

# Display the first few rows to verify
RSI.head()


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,High_14,Low_14,%K,%D,%D_Slow
21,2006-01-03 00:21:00,2.586432,2.670284,2.581529,2.669758,2.257153,804498100.0,2.670284,2.580804,99.412119,99.496497,99.524597
22,2006-01-03 00:22:00,2.586501,2.670314,2.581585,2.669763,2.257157,804367800.0,2.670314,2.580859,99.38395,99.468376,99.496492
23,2006-01-03 00:23:00,2.586569,2.670345,2.581641,2.669768,2.257162,804237500.0,2.670345,2.580915,99.355766,99.44024,99.468371
24,2006-01-03 00:24:00,2.586637,2.670375,2.581696,2.669774,2.257167,804107200.0,2.670375,2.580971,99.327565,99.412087,99.440234
25,2006-01-03 00:25:00,2.586705,2.670406,2.581752,2.669779,2.257171,803976900.0,2.670406,2.581027,99.299349,99.383918,99.412082


In [221]:
def optimized_stochastics_strategy_with_numba(RSI, stop_loss_pct):
    """
    Optimized trading strategy using Stochastics (14, 7, 3) with Numba acceleration.
    """
    # Generate Buy/Sell Signals
    # Buy when %D_Slow < 20 (Oversold), Sell when %D_Slow > 80 (Overbought)
    RSI['Signal'] = np.where(RSI['%D_Slow'] < 20, 1,  # Buy Signal
                             np.where(RSI['%D_Slow'] > 80, -1, 0))  # Sell Signal

    # Prepare data for Numba loop
    close = RSI['Close'].values
    signals = RSI['Signal'].values

    # Define constants
    buy_fee = 0.0025
    sell_fee = 0.0025

    # Call Numba-optimized loop
    profit_loss = numba_loop(close, signals, stop_loss_pct, buy_fee, sell_fee)

    # Add results back to DataFrame
    RSI['Profit/Loss'] = profit_loss

    # Total profit and investment
    total_profit = profit_loss.sum()
    total_investment = 100 * (signals == 1).sum()  # $100 per buy signal
    return total_investment, total_profit


In [222]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Initialize an empty list to store results
results = []

# Iterate over each stop loss value
for stop_loss in stop_loss_values:
    # Call the stochastics strategy function for the current stop loss value
    investment, profit = optimized_stochastics_strategy_with_numba(RSI, stop_loss_pct=stop_loss)
    
    # Calculate ROI
    roi = (profit / investment) * 100 if investment != 0 else 0
    
    # Determine stop-loss label
    stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"
    
    # Append the results as a dictionary
    results.append({
        "Stop Loss": stop_loss_label,  # Use label for stop loss
        "Total Investment ($)": investment,
        "Total Profit ($)": profit,
        "ROI (%)": roi
    })

# Convert the results into a DataFrame
results_RSI = pd.DataFrame(results)

In [223]:
results_RSI

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,110999900,-559089.76742,-0.503685
1,1.00%,110999900,-220014.832902,-0.198212
2,3.00%,110999900,-515941.898459,-0.464813
3,5.00%,110999900,-654987.928845,-0.59008
4,10.00%,110999900,-582142.197374,-0.524453
5,15.00%,110999900,-565685.992234,-0.509627
6,20.00%,110999900,-559089.76742,-0.503685


In [86]:
# Count the number of each signal type in the 'Signal' column
signal_counts = RSI['Signal'].value_counts()

# Print the counts for each signal
print("Signal Counts:")
print(f"Buy Signals (1): {signal_counts.get(1, 0)}")
print(f"Sell Signals (-1): {signal_counts.get(-1, 0)}")
print(f"Hold Signals (0): {signal_counts.get(0, 0)}")


Signal Counts:
Buy Signals (1): 1109999
Sell Signals (-1): 1664826
Hold Signals (0): 5914338


### ⚡ Step 5: Implementing a Momentum Strategy using MACD

This section introduces a **momentum-based strategy** using the **Moving Average Convergence Divergence (MACD)** indicator, a classic tool for capturing **trend shifts and momentum direction** in price movements.

#### 🔁 Strategy Logic (MACD 12, 26, 9)
- **Buy Signal (1)**: Triggered when the **MACD Line** crosses **above** the **Signal Line** — signaling bullish momentum.
- **Sell Signal (-1)**: Triggered when the **MACD Line** crosses **below** the **Signal Line** — indicating bearish momentum.
- **Hold Signal (0)**: When there's no crossover.

---

### 📈 What is MACD?

**MACD (Moving Average Convergence Divergence)** is a momentum indicator based on the relationship between two EMAs:

- **MACD Line** = 12-period EMA − 26-period EMA  
- **Signal Line** = 9-period EMA of the MACD Line  
- **MACD Histogram** = MACD Line − Signal Line

It captures:
- **Momentum shifts** (via crossovers)
- **Strength of movement** (via histogram)
- **Trend direction and reversals**

MACD is widely used for **momentum trading**, especially in markets where trends accelerate quickly.

---

### ⚙️ Technical Steps

1. **Indicator Construction**:
   - Calculate short-term (12) and long-term (26) EMAs of the closing price.
   - Derive MACD Line, Signal Line, and Histogram.

2. **Signal Assignment**:
   - Generate trade signals based on crossover of MACD and Signal Line.

3. **Backtest Execution**:
   - Use the shared **Numba-accelerated loop** for efficient simulation.
   - Evaluate the strategy across multiple stop-loss settings.
   - Capture total investment, profit, and ROI for each configuration.

4. **Signal Distribution**:
   - Print the count of buy, sell, and hold signals to understand how frequently the strategy acts.

---

### 📝 Notes

MACD-based strategies are useful for **capturing sustained momentum**, especially after a breakout or trend change.  
This method provides an additional perspective compared to mean-reversion and stochastic approaches — making it valuable in our broader comparison against RL agents.

In [34]:
# Copy data to a new DataFrame for MACD calculations
MD = data.copy()

# Calculate the 12-period EMA (faster EMA)
MD['EMA_12'] = MD['Close'].ewm(span=12, adjust=False).mean()

# Calculate the 26-period EMA (slower EMA)
MD['EMA_26'] = MD['Close'].ewm(span=26, adjust=False).mean()

# Calculate the MACD Line
MD['MACD_Line'] = MD['EMA_12'] - MD['EMA_26']

# Calculate the Signal Line (9-period EMA of MACD Line)
MD['Signal_Line'] = MD['MACD_Line'].ewm(span=9, adjust=False).mean()

# Calculate the MACD Histogram
MD['MACD_Histogram'] = MD['MACD_Line'] - MD['Signal_Line']


In [35]:
# Count rows with any null values
num_null_rows = MD.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 0


In [36]:
# Display the first few rows to verify
MD.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,EMA_12,EMA_26,MACD_Line,Signal_Line,MACD_Histogram
0,2006-01-03 00:00:00,2.585,2.669643,2.580357,2.669643,2.257056,807234400.0,2.669643,2.669643,0.0,0.0,0.0
1,2006-01-03 00:01:00,2.585068,2.669673,2.580413,2.669648,2.257061,807104100.0,2.669644,2.669643,4.352611e-07,8.705222e-08,3.482089e-07
2,2006-01-03 00:02:00,2.585136,2.669704,2.580469,2.669654,2.257065,806973800.0,2.669645,2.669644,1.206578e-06,3.109575e-07,8.95621e-07
3,2006-01-03 00:03:00,2.585205,2.669734,2.580524,2.669659,2.25707,806843500.0,2.669647,2.669645,2.232398e-06,6.952455e-07,1.537152e-06
4,2006-01-03 00:04:00,2.585273,2.669765,2.58058,2.669665,2.257074,806713200.0,2.66965,2.669647,3.445923e-06,1.245381e-06,2.200542e-06


In [136]:
def optimized_macd_strategy_with_numba(MD, stop_loss_pct):
    """
    Optimized trading strategy using MACD (12, 26, 9) with Numba acceleration.
    """
    # Generate Buy/Sell Signals
    # Buy when MACD_Line crosses above Signal_Line, Sell when MACD_Line crosses below Signal_Line
    MD['Signal'] = np.where(MD['MACD_Line'] > MD['Signal_Line'], 1,  # Bullish Crossover (Buy Signal)
                            np.where(MD['MACD_Line'] < MD['Signal_Line'], -1, 0))  # Bearish Crossover (Sell Signal)

    # Prepare data for Numba loop
    close = MD['Close'].values
    signals = MD['Signal'].values

    # Define constants
    buy_fee = 0.0025
    sell_fee = 0.0025

    # Call Numba-optimized loop
    profit_loss = numba_loop(close, signals, stop_loss_pct, buy_fee, sell_fee)

    # Add results back to DataFrame
    MD['Profit/Loss'] = profit_loss

    # Total profit and investment
    total_profit = profit_loss.sum()
    total_investment = 100 * (signals == 1).sum()  # $100 per buy signal
    return total_investment, total_profit


In [44]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Initialize an empty list to store results
results = []

# Iterate over each stop loss value
for stop_loss in stop_loss_values:
    # Call the MACD strategy function for the current stop loss value
    investment, profit = optimized_macd_strategy_with_numba(MD, stop_loss_pct=stop_loss)
    
    # Calculate ROI
    roi = (profit / investment) * 100 if investment != 0 else 0
    
    # Determine stop-loss label
    stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"
    
    # Append the results as a dictionary
    results.append({
        "Stop Loss": stop_loss_label,  # Use label for stop loss
        "Total Investment ($)": investment,
        "Total Profit ($)": profit,
        "ROI (%)": roi
    })

# Convert the results into a DataFrame
results_MACD = pd.DataFrame(results)

In [45]:
results_MACD

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,425963900,-1982978.0,-0.465527
1,1.00%,425963900,-1979219.0,-0.464645
2,3.00%,425963900,-1982978.0,-0.465527
3,5.00%,425963900,-1982978.0,-0.465527
4,10.00%,425963900,-1982978.0,-0.465527
5,15.00%,425963900,-1982978.0,-0.465527
6,20.00%,425963900,-1982978.0,-0.465527


In [87]:
# Count the number of each signal type in the 'Signal' column
signal_counts = MD['Signal'].value_counts()

# Print the counts for each signal
print("Signal Counts:")
print(f"Buy Signals (1): {signal_counts.get(1, 0)}")
print(f"Sell Signals (-1): {signal_counts.get(-1, 0)}")
print(f"Hold Signals (0): {signal_counts.get(0, 0)}")


Signal Counts:
Buy Signals (1): 4259639
Sell Signals (-1): 4327884
Hold Signals (0): 101661


### 📦 Step 6: Implementing a Volume-Based Strategy using OBV

This section evaluates a volume-driven trading strategy using **On-Balance Volume (OBV)** — a popular indicator that combines **price movement and trading volume** to detect the strength of buying or selling pressure.

#### 🔁 Strategy Logic (OBV)
- **Buy Signal (1)**: When **OBV is rising** *and* the price is also increasing — indicating strong accumulation.
- **Sell Signal (-1)**: When **OBV is falling** *and* the price is decreasing — indicating strong distribution.
- **Hold Signal (0)**: No significant confirmation from price-volume dynamics.

---

### 📈 What is OBV (On-Balance Volume)?

**On-Balance Volume** is a **cumulative volume indicator** that tracks the flow of volume in relation to price direction. It works as follows:

- If the price **closes higher**, that day's volume is **added** to OBV.
- If the price **closes lower**, that day's volume is **subtracted** from OBV.
- The idea is that **rising OBV confirms upward trends**, while **falling OBV confirms downward trends**.

It’s often used to validate price breakouts or spot divergences between volume and price.

---

### ⚙️ Technical Steps

1. **OBV Construction**:
   - Initialize OBV with 0.
   - Apply cumulative logic to compute OBV based on price direction and volume.
   - Ensure no missing values before analysis.

2. **Signal Generation**:
   - Use a simple dual-condition rule:
     - Buy when both OBV and price are increasing.
     - Sell when both OBV and price are decreasing.

3. **Backtesting**:
   - Use the shared **Numba loop** to simulate trading outcomes efficiently.
   - Evaluate performance across multiple stop-loss values.
   - Record investment, profit, and ROI metrics.

4. **Signal Statistics**:
   - Count and report the number of buy/sell/hold signals to understand signal density.

---

### 📝 Notes

OBV adds a unique angle by incorporating **volume dynamics**, making it complementary to price-only indicators.  
This gives our evaluation a **more complete technical spectrum** before comparing everything against reinforcement learning models.

In [46]:
# Copy data to a new DataFrame for OBV calculations
VI = data.copy()

# Initialize OBV column
VI['OBV'] = 0.0

# Calculate OBV
VI['OBV'] = (
    (VI['Close'].diff() > 0).astype(int) * VI['Volume']  # Add volume if Close increases
    - (VI['Close'].diff() < 0).astype(int) * VI['Volume']  # Subtract volume if Close decreases
).cumsum()


In [47]:
# Count rows with any null values
num_null_rows = VI.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 0


In [48]:
# Display the first few rows to verify
VI.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,OBV
0,2006-01-03 00:00:00,2.585,2.669643,2.580357,2.669643,2.257056,807234400.0,0.0
1,2006-01-03 00:01:00,2.585068,2.669673,2.580413,2.669648,2.257061,807104100.0,807104100.0
2,2006-01-03 00:02:00,2.585136,2.669704,2.580469,2.669654,2.257065,806973800.0,1614078000.0
3,2006-01-03 00:03:00,2.585205,2.669734,2.580524,2.669659,2.25707,806843500.0,2420921000.0
4,2006-01-03 00:04:00,2.585273,2.669765,2.58058,2.669665,2.257074,806713200.0,3227635000.0


In [137]:
def optimized_obv_strategy_with_numba(VI, stop_loss_pct):
    """
    Optimized trading strategy using On-Balance-Volume (OBV) with Numba acceleration.
    """
    # Generate Buy/Sell Signals
    # Buy when OBV is rising with increasing prices, Sell when OBV is falling with decreasing prices
    VI['Signal'] = np.where((VI['OBV'].diff() > 0) & (VI['Close'].diff() > 0), 1,  # Buy Signal
                            np.where((VI['OBV'].diff() < 0) & (VI['Close'].diff() < 0), -1, 0))  # Sell Signal

    # Prepare data for Numba loop
    close = VI['Close'].values
    signals = VI['Signal'].values

    # Define constants
    buy_fee = 0.0025
    sell_fee = 0.0025

    # Call Numba-optimized loop
    profit_loss = numba_loop(close, signals, stop_loss_pct, buy_fee, sell_fee)

    # Add results back to DataFrame
    VI['Profit/Loss'] = profit_loss

    # Total profit and investment
    total_profit = profit_loss.sum()
    total_investment = 100 * (signals == 1).sum()  # $100 per buy signal
    return total_investment, total_profit


In [52]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Initialize an empty list to store results
results = []

# Iterate over each stop loss value
for stop_loss in stop_loss_values:
    # Call the OBV strategy function for the current stop loss value
    investment, profit = optimized_obv_strategy_with_numba(VI, stop_loss_pct=stop_loss)
    
    # Calculate ROI
    roi = (profit / investment) * 100 if investment != 0 else 0
    
    # Determine stop-loss label
    stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"
    
    # Append the results as a dictionary
    results.append({
        "Stop Loss": stop_loss_label,  # Use label for stop loss
        "Total Investment ($)": investment,
        "Total Profit ($)": profit,
        "ROI (%)": roi
    })

# Convert the results into a DataFrame
results_OBV = pd.DataFrame(results)

In [53]:
results_OBV

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,471179000,7746502.0,1.644068
1,0.03%,471179000,-2352950.0,-0.499375
2,1.00%,471179000,7746502.0,1.644068
3,3.00%,471179000,7746502.0,1.644068
4,5.00%,471179000,7746502.0,1.644068
5,10.00%,471179000,7746502.0,1.644068
6,15.00%,471179000,7746502.0,1.644068
7,20.00%,471179000,7746502.0,1.644068


In [88]:
# Count the number of each signal type in the 'Signal' column
signal_counts = VI['Signal'].value_counts()

# Print the counts for each signal
print("Signal Counts:")
print(f"Buy Signals (1): {signal_counts.get(1, 0)}")
print(f"Sell Signals (-1): {signal_counts.get(-1, 0)}")
print(f"Hold Signals (0): {signal_counts.get(0, 0)}")


Signal Counts:
Buy Signals (1): 4711790
Sell Signals (-1): 3952913
Hold Signals (0): 24481


### 🔁 EMA + MACD Strategy (Trend + Momentum Combo)

This strategy combines two indicators:
- **Trend Filter**: 50-day EMA > 200-day EMA to confirm an uptrend (or downtrend for sell).
- **Momentum Confirmation**: MACD Line crossing above (or below) Signal Line for trade timing.

#### Signal Logic:
- **Buy**: 50 EMA > 200 EMA **and** MACD Line > Signal Line  
- **Sell**: 50 EMA < 200 EMA **and** MACD Line < Signal Line  
- **Hold**: All other cases

We backtest the strategy over multiple stop-loss values and evaluate investment, profit, ROI, and signal frequency — same as in earlier strategies.

In [54]:
# Copy data to a new DataFrame for EMA and MACD calculations
EM_MA = data.copy()

# Calculate 50-day EMA (shorter-term trend)
EM_MA['EMA_50'] = EM_MA['Close'].ewm(span=50, adjust=False).mean()

# Calculate 200-day EMA (longer-term trend)
EM_MA['EMA_200'] = EM_MA['Close'].ewm(span=200, adjust=False).mean()

# Calculate the 12-period EMA (faster EMA for MACD)
EM_MA['EMA_12'] = EM_MA['Close'].ewm(span=12, adjust=False).mean()

# Calculate the 26-period EMA (slower EMA for MACD)
EM_MA['EMA_26'] = EM_MA['Close'].ewm(span=26, adjust=False).mean()

# Calculate the MACD Line
EM_MA['MACD_Line'] = EM_MA['EMA_12'] - EM_MA['EMA_26']

# Calculate the Signal Line (9-period EMA of MACD Line)
EM_MA['Signal_Line'] = EM_MA['MACD_Line'].ewm(span=9, adjust=False).mean()

# Calculate the MACD Histogram
EM_MA['MACD_Histogram'] = EM_MA['MACD_Line'] - EM_MA['Signal_Line']


In [55]:
# Count rows with any null values
num_null_rows = EM_MA.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 0


In [56]:
# Display the first few rows to verify
EM_MA.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,EMA_50,EMA_200,EMA_12,EMA_26,MACD_Line,Signal_Line,MACD_Histogram
0,2006-01-03 00:00:00,2.585,2.669643,2.580357,2.669643,2.257056,807234400.0,2.669643,2.669643,2.669643,2.669643,0.0,0.0,0.0
1,2006-01-03 00:01:00,2.585068,2.669673,2.580413,2.669648,2.257061,807104100.0,2.669643,2.669643,2.669644,2.669643,4.352611e-07,8.705222e-08,3.482089e-07
2,2006-01-03 00:02:00,2.585136,2.669704,2.580469,2.669654,2.257065,806973800.0,2.669644,2.669643,2.669645,2.669644,1.206578e-06,3.109575e-07,8.95621e-07
3,2006-01-03 00:03:00,2.585205,2.669734,2.580524,2.669659,2.25707,806843500.0,2.669644,2.669643,2.669647,2.669645,2.232398e-06,6.952455e-07,1.537152e-06
4,2006-01-03 00:04:00,2.585273,2.669765,2.58058,2.669665,2.257074,806713200.0,2.669645,2.669643,2.66965,2.669647,3.445923e-06,1.245381e-06,2.200542e-06


In [138]:
def optimized_ema_macd_strategy_with_numba(EM_MA, stop_loss_pct):
    """
    Optimized trading strategy using EMA and MACD cross-confirmation with Numba acceleration.
    """
    # Generate Buy/Sell Signals
    # Buy when 50-day EMA > 200-day EMA (uptrend) and MACD_Line crosses above Signal_Line
    # Sell when 50-day EMA < 200-day EMA (downtrend) and MACD_Line crosses below Signal_Line
    EM_MA['Signal'] = np.where(
        (EM_MA['EMA_50'] > EM_MA['EMA_200']) & (EM_MA['MACD_Line'] > EM_MA['Signal_Line']), 1,  # Buy Signal
        np.where((EM_MA['EMA_50'] < EM_MA['EMA_200']) & (EM_MA['MACD_Line'] < EM_MA['Signal_Line']), -1, 0)  # Sell Signal
    )

    # Prepare data for Numba loop
    close = EM_MA['Close'].values
    signals = EM_MA['Signal'].values

    # Define constants
    buy_fee = 0.0025
    sell_fee = 0.0025

    # Call Numba-optimized loop
    profit_loss = numba_loop(close, signals, stop_loss_pct, buy_fee, sell_fee)

    # Add results back to DataFrame
    EM_MA['Profit/Loss'] = profit_loss

    # Total profit and investment
    total_profit = profit_loss.sum()
    total_investment = 100 * (signals == 1).sum()  # $100 per buy signal
    return total_investment, total_profit


In [65]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Initialize an empty list to store results
results = []

# Iterate over each stop loss value
for stop_loss in stop_loss_values:
    # Call the EMA + MACD strategy function for the current stop loss value
    investment, profit = optimized_ema_macd_strategy_with_numba(EM_MA, stop_loss_pct=stop_loss)
    
    # Calculate ROI
    roi = (profit / investment) * 100 if investment != 0 else 0
    
    # Determine stop-loss label
    stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"
    
    # Append the results as a dictionary
    results.append({
        "Stop Loss": stop_loss_label,  # Use label for stop loss
        "Total Investment ($)": investment,
        "Total Profit ($)": profit,
        "ROI (%)": roi
    })

# Convert the results into a DataFrame
results_EMA_MACD = pd.DataFrame(results)

In [66]:
results_EMA_MACD

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,235829000,3935333.0,1.668723
1,0.03%,235829000,-1177671.0,-0.499375
2,1.00%,235829000,3935339.0,1.668726
3,3.00%,235829000,3935333.0,1.668723
4,5.00%,235829000,3935333.0,1.668723
5,10.00%,235829000,3935333.0,1.668723
6,15.00%,235829000,3935333.0,1.668723
7,20.00%,235829000,3935333.0,1.668723


In [89]:
# Count the number of each signal type in the 'Signal' column
signal_counts = EM_MA['Signal'].value_counts()

# Print the counts for each signal
print("Signal Counts:")
print(f"Buy Signals (1): {signal_counts.get(1, 0)}")
print(f"Sell Signals (-1): {signal_counts.get(-1, 0)}")
print(f"Hold Signals (0): {signal_counts.get(0, 0)}")


Signal Counts:
Buy Signals (1): 2358290
Sell Signals (-1): 1999376
Hold Signals (0): 4331518


### 🔁 EMA + Bollinger Bands Strategy (Trend + Mean Reversion Combo)

This strategy blends **trend direction** with **price deviation** logic:
- **Buy**: When 50 EMA > 200 EMA (uptrend) **and** price touches or falls below the lower Bollinger Band (oversold).
- **Sell**: When 50 EMA < 200 EMA (downtrend) **and** price reaches or exceeds the upper Bollinger Band (overbought).
- **Hold**: Otherwise.

As before, we evaluate this logic across different stop-loss values and analyze ROI, investment, profits, and signal breakdowns.

In [60]:
# Copy data to a new DataFrame for EMA + Bollinger Bands calculations
EM_BB = data.copy()

# Calculate 50 EMA (shorter-term trend)
EM_BB['EMA_50'] = EM_BB['Close'].ewm(span=50, adjust=False).mean()

# Calculate 200 EMA (longer-term trend)
EM_BB['EMA_200'] = EM_BB['Close'].ewm(span=200, adjust=False).mean()

# Calculate 20 SMA (Bollinger Bands centerline)
EM_BB['SMA_20'] = EM_BB['Close'].rolling(window=20).mean()

# Calculate 20 Standard Deviation
EM_BB['Std_Dev_20'] = EM_BB['Close'].rolling(window=20).std()

# Calculate Upper Bollinger Band
EM_BB['Upper_Band'] = EM_BB['SMA_20'] + (2 * EM_BB['Std_Dev_20'])

# Calculate Lower Bollinger Band
EM_BB['Lower_Band'] = EM_BB['SMA_20'] - (2 * EM_BB['Std_Dev_20'])


In [61]:
# Count rows with any null values
num_null_rows = EM_BB.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 19


In [62]:
# Drop all rows with any null values
EM_BB = EM_BB.dropna()

In [63]:
# Display the first few rows to verify
EM_BB.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,EMA_50,EMA_200,SMA_20,Std_Dev_20,Upper_Band,Lower_Band
19,2006-01-03 00:19:00,2.586296,2.670223,2.581417,2.669747,2.257144,804758700.0,2.669675,2.669653,2.669695,3.2e-05,2.669759,2.66963
20,2006-01-03 00:20:00,2.586364,2.670253,2.581473,2.669752,2.257148,804628400.0,2.669678,2.669654,2.6697,3.2e-05,2.669765,2.669636
21,2006-01-03 00:21:00,2.586432,2.670284,2.581529,2.669758,2.257153,804498100.0,2.669682,2.669655,2.669706,3.2e-05,2.66977,2.669641
22,2006-01-03 00:22:00,2.586501,2.670314,2.581585,2.669763,2.257157,804367800.0,2.669685,2.669656,2.669711,3.2e-05,2.669776,2.669647
23,2006-01-03 00:23:00,2.586569,2.670345,2.581641,2.669768,2.257162,804237500.0,2.669688,2.669657,2.669717,3.2e-05,2.669781,2.669652


In [139]:
def optimized_ema_bollinger_strategy_with_numba(EM_BB, stop_loss_pct):
    """
    Optimized trading strategy using EMA + Bollinger Bands with Numba acceleration.
    """
    # Generate Buy/Sell Signals
    EM_BB['Signal'] = np.where(
        (EM_BB['EMA_50'] > EM_BB['EMA_200']) & (EM_BB['Close'] <= EM_BB['Lower_Band']), 1,  # Buy Signal
        np.where((EM_BB['EMA_50'] < EM_BB['EMA_200']) & (EM_BB['Close'] >= EM_BB['Upper_Band']), -1, 0)  # Sell Signal
    )

    # Prepare data for Numba loop
    close = EM_BB['Close'].values
    signals = EM_BB['Signal'].values

    # Define constants
    buy_fee = 0.0025
    sell_fee = 0.0025

    # Call Numba-optimized loop
    profit_loss = numba_loop(close, signals, stop_loss_pct, buy_fee, sell_fee)

    # Add results back to DataFrame
    EM_BB['Profit/Loss'] = profit_loss

    # Total profit and investment
    total_profit = profit_loss.sum()
    total_investment = 100 * (signals == 1).sum()  # $100 per buy signal
    return total_investment, total_profit


In [67]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Initialize an empty list to store results
results = []

# Iterate over each stop loss value
for stop_loss in stop_loss_values:
    # Call the EMA + Bollinger Bands strategy function for the current stop loss value
    investment, profit = optimized_ema_bollinger_strategy_with_numba(EM_BB, stop_loss_pct=stop_loss)
    
    # Calculate ROI
    roi = (profit / investment) * 100 if investment != 0 else 0
    
    # Determine stop-loss label
    stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"
    
    # Append the results as a dictionary
    results.append({
        "Stop Loss": stop_loss_label,  # Use label for stop loss
        "Total Investment ($)": investment,
        "Total Profit ($)": profit,
        "ROI (%)": roi
    })

# Convert the results into a DataFrame
results_EMA_BB = pd.DataFrame(results)

In [68]:
results_EMA_BB

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,1534400,-71611.454521,-4.667066
1,1.00%,1534400,-2073.424735,-0.135129
2,3.00%,1534400,-24456.196469,-1.593861
3,5.00%,1534400,-10417.769385,-0.678947
4,10.00%,1534400,-33470.424169,-2.181336
5,15.00%,1534400,-64099.880612,-4.177521
6,20.00%,1534400,-71385.382334,-4.652332


In [90]:
# Count the number of each signal type in the 'Signal' column
signal_counts = EM_BB['Signal'].value_counts()

# Print the counts for each signal
print("Signal Counts:")
print(f"Buy Signals (1): {signal_counts.get(1, 0)}")
print(f"Sell Signals (-1): {signal_counts.get(-1, 0)}")
print(f"Hold Signals (0): {signal_counts.get(0, 0)}")


Signal Counts:
Buy Signals (1): 15344
Sell Signals (-1): 16516
Hold Signals (0): 8657305


### 🔁 EMA + MACD + OBV Strategy (Trend + Momentum + Volume)

This strategy combines:
- **Trend**: 50 EMA > 200 EMA (for buys), or vice versa for sells
- **Momentum**: MACD Line crosses above/below Signal Line
- **Volume**: OBV rising (for buys) or falling (for sells)

#### Signal Logic:
- **Buy**: All three indicators agree on upward movement  
- **Sell**: All three confirm a downtrend  
- **Hold**: Otherwise

As before, we run the strategy across multiple stop-loss levels and evaluate performance through ROI, investment, profit, and signal counts.

In [141]:
# Copy data to a new DataFrame for EMA, MACD, and OBV calculations
EA_MA_V = data.copy()

# Calculate 50-day EMA (shorter-term trend)
EA_MA_V['EMA_50'] = EA_MA_V['Close'].ewm(span=50, adjust=False).mean()

# Calculate 200-day EMA (longer-term trend)
EA_MA_V['EMA_200'] = EA_MA_V['Close'].ewm(span=200, adjust=False).mean()

# Calculate the 12-period EMA (faster EMA for MACD)
EA_MA_V['EMA_12'] = EA_MA_V['Close'].ewm(span=12, adjust=False).mean()

# Calculate the 26-period EMA (slower EMA for MACD)
EA_MA_V['EMA_26'] = EA_MA_V['Close'].ewm(span=26, adjust=False).mean()

# Calculate the MACD Line
EA_MA_V['MACD_Line'] = EA_MA_V['EMA_12'] - EA_MA_V['EMA_26']

# Calculate the Signal Line (9-period EMA of MACD Line)
EA_MA_V['Signal_Line'] = EA_MA_V['MACD_Line'].ewm(span=9, adjust=False).mean()

# Calculate the MACD Histogram
EA_MA_V['MACD_Histogram'] = EA_MA_V['MACD_Line'] - EA_MA_V['Signal_Line']

# Calculate On-Balance Volume (OBV)
EA_MA_V['OBV'] = (np.where(EA_MA_V['Close'] > EA_MA_V['Close'].shift(1), EA_MA_V['Volume'],
                 np.where(EA_MA_V['Close'] < EA_MA_V['Close'].shift(1), -EA_MA_V['Volume'], 0))).cumsum()


In [70]:
# Count rows with any null values
num_null_rows = EA_MA_V.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 0


In [142]:
# Display the first few rows to verify
EA_MA_V.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,EMA_50,EMA_200,EMA_12,EMA_26,MACD_Line,Signal_Line,MACD_Histogram,OBV
0,2006-01-03 00:00:00,2.585,2.669643,2.580357,2.669643,2.257056,807234400.0,2.669643,2.669643,2.669643,2.669643,0.0,0.0,0.0,0.0
1,2006-01-03 00:01:00,2.585068,2.669673,2.580413,2.669648,2.257061,807104100.0,2.669643,2.669643,2.669644,2.669643,4.352611e-07,8.705222e-08,3.482089e-07,807104100.0
2,2006-01-03 00:02:00,2.585136,2.669704,2.580469,2.669654,2.257065,806973800.0,2.669644,2.669643,2.669645,2.669644,1.206578e-06,3.109575e-07,8.95621e-07,1614078000.0
3,2006-01-03 00:03:00,2.585205,2.669734,2.580524,2.669659,2.25707,806843500.0,2.669644,2.669643,2.669647,2.669645,2.232398e-06,6.952455e-07,1.537152e-06,2420921000.0
4,2006-01-03 00:04:00,2.585273,2.669765,2.58058,2.669665,2.257074,806713200.0,2.669645,2.669643,2.66965,2.669647,3.445923e-06,1.245381e-06,2.200542e-06,3227635000.0


In [143]:
def optimized_ema_macd_obv_strategy_with_numba(EA_MA_V, stop_loss_pct):
    """
    Optimized trading strategy using EMA, MACD, and OBV with Numba acceleration.
    """
    # Generate Buy/Sell Signals
    EA_MA_V['Signal'] = np.where(
        (EA_MA_V['EMA_50'] > EA_MA_V['EMA_200']) &  # Trend Confirmation (Uptrend)
        (EA_MA_V['MACD_Line'] > EA_MA_V['Signal_Line']) &  # Momentum Confirmation (Bullish)
        (EA_MA_V['OBV'] > EA_MA_V['OBV'].shift(1)),  # Volume Confirmation (Rising OBV)
        1,  # Buy Signal
        np.where(
            (EA_MA_V['EMA_50'] < EA_MA_V['EMA_200']) &  # Trend Confirmation (Downtrend)
            (EA_MA_V['MACD_Line'] < EA_MA_V['Signal_Line']) &  # Momentum Confirmation (Bearish)
            (EA_MA_V['OBV'] < EA_MA_V['OBV'].shift(1)),  # Volume Confirmation (Falling OBV)
            -1,  # Sell Signal
            0  # Hold
        )
    )

    # Prepare data for Numba loop
    close = EA_MA_V['Close'].values
    signals = EA_MA_V['Signal'].values

    # Define constants
    buy_fee = 0.0025
    sell_fee = 0.0025

    # Call Numba-optimized loop
    profit_loss = numba_loop(close, signals, stop_loss_pct, buy_fee, sell_fee)

    # Add results back to DataFrame
    EA_MA_V['Profit/Loss'] = profit_loss

    # Total profit and investment
    total_profit = profit_loss.sum()
    total_investment = 100 * (signals == 1).sum()  # $100 per buy signal
    return total_investment, total_profit


In [73]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Initialize an empty list to store results
results = []

# Iterate over each stop loss value
for stop_loss in stop_loss_values:
    # Call the EMA + MACD + OBV strategy function for the current stop loss value
    investment, profit = optimized_ema_macd_obv_strategy_with_numba(EA_MA_V, stop_loss_pct=stop_loss)
    
    # Calculate ROI
    roi = (profit / investment) * 100 if investment != 0 else 0
    
    # Determine stop-loss label
    stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"
    
    # Append the results as a dictionary
    results.append({
        "Stop Loss": stop_loss_label,  # Use label for stop loss
        "Total Investment ($)": investment,
        "Total Profit ($)": profit,
        "ROI (%)": roi
    })

# Convert the results into a DataFrame
results_EMA_MACD_OBV = pd.DataFrame(results)

In [74]:
results_EMA_MACD_OBV

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,235545600,3992590.0,1.695039
1,1.00%,235545600,3991999.0,1.694788
2,3.00%,235545600,3992590.0,1.695039
3,5.00%,235545600,3992590.0,1.695039
4,10.00%,235545600,3992590.0,1.695039
5,15.00%,235545600,3992590.0,1.695039
6,20.00%,235545600,3992590.0,1.695039


In [91]:
# Count the number of each signal type in the 'Signal' column
signal_counts = EA_MA_V['Signal'].value_counts()

# Print the counts for each signal
print("Signal Counts:")
print(f"Buy Signals (1): {signal_counts.get(1, 0)}")
print(f"Sell Signals (-1): {signal_counts.get(-1, 0)}")
print(f"Hold Signals (0): {signal_counts.get(0, 0)}")


Signal Counts:
Buy Signals (1): 2355456
Sell Signals (-1): 1997063
Hold Signals (0): 4336665


### 🔁 EMA + Bollinger + MACD + OBV Strategy (Trend + Reversion + Momentum + Volume)

This comprehensive strategy combines all four types of signals:
- **Trend**: 50 EMA > 200 EMA
- **Mean Reversion**: Price below lower Bollinger Band
- **Momentum**: MACD Line > Signal Line
- **Volume**: OBV rising

#### Signal Logic:
- **Buy**: All four indicators confirm bullish conditions  
- **Sell**: All four confirm bearish conditions  
- **Hold**: Otherwise

As with earlier strategies, we test this across multiple stop-loss settings and analyze the resulting ROI, profit, and signal distribution.

In [75]:
# Copy data to a new DataFrame for the 4-indicator strategy
four_indicators = data.copy()

# 1. Calculate EMA (Trend Indicator)
four_indicators['EMA_50'] = four_indicators['Close'].ewm(span=50, adjust=False).mean()
four_indicators['EMA_200'] = four_indicators['Close'].ewm(span=200, adjust=False).mean()

# 2. Calculate Bollinger Bands (Mean Reversion Indicator)
four_indicators['SMA_20'] = four_indicators['Close'].rolling(window=20).mean()
four_indicators['Std_Dev_20'] = four_indicators['Close'].rolling(window=20).std()
four_indicators['Upper_Band'] = four_indicators['SMA_20'] + (2 * four_indicators['Std_Dev_20'])
four_indicators['Lower_Band'] = four_indicators['SMA_20'] - (2 * four_indicators['Std_Dev_20'])

# 3. Calculate MACD (Momentum Indicator)
four_indicators['EMA_12'] = four_indicators['Close'].ewm(span=12, adjust=False).mean()
four_indicators['EMA_26'] = four_indicators['Close'].ewm(span=26, adjust=False).mean()
four_indicators['MACD_Line'] = four_indicators['EMA_12'] - four_indicators['EMA_26']
four_indicators['Signal_Line'] = four_indicators['MACD_Line'].ewm(span=9, adjust=False).mean()
four_indicators['MACD_Histogram'] = four_indicators['MACD_Line'] - four_indicators['Signal_Line']

# 4. Calculate OBV (Volume Indicator)
four_indicators['OBV'] = (np.where(four_indicators['Close'] > four_indicators['Close'].shift(1), four_indicators['Volume'],
                         np.where(four_indicators['Close'] < four_indicators['Close'].shift(1), -four_indicators['Volume'], 0))).cumsum()



In [76]:
# Count rows with any null values
num_null_rows = four_indicators.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 19


In [77]:
# Drop all rows with any null values
four_indicators = four_indicators.dropna()

In [78]:
# Display the first few rows to verify
four_indicators.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,EMA_50,EMA_200,SMA_20,Std_Dev_20,Upper_Band,Lower_Band,EMA_12,EMA_26,MACD_Line,Signal_Line,MACD_Histogram,OBV
19,2006-01-03 00:19:00,2.586296,2.670223,2.581417,2.669747,2.257144,804758700.0,2.669675,2.669653,2.669695,3.2e-05,2.669759,2.66963,2.669718,2.669694,2.4e-05,1.9e-05,5e-06,15312700000.0
20,2006-01-03 00:20:00,2.586364,2.670253,2.581473,2.669752,2.257148,804628400.0,2.669678,2.669654,2.6697,3.2e-05,2.669765,2.669636,2.669723,2.669698,2.5e-05,2e-05,5e-06,16117330000.0
21,2006-01-03 00:21:00,2.586432,2.670284,2.581529,2.669758,2.257153,804498100.0,2.669682,2.669655,2.669706,3.2e-05,2.66977,2.669641,2.669728,2.669703,2.6e-05,2.1e-05,4e-06,16921820000.0
22,2006-01-03 00:22:00,2.586501,2.670314,2.581585,2.669763,2.257157,804367800.0,2.669685,2.669656,2.669711,3.2e-05,2.669776,2.669647,2.669734,2.669707,2.6e-05,2.2e-05,4e-06,17726190000.0
23,2006-01-03 00:23:00,2.586569,2.670345,2.581641,2.669768,2.257162,804237500.0,2.669688,2.669657,2.669717,3.2e-05,2.669781,2.669652,2.669739,2.669712,2.7e-05,2.3e-05,4e-06,18530430000.0


In [144]:
def optimized_4_indicator_strategy_with_numba(four_indicators, stop_loss_pct):
    """
    Optimized trading strategy using EMA, Bollinger Bands, MACD, and OBV with Numba acceleration.
    """
    # Generate Buy/Sell Signals
    four_indicators['Signal'] = np.where(
        (four_indicators['EMA_50'] > four_indicators['EMA_200']) &  # Trend Confirmation (Uptrend)
        (four_indicators['Close'] <= four_indicators['Lower_Band']) &  # Mean Reversion (Oversold)
        (four_indicators['MACD_Line'] > four_indicators['Signal_Line']) &  # Momentum Confirmation (Bullish)
        (four_indicators['OBV'] > four_indicators['OBV'].shift(1)),  # Volume Confirmation (Rising OBV)
        1,  # Buy Signal
        np.where(
            (four_indicators['EMA_50'] < four_indicators['EMA_200']) &  # Trend Confirmation (Downtrend)
            (four_indicators['Close'] >= four_indicators['Upper_Band']) &  # Mean Reversion (Overbought)
            (four_indicators['MACD_Line'] < four_indicators['Signal_Line']) &  # Momentum Confirmation (Bearish)
            (four_indicators['OBV'] < four_indicators['OBV'].shift(1)),  # Volume Confirmation (Falling OBV)
            -1,  # Sell Signal
            0  # Hold
        )
    )

    # Prepare data for Numba loop
    close = four_indicators['Close'].values
    signals = four_indicators['Signal'].values

    # Define constants
    buy_fee = 0.0025
    sell_fee = 0.0025

    # Call Numba-optimized loop
    profit_loss = numba_loop(close, signals, stop_loss_pct, buy_fee, sell_fee)

    # Add results back to DataFrame
    four_indicators['Profit/Loss'] = profit_loss

    # Total profit and investment
    total_profit = profit_loss.sum()
    total_investment = 100 * (signals == 1).sum()  # $100 per buy signal
    return total_investment, total_profit


In [81]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Initialize an empty list to store results
results = []

# Iterate over each stop loss value
for stop_loss in stop_loss_values:
    # Call the 4-Indicator strategy function for the current stop loss value
    investment, profit = optimized_4_indicator_strategy_with_numba(four_indicators, stop_loss_pct=stop_loss)
    
    # Calculate ROI
    roi = (profit / investment) * 100 if investment != 0 else 0
    
    # Determine stop-loss label
    stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"
    
    # Append the results as a dictionary
    results.append({
        "Stop Loss": stop_loss_label,  # Use label for stop loss
        "Total Investment ($)": investment,
        "Total Profit ($)": profit,
        "ROI (%)": roi
    })

# Convert the results into a DataFrame
results_4_indicators = pd.DataFrame(results)

In [82]:
results_4_indicators

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,1.00%,0,0.0,0
2,3.00%,0,0.0,0
3,5.00%,0,0.0,0
4,10.00%,0,0.0,0
5,15.00%,0,0.0,0
6,20.00%,0,0.0,0


In [92]:
# Count the number of each signal type in the 'Signal' column
signal_counts = four_indicators['Signal'].value_counts()

# Print the counts for each signal
print("Signal Counts:")
print(f"Buy Signals (1): {signal_counts.get(1, 0)}")
print(f"Sell Signals (-1): {signal_counts.get(-1, 0)}")
print(f"Hold Signals (0): {signal_counts.get(0, 0)}")


Signal Counts:
Buy Signals (1): 0
Sell Signals (-1): 0
Hold Signals (0): 8689165


In [93]:
buy_condition = (
    (four_indicators['EMA_50'] > four_indicators['EMA_200']) &
    (four_indicators['Close'] <= four_indicators['Lower_Band']) &
    (four_indicators['MACD_Line'] > four_indicators['Signal_Line']) &
    (four_indicators['OBV'] > four_indicators['OBV'].shift(1))
)
print(f"Buy Condition: {buy_condition.sum()} rows satisfied")


Buy Condition: 0 rows satisfied


### 📉 Evaluation of Hardcoded Trading Strategies

We evaluated a variety of traditional **technical indicator strategies** (trend, momentum, mean reversion, volume, and combinations) on minute-level Apple stock data across multiple stop-loss configurations. Here's what the results reveal:

---

### 🧾 Key Observations

#### ⚠️ Low-Yielding or Ineffective Strategies:
- Even the **“best-performing” methods**, like **OBV (Volume)** or **EMA + MACD + OBV**, delivered only **~1.6–1.7% total ROI over 20 years**, which is **extremely poor** in practical terms.
  - For context, a basic **buy-and-hold strategy on Apple** would have yielded **6366.75%** over the same period.
- Strategies like **EMA crossover**, **MACD**, and **EMA + MACD** also showed **positive but negligible** returns, barely justifying the transaction costs and risks taken.

#### ❌ Consistently Negative Strategies:
- **Mean Reversion (Bollinger Bands)**, **EMA + BB**, and **momentum methods** like **Stochastics** or **MACD-only** posted **consistent losses**, showing structural weakness in these rules under high-frequency noise.

#### 🚫 Strategy Failure:
- The **4-indicator combo** produced **zero trades** — a classic case of **over-filtering**, where the strict criteria made it too hard to act, missing all opportunities.

---

### 📉 Why Do These Hardcoded Strategies Fail?

- **Rules are oversimplified**: They assume market behavior can be modeled with a few lines of logic — this fails in real-world, noisy environments.
- **Zero adaptation**: They don’t learn from the past or adapt to new regimes.
- **Overfitting to theory**: These strategies often work on paper or in ideal conditions but fall apart in real markets.
- **Horrible risk-reward profile**: Risking billions to make 1–2% over two decades is **not viable** — you'd be better off parking the money in short-term Treasury bills.

---

### 🧠 Why “Intuition” Matters in Trading

Human traders often succeed by:
- **Adapting** to shifting trends, regimes, and market microstructures.
- Incorporating **non-quantifiable signals**, like sentiment, macro news, or liquidity patterns.
- Recognizing that the **same signal can mean different things** in different contexts — something hardcoded systems cannot grasp.

But even human intuition has limits — which is where **Reinforcement Learning** comes in.

---

### 🚀 Moving Forward: Reinforcement Learning

Our next goal is to train an RL agent (starting with Deep Q-Learning + LSTM) that can:
- **Learn adaptive, context-aware strategies**
- React to **minute-level data dynamics**
- Potentially outperform both hardcoded rules and static machine learning approaches

Let’s see if learning from experience can beat human-designed heuristics.

### 🧹 RL Preprocessing: Feature Engineering & Dataset Structuring

Before training a reinforcement learning (RL) agent, we first prepare the dataset with **relevant trading features**, **temporal context**, and **clean structure** — giving the agent all the raw ingredients it needs to learn profitable behaviors.

---

### 🧠 Feature Engineering

We enrich the raw minute-level Apple stock data with a mix of **technical indicators** from multiple trading styles:

#### 📈 1. Trend Indicators
- `EMA_50`, `EMA_200`: Capture short- and long-term directional trends.

#### 🔁 2. Mean Reversion Indicators
- `SMA_20`, `Upper_Band`, `Lower_Band`: Bollinger Bands signal overbought/oversold zones.

#### ⚡ 3. Momentum Indicators
- `MACD_Line`, `Signal_Line`: Track changes in price momentum via EMA differentials.

#### 📊 4. Relative Strength Indicators
- `%K`, `%D`: Stochastic Oscillator to identify when price is relatively high or low.

#### 📦 5. Volume Indicators
- `OBV`: Measures accumulation/distribution based on price-volume relationship.

These features together form a **multi-dimensional market view** — which becomes the **observation space** for the RL agent.

---

### 🧼 Data Cleaning & Temporal Context

- Dropped all rows with `NaN` values introduced by rolling window operations (like EMA, SMA, Std Dev).
- Extracted time-related columns: `Year`, `Month`, `Day`, `Time` from the `Date` column.
  - This enables clean **year-based splitting** for training and testing.
  - Allows evaluation of agent performance in **unseen future years**.

---

### 🕰️ Temporal Splitting Prep: Extracting Date Components

We extract additional features from the `Date` column:
- **Year**, **Month**, **Day**, and **Time** — useful for:
  - **Training/validation/testing splits by year**
  - Tracking performance across different market regimes
  - Agent evaluation on **unseen future data**

---

### 📊 Why Count Distinct Years?

We print year-wise sample counts to **understand the data distribution** across time. This helps in:
- Ensuring **balanced splits** across training and test periods
- Avoiding **data leakage** by training and testing on overlapping regimes
- Selecting **early years** for training and **recent years** for testing generalization

---

### 🔄 Feature Normalization

After all features were created, the following key indicator columns were **scaled to [0, 1]** using **MinMaxScaler**:

This step ensures:
- Uniform feature scales for stable neural network training
- No dominance by high-magnitude features like OBV or Bollinger bands

---

### 📦 Final Dataset Structure

- **~8.6 million rows** after cleaning  
- **24 columns**: including raw price data, engineered indicators, and date components  
- Ready to be fed into a **gym-style trading environment** for training the RL agent


In [4]:
# Copy data to a new DataFrame
data_2 = data.copy()

# Convert 'Date' column to datetime if not already converted
data_2['Date'] = pd.to_datetime(data_2['Date'])

### 1. Trend Indicators: 50-Day EMA and 200-Day EMA
data_2['EMA_50'] = data_2['Close'].ewm(span=50, adjust=False).mean()  # 50-Day EMA
data_2['EMA_200'] = data_2['Close'].ewm(span=200, adjust=False).mean()  # 200-Day EMA

### 2. Mean Reversion Indicators: Bollinger Bands (20, 2)
data_2['SMA_20'] = data_2['Close'].rolling(window=20).mean()  # 20-Day Simple Moving Average (SMA)
data_2['Std_Dev_20'] = data_2['Close'].rolling(window=20).std()  # 20-Day Standard Deviation
data_2['Upper_Band'] = data_2['SMA_20'] + (2 * data_2['Std_Dev_20'])  # Upper Bollinger Band
data_2['Lower_Band'] = data_2['SMA_20'] - (2 * data_2['Std_Dev_20'])  # Lower Bollinger Band

### 3. Relative Strength Indicators: Stochastics (14, 7, 3)
# High and Low for the past 14 periods
data_2['High_14'] = data_2['High'].rolling(window=14).max()
data_2['Low_14'] = data_2['Low'].rolling(window=14).min()
# %K: Stochastic Oscillator
data_2['%K'] = ((data_2['Close'] - data_2['Low_14']) / (data_2['High_14'] - data_2['Low_14'])) * 100
# %D: 3-Period Moving Average of %K
data_2['%D'] = data_2['%K'].rolling(window=3).mean()

### 4. Momentum Indicators: MACD (12, 26, 9)
# MACD Line: Difference between 12-period and 26-period EMAs
data_2['MACD_Line'] = data_2['Close'].ewm(span=12, adjust=False).mean() - data_2['Close'].ewm(span=26, adjust=False).mean()
# Signal Line: 9-period EMA of the MACD Line
data_2['Signal_Line'] = data_2['MACD_Line'].ewm(span=9, adjust=False).mean()

### 5. Volume Indicators: On-Balance Volume (OBV)
# OBV Calculation
data_2['Daily_Change'] = data_2['Close'].diff()
data_2['OBV'] = (np.where(data_2['Daily_Change'] > 0, data_2['Volume'],
                  np.where(data_2['Daily_Change'] < 0, -data_2['Volume'], 0))).cumsum()

# Drop intermediate columns not required
data_2.drop(columns=['Daily_Change'], inplace=True)


In [5]:
### Check for any null values introduced due to rolling calculations
num_null_rows = data_2.isnull().any(axis=1).sum()
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 19


In [6]:
# Drop rows with null values if needed
data_2.dropna(inplace=True)

In [8]:
data_2.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,EMA_50,EMA_200,SMA_20,Std_Dev_20,Upper_Band,Lower_Band,High_14,Low_14,%K,%D,MACD_Line,Signal_Line,OBV
19,2006-01-03 00:19:00,2.586296,2.670223,2.581417,2.669747,2.257144,804758700.0,2.669675,2.669653,2.669695,3.2e-05,2.669759,2.66963,2.670223,2.580692,99.468408,99.496524,2.4e-05,1.9e-05,15312700000.0
20,2006-01-03 00:20:00,2.586364,2.670253,2.581473,2.669752,2.257148,804628400.0,2.669678,2.669654,2.6697,3.2e-05,2.669765,2.669636,2.670253,2.580748,99.440271,99.468403,2.5e-05,2e-05,16117330000.0
21,2006-01-03 00:21:00,2.586432,2.670284,2.581529,2.669758,2.257153,804498100.0,2.669682,2.669655,2.669706,3.2e-05,2.66977,2.669641,2.670284,2.580804,99.412119,99.440266,2.6e-05,2.1e-05,16921820000.0
22,2006-01-03 00:22:00,2.586501,2.670314,2.581585,2.669763,2.257157,804367800.0,2.669685,2.669656,2.669711,3.2e-05,2.669776,2.669647,2.670314,2.580859,99.38395,99.412114,2.6e-05,2.2e-05,17726190000.0
23,2006-01-03 00:23:00,2.586569,2.670345,2.581641,2.669768,2.257162,804237500.0,2.669688,2.669657,2.669717,3.2e-05,2.669781,2.669652,2.670345,2.580915,99.355766,99.383945,2.7e-05,2.3e-05,18530430000.0


In [9]:
data_2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8689165 entries, 19 to 8689183
Data columns (total 20 columns):
 #   Column       Dtype         
---  ------       -----         
 0   Date         datetime64[ns]
 1   Open         float64       
 2   High         float64       
 3   Low          float64       
 4   Close        float64       
 5   Adj Close    float64       
 6   Volume       float64       
 7   EMA_50       float64       
 8   EMA_200      float64       
 9   SMA_20       float64       
 10  Std_Dev_20   float64       
 11  Upper_Band   float64       
 12  Lower_Band   float64       
 13  High_14      float64       
 14  Low_14       float64       
 15  %K           float64       
 16  %D           float64       
 17  MACD_Line    float64       
 18  Signal_Line  float64       
 19  OBV          float64       
dtypes: datetime64[ns](1), float64(19)
memory usage: 1.4 GB


In [7]:
# Convert 'Date' column to datetime if not already
data_2['Date'] = pd.to_datetime(data_2['Date'])

# Split into separate columns
data_2['Year'] = data_2['Date'].dt.year
data_2['Month'] = data_2['Date'].dt.month
data_2['Day'] = data_2['Date'].dt.day
data_2['Time'] = data_2['Date'].dt.time

In [11]:
data_2.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,EMA_50,EMA_200,SMA_20,...,Low_14,%K,%D,MACD_Line,Signal_Line,OBV,Year,Month,Day,Time
19,2006-01-03 00:19:00,2.586296,2.670223,2.581417,2.669747,2.257144,804758700.0,2.669675,2.669653,2.669695,...,2.580692,99.468408,99.496524,2.4e-05,1.9e-05,15312700000.0,2006,1,3,00:19:00
20,2006-01-03 00:20:00,2.586364,2.670253,2.581473,2.669752,2.257148,804628400.0,2.669678,2.669654,2.6697,...,2.580748,99.440271,99.468403,2.5e-05,2e-05,16117330000.0,2006,1,3,00:20:00
21,2006-01-03 00:21:00,2.586432,2.670284,2.581529,2.669758,2.257153,804498100.0,2.669682,2.669655,2.669706,...,2.580804,99.412119,99.440266,2.6e-05,2.1e-05,16921820000.0,2006,1,3,00:21:00
22,2006-01-03 00:22:00,2.586501,2.670314,2.581585,2.669763,2.257157,804367800.0,2.669685,2.669656,2.669711,...,2.580859,99.38395,99.412114,2.6e-05,2.2e-05,17726190000.0,2006,1,3,00:22:00
23,2006-01-03 00:23:00,2.586569,2.670345,2.581641,2.669768,2.257162,804237500.0,2.669688,2.669657,2.669717,...,2.580915,99.355766,99.383945,2.7e-05,2.3e-05,18530430000.0,2006,1,3,00:23:00


In [12]:
# Print distinct values of the Year column and their counts
year_counts = data_2['Year'].value_counts()

print("Distinct Years and Counts:")
for year, count in year_counts.items():
    print(f"Year: {year}, Count: {count}")

Distinct Years and Counts:
Year: 2015, Count: 478092
Year: 2013, Count: 478092
Year: 2009, Count: 476652
Year: 2021, Count: 475212
Year: 2023, Count: 475212
Year: 2020, Count: 475212
Year: 2017, Count: 475212
Year: 2016, Count: 473773
Year: 2010, Count: 473772
Year: 2012, Count: 473772
Year: 2011, Count: 472333
Year: 2019, Count: 472332
Year: 2014, Count: 472332
Year: 2007, Count: 472332
Year: 2018, Count: 470892
Year: 2008, Count: 468012
Year: 2022, Count: 466573
Year: 2006, Count: 463673
Year: 2024, Count: 175685


In [103]:
data_2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8689165 entries, 19 to 8689183
Data columns (total 24 columns):
 #   Column       Dtype         
---  ------       -----         
 0   Date         datetime64[ns]
 1   Open         float64       
 2   High         float64       
 3   Low          float64       
 4   Close        float64       
 5   Adj Close    float64       
 6   Volume       float64       
 7   EMA_50       float64       
 8   EMA_200      float64       
 9   SMA_20       float64       
 10  Std_Dev_20   float64       
 11  Upper_Band   float64       
 12  Lower_Band   float64       
 13  High_14      float64       
 14  Low_14       float64       
 15  %K           float64       
 16  %D           float64       
 17  MACD_Line    float64       
 18  Signal_Line  float64       
 19  OBV          float64       
 20  Year         int32         
 21  Month        int32         
 22  Day          int32         
 23  Time         object        
dtypes: datetime64[ns](1), float6

In [8]:
# Normalize the feature columns
features = [
    'EMA_50', 'EMA_200', 'SMA_20', 'Upper_Band', 'Lower_Band', '%K', '%D',
    'MACD_Line', 'Signal_Line', 'OBV'
]
scaler = MinMaxScaler()
data_2[features] = scaler.fit_transform(data_2[features])

## ♟️ Reinforcement Learning Pipeline: Environment + Agent Architecture

We now build the **heart of the system**: a custom trading simulator + Deep Q-Learning (DQN) agent with **LSTM-based memory**, trained on historical market data using temporal sequences of technical indicators.

The architecture is composed of four main blocks:

---

### 1️⃣ `TradingEnv`: Custom Gym Environment for Trading

This is a specialized subclass of `gym.Env`, simulating a **discrete trading scenario** with real constraints and agent control.

#### 🧠 Key Design:

| Component | Behavior |
|----------|----------|
| **Actions** | 0 = Hold, 1 = Buy ($100), 2 = Sell (all shares) |
| **State Space** | 11-dimensional vector = 10 normalized indicators + `shares_held` |
| **Reward** | Only triggered on **Sell** — realized profit from liquidation |
| **Transaction Fee** | 0.25% on both Buy and Sell |
| **Final Liquidation** | If episode ends with open position, it's sold automatically |

#### 🧩 Internal State Tracked:
- `shares_held`: current accumulated position
- `cost_so_far`: capital spent on current holdings
- `total_profit`: cumulative **realized** profit across the episode
- `total_buys_count`: used to track **total invested amount**

The `step()` function handles execution logic: applying the chosen action, moving to the next time step, calculating reward, and checking for episode termination.

This environment is **fully differentiable and stateless outside of RL logic**, making it ideal for training episodic agents.

---

### 2️⃣ `ReplayBuffer`: Experience Replay with Sequences

Since we’re using an LSTM-based agent, we can’t sample individual transitions like classic DQN. Instead, we use a **replay buffer that stores full sequences** of transitions.

#### 🎬 Design Choices:
- Stores `(state, action, reward, next_state, done)` in a rolling buffer
- Sampling returns **mini-batches of sequences** (e.g., 8 steps long)
- This lets the agent learn from temporal patterns over recent history

Without this, the LSTM wouldn’t have enough context to learn temporal dynamics — especially critical in trading where trends build over time.

---

### 3️⃣ `RecurrentQNetwork`: DQN with Memory (LSTM)

The Q-network is implemented using an **LSTM + Linear head**, allowing it to:
- Process **sequences of states**, rather than single snapshots
- Learn to capture **temporal dependencies** between consecutive market events

#### 🧠 Architecture:
- Input: `(batch_size, seq_len, input_dim=11)`
- LSTM hidden size: 64
- Output: Q-values for all 3 actions at the **last time step** of each sequence

The network has a `init_hidden()` method for zeroing out the memory at the beginning of each episode — useful for both training and evaluation.

---

### 4️⃣ `DQNAgent`: Core Agent Logic

This class encapsulates the training loop, Q-learning updates, exploration schedule, and LSTM management.

#### 🧱 Components:
- Maintains both **online Q-network** and **target network**
- Uses **ε-greedy strategy** for balancing exploration and exploitation
- Trains via **Temporal Difference (TD)** learning using MSE loss between:
  ```
  TD target = reward + γ * max_a' Q_target(next_state, a')
  ```

#### ⚙️ Key Features:
- **Sequential mini-batch training** using sampled sequences from memory
- Hidden states passed and updated properly across time
- **Epsilon decay** from 1.0 to 0.01 over time
- **Target network soft updates** every few steps (to stabilize learning)
- Automatically liquidates at episode end to close position

---

### 📈 Training & Evaluation Loops

#### `train_one_episode()`
- Runs one complete episode from start to finish
- For each step:
  - Select action using ε-greedy policy
  - Interact with the environment
  - Store transition
  - Train on sampled sequences
  - Decay exploration

At the end of each episode, returns:
- **Total realized profit**
- **Total capital invested**

#### `evaluate_agent()`
- Runs a **greedy policy** (no exploration) on the test set
- No training or buffer updates — just execution
- Reports final profit and investment
- Used to evaluate generalization on unseen market years

---

### 📦 Summary

| Module | Purpose |
|--------|---------|
| `TradingEnv` | Simulates the trading loop, provides state & reward to agent |
| `ReplayBuffer` | Stores experiences as sequences for LSTM-based learning |
| `RecurrentQNetwork` | Predicts Q-values using LSTM over state sequences |
| `DQNAgent` | Handles action selection, learning, and policy updates |

This full stack forms a **deep recurrent reinforcement learning agent** capable of:
- Learning to **time trades intelligently**
- Adapting to **non-linear market dynamics**
- Reacting based on a **multi-indicator view of price action**


In [15]:
class TradingEnv(gym.Env):
    """
    A specialized trading environment:
    - Discrete actions: 0=Hold, 1=Buy($100), 2=Sell(all).
    - 0.25% fee on both Buy and Sell.
    - No position limit (can keep buying to accumulate shares).
    - Final liquidation at the end if still holding.
    - Observations: 10 normalized features + shares_held (float).
    - Tracks total buys and total realized profit.
    """
    def __init__(self, df, start_idx=0, end_idx=None, fee=0.0025):
        super(TradingEnv, self).__init__()
        
        self.df = df.reset_index(drop=True)
        self.start_idx = start_idx
        self.end_idx = end_idx if end_idx is not None else len(self.df) - 1
        self.current_idx = self.start_idx
        
        # Transaction fee
        self.fee = fee
        
        # Actions: 0=Hold, 1=Buy, 2=Sell
        self.action_space = spaces.Discrete(3)
        
        # Observations: 10 normalized indicators + 1 for shares_held
        self.observation_space = spaces.Box(
            low=0, high=1, shape=(11,), dtype=np.float32
        )
        
        # Internal state
        self.shares_held = 0.0          # total number of shares currently held
        self.cost_so_far = 0.0         # total dollars spent for the current open position(s)
        self.total_profit = 0.0        # cumulative realized profit
        self.total_buys_count = 0      # number of buy actions (each is $100)
        
        self.done = False

    def _get_observation(self):
        """
        Build observation vector:
        [EMA_50, EMA_200, SMA_20, Upper_Band, Lower_Band,
         %K, %D, MACD_Line, Signal_Line, OBV, shares_held]
        """
        row = self.df.loc[self.current_idx]
        
        obs_features = [
            row['EMA_50'],
            row['EMA_200'],
            row['SMA_20'],
            row['Upper_Band'],
            row['Lower_Band'],
            row['%K'],
            row['%D'],
            row['MACD_Line'],
            row['Signal_Line'],
            row['OBV']
        ]
        obs = np.array(obs_features, dtype=np.float32)
        
        # Append shares_held
        obs = np.append(obs, self.shares_held).astype(np.float32)
        return obs

    def _final_liquidation(self):
        """
        Liquidate any remaining shares at the current price (minus fee).
        Realize profit/loss from those shares and reset position to 0.
        """
        if self.shares_held > 0.0:
            current_price = self.df.loc[self.current_idx, 'Close']
            # value of all shares (minus sell fee)
            gross_value = self.shares_held * current_price
            net_value = gross_value * (1.0 - self.fee)

            # Realized profit = net_value - cost_so_far
            final_profit = net_value - self.cost_so_far
            self.total_profit += final_profit

            # Reset position
            self.shares_held = 0.0
            self.cost_so_far = 0.0

    def step(self, action):
        """
        Execute action, compute reward, move to next timestep.
        Reward is realized profit when selling. 0 otherwise.
        If we reach end_idx, automatically liquidate any remaining shares.
        """
        assert self.action_space.contains(action), "Invalid Action"
        reward = 0.0
        
        current_price = self.df.loc[self.current_idx, 'Close']
        
        # ---------- Action Logic ----------
        if action == 1:  # Buy
            # Each buy invests $100, minus buy fee
            invest_amount = 100.0
            invest_amount_after_fee = invest_amount * (1.0 - self.fee)

            # Increase shares_held
            shares_bought = invest_amount_after_fee / current_price
            self.shares_held += shares_bought

            # Track cost
            self.cost_so_far += invest_amount
            # Increment total buy count
            self.total_buys_count += 1

        elif action == 2 and self.shares_held > 0.0:  # Sell (liquidate all)
            # Sell all shares
            gross_value = self.shares_held * current_price
            net_value = gross_value * (1.0 - self.fee)
            
            # Realized profit
            profit = net_value - self.cost_so_far
            self.total_profit += profit
            reward = profit  # immediate reward

            # Reset position
            self.shares_held = 0.0
            self.cost_so_far = 0.0
        
        # ---------- Move forward ----------
        self.current_idx += 1
        if self.current_idx >= self.end_idx:
            # Final liquidation if we still hold shares
            self._final_liquidation()
            # End the episode
            self.done = True
        else:
            self.done = False
        
        obs = self._get_observation()
        return obs, reward, self.done, {}

    def reset(self):
        """
        Reset environment to the start of the timeframe.
        """
        self.current_idx = self.start_idx
        self.done = False
        self.shares_held = 0.0
        self.cost_so_far = 0.0
        self.total_profit = 0.0
        self.total_buys_count = 0
        
        return self._get_observation()

    def render(self, mode='human'):
        """
        Optionally print debug info
        """
        pass


In [16]:
Transition = collections.namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done'))

class ReplayBuffer:
    def __init__(self, capacity=100000):
        self.memory = collections.deque(maxlen=capacity)
    
    def push(self, *args):
        """Saves a transition (state, action, reward, next_state, done)."""
        self.memory.append(Transition(*args))
    
    def sample(self, batch_size, seq_len=8):
        """
        Sample random sequences of length seq_len.
        We'll do a simplified approach:
          - pick random start indices for the sequences,
          - collect consecutive transitions.
        """
        if len(self.memory) < seq_len:
            return None
        
        sequences = []
        for _ in range(batch_size):
            start_idx = random.randint(0, len(self.memory) - seq_len)
            seq = list(itertools.islice(self.memory, start_idx, start_idx + seq_len))
            sequences.append(seq)
        return sequences
    
    def __len__(self):
        return len(self.memory)


In [17]:
class RecurrentQNetwork(nn.Module):
    def __init__(self, input_dim=11, hidden_size=64, num_actions=3):
        """
        LSTM for sequential processing, then a linear head to produce Q-values.
        """
        super(RecurrentQNetwork, self).__init__()
        self.hidden_size = hidden_size
        
        # LSTM
        self.lstm = nn.LSTM(input_dim, hidden_size, batch_first=True)
        
        # Q-value output
        self.fc = nn.Linear(hidden_size, num_actions)
        
    def forward(self, x, hidden):
        """
        x shape: (batch_size, seq_len, input_dim)
        hidden: (h, c) each shape (1, batch_size, hidden_size)
        Returns q_values for the last time-step, plus updated hidden states.
        """
        out, (h, c) = self.lstm(x, hidden)
        # out: (batch_size, seq_len, hidden_size)
        # We only want the Q-values from the last step of the sequence
        last_step = out[:, -1, :]  # (batch_size, hidden_size)
        q_values = self.fc(last_step)  # (batch_size, num_actions)
        return q_values, (h, c)
    
    def init_hidden(self, batch_size=1):
        return (torch.zeros(1, batch_size, self.hidden_size),
                torch.zeros(1, batch_size, self.hidden_size))


In [23]:
class DQNAgent:
    def __init__(self,
                 input_dim=11,
                 num_actions=3,
                 hidden_size=64,
                 lr=1e-3,
                 gamma=0.99,
                 batch_size=32,
                 seq_len=8,
                 buffer_size=100000,
                 epsilon_start=1.0,
                 epsilon_end=0.01,
                 epsilon_decay_steps=50000,
                 target_update_freq=1000):
        
        self.num_actions = num_actions
        self.gamma = gamma
        self.batch_size = batch_size
        self.seq_len = seq_len
        
        # Epsilon schedule
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = (epsilon_start - epsilon_end) / float(epsilon_decay_steps)
        self.global_step = 0
        
        # Q-Network and Target Network
        self.q_network = RecurrentQNetwork(input_dim, hidden_size, num_actions)
        self.target_network = RecurrentQNetwork(input_dim, hidden_size, num_actions)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.target_network.eval()
        
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
        self.replay_buffer = ReplayBuffer(capacity=buffer_size)
        
        self.target_update_freq = target_update_freq
        
    def select_action(self, state, hidden):
        """
        Epsilon-greedy action selection.
        state: np.array of shape (input_dim,)
        hidden: (h, c) LSTM hidden states
        """
        if random.random() < self.epsilon:
            # Random action
            action = random.randint(0, self.num_actions - 1)
            return action, hidden
        else:
            # Use Q-network
            state_t = torch.FloatTensor(state).unsqueeze(0).unsqueeze(0)  # (1,1,input_dim)
            with torch.no_grad():
                q_values, hidden_out = self.q_network(state_t, hidden)
            action = q_values.argmax(dim=1).item()
            return action, hidden_out
    
    def store_transition(self, state, action, reward, next_state, done):
        self.replay_buffer.push(state, action, reward, next_state, done)
    
    def update_epsilon(self):
        if self.epsilon > self.epsilon_end:
            self.epsilon -= self.epsilon_decay
        else:
            self.epsilon = self.epsilon_end
    
    def update(self):
        """
        Sample from replay buffer and perform a DQN update step.
        """
        if len(self.replay_buffer) < self.batch_size * self.seq_len:
            return  # Not enough data to sample a full batch

        batch = self.replay_buffer.sample(self.batch_size, self.seq_len)
        if batch is None:
            return

        # Convert sampled sequences into tensors
        state_seq = []
        action_seq = []
        reward_seq = []
        next_state_seq = []
        done_seq = []

        for seq in batch:  # seq is a list of length seq_len
            s = []
            a = []
            r = []
            ns = []
            d = []
            for t in seq:
                s.append(t.state)
                a.append(t.action)
                r.append(t.reward)
                ns.append(t.next_state)
                d.append(t.done)
            state_seq.append(s)
            action_seq.append(a)
            reward_seq.append(r)
            next_state_seq.append(ns)
            done_seq.append(d)

        # Convert lists to NumPy arrays first
        state_seq = np.array(state_seq, dtype=np.float32)        # Shape: (batch_size, seq_len, input_dim)
        action_seq = np.array(action_seq, dtype=np.int64)        # Shape: (batch_size, seq_len)
        reward_seq = np.array(reward_seq, dtype=np.float32)      # Shape: (batch_size, seq_len)
        next_state_seq = np.array(next_state_seq, dtype=np.float32)  # Shape: (batch_size, seq_len, input_dim)
        done_seq = np.array(done_seq, dtype=np.float32)          # Shape: (batch_size, seq_len)

        # Now convert to PyTorch tensors
        state_seq = torch.from_numpy(state_seq)  # dtype=torch.float32
        action_seq = torch.from_numpy(action_seq)
        reward_seq = torch.from_numpy(reward_seq)
        next_state_seq = torch.from_numpy(next_state_seq)
        done_seq = torch.from_numpy(done_seq)

        # If you're using a GPU, move tensors to the appropriate device
        device = next(self.q_network.parameters()).device
        state_seq = state_seq.to(device)
        action_seq = action_seq.to(device)
        reward_seq = reward_seq.to(device)
        next_state_seq = next_state_seq.to(device)
        done_seq = done_seq.to(device)

        # Initialize hidden states
        h0_q = self.q_network.init_hidden(self.batch_size)
        h0_t = self.target_network.init_hidden(self.batch_size)

        # Forward pass on current states
        q_values, _ = self.q_network(state_seq, h0_q)  # (batch_size, num_actions) for the last step

        # Forward pass on next states
        with torch.no_grad():
            next_q_values, _ = self.target_network(next_state_seq, h0_t)  # (batch_size, num_actions)

        # We only use the last step of each sequence to form the TD target
        last_idx = self.seq_len - 1
        chosen_actions = action_seq[:, last_idx]  # (batch_size,)
        chosen_qvals = q_values.gather(1, chosen_actions.unsqueeze(1)).squeeze(1)  # shape (batch_size,)

        # Max next Q
        max_next_q = next_q_values.max(dim=1)[0]

        # Done mask
        done_mask = done_seq[:, last_idx]  # shape (batch_size,)
        rewards_final = reward_seq[:, last_idx]

        # TD target
        target = rewards_final + (1.0 - done_mask) * self.gamma * max_next_q

        # Ensure target is detached from the current graph
        target = target.detach()

        loss = nn.MSELoss()(chosen_qvals, target)

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Update target network periodically
        if self.global_step % self.target_update_freq == 0:
            self.target_network.load_state_dict(self.q_network.state_dict())

        self.global_step += 1

    
    def train_one_episode(self, env):
        """
        Roll through the environment from start_idx to end_idx once.
        Return total episode profit, total amount invested, etc.
        """
        state = env.reset()
        hidden = self.q_network.init_hidden(batch_size=1)
        done = False
        episode_reward = 0.0
        
        while not done:
            # Select action
            action, hidden_next = self.select_action(state, hidden)
            
            next_state, reward, done, _ = env.step(action)
            
            # Store transition
            self.store_transition(state, action, reward, next_state, done)
            
            # Train
            self.update()
            # Epsilon decay
            self.update_epsilon()
            
            state = next_state
            hidden = hidden_next
            episode_reward += reward
        
        # After the episode, env.total_profit is the realized profit
        # total_buys_count * 100 = total invested
        return env.total_profit, env.total_buys_count * 100


In [20]:
def evaluate_agent(agent, test_data):
    # Create environment
    env = TradingEnv(df=test_data,
                     start_idx=0,
                     end_idx=len(test_data) - 1,
                     fee=0.0025)
    
    # Set epsilon to 0 for greedy policy
    old_epsilon = agent.epsilon
    agent.epsilon = 0.0
    
    state = env.reset()
    hidden = agent.q_network.init_hidden(batch_size=1)
    done = False
    
    while not done:
        # Greedy action
        state_t = torch.FloatTensor(state).unsqueeze(0).unsqueeze(0)
        with torch.no_grad():
            q_values, hidden_next = agent.q_network(state_t, hidden)
        action = q_values.argmax(dim=1).item()
        
        state, reward, done, _ = env.step(action)
        hidden = hidden_next
    
    # Restore old epsilon
    agent.epsilon = old_epsilon
    
    final_profit = env.total_profit
    total_invested = env.total_buys_count * 100
    print(f"[Test Results] Total Invested: ${total_invested} | "
          f"Realized Profit: ${final_profit:.2f}")
    
    return final_profit, total_invested


## 🚀 DQN Agent Training: Dataset, ROI Benchmark, and Hyperparameters

---

Absolutely — here’s your updated section with the reasoning for starting with just the **first 10,000 rows** included clearly and naturally:

---

### 📅 Dataset Split: Train vs. Test

We split the full historical Apple minute-by-minute dataset into two parts:

| Split      | Range                     | Purpose       |
|------------|---------------------------|---------------|
| **Train**  | Rows `0` to `9999`        | Agent learning (initial prototype)  
| **Test**   | Rows `10,000` onward      | Generalization evaluation  

This slicing gives us a **clear temporal separation**:  
- The agent trains on **early years**, learns from price movements, indicators, and trading outcomes.  
- Then, it's evaluated on **future unseen data** to test whether it can generalize — a critical property for real-world financial deployment.

---

### 🧪 Why Just the First 10,000 Rows?

Starting with only 10,000 rows (i.e., about a week’s worth of minute-level data) is a **deliberate design choice** in early experimentation. It allows us to:

- ✅ **Quickly prototype** the environment-agent loop  
- 🔧 **Test and tune hyperparameters** without long training times  
- 📉 **Monitor behavior** (e.g., is the agent actually buying and selling?)  
- 🧠 **Debug agent learning dynamics** before scaling to millions of rows  

Once this initial loop works well and behavior looks promising, we can scale up:
- Increase the training window (e.g., 100k or 1M rows)
- Try different train-test year splits (e.g., train on 2010–2018, test on 2019–2024)

This incremental approach gives us **rapid feedback**, avoids overfitting to millions of examples prematurely, and builds confidence in the pipeline.

---

### 💰 Benchmark ROI: Buy-and-Hold Strategy

To measure how well the RL agent performs, we compare it to a naive strategy: **buy one share at the start of the test period and hold it until the end**.

```python
roi = (final_price - initial_price) / initial_price * 100
```

This gives us a baseline — the **market’s passive return** over the test window.

If the RL agent beats this ROI using active trading decisions, **we've added value**.

---

### 🧠 Training the Agent

We use the `train_dqn()` function to train our DQN agent inside the `TradingEnv`. Here's the process:

- For each **episode**, the agent runs through the full training dataset once.
- At each step:
  - It selects an action (Buy/Hold/Sell)
  - Gets a reward based on realized profit
  - Learns from transitions in its **replay buffer**

After each episode, we print:
- 📊 **Total capital invested**
- 💵 **Realized profit**
- 🎲 **Current exploration rate (ε)**

Once training ends, we evaluate the agent on the test set using **greedy actions only** (ε = 0).

---

## ⚙️ Hyperparameter Configuration (Deep Dive)

Here’s a detailed look into each hyperparameter used to train the DQN agent:

| Parameter | Value | Why It Matters |
|----------|-------|----------------|
| **`input_dim`** | 11 | 10 technical features + `shares_held` appended to observation vector |
| **`num_actions`** | 3 | Agent can choose between: Hold, Buy ($100), or Sell (all shares) |
| **`hidden_size`** | 64 | Size of LSTM hidden state. Balances expressiveness vs. overfitting. Large enough to capture sequential market dynamics. |
| **`lr` (learning rate)** | 1e-3 | Controls update step size during gradient descent. A moderate learning rate, tuned for stable convergence. |
| **`gamma` (discount factor)** | 0.99 | Encourages agent to consider long-term rewards. Suitable for trading where profit may occur after several steps. |
| **`batch_size`** | 8 | Number of sequences in each training batch. Smaller due to the memory and sequence complexity of LSTM training. |
| **`seq_len`** | 8 | Each training sample is a sequence of 8 steps. Helps the agent "see" context from the past 8 timesteps before making a decision. |
| **`buffer_size`** | 100,000 | Stores past transitions for training. Large enough to retain diverse market situations without overfilling memory. |
| **`epsilon_start`** | 1.0 | Exploration starts fully random — agent tries all possible actions early on. |
| **`epsilon_end`** | 0.1 | Minimum exploration maintained even after decay — avoids converging to a potentially suboptimal policy. |
| **`epsilon_decay_steps`** | 10,000 | Controls how quickly ε decays from 1.0 to 0.1. Smooth decay encourages structured exploration during early training. |
| **`target_update_freq`** | 1000 steps | Controls how often the target Q-network is synced with the online Q-network. A larger value stabilizes training. |

These values are chosen to balance:
- Learning stability
- Temporal context
- Realism of trading behaviors
- Computational efficiency

This setup is especially suited for early prototyping on a relatively short training window (10,000 timesteps). For larger-scale experiments, you might consider tuning:
- `batch_size`, `seq_len` (for deeper temporal context)
- `epsilon_decay_steps` (to explore longer)
- `hidden_size` (for more expressive memory)

---

### 🧪 Final Evaluation

After training, we evaluate the agent on the test set:

- Agent acts greedily (`epsilon = 0`)
- Tracks how much it **invests** and how much **profit** it realizes
- Computes **ROI**:
  $$
  \text{ROI} = \left( \frac{\text{profit}}{\text{total invested}} \right) \times 100
  $$

This is directly comparable with:
- Buy-and-hold strategy ROI
- Hardcoded technical indicator strategies ROI

---

### 🧾 Summary

This section finalizes the RL loop — using well-structured data, a rich feature set, and a robust training strategy to empower the agent to **learn profitable behaviors**. We then **compare this learned behavior** to both naive baselines and hardcoded rule-based strategies to understand how much value our agent adds.

In [44]:
# Training data: First 10,000 rows
train_data = data_2.iloc[:10000].reset_index(drop=True)

# Testing data: Rows 10,001 to last row
test_data = data_2.iloc[10000:].reset_index(drop=True)

# Print split sizes
print(f"Training Data: {train_data.shape}")
print(f"Testing Data: {test_data.shape}")

Training Data: (10000, 24)
Testing Data: (8679165, 24)


In [45]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data.iloc[0]['Close']
final_price = test_data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 6366.75%


In [79]:
def train_dqn(train_data, num_episodes=5):
    """
    Train a DQN on the given training dataframe for num_episodes.
    """
    # Create environment
    train_env = TradingEnv(
        df=train_data,
        start_idx=0,
        end_idx=len(train_data) - 1,
        fee=0.0025
    )
    
    # Hyperparameters 
    config = {
        'input_dim': 11,   # 10 features + 1 shares_held
        'num_actions': 3,  # hold, buy, sell
        'hidden_size': 64,
        'lr': 1e-3,
        'gamma': 0.99,
        'batch_size': 8,
        'seq_len': 8,
        'buffer_size': 100000,
        'epsilon_start': 1.0,
        'epsilon_end': 0.1,
        'epsilon_decay_steps': 10000,
        'target_update_freq': 1000
    }
    
    agent = DQNAgent(**config)
    
    for episode in range(num_episodes):
        ep_profit, ep_invested = agent.train_one_episode(train_env)
        
        print(f"Episode {episode+1}/{num_episodes} | "
              f"Total Invested: ${ep_invested} | "
              f"Realized Profit: ${ep_profit:.2f} | "
              f"Epsilon: {agent.epsilon:.3f}")
    
    print("Training complete.")
    return agent


In [80]:
# Example usage:
agent = train_dqn(train_data, num_episodes=5)

Episode 1/5 | Total Invested: $307000 | Realized Profit: $-1517.81 | Epsilon: 0.100
Episode 2/5 | Total Invested: $122600 | Realized Profit: $-592.87 | Epsilon: 0.100
Episode 3/5 | Total Invested: $119700 | Realized Profit: $-591.79 | Epsilon: 0.100
Episode 4/5 | Total Invested: $171100 | Realized Profit: $-809.93 | Epsilon: 0.100
Episode 5/5 | Total Invested: $276700 | Realized Profit: $-1315.78 | Epsilon: 0.100
Training complete.


In [81]:
# Evaluate on test set
profit, invested = evaluate_agent(agent, test_data)

[Test Results] Total Invested: $600 | Realized Profit: $38002.76


In [47]:
total_invested = 600  # Total amount invested
realized_profit = 38002.76  # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 6333.79%


### 📊 Preliminary Results: Agent vs. Buy-and-Hold

After training our DQN agent on the first 10,000 rows of minute-by-minute Apple data and testing it on the remainder, here’s how it performed:

| Strategy        | Total Invested ($) | Realized Profit ($) | ROI (%)     |
|------------------|--------------------|-----------------------|-------------|
| **Buy & Hold**   | — *(1 share)*      | —                     | **6366.75%** |
| **DQN Agent**     | 600                | 38,002.76             | **6333.79%** |

---

### ⚠️ Important Notes & Caveats

- **This is a sanity check run**:
  - The agent was trained on just **10,000 rows** to get the pipeline going.
  - It’s not meant to fully optimize profit — rather to **test if the loop works**, see if the agent is learning, and begin tuning hyperparameters.

- **Buy & Hold ROI** is based on:
  - Buying 1 share at the **start** of the test window,
  - Holding till the **end** — a benchmark for passive investing.

- **DQN Agent ROI** is based on:
  - Actively trading in $100 increments,
  - Paying transaction fees,
  - Making decisions step-by-step using learned Q-values.


### 🔁 Scaling Up: Full Agent Evaluation on Larger Dataset

Now that the pipeline works, we scale training to the **first 100,000 rows** and test the agent on **the rest of the dataset**.

- This gives the agent more room to **observe patterns** and **refine its policy**.
- We'll evaluate its performance on a **longer, unseen test window** to check generalization.
- No change to the RL architecture — same environment, features, and hyperparameters (with a longer epsilon decay).

In [48]:
# Training data: First 100,000 rows
train_data = data_2.iloc[:100000].reset_index(drop=True)

# Testing data: Rows 100,001 to last row
test_data = data_2.iloc[100000:].reset_index(drop=True)

# Print split sizes
print(f"Training Data: {train_data.shape}")
print(f"Testing Data: {test_data.shape}")

Training Data: (100000, 24)
Testing Data: (8589165, 24)


In [49]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data.iloc[0]['Close']
final_price = test_data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 8035.58%


In [72]:
def train_dqn(train_data, num_episodes=5):
    """
    Train a DQN on the given training dataframe for num_episodes.
    """
    # Create environment
    train_env = TradingEnv(
        df=train_data,
        start_idx=0,
        end_idx=len(train_data) - 1,
        fee=0.0025
    )
    
    # Hyperparameters 
    config = {
        'input_dim': 11,   # 10 features + 1 shares_held
        'num_actions': 3,  # hold, buy, sell
        'hidden_size': 64,
        'lr': 1e-3,
        'gamma': 0.99,
        'batch_size': 8,
        'seq_len': 8,
        'buffer_size': 100000,
        'epsilon_start': 1.0,
        'epsilon_end': 0.1,
        'epsilon_decay_steps': 100000,
        'target_update_freq': 1000
    }
    
    agent = DQNAgent(**config)
    
    for episode in range(num_episodes):
        ep_profit, ep_invested = agent.train_one_episode(train_env)
        
        print(f"Episode {episode+1}/{num_episodes} | "
              f"Total Invested: ${ep_invested} | "
              f"Realized Profit: ${ep_profit:.2f} | "
              f"Epsilon: {agent.epsilon:.3f}")
    
    print("Training complete.")
    return agent


In [73]:
# Example usage:
agent = train_dqn(train_data, num_episodes=5)

Episode 1/5 | Total Invested: $2794800 | Realized Profit: $-14006.83 | Epsilon: 0.100
Episode 2/5 | Total Invested: $1399200 | Realized Profit: $-7073.55 | Epsilon: 0.100
Episode 3/5 | Total Invested: $1496000 | Realized Profit: $-7579.67 | Epsilon: 0.100
Episode 4/5 | Total Invested: $1779400 | Realized Profit: $-8883.09 | Epsilon: 0.100
Episode 5/5 | Total Invested: $1704500 | Realized Profit: $-8597.04 | Epsilon: 0.100
Training complete.


In [74]:
# Evaluate on test set
profit, invested = evaluate_agent(agent, test_data)

[Test Results] Total Invested: $2451400 | Realized Profit: $-12025.19


In [50]:
total_invested = 2451400  # Total amount invested
realized_profit = -12025.19  # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")


ROI: -0.49%


### 📉 Initial Full-Scale Run: Agent vs. Buy-and-Hold

We trained the DQN agent on the first **100,000 rows** and tested it on the remaining dataset:

| Strategy        | Total Invested ($) | Realized Profit ($) | ROI (%)     |
|------------------|--------------------|-----------------------|-------------|
| **Buy & Hold**   | — *(1 share)*      | —                     | **8035.58%** |
| **DQN Agent**     | 2,451,400          | -12,025.19            | **-0.49%**   |

---

### ⚠️ No Alarms — Still Early Stage

- This is the **first full-length run** — a key milestone in testing scalability.
- Despite negative returns, the agent is:
  - Making trades,
  - Responding to signals,
  - Not overfitting or crashing.
  
> We're still in the **early training phase** — just 5 episodes on a vast dataset.  
Further tuning, longer training, better exploration, and architectural tweaks are still to come.

💡 Nothing to worry about — this is part of the RL development cycle. We'll build on this!

### 📆 Switching to Year-Based Splits

We now move from arbitrary row-based splits to **calendar-year-based partitioning**, which better reflects real-world deployment:

| Set       | Years Covered | Purpose                  |
|-----------|----------------|--------------------------|
| **Train** | ≤ 2007         | Agent learns from early historical market behavior  
| **Test**  | ≥ 2008         | Evaluate on future, **unseen regimes**

---

### 🧠 Why Year-Based Splits?

- This mirrors how financial models are deployed in the real world — trained on past data, tested on future data.
- Prevents **temporal leakage** (future information influencing training).
- Captures **regime shifts** (e.g., 2008 crash, bull runs, COVID dip), giving a more realistic measure of generalization.

---

### 📊 Market Baseline: Buy-and-Hold

To contextualize performance, we compute the **buy-and-hold ROI** from 2008 onward:

> 🟢 **Buy-and-Hold ROI**: If you simply bought 1 share at the start of 2008 and held till the end,  
you’d have earned **`{roi:.2f}%` return** — the passive benchmark our agent needs to beat.

---

### ⚙️ Agent Training Setup

- **Training Episodes**: 5  
- **Features Used**: All 10 engineered indicators + `shares_held`  
- **Environment**: Same as before (discrete actions, $100 per buy, 0.25% fee)  
- **Key Hyperparameters**:
  - `gamma=0.99`: Emphasize long-term reward
  - `seq_len=8`: Capture temporal patterns with short-term memory
  - `epsilon_decay_steps=936,005`: Matches training set length for a smooth transition from exploration to exploitation

---

### 🧪 Testing Performance

After training, the agent is evaluated **greedily (ε = 0)** on the entire post-2008 period.  

In [51]:
# Split the dataset into train, validation, and test sets
train_data = data_2[data_2['Year'] <= 2007]
test_data = data_2[data_2['Year'] >= 2008]

# Print split sizes
print(f"Training Data: {train_data.shape}")
print(f"Testing Data: {test_data.shape}")

Training Data: (936005, 24)
Testing Data: (7753160, 24)


In [52]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data.iloc[0]['Close']
final_price = test_data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 2554.91%


In [91]:
def train_dqn(train_data, num_episodes=5):
    """
    Train a DQN on the given training dataframe for num_episodes.
    """
    # Create environment
    train_env = TradingEnv(
        df=train_data,
        start_idx=0,
        end_idx=len(train_data) - 1,
        fee=0.0025
    )
    
    # Hyperparameters 
    config = {
        'input_dim': 11,   # 10 features + 1 shares_held
        'num_actions': 3,  # hold, buy, sell
        'hidden_size': 64,
        'lr': 1e-3,
        'gamma': 0.99,
        'batch_size': 8,
        'seq_len': 8,
        'buffer_size': 100000,
        'epsilon_start': 1.0,
        'epsilon_end': 0.1,
        'epsilon_decay_steps': 936005,
        'target_update_freq': 1000
    }
    
    agent = DQNAgent(**config)
    
    for episode in range(num_episodes):
        ep_profit, ep_invested = agent.train_one_episode(train_env)
        
        print(f"Episode {episode+1}/{num_episodes} | "
              f"Total Invested: ${ep_invested} | "
              f"Realized Profit: ${ep_profit:.2f} | "
              f"Epsilon: {agent.epsilon:.3f}")
    
    print("Training complete.")
    return agent


In [92]:
# Example usage:
agent = train_dqn(train_data, num_episodes=5)

Episode 1/5 | Total Invested: $31461600 | Realized Profit: $-156816.04 | Epsilon: 0.100
Episode 2/5 | Total Invested: $13208100 | Realized Profit: $-65619.89 | Epsilon: 0.100
Episode 3/5 | Total Invested: $14388100 | Realized Profit: $-71457.12 | Epsilon: 0.100
Episode 4/5 | Total Invested: $11309800 | Realized Profit: $-56599.10 | Epsilon: 0.100
Episode 5/5 | Total Invested: $13776500 | Realized Profit: $-68329.47 | Epsilon: 0.100
Training complete.


In [93]:
# Evaluate on test set
profit, invested = evaluate_agent(agent, test_data)

[Test Results] Total Invested: $6300 | Realized Profit: $36708.69


In [1]:
total_invested = 6300  # Total amount invested
realized_profit = 36708.69  # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")


ROI: 582.68%


### 📊 Full-History Evaluation: Early Training, Full Test

We now push our framework into a **more realistic scenario** — training on the earliest available data (pre-2008), and testing on **the entire remaining history**, from the 2008 crash to 2024.

| Split      | Years Covered        | Purpose                  |
|------------|----------------------|--------------------------|
| **Train**  | ≤ 2007               | Agent observes only the early market dynamics  
| **Test**   | 2008 – 2024          | Full future evaluation across all market regimes

---

### 💹 Baseline ROI: Buy-and-Hold Strategy

> 📈 **Buy-and-Hold ROI** (2008–2024): **`2554.91%`**

This is what a passive investor would earn simply by buying Apple in early 2008 and holding through all ups and downs.

---

### 🧠 RL Agent Performance

Despite training only on the pre-2008 slice, the agent is tested across **16 years of diverse market conditions**:

> 🧪 **RL Agent ROI**: **`582.68%`**  
> ✅ Invested: `$6,300` → Realized Profit: `$36,708.69`  

That’s **a positive return** — achieved purely by reacting to learned market signals and adapting through policy learning.

---

### 🛠️ Training Still Limited

Although the agent is showing **early signs of promise**, it’s worth noting:

- **Training was done on only 5 episodes**, using ~936k timesteps (2006–2007).
- It has **never seen** events like the 2008 crash, 2013 rally, 2020 pandemic dip, etc.
- **Buy-and-hold is still vastly superior** for now — no surprise given Apple’s long-term exponential growth.

---

### 🔭 What's Next?

This result tells us that the RL agent **can generalize**, but:

- It’s still in the **early stages** — short training, no tuning, and no reward shaping yet.
- We haven’t trained across multiple years or validated over rolling windows.
- There's huge room for improvement with **longer training**.

Let’s call this a **positive first step** in moving beyond hardcoded rules.


### 🧭 Phase 2: Year-Based Split — Training Up to 2009

We now switch to a **year-based slicing strategy** for training and testing — moving away from row indices to concrete time periods.

| Split      | Years Included       | Purpose                  |
|------------|----------------------|--------------------------|
| **Train**  | All data up to 2009  | Agent sees broader history including early 2000s, 2008 crash, and recovery  
| **Test**   | 2010 and beyond      | Generalization to long-term future across new market regimes  

---

### 🏗️ Why 2009 as Cutoff?

- Gradually expanding the agent’s exposure to market dynamics without giving it access to recent data.
- The **post-crisis recovery** is an important learning regime before tackling longer bull markets.
- Ensures a **clean and forward-only split**, avoiding any future leakage.

---

### ⚙️ Epsilon Decay Consideration

- Since the training window is longer (approx. 1.88M timesteps), we increase `epsilon_decay_steps` to **match the episode length**.
- This allows **smoother exploration → exploitation transition** over longer training periods.

```python
'epsilon_decay_steps': 1880669  # matches length of training data
```

---

### 🧪 Test Window

- Covers **2010 to 2024**, a span of **major macro events** including:
  - Bull market expansion
  - COVID crash and rebound
  - Recent volatility and rate hikes

This setup allows us to test **long-horizon generalization** without modifying the environment or reward structure.

In [10]:
# Split the dataset into train, validation, and test sets
train_data = data_2[data_2['Year'] <= 2009]
test_data = data_2[data_2['Year'] >= 2010]

# Print split sizes
print(f"Training Data: {train_data.shape}")
print(f"Testing Data: {test_data.shape}")

Training Data: (1880669, 24)
Testing Data: (6808496, 24)


In [13]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data.iloc[0]['Close']
final_price = test_data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 2365.54%


In [95]:
def train_dqn(train_data, num_episodes=5):
    """
    Train a DQN on the given training dataframe for num_episodes.
    """
    # Create environment
    train_env = TradingEnv(
        df=train_data,
        start_idx=0,
        end_idx=len(train_data) - 1,
        fee=0.0025
    )
    
    # Hyperparameters 
    config = {
        'input_dim': 11,   # 10 features + 1 shares_held
        'num_actions': 3,  # hold, buy, sell
        'hidden_size': 64,
        'lr': 1e-3,
        'gamma': 0.99,
        'batch_size': 8,
        'seq_len': 8,
        'buffer_size': 100000,
        'epsilon_start': 1.0,
        'epsilon_end': 0.1,
        'epsilon_decay_steps': 1880669,
        'target_update_freq': 1000
    }
    
    agent = DQNAgent(**config)
    
    for episode in range(num_episodes):
        ep_profit, ep_invested = agent.train_one_episode(train_env)
        
        print(f"Episode {episode+1}/{num_episodes} | "
              f"Total Invested: ${ep_invested} | "
              f"Realized Profit: ${ep_profit:.2f} | "
              f"Epsilon: {agent.epsilon:.3f}")
    
    print("Training complete.")
    return agent


In [96]:
# Example usage:
agent_2 = train_dqn(train_data, num_episodes=5)

Episode 1/5 | Total Invested: $65829800 | Realized Profit: $-328336.68 | Epsilon: 0.100
Episode 2/5 | Total Invested: $32817700 | Realized Profit: $-163376.53 | Epsilon: 0.100
Episode 3/5 | Total Invested: $29146900 | Realized Profit: $-145978.36 | Epsilon: 0.100
Episode 4/5 | Total Invested: $31561700 | Realized Profit: $-157338.87 | Epsilon: 0.100
Episode 5/5 | Total Invested: $28121200 | Realized Profit: $-140111.01 | Epsilon: 0.100
Training complete.


In [97]:
# Evaluate on test set
profit, invested = evaluate_agent(agent_2, test_data)

[Test Results] Total Invested: $177999200 | Realized Profit: $38919692.57


In [14]:
total_invested = 177999200  # Total amount invested
realized_profit = 38919692.57  # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")


ROI: 21.87%


In [102]:
# Define the path where you want to save the agent
save_path = 'dqn_agent_2.pth'

# Create a dictionary containing all necessary components
torch.save({
    'q_network_state_dict': agent_2.q_network.state_dict(),
    'target_network_state_dict': agent_2.target_network.state_dict(),
    'optimizer_state_dict': agent_2.optimizer.state_dict(),
    'epsilon': agent_2.epsilon,
    'global_step': agent_2.global_step,
}, save_path)

print(f"Agent saved successfully at {save_path}")


Agent saved successfully at dqn_agent_2.pth


### 📈 Baseline ROI: Buy-and-Hold Performance

> 🪙 **Buy-and-Hold ROI (2010–2024)**: **`2554.91%`**

This remains our baseline for how Apple performed if held passively during the test window.

---

### 🤖 RL Agent Results

After just 5 training episodes, here’s how the agent fared:

> ✅ **Agent ROI**: **`21.86%`**  
> 💰 Invested: `$177,999,200`  
> 📈 Realized Profit: `$38,919,692.57`

This is a **massive jump** from the previous round — the agent **finally turned consistently profitable** across a full 14-year horizon. It's not beating buy-and-hold (which is still extreme due to Apple’s compounding), but it's showing that **more training data = better generalization**.

---

### 🧠 Why This is Significant

- The agent now **learned across nearly two decades of rich indicator data**.
- It still trained on only **5 episodes** — no hyperparameter tuning yet.
- Yet it managed to **profit meaningfully** on unseen future data.

This is the first configuration that gives us **stable, scalable behavior** — the groundwork for larger, longer, and multi-asset experiments.

### 📆 Year-Based Split: Expanding the Agent’s Horizon (Train ≤ 2014, Test ≥ 2015)

We now move deeper into **chronologically structured RL training** — this time allowing the agent to train on **all data up to 2014**, and evaluate on **everything from 2015 onward**.

| Split      | Year Range        | Description                        |
|------------|-------------------|------------------------------------|
| **Train**  | ≤ 2014            | 9 years of trading data (2006–2014)  
| **Test**   | ≥ 2015            | 10 full years of future market data  

This split is meant to further **increase the data exposure** during training — allowing the agent to:

- Learn across **diverse market conditions**, including early volatility, sideways regimes, and the post-2008 recovery
- Train on nearly **4.25 million rows of historical minute-level data**
- Tackle a **decade-long generalization test**, covering new macro cycles and tech sector booms

---

### 🔧 Updated DQN Training Configuration

We retain the same base agent architecture, but now adjust the key **epsilon decay schedule** to match the much larger dataset:

| Parameter              | Value           | Purpose                                                                 |
|------------------------|------------------|-------------------------------------------------------------------------|
| `epsilon_decay_steps` | **4,250,970**     | Matches number of timesteps to **decay exploration gradually** across the entire training window  
| `gamma`               | 0.99              | Long-term reward emphasis, fitting for trend-based learning  
| `batch_size` / `seq_len` | 8 / 8         | Short-term temporal pattern capture via LSTM input  
| `hidden_size`         | 64                | Balanced LSTM capacity for trading logic  
| `target_update_freq`  | 1000              | Regular target network refresh to stabilize Q-updates  

---

### 🧠 Why This Step Matters

This phase acts as a **bridge between short-horizon testing and full generalization**:

- The agent now sees a **broad range of market dynamics** while still being blind to recent years.
- We’re getting closer to **production-style training** — i.e., train on the past, deploy on the future.
- This configuration lets us observe how well the agent **adapts to post-2015 Apple behavior** — including massive rallies, drawdowns, and regime shifts.

In [15]:
# Split the dataset into train, validation, and test sets
train_data = data_2[data_2['Year'] <= 2014]
test_data = data_2[data_2['Year'] >= 2015]

# Print split sizes
print(f"Training Data: {train_data.shape}")
print(f"Testing Data: {test_data.shape}")

Training Data: (4250970, 24)
Testing Data: (4438195, 24)


In [16]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data.iloc[0]['Close']
final_price = test_data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 578.28%


In [99]:
def train_dqn(train_data, num_episodes=5):
    """
    Train a DQN on the given training dataframe for num_episodes.
    """
    # Create environment
    train_env = TradingEnv(
        df=train_data,
        start_idx=0,
        end_idx=len(train_data) - 1,
        fee=0.0025
    )
    
    # Hyperparameters 
    config = {
        'input_dim': 11,   # 10 features + 1 shares_held
        'num_actions': 3,  # hold, buy, sell
        'hidden_size': 64,
        'lr': 1e-3,
        'gamma': 0.99,
        'batch_size': 8,
        'seq_len': 8,
        'buffer_size': 100000,
        'epsilon_start': 1.0,
        'epsilon_end': 0.1,
        'epsilon_decay_steps': 4250970,
        'target_update_freq': 1000
    }
    
    agent = DQNAgent(**config)
    
    for episode in range(num_episodes):
        ep_profit, ep_invested = agent.train_one_episode(train_env)
        
        print(f"Episode {episode+1}/{num_episodes} | "
              f"Total Invested: ${ep_invested} | "
              f"Realized Profit: ${ep_profit:.2f} | "
              f"Epsilon: {agent.epsilon:.3f}")
    
    print("Training complete.")
    return agent


In [100]:
# Example usage:
agent_3 = train_dqn(train_data, num_episodes=5)

Episode 1/5 | Total Invested: $179585900 | Realized Profit: $-896176.95 | Epsilon: 0.100
Episode 2/5 | Total Invested: $81553200 | Realized Profit: $-405830.67 | Epsilon: 0.100
Episode 3/5 | Total Invested: $69487000 | Realized Profit: $-346138.34 | Epsilon: 0.100
Episode 4/5 | Total Invested: $122000800 | Realized Profit: $-608625.28 | Epsilon: 0.100
Episode 5/5 | Total Invested: $76338800 | Realized Profit: $-381047.81 | Epsilon: 0.100
Training complete.


In [101]:
# Evaluate on test set
profit, invested = evaluate_agent(agent_3, test_data)

[Test Results] Total Invested: $751500 | Realized Profit: $4417913.58


In [17]:
total_invested = 751500  # Total amount invested
realized_profit = 4417913.58  # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 587.88%


In [103]:
# Define the path where you want to save the agent
save_path = 'dqn_agent_3.pth'

# Create a dictionary containing all necessary components
torch.save({
    'q_network_state_dict': agent_3.q_network.state_dict(),
    'target_network_state_dict': agent_3.target_network.state_dict(),
    'optimizer_state_dict': agent_3.optimizer.state_dict(),
    'epsilon': agent_3.epsilon,
    'global_step': agent_3.global_step,
}, save_path)

print(f"Agent saved successfully at {save_path}")

Agent saved successfully at dqn_agent_3.pth


### 📈 Baseline ROI: Buy-and-Hold Performance

> 🪙 **Buy-and-Hold ROI (2015–2024)**: **`578.28%`**

This marks the performance of passively holding Apple stock during the test period.

---

### 🤖 RL Agent Results

> ✅ **Agent ROI**: **`587.88%`**  
> 💰 Invested: `$751,500`  
> 📈 Realized Profit: `$4,417,913.58`

With a longer training window (2006–2014), the agent shows **strong generalization** into the 2015–2024 era — nearly matching buy-and-hold performance with a **fraction of the capital deployed**.

---

### 🧠 Why This is Encouraging

- First time the agent **outperformed passive holding in ROI terms**
- Achieved this with **only 5 episodes** of training and no fine-tuning
- Indicates that **longer historical training** gives the agent better context to learn patterns that **generalize across cycles**

# 🔍 Retrospective Note: Fair Comparison Across Strategies

One important clarification:

🧠 In the initial DQN + LSTM experiments, I didn’t directly compare the agent against hardcoded strategies on the *same* test set.  

Only **buy-and-hold** and the **agent** were evaluated on the test data.

---

## ⚖️ What Happens Next

To close the loop:  
> I’ll now evaluate how all the **hardcoded strategies perform** on the **exact same test window** as the best-performing DQN + LSTM agent.

This ensures a **fair and consistent benchmark** across:
- Rule-based strategies  
- Buy-and-hold  
- RL agent

Let’s see how they stack up.

### Making test set used by Agent 3.

In [None]:
test_data_agent_3 = data_2[data_2['Year'] >= 2015]

## Trend (EMA)

In [None]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_3": test_data_agent_3.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_3 = all_results["Agent_3"]


In [None]:
results_Agent_3

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,232481400,2902821.0,1.248625
1,0.03%,232481400,-1160954.0,-0.499375
2,1.00%,232481400,2902821.0,1.248625
3,3.00%,232481400,2902821.0,1.248625
4,5.00%,232481400,2902821.0,1.248625
5,10.00%,232481400,2902821.0,1.248625
6,15.00%,232481400,2902821.0,1.248625
7,20.00%,232481400,2902821.0,1.248625


## Mean Reversion (Bollinger Bands)

In [None]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_3": test_data_agent_3.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_mean_reversion_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_3 = all_results["Agent_3"]


In [None]:
results_Agent_3

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,0.03%,0,0.0,0
2,1.00%,0,0.0,0
3,3.00%,0,0.0,0
4,5.00%,0,0.0,0
5,10.00%,0,0.0,0
6,15.00%,0,0.0,0
7,20.00%,0,0.0,0


## RSI

In [None]:
# Ensure the DataFrames are explicitly updated and not views
test_data_agent_3 = test_data_agent_3.copy()

# Add %D_Slow permanently to each DataFrame
test_data_agent_3['%D_Slow'] = test_data_agent_3['%D'].rolling(window=3).mean()


In [None]:
# Remove all rows with null values from each DataFrame
test_data_agent_3 = test_data_agent_3.dropna().reset_index(drop=True)

In [None]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_3": test_data_agent_3.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_stochastics_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_3 = all_results["Agent_3"]


In [None]:
results_Agent_3

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,443819300,1072800000.0,241.719904
1,0.03%,443819300,-2216323.0,-0.499375
2,1.00%,443819300,-5320016.0,-1.19869
3,3.00%,443819300,57762740.0,13.014922
4,5.00%,443819300,724174000.0,163.168667
5,10.00%,443819300,709298000.0,159.816845
6,15.00%,443819300,795758500.0,179.297861
7,20.00%,443819300,793235000.0,178.729263


## MACD

In [None]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_3": test_data_agent_3.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_macd_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_3 = all_results["Agent_3"]


In [None]:
results_Agent_3

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,84480900,-1597685.0,-1.891179
1,0.03%,84480900,-421876.5,-0.499375
2,1.00%,84480900,-922594.1,-1.092074
3,3.00%,84480900,-1465098.0,-1.734236
4,5.00%,84480900,-1579144.0,-1.869232
5,10.00%,84480900,-1597685.0,-1.891179
6,15.00%,84480900,-1597685.0,-1.891179
7,20.00%,84480900,-1597685.0,-1.891179


## Volume

In [None]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_3": test_data_agent_3.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_macd_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_3 = all_results["Agent_3"]


In [None]:
results_Agent_3

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,1095800,-19702.514795,-1.798003
1,0.03%,1095800,-5472.15125,-0.499375
2,1.00%,1095800,-9157.087438,-0.835653
3,3.00%,1095800,-15600.567045,-1.423669
4,5.00%,1095800,-18095.751251,-1.651374
5,10.00%,1095800,-19555.93982,-1.784627
6,15.00%,1095800,-19702.514795,-1.798003
7,20.00%,1095800,-19702.514795,-1.798003


## EMA + MACD

In [None]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_3": test_data_agent_3.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_macd_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_SP = all_results["S&P 500"]
results_Reliance = all_results["Reliance"]
results_Agent_2 = all_results["Agent_2"]
results_Agent_3 = all_results["Agent_3"]


In [None]:
results_Agent_3

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,1095800,-19702.514795,-1.798003
1,0.03%,1095800,-5472.15125,-0.499375
2,1.00%,1095800,-9157.087438,-0.835653
3,3.00%,1095800,-15600.567045,-1.423669
4,5.00%,1095800,-18095.751251,-1.651374
5,10.00%,1095800,-19555.93982,-1.784627
6,15.00%,1095800,-19702.514795,-1.798003
7,20.00%,1095800,-19702.514795,-1.798003


## EMA + BB

In [None]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_3": test_data_agent_3.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_bollinger_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_3 = all_results["Agent_3"]


In [None]:
results_Agent_3

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,0.03%,0,0.0,0
2,1.00%,0,0.0,0
3,3.00%,0,0.0,0
4,5.00%,0,0.0,0
5,10.00%,0,0.0,0
6,15.00%,0,0.0,0
7,20.00%,0,0.0,0


## EMA + MACD + OBV

In [None]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_3": test_data_agent_3.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_macd_obv_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_3 = all_results["Agent_3"]


In [None]:
results_Agent_3

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,112400,4153.718089,3.695479
1,0.03%,112400,-561.2975,-0.499375
2,1.00%,112400,3943.980678,3.50888
3,3.00%,112400,3710.493692,3.301151
4,5.00%,112400,4155.642835,3.697191
5,10.00%,112400,4153.718089,3.695479
6,15.00%,112400,4153.718089,3.695479
7,20.00%,112400,4153.718089,3.695479


## EMA + Bollinger + MACD + OBV

In [None]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_3": test_data_agent_3.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_4_indicator_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_3 = all_results["Agent_3"]


In [None]:
results_Agent_3

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,0.03%,0,0.0,0
2,1.00%,0,0.0,0
3,3.00%,0,0.0,0
4,5.00%,0,0.0,0
5,10.00%,0,0.0,0
6,15.00%,0,0.0,0
7,20.00%,0,0.0,0


### ⚔️ Strategy Showdown: Same Test Set (2015–2024)

To ensure fairness, **all rule-based strategies were re-evaluated** on the **exact same test set** as the DQN + LSTM agent. Here’s how they compare:

---

### 🏦 Baseline: Buy-and-Hold  
> 📈 **ROI**: `578.28%`  
> Simple strategy, no trading — just holding Apple throughout.

---

### 🤖 RL Agent (DQN + LSTM)  
> ✅ **ROI**: `587.88%`  
> 💰 **Invested**: `$751,500`  
> 📈 **Realized Profit**: `$4,417,913.58`  

The agent slightly **outperforms** buy-and-hold while deploying significantly **less capital** — a promising outcome.

---

### ⚙️ Rule-Based Strategies  

#### 📉 Trend (EMA Crossover)  
- ROI: `~1.24%` (No Stop Loss)
- Huge capital deployed: `$232M+`
- Consistently poor ROI with stop losses — strategy lacks robustness here.

#### 🔁 Mean Reversion (Bollinger Bands)  
- ROI: `0%`  
- **No trades** triggered — shows limitation of BB-only strategies in trending markets like Apple.

#### 💪 Relative Strength (RSI)  
- Highest ROI among hardcoded: **`241.72%`**  
- ROI improves significantly with stop losses at 5–20%.  
- Still far below agent performance and buy-and-hold.

#### 📊 Momentum (MACD)  
- ROI: Negative across all configs  
- ❌ Consistent underperformance — not suited standalone here.

#### 📦 Volume-Based (OBV)  
- ROI: `~ -1.8%`  
- No meaningful signals — ineffective in isolation.

---

### ⚙️ Combined Strategies  

#### 📉 EMA + MACD  
- ROI: `~ -1.8%`  
- Synergy didn’t help — strategy remains unprofitable.

#### 🔁 EMA + Bollinger Bands  
- ROI: `0%`  
- Like BB alone, this also triggers **no trades**.

#### 🔁 EMA + MACD + OBV  
- ROI: `~3.7%`  
- A small profit, but still a far cry from the agent’s ROI.

#### 🧠 EMA + BB + MACD + OBV  
- ROI: `0%`  
- Still **no trades** triggered — suggests rule overlap may be too restrictive.

---

### 🧠 Takeaway

- **DQN + LSTM agent** is the clear winner — **highest ROI** with **lowest capital** used.
- Hardcoded strategies underperform by large margins.
- Some strategies (e.g., RSI) showed decent results, but still not competitive.
- Others failed to trigger trades altogether — highlighting the **difficulty of manual signal crafting**.

> ✅ From now on, all strategies will follow this **standardized head-to-head format** on shared test sets.

## 🔄 Why Try PPO?

Now we're going to explore **PPO (Proximal Policy Optimization)** simply to test how a **policy-gradient approach** compares to DQN for this trading setup.

It offers a few things worth trying out:
- Handles **stochastic policies**, which might suit market uncertainty  
- Supports **direct policy learning**, instead of relying on value estimates  
- Often more stable when dealing with **noisy or delayed rewards**  

Not assuming it’ll perform better — just checking if its structure aligns well with the challenges in trading environments.

## ♟️ Reinforcement Learning Pipeline: PPO Agent with LSTM-Based Actor-Critic

We now construct a **Proximal Policy Optimization (PPO)** agent with LSTM memory — a modern policy-gradient method often used in continuous control tasks. Like before, we train on sequences of real market indicators within a custom trading environment.

This architecture is composed of four major parts:

---

### 1️⃣ `TradingEnv`: Gym-Compatible Market Simulator

This is the **same environment** used in DQN — a realistic episodic trading simulator with finite data, discrete actions, and explicit reward shaping.

| Component        | Details                                                                 |
|------------------|-------------------------------------------------------------------------|
| **Actions**      | `0 = Hold`, `1 = Buy ($100)`, `2 = Sell (liquidate)`                    |
| **Observations** | `11-D` vector = 10 normalized indicators + `shares_held`                |
| **Reward**       | **Only earned when selling** (realized profit after fee)                |
| **Transaction Fee** | 0.25% on both Buy and Sell                                           |
| **Termination**  | Ends when reaching the dataset's end; any open position is liquidated   |

The environment tracks:
- `shares_held`: how much stock the agent owns  
- `cost_so_far`: total cost of those shares  
- `total_profit`: realized gains over the episode  
- `total_buys_count`: used to compute total investment  

All of this provides a **realistic constraint-based setup** for the agent to explore and learn.

---

### 2️⃣ `RecurrentActorCritic`: Shared LSTM for Policy + Value Estimation

At the core of PPO is a **shared neural network** that outputs both:
- The **policy** (actor): what action to take
- The **value** (critic): how good the current state is

This architecture uses an **LSTM** to model **temporal dependencies**, crucial for trading.

#### 🧠 Network Architecture:
| Layer            | Purpose                                      |
|------------------|----------------------------------------------|
| **LSTM**         | Processes sequences of state inputs          |
| **Actor Head**   | Fully connected → outputs logits (for actions) |
| **Critic Head**  | Fully connected → outputs scalar state value  |

By using a **single LSTM** backbone for both heads, the agent can learn shared temporal patterns that drive both action selection and value prediction.

---

### 3️⃣ `PPOAgent`: Core PPO Logic with Recurrent Policy

The PPO agent encapsulates the full training behavior: rollout collection, policy update, value learning, and entropy regularization.

#### ⚙️ Core Features:
| Component         | Purpose                                                              |
|------------------|----------------------------------------------------------------------|
| `select_action()` | Samples an action from the current policy and tracks log-prob/value |
| `update()`        | Runs PPO's clipped objective updates on collected trajectories      |
| `compute_returns()` | Computes discounted returns (bootstrapped if episode ends early) |

#### 📐 PPO-Specific Design:
- **Clipped Objective**: Stabilizes updates by limiting how much a policy can change per step  
- **Entropy Bonus**: Encourages exploration early in training  
- **Multiple Epochs per Update**: Trajectories are reused to increase sample efficiency  
- **Recurrent Processing**: Even though PPO doesn’t require LSTMs, we use one to track sequential market dependencies

All tensors are carefully shaped (with added sequence dimensions) to support LSTM batching during training and inference.

---

### 4️⃣ Training & Evaluation Loop

#### `collect_trajectory(agent, env)`
- Runs the agent in the environment to collect:
  - `states`, `actions`, `log_probs`, `rewards`, `values`, `dones`
- At the end, bootstraps the final value to compute **discounted returns**
- Used as a batch of experience for PPO updates

#### `train_ppo(agent, env)`
- Repeats for `num_episodes`:
  - Collect a trajectory
  - Run PPO updates using the clipped loss
  - Prints average reward every few episodes (not ROI — just raw training reward)

#### `evaluate_agent(agent, test_data)`
- Creates a fresh `TradingEnv` using **unseen test data**
- Uses the **greedy policy** (no randomness)
- Tracks the final realized profit and investment
- Reports a clean ROI from the agent’s real trades

---

### 🧱 Summary

| Module                | Description                                                         |
|------------------------|---------------------------------------------------------------------|
| `TradingEnv`           | Market simulation with indicator-based state and discrete actions  |
| `RecurrentActorCritic` | LSTM-based model with actor and critic heads                       |
| `PPOAgent`             | Handles action sampling, return computation, and PPO updates       |
| `train_ppo()`          | Loop to collect rollouts and update the policy                     |
| `evaluate_agent()`     | Runs agent in test mode and measures final realized profit         |


In [41]:
# Trading environment remains largely unchanged
class TradingEnv(gym.Env):
    """
    A specialized trading environment:
    - Discrete actions: 0=Hold, 1=Buy($100), 2=Sell(all).
    - 0.25% fee on both Buy and Sell.
    - No position limit (can keep buying to accumulate shares).
    - Final liquidation at the end if still holding.
    - Observations: 10 normalized features + shares_held (float).
    - Tracks total buys and total realized profit.
    """
    def __init__(self, df, start_idx=0, end_idx=None, fee=0.0025):
        super(TradingEnv, self).__init__()
        
        self.df = df.reset_index(drop=True)
        self.start_idx = start_idx
        self.end_idx = end_idx if end_idx is not None else len(self.df) - 1
        self.current_idx = self.start_idx
        
        self.fee = fee
        
        # Actions: 0=Hold, 1=Buy, 2=Sell
        self.action_space = spaces.Discrete(3)
        # Observations: 10 normalized indicators + 1 for shares_held
        self.observation_space = spaces.Box(low=0, high=1, shape=(11,), dtype=np.float32)
        
        self.shares_held = 0.0
        self.cost_so_far = 0.0
        self.total_profit = 0.0
        self.total_buys_count = 0
        
        self.done = False

    def _get_observation(self):
        row = self.df.loc[self.current_idx]
        obs_features = [
            row['EMA_50'], row['EMA_200'], row['SMA_20'],
            row['Upper_Band'], row['Lower_Band'], row['%K'],
            row['%D'], row['MACD_Line'], row['Signal_Line'], row['OBV']
        ]
        obs = np.array(obs_features, dtype=np.float32)
        obs = np.append(obs, self.shares_held).astype(np.float32)
        return obs

    def _final_liquidation(self):
        if self.shares_held > 0.0:
            current_price = self.df.loc[self.current_idx, 'Close']
            gross_value = self.shares_held * current_price
            net_value = gross_value * (1.0 - self.fee)
            final_profit = net_value - self.cost_so_far
            self.total_profit += final_profit
            self.shares_held = 0.0
            self.cost_so_far = 0.0

    def step(self, action):
        assert self.action_space.contains(action), "Invalid Action"
        reward = 0.0
        current_price = self.df.loc[self.current_idx, 'Close']
        
        if action == 1:  # Buy
            invest_amount = 100.0
            invest_amount_after_fee = invest_amount * (1.0 - self.fee)
            shares_bought = invest_amount_after_fee / current_price
            self.shares_held += shares_bought
            self.cost_so_far += invest_amount
            self.total_buys_count += 1

        elif action == 2 and self.shares_held > 0.0:  # Sell (liquidate all)
            gross_value = self.shares_held * current_price
            net_value = gross_value * (1.0 - self.fee)
            profit = net_value - self.cost_so_far
            self.total_profit += profit
            reward = profit
            self.shares_held = 0.0
            self.cost_so_far = 0.0

        self.current_idx += 1
        if self.current_idx >= self.end_idx:
            self._final_liquidation()
            self.done = True
        else:
            self.done = False
        
        obs = self._get_observation()
        return obs, reward, self.done, {}

    def reset(self):
        self.current_idx = self.start_idx
        self.done = False
        self.shares_held = 0.0
        self.cost_so_far = 0.0
        self.total_profit = 0.0
        self.total_buys_count = 0
        return self._get_observation()

    def render(self, mode='human'):
        pass


In [42]:
# Actor-Critic network with recurrent processing
class RecurrentActorCritic(nn.Module):
    def __init__(self, input_dim=11, hidden_size=64, num_actions=3):
        super(RecurrentActorCritic, self).__init__()
        self.hidden_size = hidden_size
        
        # Shared LSTM feature extractor
        self.lstm = nn.LSTM(input_dim, hidden_size, batch_first=True)
        
        # Actor head: outputs logits for each action
        self.actor = nn.Linear(hidden_size, num_actions)
        
        # Critic head: outputs state-value estimate
        self.critic = nn.Linear(hidden_size, 1)
    
    def forward(self, x, hidden):
        # x shape: (batch_size, seq_len, input_dim)
        out, hidden = self.lstm(x, hidden)
        # Use the output of the last timestep
        last_out = out[:, -1, :]  # (batch_size, hidden_size)
        logits = self.actor(last_out)    # (batch_size, num_actions)
        value = self.critic(last_out)      # (batch_size, 1)
        return logits, value, hidden
    
    def init_hidden(self, batch_size=1):
        # Initialize hidden states for LSTM
        return (torch.zeros(1, batch_size, self.hidden_size),
                torch.zeros(1, batch_size, self.hidden_size))


In [49]:
# PPO Agent that uses the actor-critic network
class PPOAgent:
    def __init__(self, input_dim=11, num_actions=3, hidden_size=64,
                 lr=1e-3, gamma=0.99, clip_param=0.3, ppo_epochs=8, batch_size=64, entropy_coef=0.05):
        self.gamma = gamma
        self.clip_param = clip_param
        self.ppo_epochs = ppo_epochs
        self.batch_size = batch_size
        self.entropy_coef = entropy_coef  # store entropy coefficient

        # Initialize the actor-critic network
        self.policy = RecurrentActorCritic(input_dim, hidden_size, num_actions)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
    
    def select_action(self, state, hidden):
        """
        Given a state and hidden state, sample an action from the policy,
        and return the action, its log probability, and the estimated value.
        """
        state_tensor = torch.FloatTensor(state).unsqueeze(0).unsqueeze(0)  # (1, 1, input_dim)
        logits, value, hidden = self.policy(state_tensor, hidden)
        dist = Categorical(logits=logits)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob.item(), value.item(), hidden

    def compute_returns(self, rewards, dones, values, next_value):
        """
        Compute discounted returns for a trajectory.
        """
        returns = []
        R = next_value
        # Traverse backwards through rewards
        for step in reversed(range(len(rewards))):
            R = rewards[step] + self.gamma * R * (1 - dones[step])
            returns.insert(0, R)
        return returns

    def update(self, trajectories):
        """
        Perform PPO update using the collected trajectories.
        trajectories: dictionary with keys 'states', 'actions', 'log_probs',
                      'rewards', 'dones', 'values', and computed 'returns'
        """
        # Convert lists to tensors
        states = torch.FloatTensor(np.array(trajectories['states'])) # shape: (N, input_dim)
        actions = torch.LongTensor(np.array(trajectories['actions']))
        old_log_probs = torch.FloatTensor(np.array(trajectories['log_probs']))
        returns = torch.FloatTensor(np.array(trajectories['returns']))
        values = torch.FloatTensor(np.array(trajectories['values']))

        
        # Advantage: difference between returns and value estimates
        advantages = returns - values

        # For simplicity, we assume the states are single-step observations.
        # If you collect sequential data, you may need to reshape to (batch, seq_len, input_dim)
        hidden = self.policy.init_hidden(batch_size=states.size(0))
        # Add a sequence dimension: (batch_size, seq_len=1, input_dim)
        states_seq = states.unsqueeze(1)
        for _ in range(self.ppo_epochs):
            logits, value, _ = self.policy(states_seq, hidden)
            dist = Categorical(logits=logits)
            new_log_probs = dist.log_prob(actions)
            ratio = torch.exp(new_log_probs - old_log_probs)
            
            # Clipped surrogate objective
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1.0 - self.clip_param, 1.0 + self.clip_param) * advantages
            actor_loss = -torch.min(surr1, surr2).mean()
            
            # Critic loss
            critic_loss = nn.MSELoss()(value.squeeze(), returns)
            
            loss = actor_loss + 0.5 * critic_loss
            
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()


In [44]:
def collect_trajectory(agent, env, max_steps=1000):
    """
    Collect a trajectory (or rollout) by running the current policy in the environment.
    Returns a dictionary of lists for states, actions, log_probs, rewards, dones, and values.
    """
    trajectories = {
        'states': [],
        'actions': [],
        'log_probs': [],
        'rewards': [],
        'dones': [],
        'values': []
    }
    
    state = env.reset()
    hidden = agent.policy.init_hidden(batch_size=1)
    done = False
    steps = 0
    
    while not done and steps < max_steps:
        action, log_prob, value, hidden = agent.select_action(state, hidden)
        trajectories['states'].append(state)
        trajectories['actions'].append(action)
        trajectories['log_probs'].append(log_prob)
        trajectories['values'].append(value)
        
        next_state, reward, done, _ = env.step(action)
        trajectories['rewards'].append(reward)
        trajectories['dones'].append(done)
        
        state = next_state
        steps += 1
    
    # Get value for the final state (for bootstrapping)
    state_tensor = torch.FloatTensor(state).unsqueeze(0).unsqueeze(0)
    _, next_value, _ = agent.policy(state_tensor, agent.policy.init_hidden(batch_size=1))
    next_value = next_value.item()
    
    # Compute returns (discounted rewards)
    trajectories['returns'] = agent.compute_returns(trajectories['rewards'], trajectories['dones'], trajectories['values'], next_value)
    return trajectories

def train_ppo(agent, env, num_episodes=100, log_interval=50):
    """
    Main PPO training loop.
    For each episode, collect a trajectory, update the agent using the trajectory.
    Only print progress every `log_interval` episodes.
    """
    total_rewards = []
    
    for episode in range(num_episodes):
        traj = collect_trajectory(agent, env)
        agent.update(traj)
        episode_reward = sum(traj['rewards'])
        total_rewards.append(episode_reward)
        
        # Print progress only every log_interval episodes
        if (episode + 1) % log_interval == 0 or episode == 0:
            avg_reward = np.mean(total_rewards[-log_interval:])
            print(f"Episode {episode+1}/{num_episodes} | Average Reward (last {log_interval} episodes): {avg_reward:.2f}")
    
    print("Training complete.")





In [45]:
def evaluate_agent(agent, test_data):
    # Create a new environment for testing
    env = TradingEnv(df=test_data,
                     start_idx=0,
                     end_idx=len(test_data) - 1,
                     fee=0.0025)
    
    state = env.reset()
    hidden = agent.policy.init_hidden(batch_size=1)
    done = False
    
    # Run through the environment deterministically
    while not done:
        # Convert state to tensor and add sequence dimensions: (1, 1, input_dim)
        state_tensor = torch.FloatTensor(state).unsqueeze(0).unsqueeze(0)
        
        with torch.no_grad():
            # Get logits from the policy (actor) and value is ignored here
            logits, _, hidden_next = agent.policy(state_tensor, hidden)
        
        # Deterministically select the action with the highest probability
        action = torch.argmax(logits, dim=1).item()
        
        state, reward, done, _ = env.step(action)
        hidden = hidden_next
    
    final_profit = env.total_profit
    total_invested = env.total_buys_count * 100
    print(f"[Test Results] Total Invested: ${total_invested} | Realized Profit: ${final_profit:.2f}")
    return final_profit, total_invested


### 🚦 PPO Training: Quick Overview

We’re training a **PPO agent** on the **first 10,000 rows** of data — a short run, just to:
- Validate pipeline functionality
- See early signs of learning
- Ensure agent interacts meaningfully with the environment

After training, we evaluate the policy on **all remaining data** — measuring how well the agent generalizes to unseen price action.

---

### 🧠 PPO Hyperparameters — Deep Dive

| Hyperparameter     | Value     | What It Means                                                                 |
|--------------------|-----------|-------------------------------------------------------------------------------|
| `input_dim`        | `11`      | The number of inputs: 10 indicators + `shares_held`                          |
| `num_actions`      | `3`       | Discrete: {0 = Hold, 1 = Buy, 2 = Sell}                                      |
| `hidden_size`      | `64`      | LSTM hidden units — controls memory capacity for temporal pattern learning   |
| `lr` (learning rate) | `1e-3`   | Moderate learning rate for stable PPO updates                                |
| `gamma`            | `0.99`    | Discount factor — favors **long-term profit** over short-term gains          |
| `clip_param`       | `0.3`     | **Clipping range** for PPO update — larger value allows more flexible policy shifts (good for early-stage learning) |
| `ppo_epochs`       | `8`       | Number of **gradient updates per batch** — increases sample reuse & stability |
| `batch_size`       | `64`      | Number of samples used per PPO update                                        |
| `entropy_coef`     | `0.05`    | Strength of **exploration bonus** — encourages policy randomness to avoid premature convergence |

Together, these settings make PPO:
- **More exploratory** (via `entropy_coef`)
- **Less conservative** in updates (`clip_param = 0.3`)
- More effective at **generalizing from small datasets**

---

### ✅ Outcome

After training, the agent is evaluated using `evaluate_agent()` on unseen data to measure:
- **Realized Profit** from its trading behavior
- **Total Invested Capital**
- **Return on Investment (ROI)**

This helps us compare the **PPO agent's generalization** against baselines like **buy-and-hold** or other RL methods.

In [None]:
# Training data: First 10,000 rows
train_data = data_2.iloc[:10000].reset_index(drop=True)

# Testing data: Rows 10,001 to last row
test_data = data_2.iloc[10000:].reset_index(drop=True)

# Print split sizes
print(f"Training Data: {train_data.shape}")
print(f"Testing Data: {test_data.shape}")

Training Data: (10000, 24)
Testing Data: (8679165, 24)


In [None]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data.iloc[0]['Close']
final_price = test_data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 6366.75%


In [None]:
# Create training environment
train_env = TradingEnv(df=train_data, start_idx=0, end_idx=len(train_data) - 1, fee=0.0025)

# Initialize the PPO agent with higher exploration settings
ppo_agent = PPOAgent(
    input_dim=11,
    num_actions=3,
    hidden_size=64,
    lr=1e-3,
    gamma=0.99,
    clip_param=0.3,      # Looser clip for more aggressive updates
    ppo_epochs=8,        # More epochs per update cycle
    batch_size=64,
    entropy_coef=0.05    # Higher entropy bonus for increased exploration
)


# Train the PPO agent on the training environment
train_ppo(ppo_agent, train_env, num_episodes=100)

# Evaluate the agent on test data
final_profit, total_invested = evaluate_agent(ppo_agent, test_data)

print(f"Final Profit: ${final_profit:.2f}")
print(f"Total Invested: ${total_invested}")


Episode 1/100 | Average Reward (last 50 episodes): -185.02
Episode 50/100 | Average Reward (last 50 episodes): -56.29
Episode 100/100 | Average Reward (last 50 episodes): -2.25
Training complete.
[Test Results] Total Invested: $0 | Realized Profit: $0.00
Final Profit: $0.00
Total Invested: $0


In [None]:
total_invested = 0.000001   # Total amount invested
realized_profit = 0  # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 0.00%


### 🤖 PPO Agent: Initial Run (10k Rows)

We trained the PPO agent on a **small 10,000-row window** to test the setup and ensure the training loop was working as expected.

> 🧮 **Buy-and-Hold ROI (Benchmark)**: `6366.75%`  
> 🧠 **PPO Agent ROI**: `0.00%`  
> 💵 **Total Invested**: `$0`  
> 📉 **Profit**: `$0.00`

---

### 🧪 Why This Result Is Expected

- The agent **did not place any trades** during this small training run.
- Early episodes had **highly negative average rewards**, but by episode 100 it improved to ~`-2.25`, suggesting **learning had begun**.
- With such a small training window, the agent likely didn't see enough data patterns to build confidence in any action.

---

### 🚧 Next Steps

- **Increase the training window** to expose the agent to more varied market conditions.
- Consider tweaking **entropy bonus** or **clip parameter** to encourage exploration earlier in training.
- Run longer training (more episodes) and inspect whether trades begin firing.

This run was about validating the PPO pipeline — not performance yet. Everything is working, now it’s time to scale.

### 🚀 PPO Agent: Scaling Up Training

We now increase the training window from 10,000 to **100,000 rows** to give the PPO agent a **much broader view of market behavior**.

- This larger window exposes the agent to **more price dynamics, patterns, and indicator variations**.
- The test set spans all data **after row 100,000**, simulating forward generalization.
- Training is run for **100 episodes**, same as before, to keep comparisons consistent.
- The goal is to observe whether **increased data alone enables better learning and trading activity**, even without tuning.

This setup will help assess whether PPO begins taking meaningful actions once it’s seen more of the market.

In [None]:
# Training data: First 100,000 rows
train_data = data_2.iloc[:100000].reset_index(drop=True)

# Testing data: Rows 100,001 to last row
test_data = data_2.iloc[100000:].reset_index(drop=True)

# Print split sizes
print(f"Training Data: {train_data.shape}")
print(f"Testing Data: {test_data.shape}")

Training Data: (100000, 24)
Testing Data: (8589165, 24)


In [None]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data.iloc[0]['Close']
final_price = test_data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 8035.58%


In [None]:
# Create training environment
train_env = TradingEnv(df=train_data, start_idx=0, end_idx=len(train_data) - 1, fee=0.0025)

# Initialize the PPO agent with higher exploration settings
ppo_agent = PPOAgent(
    input_dim=11,
    num_actions=3,
    hidden_size=64,
    lr=1e-3,
    gamma=0.99,
    clip_param=0.3,      # Looser clip for more aggressive updates
    ppo_epochs=8,        # More epochs per update cycle
    batch_size=64,
    entropy_coef=0.05    # Higher entropy bonus for increased exploration
)


# Train the PPO agent on the training environment
train_ppo(ppo_agent, train_env, num_episodes=100)

# Evaluate the agent on test data
final_profit, total_invested = evaluate_agent(ppo_agent, test_data)

print(f"Final Profit: ${final_profit:.2f}")
print(f"Total Invested: ${total_invested}")


Episode 1/100 | Average Reward (last 50 episodes): -156.10
Episode 50/100 | Average Reward (last 50 episodes): -46.96
Episode 100/100 | Average Reward (last 50 episodes): -0.69
Training complete.
[Test Results] Total Invested: $0 | Realized Profit: $0.00
Final Profit: $0.00
Total Invested: $0


In [None]:
total_invested = 0.000001 # Total amount invested
realized_profit = 0 # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 0.00%


### 📉 PPO Agent: Still Not Taking Action

Despite training on **100,000 rows** for **100 episodes**, the PPO agent **did not make a single trade** during evaluation.  

- It learned to minimize loss (average reward steadily improved), but never crossed the threshold to **buy or sell**.
- We'll need to adjust exploration dynamics, reward shaping, or training strategy to encourage **actual participation in the market**.  

🧪 Next: Try training on longer horizons (e.g., multiple years) or modify the reward/entropy mix.

### 🧪 PPO Experiment: Training on Pre-2008 Data

In this run, we train the PPO agent using **data up to 2007**, then test it on **unseen years from 2008 onward**.

- This split simulates **learning before the 2008 financial crisis**, then seeing how well the agent generalizes into more volatile markets.
- We again use high-entropy settings to **encourage exploration** early in training.

In [None]:
# Split the dataset into train, validation, and test sets
train_data = data_2[data_2['Year'] <= 2007]
test_data = data_2[data_2['Year'] >= 2008]

# Print split sizes
print(f"Training Data: {train_data.shape}")
print(f"Testing Data: {test_data.shape}")

Training Data: (936005, 24)
Testing Data: (7753160, 24)


In [None]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data.iloc[0]['Close']
final_price = test_data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 2554.91%


In [None]:
# Create training environment
train_env = TradingEnv(df=train_data, start_idx=0, end_idx=len(train_data) - 1, fee=0.0025)

# Initialize the PPO agent with higher exploration settings
ppo_agent = PPOAgent(
    input_dim=11,
    num_actions=3,
    hidden_size=64,
    lr=1e-3,
    gamma=0.99,
    clip_param=0.3,      # Looser clip for more aggressive updates
    ppo_epochs=8,        # More epochs per update cycle
    batch_size=64,
    entropy_coef=0.05    # Higher entropy bonus for increased exploration
)


# Train the PPO agent on the training environment
train_ppo(ppo_agent, train_env, num_episodes=100)

# Evaluate the agent on test data
final_profit, total_invested = evaluate_agent(ppo_agent, test_data)

print(f"Final Profit: ${final_profit:.2f}")
print(f"Total Invested: ${total_invested}")


Episode 1/100 | Average Reward (last 50 episodes): -186.56
Episode 50/100 | Average Reward (last 50 episodes): -110.02
Episode 100/100 | Average Reward (last 50 episodes): -1.08
Training complete.
[Test Results] Total Invested: $0 | Realized Profit: $0.00
Final Profit: $0.00
Total Invested: $0


In [None]:
total_invested = 0.000001 # Total amount invested
realized_profit = 0 # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 0.00%


### ❌ PPO Agent: Pre-2008 

Despite running for 100 episodes, the PPO agent **did not place any trades** during evaluation:

#### 🧪 **Trained on data ≤ 2007**, tested on 2008+
- Average reward improved over time (from -186.56 → -1.08), indicating some internal learning.
- Yet **no trades were executed** on the test set — the policy remained too cautious or underconfident.

### 🧪 PPO Experiment: Training up to 2009, Testing from 2010+

We extend the training window through **2009**, giving the PPO agent a chance to experience **crisis + post-crisis dynamics**.

- The test set starts from 2010 — a period of strong bull runs and structural changes.
- With more varied training data, we’re testing if the agent can finally **build conviction and act** during the evaluation window. 

In [None]:
# Split the dataset into train, validation, and test sets
train_data = data_2[data_2['Year'] <= 2009]
test_data = data_2[data_2['Year'] >= 2010]

# Print split sizes
print(f"Training Data: {train_data.shape}")
print(f"Testing Data: {test_data.shape}")

Training Data: (1880669, 24)
Testing Data: (6808496, 24)


In [None]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data.iloc[0]['Close']
final_price = test_data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 2365.54%


In [None]:
# Create training environment
train_env = TradingEnv(df=train_data, start_idx=0, end_idx=len(train_data) - 1, fee=0.0025)

# Initialize the PPO agent with higher exploration settings
ppo_agent = PPOAgent(
    input_dim=11,
    num_actions=3,
    hidden_size=64,
    lr=1e-3,
    gamma=0.99,
    clip_param=0.3,      # Looser clip for more aggressive updates
    ppo_epochs=8,        # More epochs per update cycle
    batch_size=64,
    entropy_coef=0.05    # Higher entropy bonus for increased exploration
)


# Train the PPO agent on the training environment
train_ppo(ppo_agent, train_env, num_episodes=100)

# Evaluate the agent on test data
final_profit, total_invested = evaluate_agent(ppo_agent, test_data)

print(f"Final Profit: ${final_profit:.2f}")
print(f"Total Invested: ${total_invested}")


Episode 1/100 | Average Reward (last 50 episodes): -151.07
Episode 50/100 | Average Reward (last 50 episodes): -39.50
Episode 100/100 | Average Reward (last 50 episodes): -1.32
Training complete.
[Test Results] Total Invested: $0 | Realized Profit: $0.00
Final Profit: $0.00
Total Invested: $0


In [None]:
total_invested = 0.000001   # Total amount invested
realized_profit = 0 # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 0.00%


### ❌ PPO Agent: Pre-2010 Training

Despite running for 100 episodes, the PPO agent **did not place any trades** during evaluation:

#### 🧪 **Trained on data ≤ 2009**, tested on 2010+
- Similar pattern: learning signals improved (rewards approaching 0), but still **zero investment** at test time.
- Agent may be struggling to **translate learned value estimates into confident actions**.

---

### 🧠 What This Suggests

- PPO is **learning something** (reward steadily improving), but not enough to act during evaluation.
- Could be underfitting, overly conservative policy, or insufficient diversity in the training data.
- Later runs will explore whether **more recent or richer training windows** help PPO become more active.

### ⏱️ PPO Training: Learning on Data up to 2014

In this setup, the PPO agent is trained using **all available market data up to and including 2014**, then evaluated on the unseen period from **2015 onwards**.

This structure mirrors a **realistic live deployment** setting — the agent is trained purely on historical data and must operate on future data without retraining.

---

### 🎛️ PPO Hyperparameters Recap

- **`clip_param = 0.3`**  
  Allows more aggressive policy updates; larger range of trust region.
  
- **`ppo_epochs = 8`**  
  The agent performs multiple updates per rollout, giving it more opportunity to fit its policy/value networks.

- **`entropy_coef = 0.05`**  
  Encourages exploration by rewarding higher entropy (less certainty), helpful for avoiding early convergence to conservative or suboptimal policies.

- **`gamma = 0.99`**  
  Standard discount factor to prioritize longer-term rewards.

- **`batch_size = 64`**, **`hidden_size = 64`**  
  Balanced configuration for model complexity and update stability.

---

### 🧪 What This Experiment Tests

- Whether PPO can **generalize to recent years (2015–2024)** after training on a **wide, diverse range of earlier years**.
- Whether broader historical context leads to **greater policy confidence and trade activity** during evaluation.
- Builds on the idea that PPO has struggled in earlier setups — this gives it more market variety to learn from.

The outcome will help determine if **training on longer timeframes** leads to **more meaningful trading behaviors** in PPO.

In [46]:
# Split the dataset into train, validation, and test sets
train_data = data_2[data_2['Year'] <= 2014]
test_data = data_2[data_2['Year'] >= 2015]

# Print split sizes
print(f"Training Data: {train_data.shape}")
print(f"Testing Data: {test_data.shape}")

Training Data: (4250970, 24)
Testing Data: (4438195, 24)


In [47]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data.iloc[0]['Close']
final_price = test_data.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 578.28%


In [50]:
# Create training environment
train_env = TradingEnv(df=train_data, start_idx=0, end_idx=len(train_data) - 1, fee=0.0025)

# Initialize the PPO agent with higher exploration settings
ppo_agent = PPOAgent(
    input_dim=11,
    num_actions=3,
    hidden_size=64,
    lr=1e-3,
    gamma=0.99,
    clip_param=0.3,      # Looser clip for more aggressive updates
    ppo_epochs=8,        # More epochs per update cycle
    batch_size=64,
    entropy_coef=0.05    # Higher entropy bonus for increased exploration
)


# Train the PPO agent on the training environment
train_ppo(ppo_agent, train_env, num_episodes=100)

# Evaluate the agent on test data
final_profit, total_invested = evaluate_agent(ppo_agent, test_data)

print(f"Final Profit: ${final_profit:.2f}")
print(f"Total Invested: ${total_invested}")


Episode 1/100 | Average Reward (last 50 episodes): -209.46
Episode 50/100 | Average Reward (last 50 episodes): -36.81
Episode 100/100 | Average Reward (last 50 episodes): -1.28
Training complete.
[Test Results] Total Invested: $0 | Realized Profit: $0.00
Final Profit: $0.00
Total Invested: $0


In [51]:
total_invested = 0.000001   # Total amount invested
realized_profit = 0 # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 0.00%


### ❌ PPO Agent Result

> 💰 **Total Invested**: `$0`  
> 📉 **Realized Profit**: `$0.00`  
> 📊 **ROI**: `0.00%`

Despite training over a longer historical window, the agent **never executed a trade** during the evaluation period — indicating that it's still struggling to gain confidence in the market dynamics under PPO.

### 🧪 Tried *Everything* with PPO — Still Doesn’t Learn

At this point, I’ve run **dozens of experiments** trying to get PPO to work for this trading setup — and *nothing* has moved the needle.  

> ⭐ **Note:** I haven’t included *every* PPO variant I tested here, as that would be tedious for the reader — but I tried a lot.

This wasn't just tuning hyperparameters. I explored:

- Standard PPO vs. recurrent PPO with LSTM  
- Different rollout lengths and update cycles  
- Reward shaping, entropy bonuses, tighter/looser clip ranges  
- Alternate policy architectures (shallow vs deep heads)  
- Observations with and without state variables like `shares_held`

Despite all this, PPO **never placed a single trade**. Every run ended with:

> `Total Invested: $0 | Realized Profit: $0.00`

---

### 🧠 Why Might PPO Be Failing Here?

After analyzing everything, a few likely culprits stand out:

1. **Sparse and Delayed Rewards**  
   PPO thrives on denser feedback. Here, rewards only happen on **Sell** actions, and often negatively at first. This makes early exploration extremely punishing.

2. **On-Policy Limitation**  
   PPO doesn’t store experience — it learns *only* from current rollouts. So if the current episode is bad (which it usually is early on), it can’t recover or generalize from better past episodes.

3. **Flat Exploration Dynamics**  
   Without early reward signals or strong gradients, the actor often collapses into a **"Hold forever" policy** and gets stuck.

4. **PPO’s Sensitivity to Action Granularity**  
   Our environment uses **discrete decisions with financial side effects** (fees, holdings, etc.). PPO isn’t always great at fine-tuning such **non-continuous reward profiles**.

---

### ✅ Moving On: DQN + LSTM Was Actually Working

In contrast, my DQN + LSTM agent:

- Trained meaningfully within a few episodes  
- Learned to place trades and realize profit  
- Showed stronger generalization as training data expanded  
- Reacted well to features like `shares_held`, volume, and momentum shifts

So I’m officially **ruling out PPO** for now.

> ⭐ **Next:** I’ll be shifting focus toward **testing DQN on additional assets** (e.g., Reliance and S&P500) to verify generality across market types.  

No point chasing what isn’t working — DQN is already showing promising behavior.

# 🌍 Next: Generalization to Other Assets

The big question now:

> **Can the best-performing DQN + LSTM agent trained on Apple generalize to brand new assets like S&P 500 and Reliance?**

If not, the follow-up is equally important:

> **Can we train *new* DQN + LSTM agents from scratch that can learn to trade these new assets effectively?**

This next phase will explore whether the learning approach is truly **market-agnostic**, or if each asset requires tailored agent training.

Let’s find out.

### 🧹 Preparing S&P 500 and Reliance Datasets

Now that the Apple agent is finalized, we begin preparing **new datasets** for the next phase: testing generalization and building new agents if needed.

---

#### 📊 Step 1: Load & Format S&P 500 Dataset (`data_SP`)
- Loaded from `dataset_2.csv`
- Cleaned and converted the `date` column to datetime
- Engineered all key **technical indicators** used by the agent:
  - **Trend**: `EMA_50`, `EMA_200`
  - **Mean Reversion**: `SMA_20`, `Bollinger Bands`
  - **Relative Strength**: `%K`, `%D`
  - **Momentum**: `MACD_Line`, `Signal_Line`
  - **Volume**: `OBV`
- Renamed columns for consistent formatting
- Dropped intermediate or irrelevant columns (`Unnamed: 0`, `Barcount`, etc.)
- Removed all rows with missing values

---

#### 📈 Step 2: Load & Format Reliance Dataset (`data_R`)
- Loaded from `dataset_3.csv`
- Identical processing pipeline as S&P:
  - Converted date
  - Computed all indicators
  - Cleaned column names and removed NaNs

Both datasets are now **ready for evaluation and modeling** using the same environment and agent framework developed earlier.

---

### 🧪 What’s Next: Hardcoded Strategy Baselines

Before trying DQN + LSTM on these new stocks, we’ll first run **all hardcoded strategies** (Trend, MR, RSI, MACD, Volume, and combos) on the **full S&P 500 and Reliance datasets**.

> This will give a sense of how rule-based systems perform **overall** on these stocks — similar to how we benchmarked on Apple.

Once done, we’ll proceed to:
- Evaluate how well the existing Apple agent generalizes
- Or build **new agents** tailored to these assets if needed.

In [18]:
# Load the dataset
file_path = 'dataset_2.csv'  # The file is in the same folder as the notebook
data_SP = pd.read_csv(file_path, low_memory=False)

In [19]:
# Convert 'Date' column to datetime if not already converted
data_SP['date'] = pd.to_datetime(data_SP['date'])

### 1. Trend Indicators: 50-Day EMA and 200-Day EMA
data_SP['EMA_50'] = data_SP['close'].ewm(span=50, adjust=False).mean()  # 50-Day EMA
data_SP['EMA_200'] = data_SP['close'].ewm(span=200, adjust=False).mean()  # 200-Day EMA

### 2. Mean Reversion Indicators: Bollinger Bands (20, 2)
data_SP['SMA_20'] = data_SP['close'].rolling(window=20).mean()  # 20-Day Simple Moving Average (SMA)
data_SP['Std_Dev_20'] = data_SP['close'].rolling(window=20).std()  # 20-Day Standard Deviation
data_SP['Upper_Band'] = data_SP['SMA_20'] + (2 * data_SP['Std_Dev_20'])  # Upper Bollinger Band
data_SP['Lower_Band'] = data_SP['SMA_20'] - (2 * data_SP['Std_Dev_20'])  # Lower Bollinger Band

### 3. Relative Strength Indicators: Stochastics (14, 7, 3)
# High and Low for the past 14 periods
data_SP['High_14'] = data_SP['high'].rolling(window=14).max()
data_SP['Low_14'] = data_SP['low'].rolling(window=14).min()
# %K: Stochastic Oscillator
data_SP['%K'] = ((data_SP['close'] - data_SP['Low_14']) / (data_SP['High_14'] - data_SP['Low_14'])) * 100
# %D: 3-Period Moving Average of %K
data_SP['%D'] = data_SP['%K'].rolling(window=3).mean()

### 4. Momentum Indicators: MACD (12, 26, 9)
# MACD Line: Difference between 12-period and 26-period EMAs
data_SP['MACD_Line'] = data_SP['close'].ewm(span=12, adjust=False).mean() - data_SP['close'].ewm(span=26, adjust=False).mean()
# Signal Line: 9-period EMA of the MACD Line
data_SP['Signal_Line'] = data_SP['MACD_Line'].ewm(span=9, adjust=False).mean()

### 5. Volume Indicators: On-Balance Volume (OBV)
# OBV Calculation
data_SP['Daily_Change'] = data_SP['close'].diff()
data_SP['OBV'] = (np.where(data_SP['Daily_Change'] > 0, data_SP['volume'],
                  np.where(data_SP['Daily_Change'] < 0, -data_SP['volume'], 0))).cumsum()

# Drop intermediate columns not required
data_SP.drop(columns=['Daily_Change'], inplace=True)

In [164]:
# Count rows with any null values
num_null_rows = data_SP.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 317


In [20]:
# Drop all rows with any null values
data_SP = data_SP.dropna()

In [21]:
# Change the first letter of each column name to uppercase
data_SP.columns = [col.capitalize() for col in data_SP.columns]


In [22]:
# Make a copy of data_SP
data_SP = data_SP.copy()

# Rename columns
data_SP.rename(columns={
    "Ema_50": "EMA_50",
    "Ema_200": "EMA_200",
    "Sma_20": "SMA_20",
    "Std_dev_20": "Std_Dev_20",
    "Upper_band": "Upper_Band",
    "Lower_band": "Lower_Band",
    "High_14": "High_14",
    "Low_14": "Low_14",
    "%k": "%K",
    "%d": "%D",
    "Macd_line": "MACD_Line",
    "Signal_line": "Signal_Line",
    "Obv": "OBV"
}, inplace=True)


In [23]:
# Drop the specified columns
data_SP = data_SP.drop(columns=["Unnamed: 0", "Barcount", "Average"])

In [24]:
# Load the dataset
file_path = 'dataset_3.csv'  # The file is in the same folder as the notebook
data_R = pd.read_csv(file_path, low_memory=False)

In [25]:
# Convert 'Date' column to datetime if not already converted
data_R['Date'] = pd.to_datetime(data_R['Date'])

### 1. Trend Indicators: 50-Day EMA and 200-Day EMA
data_R['EMA_50'] = data_R['Close'].ewm(span=50, adjust=False).mean()  # 50-Day EMA
data_R['EMA_200'] = data_R['Close'].ewm(span=200, adjust=False).mean()  # 200-Day EMA

### 2. Mean Reversion Indicators: Bollinger Bands (20, 2)
data_R['SMA_20'] = data_R['Close'].rolling(window=20).mean()  # 20-Day Simple Moving Average (SMA)
data_R['Std_Dev_20'] = data_R['Close'].rolling(window=20).std()  # 20-Day Standard Deviation
data_R['Upper_Band'] = data_R['SMA_20'] + (2 * data_R['Std_Dev_20'])  # Upper Bollinger Band
data_R['Lower_Band'] = data_R['SMA_20'] - (2 * data_R['Std_Dev_20'])  # Lower Bollinger Band

### 3. Relative Strength Indicators: Stochastics (14, 7, 3)
# High and Low for the past 14 periods
data_R['High_14'] = data_R['High'].rolling(window=14).max()
data_R['Low_14'] = data_R['Low'].rolling(window=14).min()
# %K: Stochastic Oscillator
data_R['%K'] = ((data_R['Close'] - data_R['Low_14']) / (data_R['High_14'] - data_R['Low_14'])) * 100
# %D: 3-Period Moving Average of %K
data_R['%D'] = data_R['%K'].rolling(window=3).mean()

### 4. Momentum Indicators: MACD (12, 26, 9)
# MACD Line: Difference between 12-period and 26-period EMAs
data_R['MACD_Line'] = data_R['Close'].ewm(span=12, adjust=False).mean() - data_R['Close'].ewm(span=26, adjust=False).mean()
# Signal Line: 9-period EMA of the MACD Line
data_R['Signal_Line'] = data_R['MACD_Line'].ewm(span=9, adjust=False).mean()

### 5. Volume Indicators: On-Balance Volume (OBV)
# OBV Calculation
data_R['Daily_Change'] = data_R['Close'].diff()
data_R['OBV'] = (np.where(data_R['Daily_Change'] > 0, data_R['Volume'],
                  np.where(data_R['Daily_Change'] < 0, -data_R['Volume'], 0))).cumsum()

# Drop intermediate columns not required
data_R.drop(columns=['Daily_Change'], inplace=True)

In [171]:
# Count rows with any null values
num_null_rows = data_R.isnull().any(axis=1).sum()

# Print the result
print(f"Number of rows with any null values: {num_null_rows}")

Number of rows with any null values: 19


In [26]:
# Drop all rows with any null values
data_R = data_R.dropna()

## Buy and Hold ROI% for S&P 500 and Reliance over full dataset.

In [None]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = data_SP.iloc[0]['Close']
final_price = data_SP.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 41.50%


In [None]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = data_R.iloc[0]['Close']
final_price = data_R.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 330.87%


## Trend (EMA)

In [175]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "S&P 500": data_SP.copy(),
    "Reliance": data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_SP = all_results["S&P 500"]
results_Reliance = all_results["Reliance"]


In [177]:
results_SP

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,106862300,-13934440.0,-13.039625
1,0.03%,106862300,-533643.6,-0.499375
2,1.00%,106862300,-13789080.0,-12.903595
3,3.00%,106862300,-13934870.0,-13.040027
4,5.00%,106862300,-13935010.0,-13.040157
5,10.00%,106862300,-13935200.0,-13.040336
6,15.00%,106862300,-13935030.0,-13.040177
7,20.00%,106862300,-13934750.0,-13.039911


In [178]:
results_Reliance

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,392469300,5700979.0,1.452592
1,0.03%,392469300,-1959894.0,-0.499375
2,1.00%,392469300,5700967.0,1.452589
3,3.00%,392469300,5700979.0,1.452592
4,5.00%,392469300,5700979.0,1.452592
5,10.00%,392469300,5700979.0,1.452592
6,15.00%,392469300,5700979.0,1.452592
7,20.00%,392469300,5700979.0,1.452592


## Mean Reversion (Bollinger Bands)

In [181]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "S&P 500": data_SP.copy(),
    "Reliance": data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_mean_reversion_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_SP = all_results["S&P 500"]
results_Reliance = all_results["Reliance"]


In [182]:
results_SP

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,10966900,68466.462979,0.624301
1,0.03%,10966900,-54765.956875,-0.499375
2,1.00%,10966900,33910.265949,0.309206
3,3.00%,10966900,66299.575791,0.604543
4,5.00%,10966900,68439.544934,0.624056
5,10.00%,10966900,68425.824958,0.62393
6,15.00%,10966900,68404.473131,0.623736
7,20.00%,10966900,68428.388594,0.623954


In [183]:
results_Reliance

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,2749500,-54424.195925,-1.979422
1,0.03%,2749500,-13730.315625,-0.499375
2,1.00%,2749500,-22611.428705,-0.822383
3,3.00%,2749500,-33869.585032,-1.231845
4,5.00%,2749500,-43052.685546,-1.565837
5,10.00%,2749500,-53577.862723,-1.94864
6,15.00%,2749500,-53944.401781,-1.961971
7,20.00%,2749500,-54400.662107,-1.978566


## RSI

In [194]:
# Ensure the DataFrames are explicitly updated and not views
data_SP = data_SP.copy()
data_R = data_R.copy()

# Add %D_Slow permanently to each DataFrame
data_SP['%D_Slow'] = data_SP['%D'].rolling(window=3).mean()
data_R['%D_Slow'] = data_R['%D'].rolling(window=3).mean()


In [198]:
# Remove all rows with null values from each DataFrame
data_SP = data_SP.dropna().reset_index(drop=True)
data_R = data_R.dropna().reset_index(drop=True)

In [205]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "S&P 500": data_SP.copy(),
    "Reliance": data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_stochastics_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_SP = all_results["S&P 500"]
results_Reliance = all_results["Reliance"]


In [201]:
results_SP

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,37831000,271589.686387,0.717902
1,0.03%,37831000,-188918.55625,-0.499375
2,1.00%,37831000,190416.770252,0.503335
3,3.00%,37831000,262760.32741,0.694564
4,5.00%,37831000,271508.531677,0.717688
5,10.00%,37831000,271492.481317,0.717646
6,15.00%,37831000,271464.751551,0.717572
7,20.00%,37831000,271475.767577,0.717601


In [202]:
results_Reliance

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,111673900,-1212928.0,-1.086134
1,0.03%,111673900,-557671.5,-0.499375
2,1.00%,111673900,-321174.6,-0.2876
3,3.00%,111673900,-832593.0,-0.745557
4,5.00%,111673900,-1128413.0,-1.010453
5,10.00%,111673900,-1016505.0,-0.910244
6,15.00%,111673900,-1120503.0,-1.00337
7,20.00%,111673900,-1225276.0,-1.097191


## MACD

In [208]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "S&P 500": data_SP.copy(),
    "Reliance": data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_macd_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_SP = all_results["S&P 500"]
results_Reliance = all_results["Reliance"]


In [209]:
results_SP

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,102929000,-169928.178792,-0.165093
1,0.03%,102929000,-514001.69375,-0.499375
2,1.00%,102929000,-169708.134615,-0.164879
3,3.00%,102929000,-169928.178792,-0.165093
4,5.00%,102929000,-169928.178792,-0.165093
5,10.00%,102929000,-169928.178792,-0.165093
6,15.00%,102929000,-169928.178792,-0.165093
7,20.00%,102929000,-169928.178792,-0.165093


In [210]:
results_Reliance

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,381215400,-1780802.0,-0.467138
1,0.03%,381215400,-1903694.0,-0.499375
2,1.00%,381215400,-1772554.0,-0.464974
3,3.00%,381215400,-1779936.0,-0.466911
4,5.00%,381215400,-1780802.0,-0.467138
5,10.00%,381215400,-1780802.0,-0.467138
6,15.00%,381215400,-1780802.0,-0.467138
7,20.00%,381215400,-1780802.0,-0.467138


## Volume

In [213]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "S&P 500": data_SP.copy(),
    "Reliance": data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_macd_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_SP = all_results["S&P 500"]
results_Reliance = all_results["Reliance"]
results_Agent_2 = all_results["Agent_2"]
results_Agent_3 = all_results["Agent_3"]


In [214]:
results_SP

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,43551300,-5743501.0,-13.187898
1,0.03%,43551300,-217484.3,-0.499375
2,1.00%,43551300,-5706769.0,-13.103555
3,3.00%,43551300,-5742644.0,-13.185931
4,5.00%,43551300,-5743706.0,-13.188368
5,10.00%,43551300,-5743773.0,-13.188523
6,15.00%,43551300,-5743711.0,-13.188381
7,20.00%,43551300,-5743605.0,-13.188138


In [215]:
results_Reliance

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,196642100,3179734.0,1.617016
1,0.03%,196642100,-981981.5,-0.499375
2,1.00%,196642100,3179730.0,1.617014
3,3.00%,196642100,3179734.0,1.617016
4,5.00%,196642100,3179734.0,1.617016
5,10.00%,196642100,3179734.0,1.617016
6,15.00%,196642100,3179734.0,1.617016
7,20.00%,196642100,3179734.0,1.617016


## EMA + MACD

In [224]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "S&P 500": data_SP.copy(),
    "Reliance": data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_macd_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_SP = all_results["S&P 500"]
results_Reliance = all_results["Reliance"]


In [225]:
results_SP

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,43551300,-5743501.0,-13.187898
1,0.03%,43551300,-217484.3,-0.499375
2,1.00%,43551300,-5706769.0,-13.103555
3,3.00%,43551300,-5742644.0,-13.185931
4,5.00%,43551300,-5743706.0,-13.188368
5,10.00%,43551300,-5743773.0,-13.188523
6,15.00%,43551300,-5743711.0,-13.188381
7,20.00%,43551300,-5743605.0,-13.188138


In [226]:
results_Reliance

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,196642100,3179734.0,1.617016
1,0.03%,196642100,-981981.5,-0.499375
2,1.00%,196642100,3179730.0,1.617014
3,3.00%,196642100,3179734.0,1.617016
4,5.00%,196642100,3179734.0,1.617016
5,10.00%,196642100,3179734.0,1.617016
6,15.00%,196642100,3179734.0,1.617016
7,20.00%,196642100,3179734.0,1.617016


## EMA + BB

In [229]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "S&P 500": data_SP.copy(),
    "Reliance": data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_bollinger_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_SP = all_results["S&P 500"]
results_Reliance = all_results["Reliance"]


In [230]:
results_SP

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,5390000,-763326.334446,-14.161899
1,0.03%,5390000,-26916.3125,-0.499375
2,1.00%,5390000,-748408.857681,-13.885137
3,3.00%,5390000,-763021.790529,-14.156248
4,5.00%,5390000,-762817.464894,-14.152458
5,10.00%,5390000,-762876.914613,-14.153561
6,15.00%,5390000,-762950.948793,-14.154934
7,20.00%,5390000,-763000.921471,-14.155861


In [231]:
results_Reliance

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,1397200,-93825.207302,-6.715231
1,0.03%,1397200,-6977.2675,-0.499375
2,1.00%,1397200,-15532.877516,-1.111715
3,3.00%,1397200,-35673.018318,-2.553179
4,5.00%,1397200,-51909.230693,-3.715233
5,10.00%,1397200,-82961.622889,-5.937706
6,15.00%,1397200,-93504.359894,-6.692267
7,20.00%,1397200,-93646.584703,-6.702447


## EMA + MACD + OBV

In [234]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "S&P 500": data_SP.copy(),
    "Reliance": data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_macd_obv_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_SP = all_results["S&P 500"]
results_Reliance = all_results["Reliance"]


In [235]:
results_SP

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,21553100,-2882940.0,-13.375986
1,0.03%,21553100,-107630.8,-0.499375
2,1.00%,21553100,-2841098.0,-13.181854
3,3.00%,21553100,-2880695.0,-13.365573
4,5.00%,21553100,-2882981.0,-13.376179
5,10.00%,21553100,-2883014.0,-13.376332
6,15.00%,21553100,-2883020.0,-13.37636
7,20.00%,21553100,-2882938.0,-13.375977


In [236]:
results_Reliance

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,196437900,3228793.0,1.643671
1,0.03%,196437900,-980961.8,-0.499375
2,1.00%,196437900,3218400.0,1.63838
3,3.00%,196437900,3228793.0,1.643671
4,5.00%,196437900,3228793.0,1.643671
5,10.00%,196437900,3228793.0,1.643671
6,15.00%,196437900,3228793.0,1.643671
7,20.00%,196437900,3228793.0,1.643671


## EMA + Bollinger + MACD + OBV

In [239]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "S&P 500": data_SP.copy(),
    "Reliance": data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_4_indicator_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_SP = all_results["S&P 500"]
results_Reliance = all_results["Reliance"]


In [240]:
results_SP

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,400,125.739855,31.434964
1,0.03%,400,-1.9975,-0.499375
2,1.00%,400,-173.60244,-43.40061
3,3.00%,400,-173.60244,-43.40061
4,5.00%,400,-173.60244,-43.40061
5,10.00%,400,-173.60244,-43.40061
6,15.00%,400,-173.60244,-43.40061
7,20.00%,400,-173.60244,-43.40061


In [241]:
results_Reliance

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,0.03%,0,0.0,0
2,1.00%,0,0.0,0
3,3.00%,0,0.0,0
4,5.00%,0,0.0,0
5,10.00%,0,0.0,0
6,15.00%,0,0.0,0
7,20.00%,0,0.0,0


### 📉 Hardcoded Strategies on S&P 500 and Reliance — Summary of Results

To establish a baseline, I ran **all nine hardcoded strategies** — with multiple stop-loss configurations — on both the **S&P 500** and **Reliance** datasets. That’s a total of **18 different tests**.

---

#### ⚙️ Key Observations:
- Across both assets, the strategies mostly resulted in:
  - **Negligible profits** (often less than a percent)
  - Or outright **losses**, even with large capital deployment  
- One configurations did produce minor positive returns, but is **statistically insignificant** given the investment scale and holding period

---

#### 📊 Compared to Buy-and-Hold:
| Asset     | Buy-and-Hold ROI |
|-----------|------------------|
| **S&P 500** | **41.50%**       |
| **Reliance** | **330.87%**     |

The hardcoded strategies **failed to match or beat** these passive baselines over the full time horizon.

---

### 🧭 Takeaway:
These results further highlight the **limitations of static rule-based trading**. Even with decades of data, hardcoded indicators struggled to generalize or capitalize meaningfully.  

### 🔄 Testing Apple-Trained Agent on New Assets: S&P 500 & Reliance

To test generalization, I ran **Agent 3** (the best-performing Apple agent) directly on the **full S&P 500 and Reliance datasets**.

In [246]:
profit, invested = evaluate_agent(agent_3, data_SP)

[Test Results] Total Invested: $979000 | Realized Profit: $62445.35


In [33]:
total_invested = 979000  # Total amount invested
realized_profit = 62445.35  # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")


ROI: 6.38%


In [247]:
profit, invested = evaluate_agent(agent_3, data_R)

[Test Results] Total Invested: $0 | Realized Profit: $0.00


In [2]:
total_invested = 0.00001  # Total amount invested
realized_profit = 0  # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 0.00%


#### 📉 S&P 500 (Full Horizon)

> ROI: **6.38%**  
> Invested: `$979,000`  
> Realized Profit: `$62,445.35`

While the agent did trade and make a small profit, this is **very underwhelming** compared to the **buy-and-hold ROI of 41.50%**. So despite transferring some behavior, it clearly failed to capture broader upward trends or structure in the S&P 500.

---

#### 🪫 Reliance (Full Horizon)

> ROI: **0.00%**  
> Invested: `$0`  
> Realized Profit: `$0.00`

The agent made **no trades at all** on Reliance. Its policy didn’t recognize any signals as worth acting on — highlighting a **complete lack of transferability** to this market.

---

### 🚧 Why This Matters

- **Buy-and-hold dominated** both benchmarks by a wide margin.
- The agent **didn’t generalize** effectively — especially on assets with different dynamics.
- These results confirm that we now need to:
  - Train **specialized agents** from scratch on **S&P 500** and **Reliance**.
  - Let the model learn structure **specific to each asset**, rather than expecting one trained on Apple to perform well elsewhere.

Next step: begin **dedicated DQN + LSTM training** for each market.

## 📅 Preparing the S&P 500 Dataset for Agent Training

We start by preprocessing the full minute-by-minute S&P 500 dataset:

- Converted the timestamp column into usable `Year`, `Month`, `Day`, and `Time` features.
- Extracted and **normalized all technical indicators** — including trend (EMAs), mean reversion (Bollinger Bands), momentum (MACD), volume (OBV), and relative strength (Stochastics).
- Removed any rows with missing values to ensure clean training input.

---

### 🧠 Why Use ~1.27 Million Training Rows?

In the Apple experiments, it became very clear:  
➡️ **More training data led to better generalization** and a more reliable agent.

The best-performing DQN + LSTM agent had access to **nearly a decade of data (~1.8 million rows)**. That helped it:
- Understand multiple market regimes,
- Learn indicator patterns more robustly,
- Avoid overfitting to short-term noise.

So here, we aimed to **match that depth** by training on all rows **up to the end of 2016**. This gives us:

- 📊 **~1.27 million rows** for training  
- 🧪 **~796,000 rows** for testing (from 2017 onward)

That said, the **S&P 500 dataset is smaller overall** than Apple’s in terms of total time span, especially in the early years. So while we still get decent coverage, it may slightly limit the agent’s capacity to generalize deeply — something to keep in mind when analyzing results.

---

### 📈 Next Step

With this setup, we proceed to train a **dedicated DQN + LSTM agent** on the S&P 500 data and test its performance across the future unseen years (2017–2024). This allows us to assess whether specialized agents trained per stock — even with slightly less data — can perform competitively.

In [34]:
# Convert 'Date' column to datetime if not already
data_SP['Date'] = pd.to_datetime(data_SP['Date'])

# Split into separate columns
data_SP['Year'] = data_SP['Date'].dt.year
data_SP['Month'] = data_SP['Date'].dt.month
data_SP['Day'] = data_SP['Date'].dt.day
data_SP['Time'] = data_SP['Date'].dt.time

In [254]:
# Print distinct values of the Year column and their counts
year_counts = data_SP['Year'].value_counts()

print("Distinct Years and Counts:")
for year, count in year_counts.items():
    print(f"Year: {year}, Count: {count}")

Distinct Years and Counts:
Year: 2020, Count: 305366
Year: 2009, Count: 144009
Year: 2010, Count: 142125
Year: 2015, Count: 141870
Year: 2011, Count: 141810
Year: 2016, Count: 141420
Year: 2018, Count: 141090
Year: 2012, Count: 141090
Year: 2017, Count: 141060
Year: 2013, Count: 141052
Year: 2014, Count: 140730
Year: 2019, Count: 140700
Year: 2008, Count: 139707
Year: 2021, Count: 68486


In [35]:
# Normalize the feature columns
features = [
    'EMA_50', 'EMA_200', 'SMA_20', 'Upper_Band', 'Lower_Band', '%K', '%D',
    'MACD_Line', 'Signal_Line', 'OBV'
]
scaler = MinMaxScaler()
data_SP[features] = scaler.fit_transform(data_SP[features])

In [36]:
# Split the dataset into train, validation, and test sets
train_data_SP = data_SP[data_SP['Year'] <= 2016]
test_data_SP = data_SP[data_SP['Year'] >= 2017]

# Print split sizes
print(f"Training Data: {train_data_SP.shape}")
print(f"Testing Data: {test_data_SP.shape}")

Training Data: (1273815, 23)
Testing Data: (796702, 23)


### 💸 Buy-and-Hold ROI Baseline (on test set)

This gives a baseline ROI from buying at the start of 2017 and holding until the end of the dataset.

In [37]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data_SP.iloc[0]['Close']
final_price = test_data_SP.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: -16.76%


### 🧠 Training a DQN + LSTM Agent on the S&P 500

This code block initializes and trains a **Deep Q-Network (DQN)** agent with an **LSTM-based Q-network** on the preprocessed S&P 500 dataset. The architecture and training configuration are kept **identical to the one used for Apple**, so we can directly compare results later.

---

### ⚙️ Core Training Setup

- **Environment**: A `TradingEnv` instance built over the training portion of the S&P 500 data (ending in 2016).
- **Episodes**: Agent is trained for `5` episodes — same as Apple’s agent at the early stages.
- **Fee**: A 0.25% transaction fee is applied on both buys and sells.

---

### 🧪 DQN + LSTM Hyperparameters

| Hyperparameter | Value | Description |
|----------------|-------|-------------|
| `input_dim` | `11` | 10 technical indicators + 1 for shares held |
| `num_actions` | `3` | Discrete actions: 0 = Hold, 1 = Buy $100, 2 = Sell All |
| `hidden_size` | `64` | Size of LSTM’s hidden layer — controls memory capacity |
| `lr` | `1e-3` | Learning rate for optimizer (Adam) |
| `gamma` | `0.99` | Discount factor — favors long-term gains |
| `batch_size` | `8` | Mini-batch size for training from replay buffer |
| `seq_len` | `8` | Length of LSTM sequences — i.e., number of timesteps per sample |
| `buffer_size` | `100000` | Size of replay memory for sampling experiences |
| `epsilon_start` | `1.0` | Initial exploration rate — 100% random actions |
| `epsilon_end` | `0.1` | Minimum exploration (after decay) — 10% random |
| `epsilon_decay_steps` | `1,273,813` | Decay schedule matched to number of training rows |
| `target_update_freq` | `1000` | Frequency (in steps) for updating the target network |

✅ All of these are **identical** to the hyperparameters used for the Apple agent — the goal here is to **keep the learning framework constant** so we can isolate the impact of data and stock differences.

---

### 📦 Summary

This is the first test of a **dedicated agent trained only on S&P 500**, using the exact same DQN + LSTM setup as before. The consistent setup allows fair comparison and will help answer whether the learning architecture generalizes well to new assets when trained specifically on them.

In [271]:
def train_dqn(train_data, num_episodes=5):
    """
    Train a DQN on the given training dataframe for num_episodes.
    """
    # Create environment
    train_env = TradingEnv(
        df=train_data,
        start_idx=0,
        end_idx=len(train_data) - 1,
        fee=0.0025
    )
    
    # Hyperparameters 
    config = {
        'input_dim': 11,   # 10 features + 1 shares_held
        'num_actions': 3,  # hold, buy, sell
        'hidden_size': 64,
        'lr': 1e-3,
        'gamma': 0.99,
        'batch_size': 8,
        'seq_len': 8,
        'buffer_size': 100000,
        'epsilon_start': 1.0,
        'epsilon_end': 0.1,
        'epsilon_decay_steps': 1273813,
        'target_update_freq': 1000
    }
    
    agent = DQNAgent(**config)
    
    for episode in range(num_episodes):
        ep_profit, ep_invested = agent.train_one_episode(train_env)
        
        print(f"Episode {episode+1}/{num_episodes} | "
              f"Total Invested: ${ep_invested} | "
              f"Realized Profit: ${ep_profit:.2f} | "
              f"Epsilon: {agent.epsilon:.3f}")
    
    print("Training complete.")
    return agent


In [272]:
agent_SP = train_dqn(train_data_SP, num_episodes=5)

Episode 1/5 | Total Invested: $35399100 | Realized Profit: $-160962.21 | Epsilon: 0.100
Episode 2/5 | Total Invested: $35981200 | Realized Profit: $-79926.62 | Epsilon: 0.100
Episode 3/5 | Total Invested: $35913200 | Realized Profit: $-103097.42 | Epsilon: 0.100
Episode 4/5 | Total Invested: $43621600 | Realized Profit: $-36560.30 | Epsilon: 0.100
Episode 5/5 | Total Invested: $71816700 | Realized Profit: $48959.07 | Epsilon: 0.100
Training complete.


In [273]:
profit, invested = evaluate_agent(agent_SP, test_data_SP)

[Test Results] Total Invested: $79668300 | Realized Profit: $-3768706.29


In [38]:
total_invested = 79668300  # Total amount invested
realized_profit = -3768706.29  # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: -4.73%


In [274]:
# Define the path where you want to save the agent
save_path = 'dqn_agent_SP.pth'

# Create a dictionary containing all necessary components
torch.save({
    'q_network_state_dict': agent_SP.q_network.state_dict(),
    'target_network_state_dict': agent_SP.target_network.state_dict(),
    'optimizer_state_dict': agent_SP.optimizer.state_dict(),
    'epsilon': agent_SP.epsilon,
    'global_step': agent_SP.global_step,
}, save_path)

print(f"Agent saved successfully at {save_path}")

Agent saved successfully at dqn_agent_SP.pth


### 📉 S&P 500: DQN + LSTM vs. Buy-and-Hold (2017–2024)

After training a dedicated agent on pre-2017 S&P 500 data, we evaluated it on the full unseen future window (2017–2024):

| Strategy         | Total Invested      | Realized Profit       | ROI (%)    |
|------------------|---------------------|------------------------|------------|
| **Buy-and-Hold** | —                   | —                      | **-16.76%** |
| **DQN + LSTM Agent** | $79,668,300          | $-3,768,706.29          | **-4.73%**  |

---

### 📊 Key Observations

- 📉 **Buy-and-hold suffered significant losses**, which reflects the **broad market decline** over this period.
- 🤖 The agent also ended up with a **negative ROI**, but it was **far less severe** — reducing losses by nearly **3x**.
- 💼 The agent still executed a **substantial volume of trades** (nearly **$80M invested**), meaning it actively tried to engage with the market and wasn't overly conservative.
- 🧠 This shows that it **attempted to learn patterns**, took action, and was not simply idle — it just didn’t find profitable opportunities in a difficult market.

---

### 🧠 Why This Still Matters

- Even in a poor market regime, the agent **managed to mitigate damage** — a valuable trait in real-world trading where loss minimization is just as important as profit.
- Its willingness to trade — and relatively lower losses — indicate some level of **risk-sensitive behavior** already learned from prior market structure.

---

### 🔄 What's Next

- Apply the same training structure to **Reliance** and observe if similar behavior emerges.

--- 

Even in loss, this was a **meaningful result** — next stop: Reliance.

## 🧹 Preprocessing Reliance Dataset (Minute-Level)

Before training the agent on **Reliance stock data**, we prep the dataset with temporal features and normalization.

---

### 🗓️ 1. Date Conversion + Feature Extraction

- The `Date` column is converted to `datetime` to enable rich temporal slicing.
- We then **extract `Year`, `Month`, `Day`, and `Time`** into new columns.
  - This allows chronological filtering, e.g., training on older years and testing on future years.

---

### 📊 2. Year Distribution Check

- By printing year-wise counts, we verify **data completeness** and ensure each year has sufficient minute-level entries.
- This also helps choose logical train-test splits.

---

### 📈 3. Feature Normalization

- Technical indicators like EMA, MACD, OBV vary across scales.
- Using `MinMaxScaler`, we normalize all input features to the **[0, 1] range**.
  - This is **critical** for stable learning in neural networks.

Normalized Features:
```
EMA_50, EMA_200, SMA_20, Upper_Band, Lower_Band,
%K, %D, MACD_Line, Signal_Line, OBV
```

---

### 🧪 4. Train-Test Split Based on Years

- **Train Set**: All data from years `<= 2016` (4.26M rows)
- **Test Set**: All data from `>= 2017` onward (3.47M rows)
  - Clear temporal split helps ensure **no leakage** and tests the agent on unseen, future market behavior.
- Because Reliance has **more data available than S&P 500**, we were able to keep a **larger split similar to what worked well on Apple**.

---

### ✅ Summary

| Step | Purpose |
|------|---------|
| `Date` conversion | Enables time-based filtering |
| Year counts       | Validate data distribution |
| Feature scaling   | Normalize indicators for NN input |
| Train/test split  | Evaluate generalization on future Reliance data |

This structure mirrors what you did for S&P 500 and Apple, keeping your pipeline **consistent and scalable across assets**.

---

In [39]:
# Convert 'Date' column to datetime if not already
data_R['Date'] = pd.to_datetime(data_R['Date'])

# Split into separate columns
data_R['Year'] = data_R['Date'].dt.year
data_R['Month'] = data_R['Date'].dt.month
data_R['Day'] = data_R['Date'].dt.day
data_R['Time'] = data_R['Date'].dt.time

In [262]:
# Print distinct values of the Year column and their counts
year_counts = data_R['Year'].value_counts()

print("Distinct Years and Counts:")
for year, count in year_counts.items():
    print(f"Year: {year}, Count: {count}")

Distinct Years and Counts:
Year: 2008, Count: 479511
Year: 2010, Count: 478092
Year: 2023, Count: 478092
Year: 2018, Count: 476652
Year: 2011, Count: 475212
Year: 2015, Count: 475212
Year: 2017, Count: 475212
Year: 2014, Count: 473772
Year: 2021, Count: 472332
Year: 2012, Count: 472332
Year: 2009, Count: 472332
Year: 2020, Count: 472332
Year: 2013, Count: 470893
Year: 2016, Count: 470892
Year: 2019, Count: 468013
Year: 2022, Count: 468012
Year: 2024, Count: 168486


In [40]:
# Normalize the feature columns
features = [
    'EMA_50', 'EMA_200', 'SMA_20', 'Upper_Band', 'Lower_Band', '%K', '%D',
    'MACD_Line', 'Signal_Line', 'OBV'
]
scaler = MinMaxScaler()
data_R[features] = scaler.fit_transform(data_R[features])

In [41]:
# Split the dataset into train, validation, and test sets
train_data_R = data_R[data_R['Year'] <= 2016]
test_data_R = data_R[data_R['Year'] >= 2017]

# Print split sizes
print(f"Training Data: {train_data_R.shape}")
print(f"Testing Data: {test_data_R.shape}")

Training Data: (4268250, 24)
Testing Data: (3479131, 24)


### 💸 Buy-and-Hold ROI Baseline (on test set)

This gives a baseline ROI from buying at the start of 2017 and holding until the end of the dataset.

In [42]:
# Calculate profit: difference between last and first close prices in test_data
initial_price = test_data_R.iloc[0]['Close']
final_price = test_data_R.iloc[-1]['Close']
profit = final_price - initial_price

# Calculate ROI (Return on Investment)
roi = (profit / initial_price) * 100

# Print results
print(f"ROI: {roi:.2f}%")

ROI: 465.99%


### 🧠 DQN + LSTM Agent Training on Reliance Data

This script trains a custom **Deep Q-Network agent with LSTM-based memory** to learn trading behavior on **Reliance stock** using historical minute-level data.

---

### 🏗️ Environment Setup

- A `TradingEnv` instance is created using the `train_data_R` slice.
- It simulates:
  - Discrete **buy/sell/hold** actions
  - **Realistic trading constraints** (e.g., 0.25% fee per transaction)
  - Final position liquidation at the end of episode

---

### ⚙️ DQN Hyperparameters (Same as Apple)

These parameters are tuned based on what worked well in the Apple agent:

| Hyperparameter         | Value         | Explanation |
|------------------------|---------------|-------------|
| `input_dim`            | 11            | 10 indicators + 1 for `shares_held` |
| `num_actions`          | 3             | {0: Hold, 1: Buy, 2: Sell} |
| `hidden_size`          | 64            | LSTM memory size for learning temporal dependencies |
| `lr` (learning rate)   | 1e-3          | Controls how fast the network updates |
| `gamma` (discount)     | 0.99          | Encourages long-term rewards |
| `batch_size`           | 8             | Small batch size for stable LSTM updates |
| `seq_len`              | 8             | Each training sample is an 8-step sequence |
| `buffer_size`          | 100,000       | Replay memory size to store past sequences |
| `epsilon_start`        | 1.0           | Starts fully exploratory |
| `epsilon_end`          | 0.1           | Minimum exploration at convergence |
| `epsilon_decay_steps`  | 4,268,248     | Matches the size of the training set for gradual decay |
| `target_update_freq`   | 1,000         | Frequency of syncing target network with main network |

🧠 These are **identical to the best-performing Apple agent**, ensuring consistency and fair benchmarking.

---

### 🔁 Training Loop

- Runs for **5 episodes** — each one simulates a complete trading pass over the training data.
- Prints:
  - Total capital deployed by the agent
  - Realized profit per episode
  - Current value of epsilon (exploration level)

This gives a quick health check on how the agent is learning without tuning.

---

### 🧪 Evaluation on Unseen Data

- After training, the agent is evaluated on **out-of-sample future data** from 2017 onward.
- This simulates how well the agent **generalizes to unseen Reliance price action**.

In [275]:
def train_dqn(train_data, num_episodes=5):
    """
    Train a DQN on the given training dataframe for num_episodes.
    """
    # Create environment
    train_env = TradingEnv(
        df=train_data,
        start_idx=0,
        end_idx=len(train_data) - 1,
        fee=0.0025
    )
    
    # Hyperparameters 
    config = {
        'input_dim': 11,   # 10 features + 1 shares_held
        'num_actions': 3,  # hold, buy, sell
        'hidden_size': 64,
        'lr': 1e-3,
        'gamma': 0.99,
        'batch_size': 8,
        'seq_len': 8,
        'buffer_size': 100000,
        'epsilon_start': 1.0,
        'epsilon_end': 0.1,
        'epsilon_decay_steps': 4268248,
        'target_update_freq': 1000
    }
    
    agent = DQNAgent(**config)
    
    for episode in range(num_episodes):
        ep_profit, ep_invested = agent.train_one_episode(train_env)
        
        print(f"Episode {episode+1}/{num_episodes} | "
              f"Total Invested: ${ep_invested} | "
              f"Realized Profit: ${ep_profit:.2f} | "
              f"Epsilon: {agent.epsilon:.3f}")
    
    print("Training complete.")
    return agent


In [276]:
agent_R = train_dqn(train_data_R, num_episodes=5)

Episode 1/5 | Total Invested: $111585900 | Realized Profit: $-557616.99 | Epsilon: 0.100
Episode 2/5 | Total Invested: $107535000 | Realized Profit: $-538006.67 | Epsilon: 0.100
Episode 3/5 | Total Invested: $177029500 | Realized Profit: $-884224.52 | Epsilon: 0.100
Episode 4/5 | Total Invested: $173129500 | Realized Profit: $-866105.95 | Epsilon: 0.100
Episode 5/5 | Total Invested: $189078200 | Realized Profit: $-944617.68 | Epsilon: 0.100
Training complete.


In [277]:
profit, invested = evaluate_agent(agent_R, test_data_R)

[Test Results] Total Invested: $23900 | Realized Profit: $109756.41


In [43]:
total_invested = 23900  # Total amount invested
realized_profit = 109756.41  # Profit obtained

# Calculate ROI (Return on Investment)
roi = (realized_profit / total_invested) * 100

# Print results
print(f"ROI: {roi:.2f}%")


ROI: 459.23%


In [278]:
# Define the path where you want to save the agent
save_path = 'dqn_agent_R.pth'

# Create a dictionary containing all necessary components
torch.save({
    'q_network_state_dict': agent_R.q_network.state_dict(),
    'target_network_state_dict': agent_R.target_network.state_dict(),
    'optimizer_state_dict': agent_R.optimizer.state_dict(),
    'epsilon': agent_R.epsilon,
    'global_step': agent_R.global_step,
}, save_path)

print(f"Agent saved successfully at {save_path}")

Agent saved successfully at dqn_agent_R.pth


### 📊 Observations: DQN+LSTM Agent on Reliance

#### 🔁 **Buy-and-Hold Baseline**
- **ROI:** `465.99%`  
This represents the full-term return from passively holding Reliance stock from 2017 onwards — extremely strong long-term growth.

---

#### 🤖 **DQN+LSTM Agent Performance**
- **Test ROI:** `459.23%`  
- **Total Invested:** `$23,900`  
- **Realized Profit:** `$109,756.41`

While the agent **came close to matching** the buy-and-hold ROI, a few important patterns emerge:

---

### 🧠 Key Insights

- **📉 Fewer Trades Compared to Apple & S&P Agents**  
  The total capital deployed ($23.9k) is **dramatically lower** than previous agents — which were often investing in the **millions**.  
  This likely suggests the agent was **much more conservative** in its trading decisions on Reliance.

- **📦 Still Achieved Solid ROI**  
  Despite lower trade volume, the agent delivered a **strong profit**. This hints that it was **more selective but more efficient** per dollar invested.

- **❓ Why Fewer Trades?**  
  Possibly:
  - Reliance may exhibit **different volatility patterns** or less frequent signals based on indicators.
  - The agent might have **learned to avoid noise** or overly aggressive trades.
  - The reward structure (only rewarding on `Sell`) might discourage excessive buying unless confident.

---

### 🧪 Takeaway

The agent **generalized well** on Reliance using the same architecture, but its **trading behavior was noticeably more cautious**.

### 🧪 Next Step: Hardcoded Strategies on the Same Test Sets

To ensure **fair and consistent comparison**, we will now evaluate all the **hardcoded strategies** (e.g., EMA, Bollinger Bands, RSI, MACD, etc.) on the **same test datasets** used for:

- 📉 **S&P 500 Agent** (`test_data_SP`)
- 📈 **Reliance Agent** (`test_data_R`)

This will help us directly compare:
- ❌ Hardcoded Strategy Performance  
- ✅ DQN + LSTM Agent Performance  
- 📦 Buy-and-Hold ROI  

All under **identical market conditions and timeframes**.

Let’s see how the traditional rule-based approaches hold up. 

## Trend (EMA)

In [279]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_SP": test_data_SP.copy(),
    "Agent_R" : test_data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_SP = all_results["Agent_SP"]
results_Agent_R = all_results["Agent_R"]


In [280]:
results_Agent_SP 

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,41568300,1251001.0,3.009507
1,0.03%,41568300,-207581.7,-0.499375
2,1.00%,41568300,899713.8,2.164423
3,3.00%,41568300,983541.6,2.366086
4,5.00%,41568300,992084.0,2.386636
5,10.00%,41568300,1119667.0,2.693559
6,15.00%,41568300,1206205.0,2.901743
7,20.00%,41568300,1241240.0,2.986026


In [281]:
results_Agent_R

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,283138200,14070960.0,4.969644
1,0.03%,283138200,-1413921.0,-0.499375
2,1.00%,283138200,6267507.0,2.213586
3,3.00%,283138200,11060120.0,3.90626
4,5.00%,283138200,14211080.0,5.01913
5,10.00%,283138200,14070960.0,4.969644
6,15.00%,283138200,14070960.0,4.969644
7,20.00%,283138200,14070960.0,4.969644


## Mean Reversion (Bollinger Band)

In [282]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_SP": test_data_SP.copy(),
    "Agent_R" : test_data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_mean_reversion_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_SP = all_results["Agent_SP"]
results_Agent_R = all_results["Agent_R"]


In [283]:
results_Agent_SP 

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,0.03%,0,0.0,0
2,1.00%,0,0.0,0
3,3.00%,0,0.0,0
4,5.00%,0,0.0,0
5,10.00%,0,0.0,0
6,15.00%,0,0.0,0
7,20.00%,0,0.0,0


In [284]:
results_Agent_R

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,0.03%,0,0.0,0
2,1.00%,0,0.0,0
3,3.00%,0,0.0,0
4,5.00%,0,0.0,0
5,10.00%,0,0.0,0
6,15.00%,0,0.0,0
7,20.00%,0,0.0,0


## RSI

In [285]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_SP": test_data_SP.copy(),
    "Agent_R" : test_data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_stochastics_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_SP = all_results["Agent_SP"]
results_Agent_R = all_results["Agent_R"]


In [286]:
results_Agent_SP 

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,14164900,-43422.513733,-0.30655
1,0.03%,14164900,-70735.969375,-0.499375
2,1.00%,14164900,-49602.003863,-0.350175
3,3.00%,14164900,-43407.817168,-0.306446
4,5.00%,14164900,-43418.303711,-0.30652
5,10.00%,14164900,-43415.997505,-0.306504
6,15.00%,14164900,-43403.567684,-0.306416
7,20.00%,14164900,-43428.804583,-0.306595


In [287]:
results_Agent_R

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,53446400,-552095.475411,-1.032989
1,0.03%,53446400,-266897.96,-0.499375
2,1.00%,53446400,-167221.212011,-0.312876
3,3.00%,53446400,-338528.939692,-0.633399
4,5.00%,53446400,-444171.745343,-0.83106
5,10.00%,53446400,-410285.581533,-0.767658
6,15.00%,53446400,-472112.221258,-0.883338
7,20.00%,53446400,-562549.434338,-1.052549


## MACD

In [288]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_SP": test_data_SP.copy(),
    "Agent_R" : test_data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_macd_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_SP = all_results["Agent_SP"]
results_Agent_R = all_results["Agent_R"]


In [289]:
results_Agent_SP 

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,14021300,-63824.357762,-0.455196
1,0.03%,14021300,-70018.866875,-0.499375
2,1.00%,14021300,-63768.555721,-0.454798
3,3.00%,14021300,-63827.997924,-0.455222
4,5.00%,14021300,-63825.078625,-0.455201
5,10.00%,14021300,-63822.783177,-0.455184
6,15.00%,14021300,-63825.089462,-0.455201
7,20.00%,14021300,-63824.612781,-0.455198


In [290]:
results_Agent_R

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,214062100,-3131573.0,-1.462927
1,0.03%,214062100,-1068973.0,-0.499375
2,1.00%,214062100,-1812665.0,-0.846794
3,3.00%,214062100,-2754033.0,-1.286558
4,5.00%,214062100,-3023413.0,-1.4124
5,10.00%,214062100,-3131573.0,-1.462927
6,15.00%,214062100,-3131573.0,-1.462927
7,20.00%,214062100,-3131573.0,-1.462927


## Volume

In [291]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_SP": test_data_SP.copy(),
    "Agent_R" : test_data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_macd_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_SP = all_results["Agent_SP"]
results_Agent_R = all_results["Agent_R"]


In [292]:
results_Agent_SP 

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,5085100,52905.191051,1.040396
1,0.03%,5085100,-25393.718125,-0.499375
2,1.00%,5085100,14406.969565,0.283317
3,3.00%,5085100,21751.590025,0.427751
4,5.00%,5085100,22727.784802,0.446949
5,10.00%,5085100,30772.93111,0.605159
6,15.00%,5085100,47944.924996,0.942851
7,20.00%,5085100,51673.054771,1.016166


In [293]:
results_Agent_R

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,150649600,4202512.0,2.789594
1,0.03%,150649600,-752306.4,-0.499375
2,1.00%,150649600,1498652.0,0.994793
3,3.00%,150649600,3128352.0,2.076575
4,5.00%,150649600,4838516.0,3.211768
5,10.00%,150649600,4193694.0,2.78374
6,15.00%,150649600,4127412.0,2.739743
7,20.00%,150649600,4202512.0,2.789594


## EMA + MACD

In [294]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_SP": test_data_SP.copy(),
    "Agent_R" : test_data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_macd_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_SP = all_results["Agent_SP"]
results_Agent_R = all_results["Agent_R"]


In [295]:
results_Agent_SP 

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,5085100,52905.191051,1.040396
1,0.03%,5085100,-25393.718125,-0.499375
2,1.00%,5085100,14406.969565,0.283317
3,3.00%,5085100,21751.590025,0.427751
4,5.00%,5085100,22727.784802,0.446949
5,10.00%,5085100,30772.93111,0.605159
6,15.00%,5085100,47944.924996,0.942851
7,20.00%,5085100,51673.054771,1.016166


In [296]:
results_Agent_R

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,150649600,4202512.0,2.789594
1,0.03%,150649600,-752306.4,-0.499375
2,1.00%,150649600,1498652.0,0.994793
3,3.00%,150649600,3128352.0,2.076575
4,5.00%,150649600,4838516.0,3.211768
5,10.00%,150649600,4193694.0,2.78374
6,15.00%,150649600,4127412.0,2.739743
7,20.00%,150649600,4202512.0,2.789594


## EMA + BB

In [297]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_SP": test_data_SP.copy(),
    "Agent_R" : test_data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_bollinger_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_SP = all_results["Agent_SP"]
results_Agent_R = all_results["Agent_R"]


In [298]:
results_Agent_SP 

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,0.03%,0,0.0,0
2,1.00%,0,0.0,0
3,3.00%,0,0.0,0
4,5.00%,0,0.0,0
5,10.00%,0,0.0,0
6,15.00%,0,0.0,0
7,20.00%,0,0.0,0


In [299]:
results_Agent_R

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,0.03%,0,0.0,0
2,1.00%,0,0.0,0
3,3.00%,0,0.0,0
4,5.00%,0,0.0,0
5,10.00%,0,0.0,0
6,15.00%,0,0.0,0
7,20.00%,0,0.0,0


## EMA + MACD + OBV

In [300]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_SP": test_data_SP.copy(),
    "Agent_R" : test_data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_ema_macd_obv_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_SP = all_results["Agent_SP"]
results_Agent_R = all_results["Agent_R"]


In [301]:
results_Agent_SP 

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,2696300,24160.31666,0.896054
1,0.03%,2696300,-13464.648125,-0.499375
2,1.00%,2696300,1825.699936,0.067711
3,3.00%,2696300,7460.923054,0.27671
4,5.00%,2696300,8135.773854,0.301738
5,10.00%,2696300,11862.699849,0.439962
6,15.00%,2696300,21539.185904,0.798842
7,20.00%,2696300,23284.436283,0.86357


In [302]:
results_Agent_R

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,44578900,5669171.0,12.717161
1,0.03%,44578900,-222615.9,-0.499375
2,1.00%,44578900,3226199.0,7.237053
3,3.00%,44578900,4565208.0,10.240737
4,5.00%,44578900,4845698.0,10.869935
5,10.00%,44578900,5519593.0,12.381627
6,15.00%,44578900,5521392.0,12.385662
7,20.00%,44578900,5669228.0,12.71729


## EMA + Bollinger + MACD + OBV

In [303]:
# List of stop loss percentages
stop_loss_values = [1.0, 0.0003, 0.01, 0.03, 0.05, 0.10, 0.15, 0.20]

# Ensure datasets are explicit copies
datasets = {
    "Agent_SP": test_data_SP.copy(),
    "Agent_R" : test_data_R.copy()
}

# Initialize a dictionary to store results for each dataset
all_results = {}

# Iterate over each dataset
for dataset_name, dataset in datasets.items():
    # Initialize an empty list to store results for the current dataset
    results = []

    # Iterate over each stop loss value
    for stop_loss in stop_loss_values:
        # Call the strategy function for the current stop loss value
        investment, profit = optimized_4_indicator_strategy_with_numba(dataset, stop_loss_pct=stop_loss)

        # Calculate ROI
        roi = (profit / investment) * 100 if investment != 0 else 0

        # Determine stop-loss label
        stop_loss_label = "No Stop Loss" if stop_loss == 1.0 else f"{stop_loss * 100:.2f}%"

        # Append the results as a dictionary
        results.append({
            "Stop Loss": stop_loss_label,  # Use label for stop loss
            "Total Investment ($)": investment,
            "Total Profit ($)": profit,
            "ROI (%)": roi
        })

    # Convert the results into a DataFrame and store it in the dictionary
    all_results[dataset_name] = pd.DataFrame(results)

# Access the results for each dataset
results_Agent_SP = all_results["Agent_SP"]
results_Agent_R = all_results["Agent_R"]


In [304]:
results_Agent_SP 

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,0.03%,0,0.0,0
2,1.00%,0,0.0,0
3,3.00%,0,0.0,0
4,5.00%,0,0.0,0
5,10.00%,0,0.0,0
6,15.00%,0,0.0,0
7,20.00%,0,0.0,0


In [305]:
results_Agent_R

Unnamed: 0,Stop Loss,Total Investment ($),Total Profit ($),ROI (%)
0,No Stop Loss,0,0.0,0
1,0.03%,0,0.0,0
2,1.00%,0,0.0,0
3,3.00%,0,0.0,0
4,5.00%,0,0.0,0
5,10.00%,0,0.0,0
6,15.00%,0,0.0,0
7,20.00%,0,0.0,0


### 📊 Hardcoded Strategies on Agent Test Sets: Summary

To complete the evaluation cycle, we ran **all 9 hardcoded strategies** (with and without stop-loss variants) on the **same test sets** used for our **S&P 500** and **Reliance** agents.

#### 🔁 Recap:
- Total of **18 evaluations** (9 strategies × 2 assets)
- Strategies included: Trend (EMA), Mean Reversion (Bollinger Bands), Momentum (MACD), RSI, OBV, and combinations

---

### 🧠 Key Observations:

- 💸 **Most strategies did make trades**, often investing sizable capital.
- 📉 **ROI typically ranged between -1% to +1%**, with rare exceptions.
- ⚠️ **More complex combinations (3+ indicators)** often resulted in **very few or no trades** at all — signaling **over-constrained logic**.
- ✅ **Buy-and-hold and the RL agents clearly outperformed**:
  - **S&P 500**:
    - Buy-and-hold ROI: **41.50%**
    - Agent ROI: **6.38%**
    - Hardcoded: ~Flat or negative in most cases
  - **Reliance**:
    - Buy-and-hold ROI: **330.87%**
    - Agent ROI: **459.23%**
    - Hardcoded: Mixed, mostly flat or negative

---

### 🏁 Conclusion

The **rule-based hardcoded strategies fail to generalize** well on these test sets — often trading too much with poor reward or barely trading at all.  
Compared to that, the **learned RL agents and even buy-and-hold** demonstrate significantly better risk-reward characteristics over long time horizons.


# ✅ Initial Conclusions: Comparing Agents, Baselines, and Strategies

After extensive experimentation, here are the key takeaways based on three real-world assets — **Apple**, **S&P 500**, and **Reliance**:

---

### ⚔️ DQN + LSTM vs. PPO: A Clear Winner

While PPO is a powerful policy-gradient algorithm, in this specific setup it **struggled to learn stable and effective trading behavior** — despite multiple architectural tweaks, hyperparameter variations, and training durations.

On the other hand, the **DQN + LSTM agent demonstrated consistent improvement**:
- It **successfully traded** across different market regimes.
- Captured **temporal dependencies** effectively.
- Outperformed or closely tracked the **buy-and-hold baseline** on multiple occasions.

> 📌 **Going forward, the DQN + LSTM framework remains the foundation** for all further experimentation.

---

### 📉 Hardcoded Strategies Underperform

Despite incorporating a wide range of technical indicators and thresholds, **none of the rule-based strategies showed convincing performance**:
- Many returned **minimal or negative ROI** even with large capital invested.
- More complex indicator combinations often led to **zero trades** due to overly restrictive triggers.
- In all cases, both the **RL agent** and **buy-and-hold** baseline **significantly outperformed** these heuristics.

> 🚫 These results reinforce that **hardcoded logic lacks adaptability** and is not competitive in dynamic markets.

---

### 📊 Case-by-Case Performance Recap

| Asset      | Buy-and-Hold ROI | Agent ROI | Amount Traded ($) | Result Summary |
|------------|------------------|-----------|--------------------|----------------|
| **Apple**  | 578.28%          | 587.88%   | 751,500            | Agent kept pace with long-term growth |
| **S&P 500**| -16.76%          | -4.73%    | 79,668,300         | Agent reduced downside risk |
| **Reliance** | 465.99%        | 459.23%   | 23,900             | Agent nearly matched passive returns |

> 📌 **Also worth noting**: While I did not explicitly show it for every agent for brevity, it was consistently observed that **all three agents improved noticeably as the size of the training dataset increased** — reinforcing the importance of long, diverse historical exposure.

---

### 🔍 Key Implications & Interpretations

- 🧠 **Agent Behavior Varies by Asset**  
  The amount of capital deployed and frequency of trades differed significantly across agents, despite using the same architecture.  
  For example, the **Reliance agent barely traded**, while the **S&P 500 agent invested heavily** — a sign of learned market-specific behaviors.

- 📈 **Outperformance Isn't Guaranteed**  
  While the Apple agent slightly beat buy-and-hold, S&P and Reliance agents **tracked or slightly lagged**.  
  This raises important questions about:
  - Whether **all relevant price information is already embedded** (supporting the **Efficient Market Hypothesis**).
  - Whether **additional alpha** can be extracted using more nuanced data or richer state representations.

- 🛡️ **Risk Management Emerges Naturally**  
  Even with no explicit risk modeling, the S&P 500 agent avoided deeper losses than passive investing — possibly learning to **reduce exposure** in poor conditions.

- 🔄 **Generalization Is Possible, but Not Uniform**  
  The best Apple agent transferred decently to the S&P 500, but **failed to trade Reliance at all** — suggesting that **stock-specific retraining is still necessary**.

---

This phase of the project demonstrates that **learning-based agents can meaningfully participate in financial markets**, matching or improving upon passive strategies, while reacting dynamically to structure and context.

It also sets the stage for deeper exploration into **why** agents behave as they do, **how** they encode market conditions, and **what conditions enable superior performance**.

---

Next up: a roadmap for future work — including ideas to extend, prove, and challenge everything we’ve learned so far.

## 🚀 Future Work

This project demonstrated a full pipeline for applying deep reinforcement learning (DQN + LSTM) to algorithmic trading using a realistic trading simulator and a diverse set of market indicators. While the results were promising, especially when compared to hardcoded strategies, there are several critical areas where future work could significantly extend both the depth and breadth of this exploration.

---

### 💡 1. **Enhanced DQN Variants (e.g., Double DQN, Dueling DQN)**

- **Double DQN**: The current agent suffers from Q-value overestimation due to the standard max operator in the TD target:
  $$
  \text{TD target} = r + \gamma \max_{a'} Q_{\text{target}}(s', a')
  $$
  Double DQN decouples action selection and evaluation by using the online network to select the action and the target network to evaluate it:
  $$
  \text{TD target} = r + \gamma Q_{\text{target}}(s', \arg\max_a Q_{\text{online}}(s', a))
  $$
  ✅ This could lead to more **stable learning**, especially in noisy environments like financial markets.

- **Dueling DQN**: By separating the estimation of the state-value and advantage function, dueling architectures allow the network to **learn which states are valuable, even when actions don't differ much**, which often occurs in sideways markets.

---

### 🌀 2. **n-Step Returns Instead of 1-Step TD Learning**

The current setup uses **1-step Temporal Difference updates**, which can be shortsighted in financial contexts where **profits and losses often realize after several steps**.

- ✅ **n-step returns** (e.g., 5-step or 10-step) allow the agent to better **capture medium-term consequences** of actions, smoothing out reward noise and improving credit assignment.

- This change would be especially important during **bull runs or slow reversals**, where the profit isn’t immediate after a Buy.

---

### 🤖 3. **Actor-Critic and Parallelized Learning Agents**

DQN is inherently **off-policy and value-based**, which limits flexibility in some trading contexts. Future directions could include:

- **A3C (Asynchronous Advantage Actor-Critic)**: Multiple agents explore in parallel, updating a shared global policy. This can **improve exploration**, diversity of experiences, and training speed.

- **SAC (Soft Actor-Critic)**: A more recent method that optimizes for **entropy-regularized returns**, potentially leading to better exploration and more robust policies in noisy, stochastic markets.

- **PPO (Properly Implemented)**: While PPO underperformed here, it may still hold potential in more continuous action variants or with better tuned architectures. Actor-critic agents also **natively model policy uncertainty**, which could be beneficial in volatile markets.

---

### 🧠 4. **Richer Feature Sets & Meta Information**

Currently, the state vector includes 10 technical indicators + `shares_held`. But markets are influenced by:

- News sentiment
- Sector or macro indicators
- Volume patterns beyond OBV
- Time-of-day or weekday effects (especially for intraday data)

➡️ Future models could experiment with **attention-based input modules** or **feature selection layers** to automatically adapt to the most predictive indicators.

---

### 🧪 5. **Synthetic Data Generation for Simulated Backtesting**

One of the most significant limitations was the **availability of quality minute-level data**, especially for diverse assets.

To overcome this:

- 🧰 Develop or train a **realistic stock data simulator**, conditioned on:
  - Volatility regimes
  - Asset class (growth vs value)
  - Trending vs mean-reverting behavior
- This would allow:
  - Robust testing across extreme events (e.g., crashes, bubbles)
  - Multi-agent simulations
  - Portfolio-level strategy testing

Potential ideas include using **GANs or autoregressive models** to generate time series that mirror real stock dynamics.

---

### ⚠️ 6. **Position Sizing & Risk Management Logic**

The current agent always buys $100 and sells all at once. In practice:

- Position sizing depends on volatility, conviction, or recent profits.
- Risk-adjusted returns (e.g., Sharpe, Sortino ratios) are critical in evaluating strategies.

✅ Future environments could incorporate **soft action spaces**, allowing the agent to learn **how much to buy or sell**, not just when.

---

### 🧪 7. **Market Impact, Slippage & Realism Enhancements**

The current simulator assumes:
- Perfect execution at the closing price
- No impact from the agent’s trades
- No liquidity constraints

For higher fidelity:
- Add **slippage models** and **execution delays**
- Limit trading during illiquid periods
- Incorporate **spread-aware pricing**

These changes would encourage agents to **plan ahead**, optimizing for trade efficiency — not just profitability.

---

### 📈 8. **Transfer Learning Across Assets**

The Apple-trained agent showed varying levels of generalization when applied to S&P 500 and Reliance.

➡️ Future experiments could:
- Explicitly **fine-tune agents** across assets
- Use **meta-learning** to create a generalist agent that can quickly adapt to new stocks
- Introduce **domain adaptation** layers to transfer knowledge between correlated stocks (e.g., tech stocks)

### 9.🔍 **A final insight worth noting**:  

While temporal-difference methods like 1-step and n-step TD help propagate outcomes backward for faster learning, they don't directly answer the deeper question of **"What should I have done instead?"**  
- To explore this, future work could investigate **counterfactual learning** or **causal reinforcement learning**, which aim to model and learn from *alternate decisions* — allowing the agent to reason not just about *what happened*, but *what could have happened* if it acted differently.  
- This is especially valuable in trading, where timing and action selection can drastically change outcomes, and understanding the *missed opportunities* is just as important as exploiting known ones.