# 👩‍💻Simulating Real World Task Using RL

## 📋 Overview
In this lab, you'll train a DQN (Deep Q-Network) agent to trade stocks using real pricing data. This hands-on activity will introduce you to the core elements of training an RL agent—creating the trading environment, defining the reward structure, and implementing the DQN algorithm. By the end of this lab, you will have practical experience in applying reinforcement learning to financial data, understanding the challenges and strategies for effective model training and deployment.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- ✅ Set up a stock trading environment for an RL agent
- ✅ Implement a DQN agent for learning stock trading strategies
- ✅ Train and evaluate the DQN agent using real stock price data
- ✅ Analyze agent performance and explore improvements

## Task 1: Environment Setup

**Context:** Setting up the trading environment is the first step for your RL agent.

**Steps:**

1. Create a `TradingEnv` class that represents the stock trading environment.
2. Define the states (e.g., price, simple moving average), actions (hold, buy, sell), and reward structure (profit calculation).

In [None]:
# Task 1: Environment Setup

# imports
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")


# Your code here...

💡 **Tip:** Use a grid size of 5x5 for simplicity.

⚙️ **Test Your Work:**
- Print the initial state of the environment.

**Expected output:** The starting state with initial values for price, position, etc.

## Task 2: Implement Replay Buffer

**Context:** A replay buffer stores experiences to sample during training, promoting stability.

**Steps:**

1. Define a `ReplayBuffer` class to manage storing and sampling of experiences during training.
2. Ensure buffer operations such as `push()` and `sample()` are implemented correctly.

In [None]:
# Task 2: Implement Replay Buffer

💡 **Tip:** Set a buffer capacity and batch size for efficient training.

⚙️ **Test Your Work:**
- Print the contents of the replay buffer after a few steps.

**Expected output:** A list of stored experiences.

## Task 3: Implement DQN Network

**Context:** The DQN network approximates the Q-values for state-action pairs.

**Steps:**

1. Define a `DQN` class using `torch.nn` to create a neural network.
2. Implement the forward pass that returns Q-values for given states.

In [None]:
# Task 3: Implement DQN Network

💡 **Tip:** Use `torch.nn.Linear` for defining fully connected layers, and `torch.optim.Adam` for optimization.

⚙️ **Test Your Work:**
- Print the network architecture and output for a sample input.

**Expected output:** Q-values for the input state.

## Task 4: Train the DQN Agent

**Context:** Training the DQN agent involves simulating episodes and updating Q-values.

**Steps:**

1. Implement the training loop to simulate episodes, perform action selection, and update the Q-values.
2. Use epsilon-greedy action selection, experience replay, and periodic target network updates.

In [None]:
# Task 4: Train the DQN Agent

💡 **Tip:** Track total rewards and portfolio values over episodes for evaluation.

⚙️ **Test Your Work:**
- Print the total rewards and epsilon values for each episode.

**Expected output:** Training progress with rewards and decay in exploration rate.

## Task 5: Analyze and Visualize Results

**Context:** Analyzing performance helps evaluate how well the agent learned the trading strategy.

**Steps:**

1. Plot the portfolio values over episodes to visualize performance trends.

2. Evaluate the effectiveness of the trading strategy learned by the agent.

In [None]:
# Task 5: Analyze and Visualize Results

💡 **Tip:** Use `matplotlib` for plotting portfolio value trends over time.

⚙️ **Test Your Work:**
- Display a plot of the portfolio values over episodes.

**Expected output:** A clear visual representation of the agent's portfolio value growth.

### ✅ Success Checklist

- Successfully set up the TradingEnv class with states, actions, and rewards
- Implemented and tested the ReplayBuffer class for storing experiences
- Defined and tested the DQN neural network for Q-value approximation
- Trained the DQN agent through simulated episodes
- Analyzed and visualized the agent's performance with real stock price data

### 🔍 Common Issues & Solutions

**Problem:** Incorrect state or action updates in the environment.   
**Solution:** Verify the logic in the `step()` function and ensure states and rewards are correctly updated.  

**Problem:** Replay buffer errors.   
**Solution:** Check buffer capacity and ensure experiences are added and sampled correctly.  

**Problem:** DQN network not learning.   
**Solution:**  Adjust learning parameters (e.g., learning rate, gamma) and verify the training loop implementation.

### 🔑 Key Points

- Setting up the trading environment correctly is crucial for effective RL training.
- Replay buffers help stabilize training through experience replay.
- DQN agents use neural networks to approximate Q-values and learn optimal strategies.
- Analyzing training performance helps refine and improve agent strategies.

## 💻 Exemplar Solution

<details>    
<summary><strong>Click HERE to see an exemplar solution</strong></summary>    

```python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")


# Synthetic price generator (simple geometric random walk)
def generate_random_walk(length=252,            # ~1 trading year
                         start_price=100.0,
                         mu=0.0005,             # daily drift (≈12% annualised)
                         sigma=0.01,            # daily volatility (≈16% annualised)
                         seed=None):
    """
    Generate `length` synthetic prices using a geometric random walk.

    Price_{t+1} = Price_t * exp( (mu - 0.5*sigma^2) + sigma * Z_t )
    where Z_t ~ N(0, 1).

    Returns
    -------
    np.ndarray[float32]  shape=(length,)
        Simulated close prices.
    """
    if seed is not None:
        np.random.seed(seed)

    # Pre-draw standard normals
    z = np.random.normal(size=length - 1)
    log_returns = (mu - 0.5 * sigma**2) + sigma * z
    prices = np.empty(length, dtype=np.float32)
    prices[0] = start_price
    prices[1:] = start_price * np.exp(np.cumsum(log_returns))
    return prices


# Environment
class TradingEnv:
    def __init__(self, prices):
        self.prices = prices
        self.reset()

    def reset(self):
        self.t = 0
        self.position = 0  # 0 = no position, 1 = long
        self.entry_price = 0.0
        self.cash = 0.0
        self.portfolio_value = []
        return self._get_state()

    def _get_state(self):
        price = float(self.prices[self.t])
        sma = float(np.mean(self.prices[max(0, self.t - 5):self.t + 1]))
        return np.array([price, sma, float(self.position)], dtype=np.float32)

    def step(self, action):
        # Guard: episode done
        if self.t >= len(self.prices) - 1:
            return np.zeros(3, dtype=np.float32), 0.0, True

        price = self.prices[self.t]
        reward = 0.0

        # Actions: 0 = hold, 1 = buy, 2 = sell
        if action == 1 and self.position == 0:  # Buy
            self.position = 1
            self.entry_price = price
        elif action == 2 and self.position == 1:  # Sell
            reward = price - self.entry_price
            self.position = 0
            self.cash += reward

        # Track portfolio value (cash + current position value)
        current_pos_value = price - self.entry_price if self.position == 1 else 0.0
        total_value = self.cash + current_pos_value
        self.portfolio_value.append(total_value)

        self.t += 1
        done = self.t >= len(self.prices) - 1

        next_state = np.zeros(3, dtype=np.float32) if done else self._get_state()
        return next_state, float(reward), done


# Replay Buffer
class ReplayBuffer:
    def __init__(self, capacity=1000):
        self.capacity = capacity
        self.buffer = []
        self.pos = 0

    def push(self, state, action, reward, next_state, done):
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        self.buffer[self.pos] = (state, action, reward, next_state, done)
        self.pos = (self.pos + 1) % self.capacity

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return states, actions, rewards, next_states, dones

    def __len__(self):
        return len(self.buffer)


# DQN Network
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, output_dim)
        )

    def forward(self, x):
        return self.net(x)


# Training function
def train_dqn(prices, episodes=100, gamma=0.99,
              epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.995,
              batch_size=32):

    env = TradingEnv(prices)
    input_dim = 3
    output_dim = 3  # hold, buy, sell

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    policy_net = DQN(input_dim, output_dim).to(device)
    target_net = DQN(input_dim, output_dim).to(device)
    target_net.load_state_dict(policy_net.state_dict())
    target_net.eval()

    optimizer = optim.Adam(policy_net.parameters(), lr=0.001)
    replay_buffer = ReplayBuffer()

    epsilon = epsilon_start
    portfolio_values_over_episodes = []

    for episode in range(episodes):
        state = env.reset()
        total_reward = 0.0

        while True:
            # Epsilon-greedy action selection
            if random.random() < epsilon:
                action = random.randint(0, output_dim - 1)
            else:
                state_t = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(device)
                with torch.no_grad():
                    q_values = policy_net(state_t)
                action = q_values.argmax().item()

            next_state, reward, done = env.step(action)
            replay_buffer.push(state, action, reward, next_state, done)

            state = next_state
            total_reward += reward

            # Training step
            if len(replay_buffer) >= batch_size:
                states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)

                states = torch.tensor(np.vstack(states), dtype=torch.float32).to(device)
                actions = torch.tensor(actions, dtype=torch.int64).unsqueeze(1).to(device)
                rewards = torch.tensor(rewards, dtype=torch.float32).unsqueeze(1).to(device)
                next_states = torch.tensor(np.vstack(next_states), dtype=torch.float32).to(device)
                dones = torch.tensor(dones, dtype=torch.float32).unsqueeze(1).to(device)

                q_values = policy_net(states).gather(1, actions)
                with torch.no_grad():
                    next_q_values = target_net(next_states).max(1)[0].unsqueeze(1)
                    target_q_values = rewards + gamma * next_q_values * (1 - dones)

                loss = nn.MSELoss()(q_values, target_q_values)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            if done:
                break

        # Decay epsilon
        epsilon = max(epsilon_end, epsilon * epsilon_decay)

        # Update target network every 10 episodes
        if episode % 10 == 0:
            target_net.load_state_dict(policy_net.state_dict())

        # Store portfolio value history for this episode
        portfolio_values_over_episodes.append(env.portfolio_value.copy())

        print(f"Episode {episode + 1}/{episodes} │ Total Reward: {total_reward:.2f} │ Epsilon: {epsilon:.3f}")

    return portfolio_values_over_episodes


# Main execution
if __name__ == "__main__":
    # Simulate one year (252 trading days) of prices
    prices = generate_random_walk(length=252, start_price=100.0, seed=42)

    # Train the DQN agent
    portfolio_values = train_dqn(prices, episodes=50)

    # Optional: quick visual check
    plt.plot(prices)
    plt.title("Synthetic Price Series (Random Walk)")
    plt.xlabel("Day")
    plt.ylabel("Price")
    plt.show()
```