# 👩‍💻 Train Your First Q-Learning and REINFORCE Agents

## 📋 Overview
In this lab, you'll roll up your sleeves and dive into the world of reinforcement learning by training two types of agents: one using Q-Learning and the other using the REINFORCE algorithm. These two approaches represent the powerful methodologies of value-based and policy-based learning, respectively. Through this practical activity, you'll witness firsthand how each method fosters agent learning and decision-making in a simple environment. By comparing the two, you'll sharpen your understanding of their unique strengths and applications—equipping you with the skills to choose the right approach for diverse RL scenarios.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- ✅ Set up and simulate an RL environment like GridWorld
- ✅ Implement and train a Q-Learning agent
- ✅ Implement and train a REINFORCE agent
- ✅ Compare and analyze the performance of value-based and policy-based RL techniques

## Task 1: Environment Setup

**Context:** Setting up the GridWorld environment is the first step for your RL agents.

**Steps:**

1. Create a simple GridWorld-like environment for your agents to explore.
2. Define basic states, actions, and a reward structure with a goal for agents to reach.

In [None]:
# imports
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Your code here...

💡 **Tip:** Use a grid size of 5x5 for simplicity.

⚙️ **Test Your Work:**
- Print the initial state of the environment.

**Expected output:** The starting position of the agent on the grid.

## Task 2: Implement Q-Learning Agent

**Context:** Q-Learning is a value-based method where the agent learns the optimal actions through experience.

**Steps:**

1. Set up a Q-table and implement Q-Learning, allowing the agent to learn the optimal actions through experience-driven updates.
2. Initialize exploration parameters, implement the learning loop, and run multiple episodes to refine the Q-table.

In [None]:
# Task 2: Implement Q-Learning Agent

💡 **Tip:** Use parameters like `alpha`, `gamma`, and `epsilon` for learning rate, discount factor, and exploration rate respectively.

⚙️ **Test Your Work:**
- Print the Q-table after training.

**Expected output:** The Q-table with learned values for each state-action pair.

## Task 3: Implement REINFORCE Agent

**Context:** REINFORCE is a policy-based method where the agent learns policy updates based on episodic reward trajectories.

**Steps:**

1. Construct a policy network to approximate your agent’s policy.
2. Implement the REINFORCE algorithm, which calculates policy updates based on episodic reward trajectories.
3. Train the policy network over several episodes to learn effective sequences of actions.

In [None]:
# Task 3: Implement REINFORCE Agent

💡 **Tip:** Use `torch.nn` for implementing the policy network.

⚙️ **Test Your Work:**
- Print the total rewards and steps taken during the training process.

**Expected output:** The performance metrics for each episode.

## Task 4: Comparative Analysis

**Context:** Comparing the results of Q-Learning and REINFORCE agents helps evaluate their strengths and weaknesses.

**Steps:**

1. Run a set number of episodes for both agents and record their performance, noting the cumulative rewards achieved and the consistency of reaching the goal state.
2. Compare the learning processes and outcomes of the Q-Learning and REINFORCE agents.

In [None]:
# Task 4: Comparative Analysis

💡 **Tip:** Use visualizations or statistical summaries to aid comparison.

⚙️ **Test Your Work:**
- Compare the performance metrics of both agents and discuss their strategies, speed of learning, and adaptability.

**Expected output:** A detailed comparative analysis document.

### ✅ Success Checklist

- Successfully set up the GridWorld environment with defined states and rewards
- Implemented and trained a Q-Learning agent
- Implemented and trained a REINFORCE agent
- Compared the performance of both agents
- Provided reflections and recommendations based on findings

### 🔍 Common Issues & Solutions

**Problem:** Agent actions not updating the state correctly.   
**Solution:** Ensure the actions are correctly defined and update the state as intended.

**Problem:** Rewards not being calculated correctly.   
**Solution:** Verify the reward logic and ensure it's applied correctly for each action.

**Problem:** Agent not learning effectively.   
**Solution:** Adjust the learning parameters and exploration rate for better training.

### 🔑 Key Points

- Q-Learning is a value-based method that updates state-action values to learn the optimal policy.
- REINFORCE is a policy-based method that directly optimizes the policy based on rewards received.
- Comparing different RL approaches helps in understanding their strengths, weaknesses, and suitable applications.

## 💻 Exemplar Solution

<details>    
<summary><strong>Click HERE to see a exemplar solution</strong></summary>    

```python
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Environment Setup: GridWorld
class GridWorld:
    def __init__(self):
        self.state = (0, 0)
        self.goal = (4, 4)
        self.size = 5

    def reset(self):
        self.state = (0, 0)
        return self.state

    def step(self, action):
        x, y = self.state
        if action == "right" and x < self.size - 1:
            x += 1
        elif action == "down" and y < self.size - 1:
            y += 1
        self.state = (x, y)
        reward = 1 if self.state == self.goal else -0.04
        return self.state, reward

# Q-Learning implementation
def train_q_learning(env, num_episodes=1000):
    q_table = np.zeros((env.size * env.size, 2))
    alpha = 0.1
    gamma = 0.9
    epsilon = 0.1

    for episode in range(num_episodes):
        state = env.reset()
        state_index = state[0] * env.size + state[1]
        done = False

        while not done:
            if np.random.uniform(0, 1) < epsilon:
                action_index = np.random.choice(2)
            else:
                action_index = np.argmax(q_table[state_index])

            action = ["right", "down"][action_index]
            next_state, reward = env.step(action)
            next_state_index = next_state[0] * env.size + next_state[1]
            q_table[state_index, action_index] += alpha * (reward + gamma * np.max(q_table[next_state_index]) - q_table[state_index, action_index])
            state, state_index = next_state, next_state_index
            if state == env.goal:
                done = True
    return q_table

# REINFORCE implementation
class PolicyNetwork(nn.Module):
    def __init__(self, num_inputs, num_actions):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(num_inputs, 128)
        self.fc2 = nn.Linear(128, num_actions)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x, dim=-1)

def select_action(policy_net, state, env):
    # Create one-hot encoded state
    state_tensor = torch.zeros(env.size * env.size)
    state_index = state[0] * env.size + state[1]
    state_tensor[state_index] = 1

    with torch.no_grad():
        probs = policy_net(state_tensor)
    action = np.random.choice(2, p=probs.numpy())
    return action

def train_reinforce(env, num_episodes=1000):
    policy_net = PolicyNetwork(num_inputs=env.size * env.size, num_actions=2)
    optimizer = optim.Adam(policy_net.parameters(), lr=0.01)

    for episode in range(num_episodes):
        state = env.reset()
        rewards = []
        log_probs = []
        done = False

        while not done:
            # Create one-hot encoded state
            state_tensor = torch.zeros(env.size * env.size)
            state_index = state[0] * env.size + state[1]
            state_tensor[state_index] = 1

            action = select_action(policy_net, state, env)
            probs = policy_net(state_tensor)
            log_prob = torch.log(probs[action])

            next_state, reward = env.step(["right", "down"][action])
            rewards.append(reward)
            log_probs.append(log_prob)
            state = next_state
            if state == env.goal:
                done = True

        # Calculate cumulative reward
        returns = sum(rewards)

        # Update policy network
        policy_loss = []
        for log_prob in log_probs:
            policy_loss.append(-log_prob * returns)
        policy_loss = torch.stack(policy_loss).sum()

        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()

        if episode % 100 == 0:
            print(f"Episode {episode}, Total Reward: {returns}")

# Running the agents and comparing results
env = GridWorld()
print("Training Q-Learning agent...")
q_table = train_q_learning(env)
print("\nTraining REINFORCE agent...")
train_reinforce(env)

print("\nQ-Learning Q-Table:")
print(q_table)
```                                                                                               

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Environment Setup: GridWorld
class GridWorld:
    def __init__(self):
        self.state = (0, 0)
        self.goal = (4, 4)
        self.size = 5

    def reset(self):
        self.state = (0, 0)
        return self.state

    def step(self, action):
        x, y = self.state
        if action == "right" and x < self.size - 1:
            x += 1
        elif action == "down" and y < self.size - 1:
            y += 1
        self.state = (x, y)
        reward = 1 if self.state == self.goal else -0.04
        return self.state, reward

# Q-Learning implementation
def train_q_learning(env, num_episodes=1000):
    q_table = np.zeros((env.size * env.size, 2))
    alpha = 0.1
    gamma = 0.9
    epsilon = 0.1

    for episode in range(num_episodes):
        state = env.reset()
        state_index = state[0] * env.size + state[1]
        done = False

        while not done:
            if np.random.uniform(0, 1) < epsilon:
                action_index = np.random.choice(2)
            else:
                action_index = np.argmax(q_table[state_index])

            action = ["right", "down"][action_index]
            next_state, reward = env.step(action)
            next_state_index = next_state[0] * env.size + next_state[1]
            q_table[state_index, action_index] += alpha * (reward + gamma * np.max(q_table[next_state_index]) - q_table[state_index, action_index])
            state, state_index = next_state, next_state_index
            if state == env.goal:
                done = True
    return q_table

# REINFORCE implementation
class PolicyNetwork(nn.Module):
    def __init__(self, num_inputs, num_actions):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(num_inputs, 128)
        self.fc2 = nn.Linear(128, num_actions)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x, dim=-1)

def select_action(policy_net, state, env):
    # Create one-hot encoded state
    state_tensor = torch.zeros(env.size * env.size)
    state_index = state[0] * env.size + state[1]
    state_tensor[state_index] = 1

    with torch.no_grad():
        probs = policy_net(state_tensor)
    action = np.random.choice(2, p=probs.numpy())
    return action

def train_reinforce(env, num_episodes=1000):
    policy_net = PolicyNetwork(num_inputs=env.size * env.size, num_actions=2)
    optimizer = optim.Adam(policy_net.parameters(), lr=0.01)

    for episode in range(num_episodes):
        state = env.reset()
        rewards = []
        log_probs = []
        done = False

        while not done:
            # Create one-hot encoded state
            state_tensor = torch.zeros(env.size * env.size)
            state_index = state[0] * env.size + state[1]
            state_tensor[state_index] = 1

            action = select_action(policy_net, state, env)
            probs = policy_net(state_tensor)
            log_prob = torch.log(probs[action])

            next_state, reward = env.step(["right", "down"][action])
            rewards.append(reward)
            log_probs.append(log_prob)
            state = next_state
            if state == env.goal:
                done = True

        # Calculate cumulative reward
        returns = sum(rewards)

        # Update policy network
        policy_loss = []
        for log_prob in log_probs:
            policy_loss.append(-log_prob * returns)
        policy_loss = torch.stack(policy_loss).sum()

        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()

        if episode % 100 == 0:
            print(f"Episode {episode}, Total Reward: {returns}")

# Running the agents and comparing results
env = GridWorld()
print("Training Q-Learning agent...")
q_table = train_q_learning(env)
print("\nTraining REINFORCE agent...")
train_reinforce(env)

print("\nQ-Learning Q-Table:")
print(q_table)