🎥 Recommended Video: [Reinforcement Learning: Deep Q Learning and Policy Gradient](https://www.youtube.com/watch?v=k0eMEhgTYZQ&t=13s)



## **1. Q-Learning**: Learning Through Trial and Error

Imagine you’re teaching a robot to navigate a simple maze. The robot knows nothing about the maze initially but learns through trial and error. As it moves, it receives rewards for getting closer to the goal and penalties for hitting walls. Over time, it develops a strategy to navigate efficiently. This learning process is the essence of **Q-Learning**.

Q-Learning is a model-free reinforcement learning algorithm that helps an agent learn the value of taking specific actions in different states, using a **Q-table** to store these values.

### Key Concepts:
- **Q-table**: A grid where rows represent states, and columns represent actions. Each cell holds the expected reward (Q-value) for taking that action in that state.
- **Bellman Equation**: Updates Q-values by considering both the immediate reward and the maximum future reward.

$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right)
$$

#### **Breaking Down the Terms:**
- **$\alpha$**: Learning rate, controlling how much new information overrides old knowledge.
- **$\gamma$**: Discount factor, balancing the importance of future rewards versus immediate rewards.
- **$r$**: Immediate reward received for taking action $a$ in state $s$.
- **$\max_{a'} Q(s', a')$**: Maximum expected reward for the next state $s'$.

#### Balancing Exploration and Exploitation
Q-Learning uses an **epsilon-greedy strategy** to decide between:
- **Exploration**: Trying new actions to discover better strategies.
- **Exploitation**: Choosing the best-known action based on current knowledge.

---

### Example: Robot in a Simple Maze
Let’s simulate a robot learning to move through a maze using Q-Learning:

```python
import numpy as np

# Define the environment
num_states = 6
num_actions = 2  # Example: 0 = left, 1 = right
q_table = np.zeros((num_states, num_actions))

# Hyperparameters
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 1.0  # Exploration rate
epsilon_decay = 0.995
min_epsilon = 0.01

# Simulate Q-Learning
for episode in range(1000):
    state = 0  # Start state
    done = False
    
    while not done:
        # Exploration vs Exploitation
        if np.random.rand() < epsilon:
            action = np.random.randint(num_actions)  # Explore
        else:
            action = np.argmax(q_table[state])  # Exploit
        
        # Simulate action (example: move right or left)
        if action == 1:  # Move right
            next_state = state + 1
        else:  # Move left
            next_state = state - 1
        
        # Ensure next_state stays within valid bounds
        next_state = np.clip(next_state, 0, num_states - 1)
        
        # Reward and done condition
        reward = 1 if next_state == num_states - 1 else 0  # Reward at goal state
        done = next_state == num_states - 1  # Episode ends at goal state
        
        # Update Q-value using Bellman Equation
        q_table[state, action] += alpha * (reward + gamma * np.max(q_table[next_state]) - q_table[state, action])
        
        state = next_state
    
    # Decay epsilon
    epsilon = max(min_epsilon, epsilon * epsilon_decay)

print("Q-Table:")
print(q_table)
```

---





In [None]:
import numpy as np

# Define the environment
num_states = 6
num_actions = 2  # Example: 0 = left, 1 = right
q_table = np.zeros((num_states, num_actions))

# Hyperparameters
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 1.0  # Exploration rate
epsilon_decay = 0.995
min_epsilon = 0.01

# Simulate Q-Learning
for episode in range(1000):
    state = 0  # Start state
    done = False

    while not done:
        # Exploration vs Exploitation
        if np.random.rand() < epsilon:
            action = np.random.randint(num_actions)  # Explore
        else:
            action = np.argmax(q_table[state])  # Exploit

        # Simulate action (example: move right or left)
        if action == 1:  # Move right
            next_state = state + 1
        else:  # Move left
            next_state = state - 1

        # Ensure next_state stays within valid bounds
        next_state = np.clip(next_state, 0, num_states - 1)

        # Reward and done condition
        reward = 1 if next_state == num_states - 1 else 0  # Reward at goal state
        done = next_state == num_states - 1  # Episode ends at goal state

        # Update Q-value using Bellman Equation
        q_table[state, action] += alpha * (reward + gamma * np.max(q_table[next_state]) - q_table[state, action])

        state = next_state

    # Decay epsilon
    epsilon = max(min_epsilon, epsilon * epsilon_decay)

print("Q-Table:")
print(q_table)

Q-Table:
[[0.59048997 0.6561    ]
 [0.59048961 0.729     ]
 [0.65609989 0.81      ]
 [0.72899988 0.9       ]
 [0.80999924 1.        ]
 [0.         0.        ]]


## **2. Deep Q-Learning (DQN)**: Scaling to Complex Environments

Now imagine the maze becomes vastly more complex, with thousands of states. A Q-table becomes impractical. Here’s where **Deep Q-Learning (DQN)** comes in, replacing the Q-table with a **neural network**(we'll look at neural networks in later modules)to approximate Q-values.

### Key Concepts:
- **Neural Network**: Learns to predict Q-values for each action given a state.
- **Experience Replay**: Stores past experiences (state, action, reward, next state) in a buffer and samples them randomly to train the network. This reduces correlation in the data and improves stability.
- **Target Network**: A separate network that provides stable target Q-values, updated less frequently than the main policy network.

---

### Example: Implementing DQN

```python
# Import deep learning libraries
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random

# Define the DQN network
class DQN(nn.Module):
    def __init__(self, input_size, output_size):
        super(DQN, self).__init__()
        # Define a simple feed-forward network
        self.network = nn.Sequential(
            nn.Linear(input_size, 64),  # First hidden layer with 64 neurons
            nn.ReLU(),                 # Activation function
            nn.Linear(64, 64),         # Second hidden layer with 64 neurons
            nn.ReLU(),                 # Activation function
            nn.Linear(64, output_size) # Output layer matching the action space size
        )
    
    def forward(self, x):
        # Forward pass through the network
        return self.network(x)

# Experience Replay Buffer
class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)  # Stores a fixed number of experiences
    
    def push(self, state, action, reward, next_state, done):
        # Add a new experience to the buffer
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        # Randomly sample a batch of experiences
        return random.sample(self.buffer, batch_size)
    
    def __len__(self):
        return len(self.buffer)  # Return the current size of the buffer

# DQN Agent
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Use GPU if available
        
        # Initialize the policy network and target network
        self.policy_net = DQN(state_size, action_size).to(self.device)
        self.target_net = DQN(state_size, action_size).to(self.device)
        
        # Copy weights from the policy network to the target network
        self.target_net.load_state_dict(self.policy_net.state_dict())
        
        # Optimizer and replay buffer
        self.optimizer = optim.Adam(self.policy_net.parameters())
        self.memory = ReplayBuffer()
        
        # Training hyperparameters
        self.batch_size = 32
        self.gamma = 0.99  # Discount factor
        self.epsilon = 0.1  # Exploration rate
        
    def select_action(self, state):
        # Choose an action using epsilon-greedy policy
        if random.random() < self.epsilon:
            return random.randrange(self.policy_net.network[-1].out_features)  # Explore
        
        with torch.no_grad():
            state = torch.FloatTensor(state).unsqueeze(0).to(self.device)  # Prepare state tensor
            return self.policy_net(state).max(1)[1].item()  # Exploit
    
    def learn(self):
        # Skip learning if not enough experiences in memory
        if len(self.memory) < self.batch_size:
            return
        
        # Sample a batch of experiences
        batch = self.memory.sample(self.batch_size)
        batch_state, batch_action, batch_reward, batch_next_state, batch_done = zip(*batch)
        
        # Convert batch data to tensors
        state = torch.FloatTensor(batch_state).to(self.device)
        action = torch.LongTensor(batch_action).to(self.device)
        reward = torch.FloatTensor(batch_reward).to(self.device)
        next_state = torch.FloatTensor(batch_next_state).to(self.device)
        done = torch.FloatTensor(batch_done).to(self.device)
        
        # Compute current Q-values for chosen actions
        current_q = self.policy_net(state).gather(1, action.unsqueeze(1))
        
        # Compute next Q-values using the target network
        next_q = self.target_net(next_state).max(1)[0].detach()
        
        # Compute the target Q-value using the Bellman equation
        target_q = reward + (1 - done) * self.gamma * next_q
        
        # Calculate the loss between current and target Q-values
        loss = nn.MSELoss()(current_q.squeeze(), target_q)
        
        # Perform backpropagation
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step())
```

---

By evolving from Q-Learning to Deep Q-Learning, we unlock the ability to handle more complex environments, setting the stage for powerful reinforcement learning applications.