# Mini Projects: CartPole, FrozenLake, Q-learning, and DQN

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Implement Q-learning algorithm
- Apply Q-learning to FrozenLake environment
- Apply Q-learning to CartPole environment (with state discretization)
- Understand Deep Q-Network (DQN) concepts
- Train and evaluate RL agents on classic environments

## ðŸ”— Prerequisites

- âœ… OpenAI Gym setup
- âœ… Understanding of states, actions, rewards
- âœ… Epsilon-Greedy exploration strategy
- âœ… Python knowledge (functions, classes, loops, dictionaries)
- âœ… NumPy knowledge
- âœ… Basic understanding of neural networks (for DQN section)

---

## Official Structure Reference

This notebook covers practical activities from **Course 09, Unit 1**:
- Mini projects: applying RL in games like CartPole and FrozenLake, implementing Q-learning and DQN
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 1 Practical Content

---

## Introduction

This notebook combines multiple mini projects:
1. **Q-learning on FrozenLake**: Classic grid-world problem with discrete states
2. **Q-learning on CartPole**: Continuous state space requiring discretization
3. **DQN Introduction**: Deep reinforcement learning for high-dimensional states

These projects demonstrate practical RL applications on classic benchmark environments.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import gym
import random
from collections import defaultdict

print("âœ… Libraries imported!")
print("\nMini Projects: CartPole, FrozenLake, Q-learning, and DQN")
print("=" * 60)

## Part 1: Q-Learning Algorithm Implementation


In [None]:
print("=" * 60)
print("Part 1: Q-Learning Algorithm Implementation")
print("=" * 60)

def epsilon_greedy_action(q_table, state, epsilon, n_actions):
 
    
    """Choose action using epsilon-greedy strategy."""
 if random.random() < epsilon:
 return random.randint(0, n_actions - 1)
 else:
 return np.argmax(q_table[state])

def q_learning_update(q_table, state, action, reward, next_state, alpha, gamma):
 
    
    """
 Q-learning update rule:
 Q(s,a) = Q(s,a) + Î±[r + Î³ * max(Q(s',a')) - Q(s,a)]
 """
 current_q = q_table[state, action]
 max_next_q = np.max(q_table[next_state])
 new_q = current_q + alpha * (reward + gamma * max_next_q - current_q)
 q_table[state, action] = new_q
 return q_table

print("\nQ-Learning Algorithm:")
print(" 1. Initialize Q-table (states Ã— actions)")
print(" 2. For each episode:")
print(" a. Initialize state")
print(" b. While not done:")
print(" - Choose action using epsilon-greedy")
print(" - Take action, observe reward and next state")
print(" - Update Q-table: Q(s,a) = Q(s,a) + Î±[r + Î³*max(Q(s',a')) - Q(s,a)]")
print(" - Set state = next_state")
print(" 3. Return Q-table")

print("\nKey Parameters:")
print(" - Î± (alpha): Learning rate (0.0 to 1.0)")
print(" - Î³ (gamma): Discount factor (0.0 to 1.0)")
print(" - Îµ (epsilon): Exploration rate")
print(" - Q-table: State-action value function")

print("\nâœ… Q-Learning algorithm understood!")

## Part 2: Q-Learning on FrozenLake


In [None]:
print("\n" + "=" * 60)
print("Part 2: Q-Learning on FrozenLake")
print("=" * 60)

# Create FrozenLake environment
env = gym.make('FrozenLake-v1', is_slippery=True)

# Q-learning parameters
n_states = env.observation_space.nn_actions = env.action_space.n
q_table = np.zeros((n_states, n_actions))
# Training parameters
n_episodes = 5000
alpha = 0.1 # Learning rategamma = 0.99 # Discount factor
epsilon = 1.0 # Start with full exploration
epsilon_decay = 0.9995
epsilon_min = 0.01

rewards_history = []

print(f"\nEnvironment: FrozenLake-v1")
print(f" States: {n_states}")
print(f" Actions: {n_actions}")
print(f"\nTraining for {n_episodes} episodes...")

# Training loop
for episode in range(n_episodes):
 state, info = env.reset()
 total_reward = 0
 done = False
 
 while not done:
 # Choose action using epsilon-greedy
 action = epsilon_greedy_action(q_table, state, epsilon, n_actions)
 
 # Take action
 next_state, reward, terminated, truncated, info = env.step(action)
 done = terminated or truncated
 
 # Update Q-table
 q_learning_update(q_table, state, action, reward, next_state, alpha, gamma)
 
 state = next_state
 total_reward += reward
 
 rewards_history.append(total_reward)
 
 # Decay epsilon
 epsilon = max(epsilon_min, epsilon * epsilon_decay)
 
 if (episode + 1) % 500 == 0:
 avg_reward = np.mean(rewards_history[-500:])
 success_rate = np.mean(rewards_history[-500:])
 print(f" Episode {episode+1}: Avg reward = {avg_reward:.3f}, Success rate = {success_rate:.3f}, Îµ = {epsilon:.3f}")

env.close()

# Evaluate trained agent
print(f"\nEvaluating trained agent...")
env = gym.make('FrozenLake-v1', is_slippery=True)
test_episodes = 100
successes = 0

for episode in range(test_episodes):
 state, info = env.reset()
 done = False
 
 while not done:
 action = np.argmax(q_table[state]) # Greedy policy
 state, reward, terminated, truncated, info = env.step(action)
 done = terminated or truncated
 
 if reward > 0:
 successes += 1
 break

success_rate = successes / test_episodes
print(f"Success rate: {success_rate:.2%} ({successes}/{test_episodes})")

env.close()

# Visualize learning
plt.figure(figsize=(10, 6))
window_size = 100
if len(rewards_history) >= window_size:
 smoothed = np.convolve(rewards_history, np.ones(window_size)/window_size, mode='valid')
 plt.plot(smoothed, label=f'Smoothed (window={window_size})')
plt.plot(rewards_history, alpha=0.3, label='Raw rewards')
plt.xlabel('Episode', fontsize=12)
plt.ylabel('Reward', fontsize=12)
plt.title('Q-Learning on FrozenLake: Training Progress', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nâœ… Q-Learning on FrozenLake complete!")

## Part 3: Q-Learning on CartPole (with State Discretization)


In [None]:
print("\n" + "=" * 60)
print("Part 3: Q-Learning on CartPole (with State Discretization)")
print("=" * 60)

def discretize_state(observation, bins):
 
    
    """Discretize continuous state into discrete bins."""
 cart_pos, cart_vel, pole_angle, pole_vel = observationstate_idx = (
 np.digitize(cart_pos, bins[0]) * len(bins[1]) * len(bins[2]) * len(bins[3]) +
 np.digitize(cart_vel, bins[1]) * len(bins[2]) * len(bins[3]) +
 np.digitize(pole_angle, bins[2]) * len(bins[3]) +
 np.digitize(pole_vel, bins[3])
 )
 return min(state_idx, len(bins[0]) * len(bins[1]) * len(bins[2]) * len(bins[3]) - 1)

# Create CartPole environment
env = gym.make('CartPole-v1')

# Discretization bins
n_bins = 10
cart_pos_bins = np.linspace(-2.4, 2.4, n_bins)
cart_vel_bins = np.linspace(-3.0, 3.0, n_bins)
pole_angle_bins = np.linspace(-0.2, 0.2, n_bins)
pole_vel_bins = np.linspace(-3.0, 3.0, n_bins)
bins = [cart_pos_bins, cart_vel_bins, pole_angle_bins, pole_vel_bins]

n_states = n_bins ** 4
n_actions = env.action_space.nq_table = np.zeros((n_states, n_actions))
# Training parameters
n_episodes = 5000
alpha = 0.1
gamma = 0.99
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01

rewards_history = []

print(f"\nEnvironment: CartPole-v1")
print(f" Discrete states: {n_states}")
print(f" Actions: {n_actions}")
print(f" Bins per dimension: {n_bins}")
print(f"\nTraining for {n_episodes} episodes...")

# Training loop
for episode in range(n_episodes):
 obs, info = env.reset()
 state = discretize_state(obs, bins)
 total_reward = 0
 done = False
 
 while not done:
 # Choose action
 action = epsilon_greedy_action(q_table, state, epsilon, n_actions)
 
 # Take action
 next_obs, reward, terminated, truncated, info = env.step(action)
 done = terminated or truncated
 next_state = discretize_state(next_obs, bins)
 
 # Update Q-table
 q_learning_update(q_table, state, action, reward, next_state, alpha, gamma)
 
 state = next_state
 total_reward += reward
 
 rewards_history.append(total_reward)
 epsilon = max(epsilon_min, epsilon * epsilon_decay)
 
 if (episode + 1) % 500 == 0:
 avg_reward = np.mean(rewards_history[-500:])
 print(f" Episode {episode+1}: Avg reward = {avg_reward:.2f}, Îµ = {epsilon:.3f}")

env.close()

# Visualize learning
plt.figure(figsize=(10, 6))
window_size = 100
if len(rewards_history) >= window_size:
 smoothed = np.convolve(rewards_history, np.ones(window_size)/window_size, mode='valid')
 plt.plot(smoothed, label=f'Smoothed (window={window_size})')
plt.plot(rewards_history, alpha=0.3, label='Raw rewards')
plt.xlabel('Episode', fontsize=12)
plt.ylabel('Reward', fontsize=12)
plt.title('Q-Learning on CartPole: Training Progress', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nâœ… Q-Learning on CartPole complete!")

## Part 4: Introduction to Deep Q-Network (DQN)


In [None]:
print("\n" + "=" * 60)
print("Part 4: Introduction to Deep Q-Network (DQN)")
print("=" * 60)

print("\nDQN Overview:")
print(" - Uses neural network to approximate Q-function (instead of Q-table)")
print(" - Handles high-dimensional state spaces (e.g., images)")
print(" - Key innovations:")
print(" 1. Experience Replay: Store transitions, sample randomly for training")
print(" 2. Target Network: Separate network for stable Q-targets")
print(" 3. Neural Network: Approximate Q(s,a) for continuous/ high-dim states")

print("\nDQN Architecture:")
print(" Input: State (e.g., image, observation vector)")
print(" Network: Fully connected or CNN layers")
print(" Output: Q-values for each action")

print("\nDQN vs Q-Learning:")
print(" Q-Learning:")
print(" - Uses Q-table (discrete states only)")
print(" - Limited to small state spaces")
print(" - Fast for tabular problems")
print(" DQN:")
print(" - Uses neural network (continuous/high-dim states)")
print(" - Scales to complex problems (e.g., Atari games)")
print(" - Requires more computation and tuning")

print("\nDQN Algorithm (High-Level):")
print(" 1. Initialize Q-network and target network")
print(" 2. For each episode:")
print(" a. Observe state")
print(" b. Choose action using epsilon-greedy (using Q-network)")
print(" c. Take action, store transition in replay buffer")
print(" d. Sample batch from replay buffer")
print(" e. Compute targets using target network")
print(" f. Update Q-network using loss: (Q(s,a) - target)^2")
print(" g. Periodically update target network")

print("\nNote: Full DQN implementation will be covered in Unit 3 (Deep RL)")
print("This is an introduction to the concepts.")

print("\nâœ… DQN introduction complete!")

## Summary

### Key Concepts:
1. **Q-Learning**: Off-policy TD learning algorithm
   - Updates: Q(s,a) = Q(s,a) + Î±[r + Î³*max(Q(s',a')) - Q(s,a)]
   - Uses Q-table for discrete states
   - Epsilon-greedy exploration

2. **FrozenLake**: Discrete grid-world environment
   - Perfect for tabular Q-learning
   - 16 states, 4 actions
   - Slippery/unslippery variants

3. **CartPole**: Continuous state space
   - Requires state discretization for Q-learning
   - 4 continuous state variables â†’ discrete bins
   - Alternative: Use DQN for continuous states

4. **DQN**: Deep Q-Network
   - Neural network approximates Q-function
   - Handles high-dimensional/continuous states
   - Experience replay and target networks for stability

### Implementation Highlights:
- **Q-Learning**: Tabular method, fast for discrete problems
- **State Discretization**: Convert continuous to discrete for Q-learning
- **Epsilon Decay**: Reduce exploration over time
- **DQN**: Deep learning extension for complex problems

### Best Practices:
- Start with high epsilon (exploration), decay over time
- Tune learning rate (alpha) and discount factor (gamma)
- Monitor learning curves and success rates
- Use experience replay and target networks for DQN stability

### Next Steps:
- Unit 2: Advanced Q-learning (SARSA, TD methods)
- Unit 3: Deep RL (DQN, Actor-Critic, PPO)
- Unit 4: Exploration strategies
- Unit 5: Advanced applications

**Reference:** Course 09, Unit 1: "Introduction to Reinforcement Learning" - Mini projects practical content