# Class 4 Notebook – Reinforcement Learning: Basics and Q-Learning Demo

This notebook introduces **Reinforcement Learning (RL)**, a type of machine learning where an agent learns to make decisions by interacting with an environment.

Unlike supervised learning (Classes 2–3) and unsupervised learning (other Class 4 notebooks), **reinforcement learning** involves:
- An **agent** that takes actions
- An **environment** that responds to actions
- **Rewards** that guide learning
- **Learning through trial and error**

**Objective**: Understand reinforcement learning concepts and implement a simple Q-Learning example where an agent learns to navigate a grid world.

**Model type**: Q-Learning (value-based reinforcement learning).

**Key idea**: The agent learns a Q-table that stores the expected future rewards for each state-action pair. Through exploration and exploitation, the agent improves its policy over time.

We'll follow a step-by-step workflow:

1. Install/import libraries
2. Understand RL concepts (agent, environment, rewards, Q-learning)
3. Create a simple grid world environment
4. Implement Q-Learning algorithm
5. Train the agent
6. Visualize the learned policy

Run the first code cell to confirm your environment works.

## Run in the browser (no local setup)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adzuci/ai-fundamentals/blob/class-4-unsupervised-learning/class-4-unsupervised-learning/04_class_4_reinforcement_learning_basics.ipynb)

> Tip: This notebook assumes you're comfortable with basic Python, NumPy, and Matplotlib from Classes 2 and 3.

## What is Reinforcement Learning?

**Supervised Learning** (Classes 2–3):
- Learn from labeled examples (input → output pairs)
- Goal: Predict the correct output for new inputs
- Examples: House price prediction, image classification

**Unsupervised Learning** (Class 4):
- Learn from unlabeled data
- Goal: Discover hidden patterns or structures
- Examples: Clustering, dimensionality reduction

**Reinforcement Learning** (This notebook):
- Learn through interaction with an environment
- Goal: Maximize cumulative reward over time
- Examples: Game playing (Chess, Go), robotics, autonomous vehicles

**Key Components**:
- **Agent**: The learner/decision maker
- **Environment**: The world the agent interacts with
- **State**: Current situation
- **Action**: What the agent can do
- **Reward**: Feedback signal (positive = good, negative = bad)
- **Policy**: Strategy for choosing actions

## STEP 1: Install Required Libraries

If running locally, install the required packages. In Colab, these are already available.

In [None]:
# Install required libraries (run this if needed)
# Uncomment the line below if running locally and packages aren't installed
# !pip install numpy matplotlib

## STEP 2: Import Libraries

Import NumPy for numerical operations and Matplotlib for visualization.

In [None]:
# Environment sanity check + imports
import platform

print("Python:", platform.python_version())
print("OS:", platform.system(), platform.release())

try:
    import numpy as np
    import matplotlib.pyplot as plt

    print("NumPy:", np.__version__)
    print("All libraries imported successfully!")
except ModuleNotFoundError as exc:
    print("Missing dependency:", exc)
    print("Install with: python -m pip install numpy matplotlib")
    raise

## Q-Learning: Concept

**Q-Learning** is a value-based reinforcement learning algorithm that learns the optimal action-value function Q(s, a).

**Q-Table**: A table that stores Q-values for each state-action pair.
- Q(s, a) = expected future reward when taking action 'a' in state 's'
- Higher Q-value = better action

**Q-Learning Update Rule**:
```
Q(s, a) = Q(s, a) + α * [R + γ * max(Q(s', a')) - Q(s, a)]
```

Where:
- **α (alpha)**: Learning rate (how much to update)
- **R**: Immediate reward
- **γ (gamma)**: Discount factor (importance of future rewards)
- **s'**: Next state
- **max(Q(s', a'))**: Best Q-value in the next state

**Exploration vs Exploitation**:
- **Exploration**: Try random actions to discover better strategies
- **Exploitation**: Use the best-known action
- **ε-greedy**: Choose random action with probability ε, otherwise choose best action

## STEP 3: Create Grid World Environment

We'll create a simple 4x4 grid world where:
- The agent starts at position (0, 0)
- The goal is at position (3, 3)
- The agent can move: Up, Down, Left, Right
- Reaching the goal gives +10 reward
- Each step gives -1 reward (encourages finding shortest path)

In [None]:
# Concept: Define a simple grid world environment
class GridWorld:
    def __init__(self, size=4):
        self.size = size
        self.start = (0, 0)
        self.goal = (size-1, size-1)
        self.state = self.start
        
        # Actions: 0=Up, 1=Down, 2=Left, 3=Right
        self.actions = [0, 1, 2, 3]
        
    def reset(self):
        """Reset environment to start state"""
        self.state = self.start
        return self.state
    
    def step(self, action):
        """Take an action and return (next_state, reward, done)"""
        row, col = self.state
        
        # Move based on action
        if action == 0:  # Up
            row = max(0, row - 1)
        elif action == 1:  # Down
            row = min(self.size - 1, row + 1)
        elif action == 2:  # Left
            col = max(0, col - 1)
        elif action == 3:  # Right
            col = min(self.size - 1, col + 1)
        
        self.state = (row, col)
        
        # Check if goal reached
        if self.state == self.goal:
            reward = 10
            done = True
        else:
            reward = -1  # Small penalty for each step
            done = False
        
        return self.state, reward, done

# Test the environment
env = GridWorld()
print(f"Grid World Environment ({env.size}x{env.size})")
print(f"Start: {env.start}")
print(f"Goal: {env.goal}")
print(f"Actions: Up(0), Down(1), Left(2), Right(3)")

## STEP 4: Implement Q-Learning Algorithm

We'll implement the Q-Learning algorithm with:
- Q-table initialization
- ε-greedy action selection
- Q-value updates

In [None]:
# Concept: Implement Q-Learning algorithm
class QLearning:
    def __init__(self, env, learning_rate=0.1, discount_factor=0.95, epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01):
        self.env = env
        self.learning_rate = learning_rate  # α
        self.discount_factor = discount_factor  # γ
        self.epsilon = epsilon  # Exploration rate
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        
        # Initialize Q-table: Q[state][action]
        # State is represented as (row, col)
        self.q_table = {}
        for row in range(env.size):
            for col in range(env.size):
                self.q_table[(row, col)] = np.zeros(len(env.actions))
    
    def get_state_index(self, state):
        """Convert state tuple to index"""
        return state
    
    def choose_action(self, state):
        """Choose action using ε-greedy policy"""
        if np.random.random() < self.epsilon:
            # Exploration: random action
            return np.random.choice(self.env.actions)
        else:
            # Exploitation: best action
            return np.argmax(self.q_table[state])
    
    def update_q_value(self, state, action, reward, next_state, done):
        """Update Q-value using Q-Learning update rule"""
        current_q = self.q_table[state][action]
        
        if done:
            target_q = reward
        else:
            target_q = reward + self.discount_factor * np.max(self.q_table[next_state])
        
        # Q-Learning update
        self.q_table[state][action] = current_q + self.learning_rate * (target_q - current_q)
    
    def decay_epsilon(self):
        """Decay exploration rate"""
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

# Initialize Q-Learning agent
q_agent = QLearning(env)
print("Q-Learning agent initialized!")
print(f"Initial Q-table size: {len(q_agent.q_table)} states")
print(f"Q-values per state: {len(q_agent.q_table[(0,0)])} actions")

## STEP 5: Train the Agent

We'll train the agent for multiple episodes, allowing it to learn the optimal policy through exploration and exploitation.

In [None]:
# Concept: Train the Q-Learning agent
def train_agent(agent, env, episodes=1000):
    """Train the agent for specified number of episodes"""
    episode_rewards = []
    episode_steps = []
    
    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        steps = 0
        done = False
        
        while not done:
            # Choose action
            action = agent.choose_action(state)
            
            # Take action
            next_state, reward, done = env.step(action)
            
            # Update Q-value
            agent.update_q_value(state, action, reward, next_state, done)
            
            state = next_state
            total_reward += reward
            steps += 1
            
            # Prevent infinite loops
            if steps > 100:
                break
        
        # Decay epsilon
        agent.decay_epsilon()
        
        episode_rewards.append(total_reward)
        episode_steps.append(steps)
        
        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            avg_steps = np.mean(episode_steps[-100:])
            print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, Avg Steps = {avg_steps:.2f}, Epsilon = {agent.epsilon:.3f}")
    
    return episode_rewards, episode_steps

# Train the agent
print("Training Q-Learning agent...")
rewards, steps = train_agent(q_agent, env, episodes=500)

## STEP 6: Visualize Training Progress

Let's plot how the agent's performance improves over time.

In [None]:
# Concept: Visualize training progress
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot rewards over time
ax1.plot(rewards, alpha=0.6, linewidth=0.5)
ax1.set_xlabel('Episode')
ax1.set_ylabel('Total Reward')
ax1.set_title('Rewards per Episode')
ax1.grid(True, alpha=0.3)

# Plot steps per episode
ax2.plot(steps, alpha=0.6, linewidth=0.5, color='orange')
ax2.set_xlabel('Episode')
ax2.set_ylabel('Steps per Episode')
ax2.set_title('Steps per Episode')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFinal average reward (last 100 episodes): {np.mean(rewards[-100:]):.2f}")
print(f"Final average steps (last 100 episodes): {np.mean(steps[-100:]):.2f}")

## STEP 7: Visualize Learned Policy

Let's visualize the optimal policy the agent learned by showing the best action for each state.

In [None]:
# Concept: Visualize the learned policy
def visualize_policy(agent, env):
    """Visualize the learned policy as arrows on a grid"""
    fig, ax = plt.subplots(figsize=(8, 8))
    
    # Arrow directions for each action
    arrows = {
        0: (0, -0.3),  # Up
        1: (0, 0.3),   # Down
        2: (-0.3, 0),  # Left
        3: (0.3, 0)   # Right
    }
    
    arrow_labels = {0: '↑', 1: '↓', 2: '←', 3: '→'}
    
    # Draw grid
    for row in range(env.size):
        for col in range(env.size):
            state = (row, col)
            
            # Color: start = green, goal = red, others = white
            if state == env.start:
                color = 'lightgreen'
            elif state == env.goal:
                color = 'lightcoral'
            else:
                color = 'white'
            
            # Draw cell
            rect = plt.Rectangle((col - 0.5, env.size - row - 0.5), 1, 1, 
                                facecolor=color, edgecolor='black', linewidth=2)
            ax.add_patch(rect)
            
            # Draw arrow for best action
            if state != env.goal:
                best_action = np.argmax(agent.q_table[state])
                dx, dy = arrows[best_action]
                ax.arrow(col, env.size - row, dx, dy, 
                        head_width=0.15, head_length=0.1, fc='blue', ec='blue')
                
                # Show Q-value
                q_val = agent.q_table[state][best_action]
                ax.text(col, env.size - row - 0.3, f'{q_val:.1f}', 
                       ha='center', va='center', fontsize=8)
    
    ax.set_xlim(-0.5, env.size - 0.5)
    ax.set_ylim(-0.5, env.size - 0.5)
    ax.set_aspect('equal')
    ax.set_title('Learned Policy (Arrows show best action per state)')
    ax.axis('off')
    
    # Add legend
    ax.text(env.size + 0.5, env.size - 1, 'Start', color='green', fontsize=12, weight='bold')
    ax.text(env.size + 0.5, env.size - 2, 'Goal', color='red', fontsize=12, weight='bold')
    
    plt.tight_layout()
    plt.show()

# Visualize the learned policy
visualize_policy(q_agent, env)

## Test the Trained Agent

Let's test the agent by having it navigate from start to goal using the learned policy.

In [None]:
# Concept: Test the trained agent
def test_agent(agent, env, episodes=5):
    """Test the agent's learned policy"""
    action_names = {0: 'Up', 1: 'Down', 2: 'Left', 3: 'Right'}
    
    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        steps = 0
        path = [state]
        
        print(f"\nEpisode {episode + 1}:")
        print(f"  Start: {state}")
        
        done = False
        while not done and steps < 20:
            # Use best action (no exploration)
            action = np.argmax(agent.q_table[state])
            next_state, reward, done = env.step(action)
            
            print(f"  Step {steps + 1}: {state} -> {action_names[action]} -> {next_state} (reward: {reward})")
            
            state = next_state
            total_reward += reward
            steps += 1
            path.append(state)
            
            if done:
                print(f"  Goal reached! Total reward: {total_reward}, Steps: {steps}")
                break
    
    return path

# Test the agent
test_path = test_agent(q_agent, env, episodes=3)

## Key Learning

**Reinforcement Learning** is a powerful approach for learning through interaction:

- **Agent learns by trial and error** — no labeled examples needed
- **Q-Learning** learns optimal action-values through exploration and exploitation
- **ε-greedy policy** balances exploration (trying new actions) and exploitation (using best-known actions)
- **Rewards guide learning** — the agent learns to maximize cumulative reward

**Key Concepts**:
- **State**: Current situation
- **Action**: What the agent can do
- **Reward**: Feedback signal
- **Q-Value**: Expected future reward for a state-action pair
- **Policy**: Strategy for choosing actions

**Real-World Applications**:
- Game playing (Chess, Go, video games)
- Robotics (navigation, manipulation)
- Autonomous vehicles
- Recommendation systems
- Resource allocation

**Next Steps**: Explore more advanced RL algorithms like Deep Q-Networks (DQN), Policy Gradient methods, and multi-agent reinforcement learning.