# What is Reinforcement Learning?

Welcome to your journey into Reinforcement Learning! This notebook will introduce you to RL using simple language, real-world analogies, and interactive examples.

## What You'll Learn

By the end of this notebook, you'll understand:
- What reinforcement learning is (explained with everyday examples!)
- The agent-environment interaction loop
- Key concepts: states, actions, rewards, and policies
- How RL differs from other types of machine learning
- You'll build and visualize your first RL environment!

**Prerequisites:** Basic Python. Absolutely no prior RL knowledge needed!

**Time:** ~30 minutes

---
## The Big Picture: Learning by Doing

### Analogy: Training a Dog

Imagine you're training a puppy to sit:

1. **You say "sit"** (the puppy observes the situation)
2. **The puppy does something** - maybe it sits, maybe it jumps, maybe it does nothing
3. **You give feedback:**
   - If it sits: "Good boy!" + treat (+reward)
   - If it doesn't: No treat (no reward)
4. **The puppy learns** which actions lead to treats!

**This is exactly how reinforcement learning works!**

The puppy (agent) learns through trial and error, receiving rewards for good behavior.

```
    You say "sit"           Puppy sits           You give treat
    ┌─────────┐            ┌─────────┐           ┌─────────┐
    │  STATE  │  ────────> │ ACTION  │ ────────> │ REWARD  │
    │ ("sit") │            │ (sits)  │           │ (treat) │
    └─────────┘            └─────────┘           └─────────┘
```

---
## The Three Types of Machine Learning

Before diving deeper, let's see how RL compares to other approaches:

### 1. Supervised Learning (Like a Teacher)
- **Analogy:** A teacher shows you problems WITH answers
- "This photo is a cat. This photo is a dog. Now you try!"
- You learn from labeled examples

### 2. Unsupervised Learning (Like Exploring)
- **Analogy:** Finding patterns in a pile of Legos
- "These pieces seem to go together, those are different"
- No one tells you what's right or wrong

### 3. Reinforcement Learning (Like a Video Game)
- **Analogy:** Learning to play a new video game
- No instruction manual - you try things and see what happens
- Good moves = points, bad moves = game over
- You learn the best strategy through experience!

| Type | Learns From | Example |
|------|-------------|--------|
| Supervised | Labeled answers | "This email is spam" |
| Unsupervised | Patterns in data | "These customers are similar" |
| **Reinforcement** | **Trial and error + rewards** | **"Moving right got me 10 points!"** |

---
## The RL Loop: Agent and Environment

Every RL problem has two main characters:

### The Agent (The Learner)
- Makes decisions
- Takes actions
- Learns from experience
- **Examples:** A robot, an AI player, a trading algorithm

### The Environment (The World)
- Where the agent lives and acts
- Responds to actions
- Provides feedback (rewards)
- **Examples:** A maze, a video game, the stock market

They interact in a loop:

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch
import numpy as np

# Create a beautiful visualization of the RL loop
fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_facecolor('#f8f9fa')

# Environment box (top)
env_box = FancyBboxPatch((2, 6), 10, 2.5, boxstyle="round,pad=0.1", 
                          facecolor='#e3f2fd', edgecolor='#1976d2', linewidth=3)
ax.add_patch(env_box)
ax.text(7, 7.5, 'ENVIRONMENT', ha='center', va='center', 
        fontsize=18, fontweight='bold', color='#1976d2')
ax.text(7, 6.8, '(The World: maze, game, robot world...)', ha='center', 
        fontsize=11, color='#1976d2', style='italic')

# Agent box (bottom)
agent_box = FancyBboxPatch((4, 1.5), 6, 2.5, boxstyle="round,pad=0.1", 
                            facecolor='#e8f5e9', edgecolor='#388e3c', linewidth=3)
ax.add_patch(agent_box)
ax.text(7, 3, 'AGENT', ha='center', va='center', 
        fontsize=18, fontweight='bold', color='#388e3c')
ax.text(7, 2.3, '(The Learner: robot, AI, algorithm...)', ha='center', 
        fontsize=11, color='#388e3c', style='italic')

# Arrow: State (environment -> agent) - LEFT SIDE
ax.annotate('', xy=(4.5, 4.2), xytext=(3, 6),
            arrowprops=dict(arrowstyle='->', lw=3, color='#7b1fa2'))
ax.text(2.2, 5.2, 'STATE', fontsize=14, color='#7b1fa2', fontweight='bold')
ax.text(2.2, 4.7, '"Where am I?"', fontsize=10, color='#7b1fa2', style='italic')

# Arrow: Reward (environment -> agent) - RIGHT SIDE  
ax.annotate('', xy=(9.5, 4.2), xytext=(11, 6),
            arrowprops=dict(arrowstyle='->', lw=3, color='#f57c00'))
ax.text(11.2, 5.2, 'REWARD', fontsize=14, color='#f57c00', fontweight='bold')
ax.text(11.2, 4.7, '"+10 points!"', fontsize=10, color='#f57c00', style='italic')

# Arrow: Action (agent -> environment) - CENTER
ax.annotate('', xy=(7, 6), xytext=(7, 4),
            arrowprops=dict(arrowstyle='->', lw=3, color='#d32f2f'))
ax.text(7.3, 5.2, 'ACTION', fontsize=14, color='#d32f2f', fontweight='bold')
ax.text(7.3, 4.7, '"Go right!"', fontsize=10, color='#d32f2f', style='italic')

# Title
ax.text(7, 9.3, 'The Reinforcement Learning Loop', 
        ha='center', fontsize=20, fontweight='bold', color='#333')

# Step explanations at the bottom
steps = [
    "1. Agent sees the STATE (current situation)",
    "2. Agent takes an ACTION (makes a decision)", 
    "3. Environment gives REWARD (feedback: good or bad)",
    "4. Repeat! Agent learns which actions lead to high rewards"
]
for i, step in enumerate(steps):
    ax.text(7, 0.8 - i*0.25, step, ha='center', fontsize=10, color='#555')

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("THE RL LOOP - Think of it like this:")
print("="*60)
print("\n1. You're playing a video game (you are the AGENT)")
print("2. The game shows you the screen (that's the STATE)")
print("3. You press a button (that's your ACTION)")
print("4. You get points or lose a life (that's the REWARD)")
print("5. The game continues... you learn what works!")
print("\n" + "="*60)

---
## Key Vocabulary (With Examples!)

Let's learn the RL vocabulary using a familiar example: **Pac-Man**!

| Term | Definition | Pac-Man Example |
|------|------------|----------------|
| **Agent** | The decision-maker | Pac-Man himself |
| **Environment** | The world | The maze with ghosts and dots |
| **State** | Current situation | Positions of Pac-Man, ghosts, remaining dots |
| **Action** | What agent can do | Move UP, DOWN, LEFT, or RIGHT |
| **Reward** | Feedback signal | +10 for eating dot, +200 for ghost, -500 for dying |
| **Policy** | The strategy | "If ghost is near, run away. Otherwise, eat dots." |
| **Episode** | One complete game | From start until Pac-Man dies or wins |

### More Real-World Examples

**Self-Driving Car:**
- Agent: The car's AI
- State: Camera images, sensor data, speed, location
- Actions: Accelerate, brake, turn left/right
- Rewards: +1 for safe driving, -100 for accidents, +10 for reaching destination

**Robot Learning to Walk:**
- Agent: The robot
- State: Joint angles, balance sensors
- Actions: Move each leg motor
- Rewards: +1 for each step forward, -10 for falling

---
## Let's Build Our First Environment!

We'll create a simple **Grid World** - a 5x5 grid where an agent must navigate to a goal.

```
┌───┬───┬───┬───┬───┐
│ A │   │   │   │   │   A = Agent (starts here)
├───┼───┼───┼───┼───┤
│   │   │   │   │   │
├───┼───┼───┼───┼───┤
│   │   │   │   │   │
├───┼───┼───┼───┼───┤
│   │   │   │   │   │
├───┼───┼───┼───┼───┤
│   │   │   │   │ G │   G = Goal (get here!)
└───┴───┴───┴───┴───┘
```

**The Rules:**
- Agent can move: UP, DOWN, LEFT, or RIGHT
- Each step costs -1 (encourages finding shortest path)
- Reaching the goal gives +10
- Hitting a wall = stay in place (still costs -1)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import time
from IPython.display import clear_output

class SimpleGridWorld:
    """
    A simple 5x5 Grid World environment.
    
    This is our first RL environment! The agent starts at (0,0) 
    and must reach the goal at (4,4).
    """
    
    def __init__(self, size=5):
        self.size = size
        self.goal = (size-1, size-1)  # Bottom-right corner
        self.action_names = ['UP', 'RIGHT', 'DOWN', 'LEFT']
        self.reset()
    
    def reset(self):
        """Reset the agent to the starting position."""
        self.agent_pos = [0, 0]  # Top-left corner
        return self.get_state()
    
    def get_state(self):
        """Return the current state (agent's position)."""
        return tuple(self.agent_pos)
    
    def step(self, action):
        """
        Take an action and return the result.
        
        Actions: 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT
        
        Returns: (new_state, reward, done)
        """
        # Save old position
        old_pos = self.agent_pos.copy()
        
        # Move based on action
        if action == 0:    # UP
            self.agent_pos[0] = max(0, self.agent_pos[0] - 1)
        elif action == 1:  # RIGHT
            self.agent_pos[1] = min(self.size - 1, self.agent_pos[1] + 1)
        elif action == 2:  # DOWN
            self.agent_pos[0] = min(self.size - 1, self.agent_pos[0] + 1)
        elif action == 3:  # LEFT
            self.agent_pos[1] = max(0, self.agent_pos[1] - 1)
        
        # Check if we reached the goal
        done = tuple(self.agent_pos) == self.goal
        
        # Calculate reward
        if done:
            reward = 10  # Big reward for reaching goal!
        else:
            reward = -1  # Small penalty for each step (encourages efficiency)
        
        return self.get_state(), reward, done
    
    def render(self):
        """Display the grid with the agent's position."""
        grid = np.zeros((self.size, self.size))
        
        # Mark goal
        grid[self.goal] = 2
        
        # Mark agent
        grid[tuple(self.agent_pos)] = 1
        
        # Create visualization
        fig, ax = plt.subplots(figsize=(6, 6))
        
        # Custom colormap: white=empty, blue=agent, green=goal
        colors = ['white', '#2196F3', '#4CAF50']
        cmap = ListedColormap(colors)
        
        ax.imshow(grid, cmap=cmap)
        
        # Add grid lines
        for i in range(self.size + 1):
            ax.axhline(i - 0.5, color='black', linewidth=2)
            ax.axvline(i - 0.5, color='black', linewidth=2)
        
        # Add labels
        ax.text(self.agent_pos[1], self.agent_pos[0], 'A', 
                ha='center', va='center', fontsize=24, fontweight='bold', color='white')
        ax.text(self.goal[1], self.goal[0], 'G', 
                ha='center', va='center', fontsize=24, fontweight='bold', color='white')
        
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_title('Grid World\nA=Agent, G=Goal', fontsize=14)
        
        plt.tight_layout()
        plt.show()

# Create and display our environment!
env = SimpleGridWorld(size=5)
print("Welcome to Grid World!")
print("="*40)
print(f"Grid size: {env.size}x{env.size}")
print(f"Agent starts at: {env.get_state()}")
print(f"Goal is at: {env.goal}")
print(f"Actions: {env.action_names}")
print("\nRewards:")
print("  - Each step: -1 (so we want to be efficient!)")
print("  - Reaching goal: +10")
print("="*40)

env.render()

---
## Let's Watch the Agent-Environment Loop in Action!

We'll manually step through a few moves to see exactly how the loop works.

In [None]:
# Create a fresh environment
env = SimpleGridWorld(size=5)

print("THE AGENT-ENVIRONMENT LOOP IN ACTION")
print("="*50)
print("\nWe'll manually take 4 steps to see how it works:\n")

# Define a sequence of actions to demonstrate
demo_actions = [1, 1, 2, 2]  # RIGHT, RIGHT, DOWN, DOWN

total_reward = 0

for step, action in enumerate(demo_actions, 1):
    # Get current state BEFORE the action
    old_state = env.get_state()
    
    # Take the action
    new_state, reward, done = env.step(action)
    total_reward += reward
    
    # Print what happened
    print(f"Step {step}:")
    print(f"  State (before): {old_state}")
    print(f"  Action taken:   {env.action_names[action]}")
    print(f"  State (after):  {new_state}")
    print(f"  Reward:         {reward}")
    print(f"  Done?           {done}")
    print(f"  Total reward:   {total_reward}")
    print("-" * 40)

print(f"\nAfter 4 steps, the agent has moved from (0,0) to {env.get_state()}")
print(f"Total reward collected: {total_reward}")
print("\nLet's visualize the current position:")
env.render()

---
## What is a Policy?

A **policy** is the agent's strategy - it tells the agent what to do in each situation.

### Analogy: GPS Navigation

Think of a policy like GPS directions:
- **State:** Your current location
- **Policy:** The GPS rules that tell you which way to turn
- **Action:** The turn you actually make

A good policy gets you to your destination quickly. A bad policy gets you lost!

### Let's Compare Two Policies:

In [None]:
def random_policy(state):
    """
    A RANDOM policy - just picks any action randomly.
    
    This is like closing your eyes and picking a direction!
    Not very smart, but it's a starting point.
    """
    return np.random.randint(0, 4)  # Random: UP, RIGHT, DOWN, or LEFT


def smart_policy(state):
    """
    A SMART policy - always moves toward the goal.
    
    This policy knows the goal is at (4,4) and always moves closer to it.
    It's like having a GPS that knows exactly where to go!
    """
    row, col = state
    goal_row, goal_col = 4, 4
    
    # If we're not at the goal row, move down
    if row < goal_row:
        return 2  # DOWN
    # If we're at goal row but not goal column, move right
    elif col < goal_col:
        return 1  # RIGHT
    # We're at the goal!
    else:
        return 0  # Doesn't matter, we're done


def run_episode(env, policy, policy_name, max_steps=50):
    """
    Run one complete episode using the given policy.
    Returns the total reward and number of steps taken.
    """
    state = env.reset()
    total_reward = 0
    steps = 0
    path = [state]  # Track the path taken
    
    for step in range(max_steps):
        action = policy(state)
        state, reward, done = env.step(action)
        total_reward += reward
        steps += 1
        path.append(state)
        
        if done:
            break
    
    return total_reward, steps, path, done


# Compare the two policies!
print("COMPARING POLICIES: Random vs Smart")
print("=" * 50)

# Smart policy
env = SimpleGridWorld(size=5)
smart_reward, smart_steps, smart_path, smart_success = run_episode(env, smart_policy, "Smart")

print("\nSMART POLICY (always moves toward goal):")
print(f"  Path taken: {' -> '.join([str(p) for p in smart_path])}")
print(f"  Steps taken: {smart_steps}")
print(f"  Total reward: {smart_reward}")
print(f"  Reached goal: {'Yes!' if smart_success else 'No'}")

# Random policy (run multiple times since it's random)
print("\n" + "-"*50)
print("\nRANDOM POLICY (picks random directions):")

random_rewards = []
random_steps_list = []
successes = 0

for i in range(100):  # Run 100 episodes
    env = SimpleGridWorld(size=5)
    reward, steps, path, success = run_episode(env, random_policy, "Random")
    random_rewards.append(reward)
    random_steps_list.append(steps)
    if success:
        successes += 1

print(f"  (Results averaged over 100 episodes)")
print(f"  Average steps: {np.mean(random_steps_list):.1f}")
print(f"  Average reward: {np.mean(random_rewards):.1f}")
print(f"  Success rate: {successes}%")

print("\n" + "="*50)
print("\nCONCLUSION:")
print(f"  Smart policy: {smart_steps} steps, {smart_reward} reward (optimal!)")
print(f"  Random policy: ~{np.mean(random_steps_list):.0f} steps, ~{np.mean(random_rewards):.0f} reward")
print("\n  The goal of RL is to LEARN a smart policy through experience!")
print("  We don't want to hard-code it - we want the agent to discover it.")

---
## Visualizing the Policies

In [None]:
def visualize_path(path, title):
    """Visualize a path through the grid."""
    fig, ax = plt.subplots(figsize=(7, 7))
    
    # Draw grid
    for i in range(6):
        ax.axhline(i, color='black', linewidth=2)
        ax.axvline(i, color='black', linewidth=2)
    
    # Draw path
    path_x = [p[1] + 0.5 for p in path]
    path_y = [p[0] + 0.5 for p in path]
    
    # Draw line
    ax.plot(path_x, path_y, 'b-', linewidth=3, alpha=0.5, label='Path')
    
    # Draw points
    for i, (x, y) in enumerate(zip(path_x, path_y)):
        if i == 0:
            ax.scatter(x, y, s=300, c='blue', zorder=5, label='Start')
            ax.text(x, y, 'S', ha='center', va='center', fontsize=14, color='white', fontweight='bold')
        elif i == len(path) - 1:
            ax.scatter(x, y, s=300, c='green', zorder=5, label='End')
            ax.text(x, y, 'E', ha='center', va='center', fontsize=14, color='white', fontweight='bold')
        else:
            ax.scatter(x, y, s=100, c='lightblue', zorder=4, edgecolors='blue')
            ax.text(x, y, str(i), ha='center', va='center', fontsize=10)
    
    # Mark goal
    ax.add_patch(plt.Rectangle((4, 4), 1, 1, color='lightgreen', alpha=0.5))
    ax.text(4.5, 4.5, 'GOAL', ha='center', va='center', fontsize=12, fontweight='bold', color='green')
    
    ax.set_xlim(0, 5)
    ax.set_ylim(5, 0)  # Inverted y-axis
    ax.set_aspect('equal')
    ax.set_title(title, fontsize=14)
    ax.legend(loc='upper right')
    
    plt.tight_layout()
    plt.show()

# Show smart policy path
env = SimpleGridWorld(size=5)
_, _, smart_path, _ = run_episode(env, smart_policy, "Smart")
visualize_path(smart_path, f"Smart Policy: {len(smart_path)-1} steps (Optimal!)")

# Show a random policy path
np.random.seed(42)  # For reproducibility
env = SimpleGridWorld(size=5)
_, _, random_path, _ = run_episode(env, random_policy, "Random", max_steps=30)
visualize_path(random_path[:31], f"Random Policy: {len(random_path)-1} steps (Inefficient!)")

---
## Exploration vs Exploitation: A Key Challenge

One of the most important concepts in RL is the **exploration-exploitation trade-off**.

### The Restaurant Analogy

Imagine you're in a new city and want to find a good restaurant:

**Exploitation (Use what you know):**
- You found a decent Italian place yesterday
- You could go there again - you know it's okay!
- Safe choice, but maybe there's something better?

**Exploration (Try something new):**
- There's a Thai place you've never tried
- It might be amazing... or terrible
- Risky, but you might discover your new favorite!

### The Balance

- **Too much exploitation:** You miss better options
- **Too much exploration:** You waste time on bad choices

**Good RL agents balance both!**

In [None]:
def epsilon_greedy_policy(state, best_action_func, epsilon=0.1):
    """
    Epsilon-Greedy Policy: The simplest way to balance exploration and exploitation!
    
    - With probability (1 - epsilon): EXPLOIT - take the best known action
    - With probability epsilon: EXPLORE - take a random action
    
    Args:
        state: Current state
        best_action_func: Function that returns the best action for a state
        epsilon: Probability of exploring (0.1 = 10% exploration)
    """
    if np.random.random() < epsilon:
        # EXPLORE: Random action
        return np.random.randint(0, 4)
    else:
        # EXPLOIT: Best known action
        return best_action_func(state)

# Demonstrate epsilon-greedy
print("EPSILON-GREEDY POLICY DEMONSTRATION")
print("="*50)
print("\nThis policy usually picks the best action,")
print("but sometimes explores randomly.\n")

# Simulate 1000 decisions at state (0,0)
# Best action from (0,0) is to go DOWN or RIGHT
def best_action(state):
    return 2  # DOWN is best from (0,0)

for epsilon in [0.0, 0.1, 0.3, 0.5, 1.0]:
    actions = [epsilon_greedy_policy((0,0), best_action, epsilon) for _ in range(1000)]
    best_count = actions.count(2)  # DOWN
    print(f"Epsilon = {epsilon:.1f}: Best action chosen {best_count/10:.0f}% of the time")

print("\nInterpretation:")
print("- epsilon=0.0: Always exploit (never explore) - might miss better options")
print("- epsilon=0.1: Mostly exploit, sometimes explore - good balance!")
print("- epsilon=1.0: Always explore (random) - doesn't use knowledge")

---
## Real-World RL Success Stories

RL has achieved remarkable things! Here are some famous examples:

### AlphaGo (2016)
- **What:** AI that plays the board game Go
- **Achievement:** Beat the world champion!
- **Why it matters:** Go was considered too complex for computers

### OpenAI Five (2019)
- **What:** AI team that plays Dota 2
- **Achievement:** Beat world champion esports team
- **Why it matters:** Complex teamwork and strategy

### ChatGPT (2022)
- **What:** Conversational AI
- **Achievement:** Helpful, harmless, and honest responses
- **How:** RLHF - Reinforcement Learning from Human Feedback

### Self-Driving Cars
- **What:** Cars that drive themselves
- **Companies:** Tesla, Waymo, Cruise
- **How:** RL helps with decision-making in complex traffic

### Robotics
- **What:** Robots learning to walk, grasp objects, etc.
- **Companies:** Boston Dynamics, OpenAI
- **How:** RL teaches robots through trial and error

---
## Summary: Key Takeaways

Let's recap what we learned:

### 1. What is RL?
- Learning through **trial and error**
- Agent takes actions, gets rewards, learns what works
- Like training a dog or playing a video game!

### 2. The RL Loop
```
State → Agent → Action → Environment → Reward → (repeat)
```

### 3. Key Vocabulary
- **Agent:** The learner (makes decisions)
- **Environment:** The world (responds to actions)
- **State:** Current situation
- **Action:** What the agent does
- **Reward:** Feedback (positive or negative)
- **Policy:** The agent's strategy

### 4. Exploration vs Exploitation
- **Explore:** Try new things to find better options
- **Exploit:** Use what you know works
- Good agents **balance both**!

### 5. The Goal
**Find a policy that maximizes total reward over time!**

---
## Test Your Understanding

Try to answer these questions before clicking to reveal the answers!

**1. In the dog training analogy, what is the reward?**
<details>
<summary>Click to reveal answer</summary>
The treat! It's the positive feedback that encourages the dog (agent) to repeat the good behavior (action).
</details>

**2. What's the difference between a state and an action?**
<details>
<summary>Click to reveal answer</summary>
The state is the current situation (what the agent observes). The action is what the agent decides to do in response. For example: state = "ghost is nearby", action = "run away".
</details>

**3. Why is a random policy bad?**
<details>
<summary>Click to reveal answer</summary>
A random policy doesn't learn from experience - it just guesses. It takes many more steps to reach the goal (if it ever does!) and collects much less reward than a smart policy.
</details>

**4. What is epsilon in epsilon-greedy?**
<details>
<summary>Click to reveal answer</summary>
Epsilon is the probability of exploring (taking a random action) instead of exploiting (taking the best known action). An epsilon of 0.1 means 10% exploration, 90% exploitation.
</details>

**5. What technology uses RLHF?**
<details>
<summary>Click to reveal answer</summary>
ChatGPT! RLHF (Reinforcement Learning from Human Feedback) is used to make language models more helpful, harmless, and honest.
</details>

---
## What's Next?

Congratulations! You now understand the fundamentals of reinforcement learning!

In the next notebook, we'll learn about **Markov Decision Processes (MDPs)** - the mathematical framework that formalizes everything we discussed here.

Don't worry - we'll use the same intuitive approach with lots of examples and visualizations!

**Continue to:** [Notebook 2: Markov Decision Processes](02_markov_decision_processes.ipynb)

---

*Great job completing your first RL notebook! You're on your way to understanding how AI learns through experience!*