In [None]:
from IPython.display import Image

# AI Car Racer

## Introduction

Welcome to the AI Car Racer competition! You're about to teach a car to race around a track using reinforcement learning (RL). This is a competition: **complete a full lap in the fastest time to win**.


## What You're Building

You'll train an AI agent using **Deep Q-Learning (DQN)** to control a car racing around a track. The model "sees" an image of the track, and learns to choose actions such as (turn left, turn right, accelerate, brake, or do nothing) to maximize its score.


## Competition Rules

- **Goal**: Complete a full lap around the track
- **Winner**: Fastest lap time
- **Track**: Fixed seed (everyone gets the same track)
- **Starting point**: Train from scratch
- **You can**: Tune any hyperparameters and modify wrappers


In [None]:
Image(url='https://gymnasium.farama.org/_images/car_racing.gif')

## Google Colab

### What is Colab?

Google Colab is like Google Docs for code—it's a free Jupyter notebook environment that runs in your browser.

### Key Concepts

- **Cells**: Blocks of code or text. Run them with `Shift + Enter` or click the ▶️ button
- **Code cells**: Contain Python code you can execute
- **Text cells**: Contain formatted text (like this README)
- **Runtime**: The virtual computer running your code
  - Go to `Runtime → Change runtime type` to select GPU
  - Free GPUs speed up training significantly
- **Session timeout**: After 12 hours or if idle, your runtime disconnects. Your code remains but variables reset

### Essential Shortcuts

- `Shift + Enter`: Run current cell and move to next
- `Ctrl + Enter`: Run current cell and stay on it
- `Ctrl + S`: Save notebook
- `Ctrl + /`: Comment/uncomment code

### Colab File System

- Files you create live in `/content/` directory
- **Important**: Files are temporary! They disappear when runtime disconnects
- Download important files (models, logs) to your local machine



## Reinforcement Learning

In [None]:
Image(url='https://media.geeksforgeeks.org/wp-content/uploads/20220214110501/ImagefromiOS1-660x296.jpg')

### Environment

This is the game world and all of its components, think of the track, the car, the physics and how they all interact together.

### Observations (state)
This is what the model can 'see'

https://gymnasium.farama.org/environments/box2d/car_racing/#observation-space


### Actions
Based on the observations the model will then pick the best action to take.

**Continous**

This is like having a real steering wheel, with a gas and brake pedal.

| Index | Control  | Range        | Meaning                            |
|-------|----------|--------------|------------------------------------|
| 0     | Steering | [-1.0, +1.0] | -1 = full left, +1 = full right    |
| 1     | Gas      | [0.0, 1.0]   | 0 = no throttle, 1 = full throttle |
| 2     | Brake    | [0.0, 1.0]   | 0 = no brake, 1 = full brake       |


_Example action_:
[0.3, 0.8, 0.0] → "Turn slightly right and 80% throttle, no brake."

Note the `continuous=True`

```python
import gymnasium
env = gymnasium.make("CarRacing-v3", continuous=True)
```

**Discrete**


Discrete actions are like buttons where the model chooses a button to press at each timestep.

| Action | Meaning     |
|--------|-------------|
| 0      | Do nothing  |
| 1      | Steer right |
| 2      | Steer left  |
| 3      | Gas         |
| 4      | Brake       |

Note the `continuous=False`

```python
import gymnasium
env = gymnasium.make("CarRacing-v3", continuous=False)
```


**Choosing Between Continuous and Discrete**

| Mode           | Pros                      | Cons                            |
|----------------|---------------------------|---------------------------------|
| **Continuous** | Realistic, smooth control | Model will take longer to train |
| **Discrete**   | Simple controls           | Quicker to learn, less control  |


### Reward

The model is provided with a reward for each action it takes. The reward provides feedback to the model to determine if it should take more actions like that or less actions like that.

## Hyperparameters Explained

These are the knobs you can turn to improve performance. **This is where you'll win the competition!**

In Python, `5e-4` is just a shorthand for writing `0.0005`. The `e-4` means 'shift the decimal 4 places left' or '5 times 10 to the power of -4'.

### Core Training Parameters

#### `total_timesteps` (Default: 500,000)
**What it does**: Total number of actions the agent takes during training. Start small 10k and build up to see how your model reacts
**Think of it as**: How many practice laps your car gets.  
- **Lower** (10k): Faster training but may not learn completely
- **Higher** (500k): Better final performance but takes longer

#### `learning_rate` (Default: 1e-4)
**What it does**: How big each update step is when learning.  
**Think of it as**: How quickly the car adjusts its strategy after each mistake.  
**Range to try**: 5e-5 to 5e-4
- **Lower** (5e-5): More stable, slower learning, less likely to "forget"
- **Higher** (5e-4): Faster learning but can be unstable, might overshoot
- **Sweet spot**: 1e-4 is a solid default

#### `gamma` (Default: 0.98)
**What it does**: Discount factor for future rewards.  
**Think of it as**: How much the car values long-term success vs immediate rewards.  
**Range to try**: 0.95 - 0.995
- **Lower** (0.95): Car focuses on immediate rewards, more aggressive
- **Higher** (0.995): Car plans ahead more, smoother driving





### Exploration Parameters

#### `exploration_fraction` (Default: 0.3)
**What it does**: What fraction of training to spend exploring randomly.  
**Think of it as**: How long the car experiments before settling on a strategy.  
**Range to try**: 0.2 - 0.5
- **Lower** (0.2): Commits to learned strategy sooner
- **Higher** (0.5): Explores longer, might find better solutions
- **Sweet spot**: 0.3 for most cases


## Experimentation Strategy

### Phase 1: Quick Iteration (45 min)
Try faster training runs to test ideas:
- Reduce `total_timesteps` to 50k for quick tests
- Try 2-3 different hyperparameter combinations
- Focus on `learning_rate`, `gamma`, and exploration parameters

### Phase 2: Final Training (60 min)
Once you find promising settings, do a full training run



## Monitoring Training

### Watch the Logs

Key metrics to monitor in the training output:

```python
# After training starts, you'll see:
ep_rew_mean: -50 → -20 → 100 → 300 → 500+  # Getting better!
ep_len_mean: 50 → 100 → 200 → 500+          # Driving for longer!
```

**Good signs:**
- `ep_rew_mean` increasing over time
- `ep_len_mean` increasing (car survives longer)
- Fewer negative rewards

**Bad signs:**
- `ep_rew_mean` stuck or decreasing
- Very short episodes throughout training
- Loss values exploding (> 100)


# Configuration

Edit these variables to customise your training.  All the main settings are right here, though the hyperparameters for different training schemes can differ.

## ALGORITHM SELECTION

Choose which reinforcement learning algorithm to use.

 Options:
   - "DQN"  : Deep Q-Network (works with DISCRETE actions only)
             Good for: Simple action spaces, faster training on discrete problems
             Note: DQN ONLY supports discrete actions (like pressing buttons)

   - "PPO"  : Proximal Policy Optimization (works with BOTH discrete AND continuous)
             Good for: More complex tasks, smoother learning, works with continuous control
             Note: PPO is more flexible and can handle both action types

 For more algorithms, see: https://stable-baselines3.readthedocs.io/en/master/guide/algos.html

In [None]:
ALGORITHM = "DQN"  # Options: "DQN" or "PPO"

# Action Type

This setting is ONLY relevant if you choose PPO above.
 DQN always uses discrete actions (this setting is ignored for DQN).

 Options:
   - "discrete"   : Button-like controls (left, right, gas, brake, nothing)
                    Simpler for the model to learn, but less precise control

   - "continuous" : Analog controls (like a real steering wheel + pedals)
                    More precise control, but takes longer to learn

In [None]:
PPO_ACTION_SPACE = "discrete"  # Options: "discrete" or "continuous"

## Checkpoint settings

In [None]:
CHECKPOINT_FREQUENCY = 10000  # Save model every X steps
CHECKPOINT_DIR = "./checkpoints/"  # Folder to save checkpoints
RESUME_TRAINING = True  # Set to True to continue from a checkpoint, False to start a new model
CHECKPOINT_PATH = "./checkpoints/best_model.zip"  # Path to load checkpoint from

## Training Parameters

In [None]:
TOTAL_TIMESTEPS = 50_000   # Total training steps (start with 50k for testing)
LEARNING_RATE = 5e-4        # How fast the model learns (try: 5e-5 to 5e-4)
GAMMA = 0.98                # Future reward discount (try: 0.95 to 0.995)

# DQN-specific parameters (ignored when using PPO)
EXPLORATION_FRACTION = 0.3  # Fraction of training spent exploring randomly
BUFFER_SIZE = 100_000       # Size of replay buffer (memory for past experiences)
BATCH_SIZE = 64             # Number of samples per training update

# PPO-specific parameters (ignored when using DQN)
N_STEPS = 2048              # Steps to collect before each update
N_EPOCHS = 10               # Number of epochs when updating
CLIP_RANGE = 0.2            # PPO clipping parameter

## Logging and Evaluation

In [None]:
LOG_DIR = "./logs/"         # TensorBoard logs directory
EVAL_FREQ = 10000           # Evaluate model every X steps
N_EVAL_EPISODES = 5         # Number of episodes for each evaluation

print("Configuration loaded successfully!")
print(f"   Algorithm: {ALGORITHM}")
if ALGORITHM == "PPO":
    print(f"   Action Space: {PPO_ACTION_SPACE}")
else:
    print(f"   Action Space: discrete (DQN only supports discrete)")
print(f"   Resume Training: {RESUME_TRAINING}")
print(f"   Checkpoint Frequency: Every {CHECKPOINT_FREQUENCY:,} steps")

In [None]:
!pip install "swig>=4.3.1.post0"
!pip install "gymnasium[box2d]==1.2.0"
!pip install "stable-baselines3[extra]==2.7.0"
!pip install "pyvirtualdisplay"
!sudo apt-get install -y xvfb ffmpeg

In [None]:
from datetime import datetime
import os
import gymnasium
import gymnasium as gym
from gymnasium.wrappers import ResizeObservation
from stable_baselines3 import DQN, PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv  # Environment wrapper
from stable_baselines3.common.callbacks import BaseCallback, EvalCallback, CheckpointCallback
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
from gymnasium.wrappers import RecordVideo
from gymnasium.wrappers import TransformAction
import glob
import io
import base64
import numpy as np


track_seed = 69

In [None]:
if ALGORITHM == "DQN":
    use_continuous = False
    print("Using DISCRETE action space (required for DQN)")
    print("   Actions: [0] Nothing, [1] Right, [2] Left, [3] Gas, [4] Brake")
elif ALGORITHM == "PPO":
    use_continuous = (PPO_ACTION_SPACE == "continuous")
    if use_continuous:
        print("Using CONTINUOUS action space (PPO)")
        print("   Actions: Steering [-1,+1], Gas [0,1], Brake [0,1]")
    else:
        print("Using DISCRETE action space (PPO)")
        print("   Actions: [0] Nothing, [1] Right, [2] Left, [3] Gas, [4] Brake")
else:
    raise ValueError(f"Unknown algorithm: {ALGORITHM}. Use 'DQN' or 'PPO'.")

os.makedirs(LOG_DIR, exist_ok=True)
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
print(f"\nLog directory: {LOG_DIR}")
print(f"Checkpoint directory: {CHECKPOINT_DIR}")

def make_env(continuous: bool):
    """
    Create a Car Racing environment with proper wrappers.
    
    Args:
        continuous: If True, use continuous (analog) controls.
                   If False, use discrete (button) controls.
    
    Returns:
        A wrapped gymnasium environment ready for training.
    """
    def _init():
        # Create the base Car Racing environment
        # render_mode="rgb_array" means we get pixel data (for training)
        # Use render_mode="human" if you want to watch it train (slower!)
        env = gym.make(
            "CarRacing-v3",
            continuous=use_continuous,
            render_mode="rgb_array"
        )
        
        env.reset(seed=track_seed)
        
        # Wrap with Monitor to record episode statistics (rewards, lengths)
        # This data is used for logging and evaluation
        env = Monitor(env, LOG_DIR)
        
        return env
    return _init

env = DummyVecEnv([make_env(use_continuous)])

print(f"\nEnvironment created successfully!")
print(f"   Environment: CarRacing-v3")
print(f"   Continuous: {use_continuous}")

## Create or Load Model

In [None]:
if RESUME_TRAINING:
    print(f"Loading model from checkpoint: {CHECKPOINT_PATH}")
    if not os.path.exists(CHECKPOINT_PATH) and not os.path.exists(CHECKPOINT_PATH + ".zip"):
        raise FileNotFoundError(
            f"Checkpoint not found: {CHECKPOINT_PATH}\n"
            f"   Make sure the file exists, or set RESUME_TRAINING = False to start fresh."
        )
    if ALGORITHM == "DQN":
        model = DQN.load(CHECKPOINT_PATH, env=env)
    elif ALGORITHM == "PPO":
        model = PPO.load(CHECKPOINT_PATH, env=env)
    
    print(f"Model loaded successfully!")
    print(f"   You can now continue training from where you left off.")
    
else:
    # ─────────────────────────────────────────────────────────────────────────────
    # CREATE A NEW MODEL
    # ─────────────────────────────────────────────────────────────────────────────
    # Starting fresh with a new, untrained model.
    # The model will learn from scratch.
    
    print(f"Creating a new {ALGORITHM} model...")
    
    if ALGORITHM == "DQN":
        
        model = DQN(
            policy="CnnPolicy",
            env=env,
            learning_rate=LEARNING_RATE,
            gamma=GAMMA,
            buffer_size=BUFFER_SIZE,
            batch_size=BATCH_SIZE,
            exploration_fraction=EXPLORATION_FRACTION,
            tensorboard_log=LOG_DIR,
            verbose=1
        )
        
    elif ALGORITHM == "PPO":       
        model = PPO(
            policy="CnnPolicy",
            env=env,
            learning_rate=LEARNING_RATE,
            gamma=GAMMA,
            n_steps=N_STEPS,
            n_epochs=N_EPOCHS,
            clip_range=CLIP_RANGE,
            tensorboard_log=LOG_DIR,
            verbose=1
        )
    
    print(f"\nNew {ALGORITHM} model created successfully!")
    print(f"   Policy: CnnPolicy (Convolutional Neural Network)")
    print(f"   Learning Rate: {LEARNING_RATE}")
    print(f"   Gamma: {GAMMA}")

## Set Up Callbacks

In [None]:
checkpoint_callback = CheckpointCallback(
    save_freq=CHECKPOINT_FREQUENCY,
    save_path=CHECKPOINT_DIR,
    name_prefix="rl_model",
    save_replay_buffer=True,
    save_vecnormalize=True
)

print(f"Checkpoint callback created")
print(f"   Saving every {CHECKPOINT_FREQUENCY:,} steps to {CHECKPOINT_DIR}")

eval_env = DummyVecEnv([make_env(use_continuous)])
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path=CHECKPOINT_DIR,
    log_path=LOG_DIR,
    eval_freq=EVAL_FREQ,
    n_eval_episodes=N_EVAL_EPISODES,
    render=False,
    deterministic=True
)

print(f"Evaluation callback created")
print(f"   Evaluating every {EVAL_FREQ:,} steps")
print(f"   Running {N_EVAL_EPISODES} episodes per evaluation")
print(f"   Best model will be saved to {CHECKPOINT_DIR}best_model.zip")

# Combine all callbacks into a list
callbacks = [checkpoint_callback, eval_callback]

print(f"\nAll callbacks ready!")

## Monitoring

In [None]:
print("="*80)
print(f"STARTING TRAINING")
print("="*80)
print(f"Algorithm: {ALGORITHM}")
print(f"Total Timesteps: {TOTAL_TIMESTEPS:,}")
print(f"Checkpoints: Every {CHECKPOINT_FREQUENCY:,} steps")
print(f"Evaluation: Every {EVAL_FREQ:,} steps")
print("="*80)
print("\nTraining progress will be shown below...")
print("   Look for 'ep_rew_mean' to see if the car is improving!")
print("   Higher values = better driving\n")

# ─────────────────────────────────────────────────────────────────────────────────
# THE MAIN TRAINING CALL
# ─────────────────────────────────────────────────────────────────────────────────
# This single line does ALL the training. The model.learn() method:
#   1. Collects experiences by running the agent in the environment
#   2. Updates the neural network based on those experiences
#   3. Repeats for the specified number of timesteps
#
# The callbacks run periodically during training to save checkpoints and evaluate.

model.learn(
    total_timesteps=TOTAL_TIMESTEPS,
    callback=callbacks,
    progress_bar=True  # Show a nice progress bar
)

print("\n" + "="*80)
print("TRAINING COMPLETE!")
print("="*80)


# How good is my model ?

This will check that the model you have trained, over 10 laps and calculate the reward.  

In [None]:
check_env = gymnasium.make("CarRacing-v3",  render_mode='rgb_array', continuous=use_continuous)
check_env = Monitor(check_env)
mean_reward, std_reward = evaluate_policy(model, check_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

# Save the final model

In [None]:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
model_name = f"{timestamp}_{ALGORITHM}_CarRacingv3"
model_path = f"./models/{model_name}"
os.makedirs("./models", exist_ok=True)
model.save(model_path)

print(f"Model saved successfully!")
print(f"   Path: {model_path}.zip")
print(f"\nFor competition submission:")
print(f"   1. Download the file: {model_path}.zip")
print(f"   2. Note your settings:")
print(f"      - Algorithm: {ALGORITHM}")
print(f"Your submission must consist of the ZIP file containing the model and the Jupyter notebook you used")
if ALGORITHM == "PPO":
    print(f"      - Action Space: {PPO_ACTION_SPACE}")
else:
    print(f"      - Action Space: discrete")

# See your model drive round the track



In [None]:
# Start virtual display
display = Display(visible=0, size=(1400, 900))
display.start()

# Setup the wrapper to record the video
video_callable=lambda episode_id: True
check_env = RecordVideo(check_env, video_folder='./videos', episode_trigger=video_callable)
# If using PPO with continuous action space, wrap the environment
if ALGORITHM == "PPO" and PPO_ACTION_SPACE == "continuous":
    check_env = TransformAction(
        check_env,
        lambda action: np.array([
            action[0],           # Steering: -1 to 1
            action[1],           # Gas: 0 to 1
            action[2]            # Brake: 0 to 1
        ]),
        check_env.action_space
    )
obs, info = check_env.reset()

# Run the environment until done
terminated = False
truncated = False
while not (terminated or truncated):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = check_env.step(action)

check_env.close()

# Display the video
video = io.open(glob.glob('videos/*.mp4')[0], 'r+b').read()
encoded = base64.b64encode(video)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded.decode('ascii'))))