# Autonomous Driving Car (Trackmania)
- This project aims to develop an autonomous driving model that can navigate a racing game environment by predicting directional commands based on real-time screen captures. The approach combines data collection, supervised learning, and computer vision techniques to train a model that classifies actions (up, down, left, and right) and performs them based on the gameâ€™s visual inputs.

**Key Project Components:**

- Data Collection: A script captures in-game frames with associated keypresses, creating a structured dataset by categorizing frames into folders for each action. This dataset forms the foundation for training the model.
- Model Development: Using the FastAI library, we explored different versions of ResNet (e.g., ResNet18, ResNet34, ResNet50) with pre-trained and non-pre-trained weights, dropout regularization, and advanced techniques like differential learning rates, mixed precision, and one-cycle learning.
- Real-Time Driving: The final script utilizes the trained model to predict actions in real time, capturing screen inputs and performing actions in response.
- Challenges: Throughout development, we faced several key challenges, particularly in improving turn predictions (left/right) despite high accuracy metrics. Extensive experimentation with various architectures and training techniques was required to optimize the model's performance.

# Capture and Log 
- This captures game frames in real-time and logs the key presses (up, down, left, right) with timestamps and simulated speed data. Each key-press frame is saved to a designated folder based on the action, creating a structured dataset for training.

**Challenges:**

- Ensuring data accuracy, as capturing precise, labeled game actions required balancing frame rate and keypress timing.
- Managing a large number of captured images and organizing them by keypress categories.

In [None]:
import cv2
import numpy as np
from PIL import ImageGrab
import os
import time
import csv
import random  # Simulating speed data, replace with actual speed retrieval method
from pynput import keyboard

In [None]:
# Paths to save data
BASE_FRAME_PATH = "./data/frames/"
LOG_PATH = "./data/dataset.csv"

# Ensure the frame directories exist for each key
for key in ['up', 'down', 'left', 'right']:
    key_dir = os.path.join(BASE_FRAME_PATH, key)
    if not os.path.exists(key_dir):
        os.makedirs(key_dir)

# Create CSV file for logging frames, key presses, and speed
if not os.path.exists(LOG_PATH):
    with open(LOG_PATH, mode='w') as file:
        writer = csv.writer(file)
        writer.writerow(["timestamp", "frame_path", "key", "speed"])  # Added speed to the header

# Global variable to store the latest key press
current_key = None

def on_press(key):
    """Callback to handle key press events."""
    global current_key
    try:
        current_key = key.char  # Alphanumeric keys
    except AttributeError:
        # Special keys
        if key == keyboard.Key.left:
            current_key = 'left'
        elif key == keyboard.Key.right:
            current_key = 'right'
        elif key == keyboard.Key.up:
            current_key = 'up'
        elif key == keyboard.Key.down:
            current_key = 'down'
        else:
            current_key = None  # Ignore other keys

    if current_key:
        print(f"Key Pressed: {current_key}")

def capture_and_log():
    """Capture game frames, log the key presses, and capture speed."""
    bbox = (0, 40, 1264, 720)  # Adjust to match your game window size

    while True:
        # Capture the screen
        screen = np.array(ImageGrab.grab(bbox=bbox))
        frame = cv2.cvtColor(screen, cv2.COLOR_BGR2RGB)

        # Crop to focus only on the road section (tweak these coordinates based on the actual region of the road)
        cropped_frame = frame[300:450, 150:1150]  # Example y,x coordinates

        # Simulate speed data
        speed = random.uniform(0, 200)  # Replace with actual speed retrieval method

        # Save frame with unique timestamp
        timestamp = str(int(time.time() * 1000))  # Milliseconds timestamp
        
        # Check if the current key is one of the valid keys
        if current_key in ['up', 'down', 'left', 'right']:
            # Save the frame to the corresponding directory
            frame_dir = os.path.join(BASE_FRAME_PATH, current_key)
            frame_path = os.path.join(frame_dir, f"frame_{timestamp}.jpg")
            cv2.imwrite(frame_path, cropped_frame)

            # Log the frame, key press, and speed
            with open(LOG_PATH, mode='a') as file:
                writer = csv.writer(file)
                writer.writerow([timestamp, frame_path, current_key, speed])  # Log key press and speed

        # Control frame rate
        time.sleep(0.1)  # 10 frames per second

if __name__ == "__main__":
    # Start listening to keyboard events
    listener = keyboard.Listener(on_press=on_press)
    listener.start()

    # Start capturing frames and logging data
    capture_and_log()

# Model Training
- This defines, trains, and evaluates a ResNet-based model for action classification using FastAI. We utilize data transformations for variation, apply mixed precision for efficiency, and use a one-cycle learning rate policy to fine-tune the model.

**Challenges:**

- Achieving correct predictions for turns (left/right), which proved difficult even with high accuracy due to model struggles in corner prediction.
- Iterative experiments with pre-trained vs. non-pre-trained models, dropout, resampling, and hyperparameter tuning were necessary to improve performance but did not fully resolve turn accuracy issues.

In [1]:
from fastai.vision.all import *
import os
from torch.nn import CrossEntropyLoss
from fastai.metrics import accuracy, Precision, Recall, F1Score
import warnings
import matplotlib.pyplot as plt

In [None]:
# Suppress all warnings
warnings.filterwarnings("ignore")
# Paths
BASE_FRAME_PATH = "./data/frames/"
LOG_PATH = "./data/dataset.csv"

# Define the function to extract data directly from folders
def get_image_data():
    """Loads image data from folders for each label."""
    # Create a list of all image paths and their corresponding labels based on folder names
    image_files = get_image_files(BASE_FRAME_PATH)
    data = pd.DataFrame({
        'frame_path': [str(f) for f in image_files],
        'key': [f.parent.name for f in image_files]  # Extract the folder name as the label
    })
    return data

# Load the data
data = get_image_data()

# Define `get_x` (image path) and `get_y` (classification labels: keyboard presses)
def get_x(row):
    return row['frame_path']

def get_y(row):
    return row['key'].strip()

# Create a DataBlock for classification task with data loaded from folders
block = DataBlock(
    blocks=(ImageBlock, CategoryBlock),  
    get_x=get_x,
    get_y=get_y,
    splitter=RandomSplitter(valid_pct=0.2),  
    batch_tfms=[
        RandomResizedCrop(224, min_scale=0.8),  # Randomly crop and resize images directly to 224x224
        *aug_transforms(do_flip=False, flip_vert=False)
    ]
)

# Create the DataLoader with the balanced data
dls = block.dataloaders(data, bs=16, num_workers=0)

# Define and train the model
learn = vision_learner(dls, resnet50, pretrained=True, metrics=[accuracy, Precision(average='weighted'), Recall(average='weighted'), F1Score(average='weighted')], ps=0.2)  # `ps` adds dropout

# Find the optimal learning rate
learn.lr_find()

# Unfreeze model for training the entire architecture
learn.unfreeze()

# Fine-tune the model with one-cycle learning and differential learning rates
learn.fit_one_cycle(50, lr_max=slice(1e-4, 1e-2))

# Apply mixed precision for faster training (optional)
learn = learn.to_fp16()

# Visualize results after training
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(8, 8), dpi=100)
interp.plot_top_losses(9, figsize=(15, 10))
plt.show()

# Save the model
learn.export('./models/model_classification_fastai_v2.pkl')

# Capture and Drive
- This allows real-time model inference to control the car based on screen captures, predicting and executing actions in the game. It uses the trained model to make predictions and sends keypress signals to guide the car.

**Challenges:**

- Synchronizing frame capture with game speed to ensure predictions are made quickly enough for responsive driving.
- Consistent mispredictions on turns, often leading the car to drive straight instead of adjusting for corners, revealed limitations in training data and model adjustments.

In [None]:
from fastai.vision.all import *
import torch
from pynput.keyboard import Controller, Key
from PIL import ImageGrab
import numpy as np
import cv2

In [None]:
# Redefine custom functions used during training
def get_x(row):
    return row['frame_path']

def get_y(row):
    return row['key'].strip()

# Load the trained model
model_path = './models/model_classification_fastai_v2.pkl'
learn = load_learner(model_path)
# print(learn.dls.vocab)

# Initialize keyboard controller
keyboard = Controller()

# Function to capture frames and control the car
def capture_and_drive():
    bbox = (0, 40, 1280, 720)  # Adjust based on your screen setup

    while True:
        # Capture the game screen
        screen = np.array(ImageGrab.grab(bbox=bbox))

        # Convert to RGB and then to PIL image for FastAI
        frame = cv2.cvtColor(screen, cv2.COLOR_BGR2RGB)  # Convert to RGB
        frame = PILImage.create(frame)

        # Resize to match training size
        frame = Resize(224)(frame)

        # Convert PIL image to tensor using ToTensor
        frame_tensor = ToTensor()(frame)  # Convert to tensor

        # Normalize the tensor to [0, 1] range (optional: depending on training settings)
        frame_tensor = frame_tensor.float() / 255.0  # Convert to float and normalize if required

        # Add batch dimension and move to device (GPU/CPU)
        frame_tensor = frame_tensor.unsqueeze(0).to(learn.dls.device)

        # Predict action using the loaded model
        preds = learn.model(frame_tensor)
        predicted_action = preds.argmax().item()

        # Convert index to action
        action = learn.dls.vocab[predicted_action]
        print(f"Predicted action: {action}")

        # Perform the predicted action
        if action == 'left':
            keyboard.press(Key.left)
            time.sleep(0.1)
            keyboard.release(Key.left)
        elif action == 'right':
            keyboard.press(Key.right)
            time.sleep(0.1)
            keyboard.release(Key.right)
        elif action == 'up':
            keyboard.press(Key.up)
            time.sleep(0.1)
            keyboard.release(Key.up)
        elif action == 'down':
            keyboard.press(Key.down)
            time.sleep(0.1)
            keyboard.release(Key.down)

        # Control frame rate
        time.sleep(0.2)

if __name__ == "__main__":
    capture_and_drive()