This project is a from-scratch implementation of the YOLOv1 (You Only Look Once) object detection paper using PyTorch. I implemented the entire pipeline — architecture, loss function, dataset parsing, and model training — to deeply understand how YOLO works at its core.
- 🧩 ResNet-18 Backbone (for feature extraction)
- ⚙️ YOLOv1 Loss Function (custom-built using MSELoss)
- 📦 Custom Dataset Loader for PASCAL VOC 2007 + 2012
- 🕒 15+ Hours of Training with Checkpointing & Logging
- 🧮 Complete Web App using Flask for image detection demo
- Optimizer: Adam
- Learning Rate Scheduler: StepLR
- Loss Function: MSE with λ_coord = 5, λ_noobj = 0.5
- Dataset: PASCAL VOC 2007 + 2012 (XML annotations parsed)
- Framework: PyTorch
- Web Framework: Flask
The YOLOv1 loss function combines:
-
Localization Loss (for bounding box coordinates)
-
Confidence Loss (for objectness score)
-
Classification Loss (for class probabilities)
I replicated the official YOLOv1 loss equation and implemented it using torch.nn.MSELoss with custom weighting factors:
λ_coord = 5.0
λ_noobj = 0.5
This ensures bounding box coordinates are penalized more heavily, while boxes without objects contribute less to the loss.
The YOLOv1 head is built on top of a ResNet-18 backbone pre-trained on ImageNet. It outputs a grid structure that predicts bounding boxes and class probabilities in a single forward pass, enabling real-time object detection without region proposals.
🧪 Web App Demo: click here for live app
The Flask web app allows users to upload images and view detection results instantly. Due to Hugging Face Spaces limitations, the live webcam detection feature is disabled — but you can find a recorded demo video of live detections on my LinkedIn.
👨 Person 🚗 Car 🐱 Cat 🐶 Dog 🚌 Bus 🚲 Bicycle
This project became a cornerstone of my AI research journey. I now have a deep understanding of YOLO’s architecture, loss design, and real-time detection principles — and this project represents my growth from reading research papers to building real implementations.