ComputerVision-VideoUnderstanding

Overview

This project is a computer vision system that detects, tracks, and describes objects in video.

It combines:

YOLOv8 for object detection
DeepSORT for multi-object tracking
BLIP for image captioning
Motion analysis using object history

Track: Computer Vision

Key Idea

The system maintains object identity across frames and uses motion history to describe how objects move over time.

Features

Detects common objects such as people, vehicles, and bicycles
Tracks objects across frames with consistent IDs
Generates natural language descriptions of scenes
Detects movement direction (left, right, up, down)

Pipeline

Video → YOLO Detection → DeepSORT Tracking → Motion Analysis → NLP → Output

The system processes the input video frame-by-frame using the following pipeline:

This pipeline ensures that object identity and motion are preserved across frames, enabling more meaningful scene understanding.

Input (main.py)
- Reads video frames using OpenCV
- Controls the overall pipeline execution
Object Detection (detector.py)
- Uses YOLOv8 to perform object detection on each frame
- Outputs bounding boxes, class labels, and confidence scores
Object Tracking (tracker.py)
- Uses DeepSORT to assign consistent IDs to detected objects
- Maintains object identity across frames
- Stores position history for motion analysis
Motion Analysis (tracker.py + nlp.py)
- Uses stored object history to determine movement direction
- Example: moving left, right, up, or down
Natural Language Processing (nlp.py)
- Uses BLIP to generate a caption for selected frames
- Combines caption with detected objects and motion data
- Produces human-readable descriptions
Visualization (drawer.py)
- Draws bounding boxes and object IDs on frames
- Displays captions on the video
Output (utils.py + main.py)
- Saves annotated video
- Saves structured JSON data
- Saves text descriptions

How to Run

python main.py --video input.mp4

Demo

Video 1

Video 2

The videos are used for educational purposes only.

Example Outputs

The system produces both structured data and natural language descriptions.

JSON Output (sample)

{
  "frame": 120,
  "timestamp": 4.0,
  "objects": [
    {"id": 1, "label": "car", "bbox": [100, 200, 300, 400]},
    {"id": 2, "label": "person", "bbox": [500, 250, 550, 400]}
  ]
}

Description Output (sample)

At 4.0s: person moving right. (2 cars, 1 person)

All Options

Flag	Default	Description
`--video`	`input.mp4`	Path to input video file
`--weights`	`yolov8n.pt`	YOLO model weights
`--caption-every`	`30`	Generate caption every N frames
`--no-nlp`	off	Disable BLIP captioning (faster)
`--output-dir`	`output/`	Directory for output files

Files

├── main.py          # Pipeline orchestrator and entry point
├── detector.py      # YOLOv8 object detection
├── tracker.py       # DeepSORT multi-object tracking and motion history
├── nlp.py           # BLIP captioning and scene description generation
├── drawer.py        # Frame annotation and visualization
├── utils.py         # Logging, timestamp conversion, JSON utilities
├── requirements.txt # Python dependencies
└── demos/           # Sample outputs from two test videos

Requirements

Python 3.8+
CUDA-capable GPU strongly recommended (BLIP captioning is very slow on CPU) Models are downloaded automatically on first run:
YOLOv8n (~6 MB, from Ultralytics)
BLIP image captioning base (~990 MB, from Hugging Face)

Limitations

Detection is limited to 8 object classes: person, car, bicycle, airplane, train, truck, motorcycle, bus
BLIP captioning sometimes misidentifies objects (e.g., labels a truck as a car)
DeepSORT can lose track of objects during fast movement or occlusion
No real-time processing — designed for offline video analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ComputerVision-VideoUnderstanding

Overview

Key Idea

Features

Pipeline

How to Run

Demo

Video 1

Video 2

Example Outputs

JSON Output (sample)

Description Output (sample)

All Options

Files

Requirements

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
demos		demos
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
detector.py		detector.py
drawer.py		drawer.py
main.py		main.py
nlp.py		nlp.py
requirements.txt		requirements.txt
tracker.py		tracker.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

ComputerVision-VideoUnderstanding

Overview

Key Idea

Features

Pipeline

How to Run

Demo

Video 1

Video 2

Example Outputs

JSON Output (sample)

Description Output (sample)

All Options

Files

Requirements

Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages