# **Demystifying Computer Vision**

Computer vision is a subfield of artificial intelligence (AI) that equips machines with the ability to process, analyze, and interpret visual inputs such as images and videos. It uses machine learning to help computers and other systems derive meaningful information from visual data ([IBM, 2024](https://www.ibm.com/think/topics/computer-vision)).

At its core, computer vision asks a deceptively hard question: 

> *Given raw pixels, how can a computer understand what is happening in the world?*

Images contain pixels, and pixels are just numbers representing colors (*Red, Green, Blue*). Computer vision is about transforming these numbers into meaning.  

This transformation sits at the intersection of:

- **Linear algebra & geometry** — how images form, how 3D projects to 2D
- **Probability & optimization** — noise, uncertainty, estimation
- **Machine learning & deep learning** — learning visual representations
- **Signal processing** — filters, edges, frequency content

Modern computer vision is largely data-driven, but the field remains grounded in classical ideas about geometry, optics, and perception.

## Getting Started Path

Before diving into CV topics, complete these foundational modules:

### [02 — Environment Setup](../02_environment_setup/)
Set up Python, PyTorch, OpenCV, and your development environment.

### [03 — Python & NumPy Basics](../03_python_numpy_basics/)
Review essential Python and array manipulation skills.

### [04 — OpenCV Fundamentals](../04_opencv_fundamentals/)
Learn image loading, manipulation, and basic processing.

### [05 — Git Fundamentals](../05_git_fundamentals/)
Learn Git workflow and GitHub collaboration for research.

---

## Prerequisites

To get the most out of this material, you should have:

**Programming**
- Proficiency in Python (functions, classes, data structures)
- Familiarity with NumPy for array manipulation

**Mathematics**
- Linear algebra basics (vectors, matrices, transformations)
- Calculus fundamentals (derivatives, gradients)
- Basic probability and statistics

**Machine Learning**
- Understanding of supervised learning concepts
- Experience training simple models (optional but helpful)

> Don't worry if you're not an expert in all of these areas. The Getting Started modules will help you build these skills, and we'll review key concepts as they arise.

## What Makes Computer Vision So Difficult?

Being human means we can solve difficult vision problems with minimal effort.

> Machines are not so lucky.

A cat rotated 30 degrees, partially occluded, under poor lighting, is still obviously a cat to humans. For a machine, that's a distribution shift, a geometry problem, and a generalization test rolled into one.

**Core challenges include:**

- **High dimensionality** — millions of pixels per image
- **Ambiguity** — 2D images collapse 3D reality
- **Variability** — lighting, viewpoint, scale, occlusion
- **Bias** — what you don't see matters

Vision is hard because the world is messy. 

> That's also why it's valuable.

## Why Should You Care About Computer Vision?

Computer vision is how data science escapes the spreadsheet and touches the physical world.

A huge amount of real-world data is visual: images, videos, medical scans, satellite imagery, sensor feeds, and more.

> *If you can't work with pixels, you're blind to some of the most valuable signals available.*

---

### Practical Applications

- Powers autonomous systems, robotics, and embodied AI 
- Underlies modern healthcare imaging and diagnostics
- Drives surveillance, mapping, and geospatial intelligence
- Enables generative models that create and edit visual content
- Sits at the core of many high-impact ML research problems

---

### Technical Value

Computer vision forces you to confront problems that don't show up in tabular data:

- **High-dimensional inputs** — images have millions of features
- **Spatial structure and geometry** — position and arrangement matter
- **Representation learning at scale** — learning what to look for
- **Robustness under real-world noise** — handling the unexpected

> If you can reason about vision computationally, you tend to reason better about machine learning in general. Vision models stress-test your understanding of optimization, inductive bias, data leakage, and failure modes.

**In short:** Computer vision is where theory meets reality. If you want to work on systems that interact with the world instead of just describing it, this is the place to start.

## Applications and Research Directions

Computer vision is not a single problem or technique. It's a collection of closely related problem areas, each with its own methods, datasets, and research questions. This repository covers a range of core vision topics:

- Classification & Localization
- Object Detection
- Segmentation
- Pose Estimation
- Generative Models
- Visual SLAM
- And more...

> Each of these areas is an active research space with open problems that matter in both academia and industry.

We encourage research projects across all of these directions. Some gravitate toward theory and representation learning, others toward geometry, systems, or applied work.

> **All paths are valid here.**

As you work through the material, read broadly and pay attention to what holds your interest. The goal is not to specialize immediately, but to develop enough exposure to recognize which problems you want to spend time thinking about.

## How to Use This Repository

This repository is designed to onboard new MSDS students into computer vision research quickly.

**Recommended path:**

1. **Complete Getting Started** (1-2 days)
   - [Environment Setup](../02_environment_setup/) — Get your tools ready
   - [Python & NumPy](../03_python_numpy_basics/) — Review array skills
   - [OpenCV Fundamentals](../04_opencv_fundamentals/) — Image basics
   - [Git Fundamentals](../05_git_fundamentals/) — Collaboration skills

2. **Work through foundational topics** — Start with Classification, then follow prerequisites
3. **Go deep where curious** — Spend extra time on topics that resonate with you
4. **Run the code** — Every notebook contains executable examples; experiment with them
5. **Read the papers** — We link to foundational literature throughout
6. **Start your research project** — Apply what you've learned!

**Repository structure:**

```
topics/
├── 01_getting started/
│   ├── 01_overview/        ← You are here
│   ├── 02_environment_setup/
│   ├── 03_python_numpy_basics/
│   ├── 04_opencv_fundamentals/
│   └── 05_git_fundamentals/
├── 02_classification/
├── 03_localization/
├── ...
├── 15_embeddings_similarity/
├── 16_vision_transformers/
└── 17_self_supervised_learning/
```

Each topic folder contains:
- `README.md` — Overview, prerequisites, and suggested path
- `tutorial.ipynb` — Hands-on exercises
- `project_template.ipynb` — Template for your own project

## Areas of Study

This repository is organized into the following topics. Each section builds on foundational concepts while introducing specialized techniques.

---

### [02 — Image Classification](../../02_classification/)
Assign a single label to an entire image.  
*The conceptual starting point for most of modern computer vision.*

---

### [03 — Localization](../../03_localization/)
Predict *where* an object is by drawing a bounding box.  
*Adds spatial reasoning on top of classification.*

---

### [04 — Object Detection](../../04_object_detection/)
Detect and localize multiple objects in a single image.  
*The backbone of many real-world vision systems.*

---

### [05 — Segmentation](../../05_segmentation/)
Assign a label to every pixel in the image.  
*Provides precise object boundaries instead of coarse boxes.*

---

### [06 — Pose Estimation](../../06_pose_estimation/)
Infer body or object configuration using keypoints.  
*Critical for motion analysis and human–computer interaction.*

---

### [07 — Action Recognition](../../07_action_recognition/)
Recognize what is happening over time in video.  
*Introduces temporal reasoning into vision models.*

---

### [08 — Generative Models](../../08_generative_models/)
Learn the data distribution well enough to generate new images.  
*Forces models to understand structure, not just labels.*

---

### [09 — Visual SLAM](../../09_visual_slam/)
Simultaneously localize a camera and build a map of the world.  
*A geometry-heavy area central to robotics and AR.*

---

### [10 — Scene Understanding](../../10_scene_understanding/)
Reason about objects, layout, and relationships in a scene.  
*Moves vision from perception toward meaning.*

---

### [11 — Facial Recognition](../../11_facial_recognition/)
Identify or verify individuals using facial features.  
*Technically powerful and ethically sensitive.*

---

### [12 — Geolocation](../../12_geolocation/)
Predict where an image was taken using visual cues alone.  
*Combines vision, retrieval, and large-scale data.*

---

### [13 — Anomaly Detection](../../13_anomaly_detection/)
Detect deviations from normal visual patterns.  
*Often trained without ever seeing anomalies.*

---

### [14 — Vision-Based Reinforcement Learning](../../14_vision_rl/)
Learn policies from visual input by interacting with an environment.  
*Closes the loop between perception and action.*

---

### [15 — Embeddings & Similarity Search](../../15_embeddings_similarity/)
Map images to vector representations for retrieval and matching.  
*Essential for search, deduplication, and metric learning.*

---

### [16 — Vision Transformers](../../16_vision_transformers/)
Apply transformer architectures to visual data.  
*Modern foundation models like ViT, CLIP, and SAM.*

---

### [17 — Self-Supervised Learning](../../17_self_supervised_learning/)
Learn visual representations without manual labels.  
*Enables training on massive unlabeled datasets.*

---

## References

1. IBM. (2024). *What is Computer Vision?* Retrieved from [ibm.com/think/topics/computer-vision](https://www.ibm.com/think/topics/computer-vision)

2. OpenAI. (2024). *ChatGPT.* [chatgpt.com](https://chatgpt.com)

3. Anthropic. (2025). *Claude Code.* [claude.ai/claude-code](https://claude.ai/claude-code)