<a href="https://colab.research.google.com/github/andreeo/computer-vision/blob/main/problems_in_computer_vision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is the difference between **Image Processing** vs **Computer Vision**? 🚀


In Image Processing the input is a image and the output is also typically an image. Sometimes the output image is an improved version of the input image and sometimes it is just a processed version of the input image.

> Ex: noise reduction is an image processing operation(The input is an image and the output is an improved version of the image).

> Ex: edge detection is an image processing operation(The input is an image and the output is a processed image if the image).

Another big area of image processing is image and video compression.

In the other hand **Computer Vision** the input is an image and the output is usually some information.

> Ex: face recognition is computer vision(The input is an image and the output is the identity of the person)

> Ex: object dectection is computer vision(The input is an image and the output is the location and labels of objects in image)

---
# Problems in computer vision


There are many general techniques which are used in many applicacions.

we typically solve one or more of these computer vision problems and package them together in one neat application.

Let's look at these problem domains.

*The first problem domain is image processing*. In image processing the input is an image and the ouput is typically a filtered version of image. It consists of sub-problems like **image denoising**, **enhancement** and **restoration**.

Image processing also deals with image and video compression.

Other techniques involve **image binarization** and **binary image** processing and sometimes we just filter the image based on our application and things like **edge dection**, **corner dection** they can also be thought as part of image processing.

*The second sub domain is 3D reconstruction using 2D images*. Extracting 3D information from 3D images is a huge part of computer vision. 

There are several algorithms that are appropriate for different domains. The most common one is stereo vision. In this class of techniques two images of the scene are taken from two different viewpoints using calibrated cameras. The depth at each pixel by finding which point in one image corresponds to which point in the other image and this is followed by triangulation.



> STEREO: Use two different images of the scene from slightly different viewpoints to extract 3D information.


These algorithms where we use 2D images for 3D reconstruction are often referred to as **structure from motion** because we are generating the 3D structure by moving the camera. Even though we may not be physically moving a single camera, it is the motion between the two views of the scene that provides 3D information. Structure from motion is also closely related to another problem called **visual slam**.

**Visual slam** that stands for **Simultaneous Location and Mapping**.

> Ex: a robits application where a camera mounted on a robot is used to build a 3D structure of the scene around it. This is the same technology that is used by AR kit and AR core to reconstruct the scene so you can virtually place objects in them.

> In many applications it is necessary only to recover the location and orientation of the camera from images. Virtual walkthroughs and google street view are excellent examples of such applications. In the applications full 3D reconstruction is not performed. Instead the camera locations are obtained directly from the images and then some kind of images stitching is used to create virtual walkthroughs.

Until how we have seen how to recontruct a scene using motion information. This begs the question, is it possible to obtain 3D information about a scene without moving the camera? The answer is yes.

In fact the human eye routinely decodes depth and shape information without necessarily using two eyes. When there is only one view of an object, one important clue about the shape is in the shading information. The class of algorithms that recovers the shape of an object with just one view using shading information are called **shape from shading**

Unfortunately shape from shading is an ill-posed problem which means that it can't be solved without making strong assumptions about the shape wea are trying to recover. However there is hope in shading. Instead of using one image if we use three images of an object lit from three different directions we can uniquely recover the shape of the object up to a scale. This technique is called **photometric stereo**.

> PHOTOMETRIC STEREO: Use 3 or more images of a scene with a static camera under different lighting conditions to obtain 3D shape information.

Unlike stereo algorithms here the light is moved instead of the camera. There are also algorithms for extracting shape from silhouettes and defocus.

Detecting important features like edges and corners and matching them across multiple images is a very important first step in geometric computer vision.

> Ex: if you want to calibrate a camera automatically you show it a
checkerboard pattern from multiple viewpoints. A detection algorithm detects the corners of the checkerboard and matches them across multiple views to find the calibration parameters of the camera.

Feature matching and detection is also used in **image alignment**.

**Image alignment** is one of those fundamental building blocks in computer vision that you get to use in many different applications.

- if you want to compare satellite images of the same region, you need image
alignment.
- if you want to register two scans of the brain to see the effect of treatment
over time you need image alignment.
- if you want to need to create panorams, you need to understand image
alignment.
- if you want document rectification, you need image alignment.

A closely related problem to image alignment is **motion estimation**.

Usually when we are working with videos the motion between the frames is small.
In such cases we often try to find how each pixel in one frame has moved in the other frame. This is usually referred to as **motion estimation**.

Motion estimation has many applications like video compression and video stabilization.

Until now we have discussed problems of geometry and motion.

Now let's look at some problems in recognition.

The first such problem is **image classification**. This is one of the most widely studied problems in computer vision and artificial intelligence.

The goal of image classification is to label an input image with the class that describes the image.

> Ex: given an image of a cat, a classification algorithm would return the label cat. 

As you can imagine an image classification algorithm will work best if there is only one object in the scene and it is tightly cropped.

if there are multiple objects we need to first find a bounding box around them before classifying and these algorithms are called object detection algorithms.

Ginven an input image, they return an array of bounding boxes and a class label for every bouding box. Now let's take the idea even further.

Imagine you have multiple objects but instead of an image you have a video sequence. You can do object detection on each frame but you also need to know which bounding box in one frame corresponds to which one ine the next frame.

The class of algorithms that track an object from one frame to the next are called **object tracking** algorithms. 

In *object detection* you are searching for objects in the entire image.

In *tracking* you know the location of the object in the previous frame and that information can be used to reduce the search space and make tracking fast.

Tracking algorithms often learn the appearence of the object and some of them can re-identify the object if it disappears and then reapperars in the frame.

The next important problem is called **image segmentation**. Sometimes it is not enough to put a bounding box around the object of interest. We want to find a group of pixels that belong to the object. Not just a bouding box around it. This grouping of pixels into different classes is called image segmentation.

Another problem closely related to segmentation is called **natural image matting**, where the goal is to segment the image into two classes - background and foreground.

Unlike image segmentation where a pixel can belong to only one class. In natural image matting boundary pixels belong partially to the background and partially to the foreground. So at the boundary the pixel value that we observe is a mixture of the background and the foreground and these things are very common when are looking at verifying structures like a strand of hair.

And then there are many types of domain specific recognition algorithms.

For example many recognition algorithms have been specifically developed for biometrics. You may know about face recognition, fingerprint recognition, iris recognition and even gait recognition which means you can be identified in a video based on how you walk.

In document analysis we use text recognition which is used to convert an image of the document into text.

Image recognition techniques are also used in other places where we want to detect fraud or find if a merchandise is fake or real. In addition sometimes we need more granular information about a person or object in the scene.

For example in addition to detecting the faces,we may want to locate the landmarks on the face or estimate the head pose. In some other applications we may want to estimate the pose of the entire body. 

Until now we have focused on problems where we are trying to extract information from the image or a group of images.

For example we discussed 3D reconstruction where a bunch of images were taken and 3D information was extracted. Then we looked at recognition problem like image classification, object detection, etc. Where we tried to figure out what is inside the images. Now let go over a third category of problems where we are not necessarily trying to extract information from the images but we are trying to expand the capabilities of a camera. This class of problems fall under **computational photography**. The goal of computational photography is to computationally bypass the limitations of the camera.

For example, you can take multiple images of a scene with different exposure settings, merge the images together and produce a high dynamic range photo that would not have been possible using a standard camera.

You could also take a video of a scene and create a super high resolution picture whose resolutions is much higher than the frames of the video or you could convert a grayscale image to color. 

Sometimes additional hardware optics is used to create a new kind of camera. A fine example of this was the light field camera that uses a special lens to not only capture the intensity of light but also the direction of light. With a light field camera you can change the focus of an image even after it has been acquired.

One of the crowning achievements of computational photography was the recent imaging of a black hole. Imaging a black hole would have required a telescope as large as the earth itself. Instead the event horizon telescope proyect used a network of eight different telescopes from around the globe. These images were time stamped using atomic clocks. The image recorded on a single telescope is extremely low resolution but together the eight telescopes acted like on giant telescope to bring us the first high resolution image of a black hole. 

We can clearly see how computer vision is having such a huge impact across multiple domains,

