# Object detection
1. Single shot detection
2. Multibox concept
3. predicting object positions
4. The scale problem


## Single shot detection:
Rather than going about using CNNs to determine the box where the object could be, the `boxes` go once through the algorithm to detect the gradient difference inorder to determine the most probable chances of finding the object.<br>
![Screenshot%20from%202018-08-25%2011-43-10.png](Screenshot%20from%202018-08-25%2011-43-10.png)

<b>Figure: The paper</b>
![Screenshot%20from%202018-08-25%2011-45-48.png](Screenshot%20from%202018-08-25%2011-45-48.png)

## Multibox concept:
To train an SSD we need `ground truth` SSD will break image and divide it in the boxes:<b>fig</b>
![Screenshot%20from%202018-08-25%2012-24-32.png](Screenshot%20from%202018-08-25%2012-24-32.png)
![Screenshot%20from%202018-08-25%2012-24-58.png](Screenshot%20from%202018-08-25%2012-24-58.png)![Screenshot%20from%202018-08-25%2012-32-27.png](Screenshot%20from%202018-08-25%2012-32-27.png)


For every single boxes it will try to find the class of object that it was trained for.<b>fig</b>

First of all the boxes will determine if the object is present or not using algorithm and then compare to `ground truth`, now the error will be calculated and backpropagated inside the network to adjust the weights to better address the error. All the boxes can be considered as a separate image which is working on to detect the object using `CNN`.<br>
<b>fig: ssd in work(multibox concept)above</b>



## Predicting Object Positions:
After the object is detected with matching features, the position of objects have to be determined, without knowing the full part of the the image we need to find the object position. After the detection is complete, the rectangles will be compared to the ground truth rectangle and then the errors will backpropagate to adjust the weights of boxes and finally we will reach to the full detection of boat.<br>
<b>Fig</b>
![Screenshot%20from%202018-08-25%2012-36-06.png](Screenshot%20from%202018-08-25%2012-36-06.png)


## Scale Problem:
Here for example <b>fig of horse detection example</b>:
![Screenshot%20from%202018-08-25%2012-44-28.png](Screenshot%20from%202018-08-25%2012-44-28.png)![Screenshot%20from%202018-08-25%2012-53-05.png](Screenshot%20from%202018-08-25%2012-53-05.png)

Here the horse in the front is missed by the algorithm as its `scale` is too big and thus the algorithm can't detect this horse.<br>

To deal with the scale problem, the image goes through resizing while going through the convolutions every part of the convolution has smaller form of the image to work on. After the resizing the same algorithm will run to detect the objects once again and finally, every layer will have the information about how to get back to the original size and detect this object in actual image.<br>

Through continous training process, it will find the objects with certain accuracy.<br>



In [None]:
# Homework Solution

# Importing the libraries
import torch
from torch.autograd import Variable
import cv2
from data import BaseTransform, VOC_CLASSES as labelmap
from ssd import build_ssd
import imageio

# Defining a function that will do the detections
def detect(frame, net, transform):
    height, width = frame.shape[:2]
    frame_t = transform(frame)[0]
    x = torch.from_numpy(frame_t).permute(2, 0, 1)
    x = Variable(x.unsqueeze(0))
    y = net(x)
    detections = y.data
    scale = torch.Tensor([width, height, width, height])
    # detections = [batch, number of classes, number of occurence, (score, x0, Y0, x1, y1)]
    for i in range(detections.size(1)):
        j = 0
        while detections[0, i, j, 0] >= 0.6:
            pt = (detections[0, i, j, 1:] * scale).numpy()
            cv2.rectangle(frame, (int(pt[0]), int(pt[1])), (int(pt[2]), int(pt[3])), (255, 0, 0), 2)
            cv2.putText(frame, labelmap[i - 1], (int(pt[0]), int(pt[1])), cv2.FONT_HERSHEY_SIMPLEX, 2, (255, 255, 255), 2, cv2.LINE_AA)
            j += 1
    return frame

# Creating the SSD neural network
net = build_ssd('test')
net.load_state_dict(torch.load('ssd300_mAP_77.43_v2.pth', map_location = lambda storage, loc: storage))

# Creating the transformation
transform = BaseTransform(net.size, (104/256.0, 117/256.0, 123/256.0))

# Doing some Object Detection on a video
reader = imageio.get_reader('epic_horses.mp4')
fps = reader.get_meta_data()['fps']
writer = imageio.get_writer('output.mp4', fps = fps)
for i, frame in enumerate(reader):
    frame = detect(frame, net.eval(), transform)
    writer.append_data(frame)
    print(i)
writer.close()

![Screenshot%20%2858%29.png](Screenshot%20%2858%29.png)
![Screenshot%20%2857%29.png](Screenshot%20%2857%29.png)
![Screenshot%20%2859%29.png](Screenshot%20%2859%29.png)

### Above are the screen shots from the output file.