# Object Detection for Autonomous Driving

<center>
<video width="800" height="400" src="nb_images/Result_BC.mp4" type="video/mp4" controls>
</video>
</center>

<caption><center> Vedio 1: Result of this "Object Detection for Autonomous Driving" project. The original video was taken by Chaobin Yang from iphone hold on Boston College shuttle bus while driving around Boston College. 
</center></caption>

In [48]:
import os
import warnings
warnings.filterwarnings('ignore')
from matplotlib.pyplot import imshow
import cv2
import scipy.io
import scipy.misc
import tensorflow as tf
from keras import backend as K
from keras.models import load_model
from yolo_utils import read_classes, read_anchors, generate_colors, preprocess_image, draw_boxes, scale_boxes
from yad2k.models.keras_yolo import yolo_head, yolo_boxes_to_corners

## 1 Problem

Dataset is provided by [drive.ai](https://www.drive.ai/). Images were gathered from cameras mounted to the front of cars. Use YOLO algorithm to recognize objects in images. Recognized objects are labelled with a square box. In the notebook, I did following:
- F
- max
- 
- 

### Definition of a box
$b_x$ and $b_y$ define center of box and $b_h$ and $b_w$ define size of box. If there are 80 categories to recognize, I can either represent the category of object by:
- $i)$ label c as an integer from 1 to 80: 6 elements to represent a box
- $ii)$ one hot vector whose $c_{th}$ place is 1 and all others are 0s: 85 elements to represent a box

<img src="nb_images/box_label.png" style="width:500px;height:250;">

## 2 YOLO

YOLO ("you only look once") requires only one forward propagation pass through the network to make predictions. Thus it "only looks once" at the image.

### 2.1 Model details

- The **input** is m images in tensor of shape (m, 608, 608, 3)
- The **output** is a list of boxes along with the recognized classes (m, 19, 19, 5, 85). Each image is cut into 19*19 cells. Each cell has five boxes. Each bounding box is represented by 6 numbers $(p_c, b_x, b_y, b_h, b_w, c)$ as explained above. If $c$ is expanded into an 80-dimensional vector, each bounding box is then represented by 85 numbers. 

If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object. A cell can have maximum of 5 objects centered inside.

YOLO architecture: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

<img src="nb_images/architecture.png" style="width:700px;height:400;">

### 2.2 - Filtering boxes with class scores

Each cell gives 5 boxes. So the model can predict 19x19x5=1805 boxes by just looking once at the image. So we need
- First, only keep boxes with high class score (more confident about detecting an object)
- Second, only keep one box when several overlapping boxes are detecting the same object
<img src="nb_images/anchor_map.png" style="width:200px;height:200;">

**yolo_filter_boxes( box_confidence, boxes, box_class_probs, threshold)** will filter boxes:

Step 1:
Scores of every class are calculated by $p_c$ * ($c_1$, $c_2$, ..., $c_{79}$, $c_{80}$)
- "box_confidence" is $p_c$, a tensor of shape (19, 19, 5, 1)
- "box_class_probs" is ( $c_1$, $c_2$, ..., $c_{79}$, $c_{80}$), a tensor of shape (19, 19, 5, 80)
- "boxes" is sizes of all the boxes, containing $(b_x, b_y, b_h, b_w)$, a tensor of shape (19, 19, 5, 4)

Step 2:
In every box, find the index and value of class with max score. Index is saved as "box_classes" and value is saved as "box_class_scores". Create a filtering mask based on "box_class_scores" by using "threshold".

Step 3:
Apply filtering mask to all boxes and got boxes with scores higher than threshold.
- "scores" -- tensor of shape (number_selected_boxes, 1), containing the class probability score for selected boxes
- "boxes" -- tensor of shape (number_selected_boxes, 4), containing $(b_x, b_y, b_h, b_w)$ coordinates of selected boxes
- "classes" -- tensor of shape (number_selected_boxes, 1), containing the index of the class detected by the selected boxes

In [49]:
def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
    # Step 1: Compute box scores.
    box_scores = box_confidence * box_class_probs
    
    # Step 2: find the index and value of class with max score.
    box_classes = K.argmax(box_scores, axis=-1)
    box_class_scores = K.max(box_scores, axis=-1, keepdims=False)
    # Create a filtering mask based on "box_class_scores" by using "threshold". The mask have the
    # same dimension as box_class_scores, and be True for the boxes you want to keep 
    filtering_mask = box_class_scores>=threshold
    
    # Step 3: Apply the mask to scores, boxes and classes, select box with score higher than threshold
    scores = tf.boolean_mask(box_class_scores, filtering_mask)
    boxes = tf.boolean_mask(boxes, filtering_mask)
    classes = tf.boolean_mask(box_classes, filtering_mask)
    
    return scores, boxes, classes

### 2.3 - Non-max suppression ###

After filtering by thresholding over the classes scores, I end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS). 
<img src="nb_images/non-max-suppression.png" style="width:500px;height:400;">

Non-max suppression uses the very important function called **"Intersection over Union"**, or IoU.
<img src="nb_images/iou.png" style="width:500px;height:400;">

**iou(box1, box2)** calculates IoU shown above. **IoU large->more overlap->delete**. (x1, y1, x2, y2) is upper left and lower right in box.  
box1 -- first box, list object with coordinates (x1, y1, x2, y2)  
box2 -- second box, list object with coordinates (x1, y1, x2, y2)

In [50]:
def iou(box1, box2):
    
    # Calculate its INTER Area.
    xi1 = max(box1[0],box2[0])
    yi1 = max(box1[1],box2[1])
    xi2 = min(box1[2],box2[2])
    yi2 = min(box1[3],box2[3])
    inter_area = (xi2-xi1)*(yi2-yi1)

    # Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)
    box1_area = (box1[3] - box1[1]) * (box1[2] - box1[0])
    box2_area = (box2[3] - box2[1]) * (box2[2] - box2[0])
    union_area = box1_area+box2_area-inter_area
    
    # compute the IoU
    iou = inter_area/union_area

    return iou

**yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5)**   
Implement non-max suppression. The key steps are:  
1. Select the box that has the highest score.
2. Compute its overlap with all other boxes, and remove boxes that overlap it more than "iou_threshold".
3. Go back to step 1 and iterate until the selected box has lowest score among all boxes

Arguments:  
"scores","boxes","classes" are output of yolo_filter_boxes(). "max_boxes": maximum number of predicted boxes you'd like

Returns:  
"scores","boxes","classes" selected boxes after non_max_suppression

In [51]:
def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
    
    # tensor to be used in tf.image.non_max_suppression()
    max_boxes_tensor = K.variable(max_boxes, dtype='int32')
    
    # initialize variable max_boxes_tensor
    K.get_session().run(tf.variables_initializer([max_boxes_tensor])) 
    
    # Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep
    nms_indices = tf.image.non_max_suppression(boxes,scores,max_boxes, iou_threshold,name=None)
    
    # Use K.gather() to select only nms_indices from scores, boxes and classes
    scores = K.gather(scores,nms_indices)
    boxes = K.gather(boxes,nms_indices)
    classes = K.gather(classes,nms_indices)
    
    return scores, boxes, classes

### 2.4 Wrapping up these two filtering

**yolo_eval(yolo_outputs, image_shape, max_boxes, score_threshold, iou_threshold)** use the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions just implemented.  

Arguments:  
"yolo_outputs" -- output of the deep CNN model (for image_shape of (608, 608, 3)), contains 4 tensors:  
                    box_confidence: tensor of shape (None, 19, 19, 5, 1)
                    box_xy: tensor of shape (None, 19, 19, 5, 2)
                    box_wh: tensor of shape (None, 19, 19, 5, 2)
                    box_class_probs: tensor of shape (None, 19, 19, 5, 80)
"image_shape" -- tensor of shape (2,) containing the input shape  

Returns:  
"scores", "boxes", "classes" are all output of yolo_non_max_suppression

**Note:boxes = scale_boxes(boxes, image_shape)** rescales the boxes:  
YOLO's network was trained to run on 608x608 images. To test this data on a different size--for example, 720x1280 images--need to rescale the boxes.


In [52]:
def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
    
    # Retrieve outputs of the YOLO model 
    box_confidence, box_xy, box_wh, box_class_probs = yolo_outputs

    # Convert boxes to be ready for filtering functions 
    boxes = yolo_boxes_to_corners(box_xy, box_wh)

    # perform Score-filtering with a threshold of score_threshold
    scores, boxes, classes = yolo_filter_boxes(box_confidence, boxes, box_class_probs, score_threshold)
    
    # Scale boxes back to original image shape
    boxes = scale_boxes(boxes, image_shape)

    # perform Non-max suppression with a threshold of iou_threshold 
    scores, boxes, classes = yolo_non_max_suppression(scores, boxes, classes, max_boxes, iou_threshold)

    return scores, boxes, classes

## 3 Test YOLO pretrained model with image
Create a session and start to test on graph
- Start a session
- Define classes, anchors and image shape. 80 classes and 5 anchors are loaded from files. image_shape need match test image.
- Load a pretrained model from "yolo.h5" and the summary of model. Model converts input images (shape: (m, 608, 608, 3)) into a tensor of shape (m, 19, 19, 5, 85)

In [53]:
#start a session
sess = K.get_session()

# deine classes, anchors and image shape
class_names = read_classes("model_data/coco_classes.txt")
anchors = read_anchors("model_data/yolo_anchors.txt")
image_shape = (1080., 1920.)

#load pretrained model
yolo_model = load_model("model_data/yolo.h5")
#model summary
yolo_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 608, 608, 3)  0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 608, 608, 32) 864         input_1[0][0]                    
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 608, 608, 32) 128         conv2d_1[0][0]                   
__________________________________________________________________________________________________
leaky_re_lu_1 (LeakyReLU)       (None, 608, 608, 32) 0           batch_normalization_1[0][0]      
__________________________________________________________________________________________________
max_poolin

__________________________________________________________________________________________________
leaky_re_lu_12 (LeakyReLU)      (None, 38, 38, 256)  0           batch_normalization_12[0][0]     
__________________________________________________________________________________________________
conv2d_13 (Conv2D)              (None, 38, 38, 512)  1179648     leaky_re_lu_12[0][0]             
__________________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 38, 38, 512)  2048        conv2d_13[0][0]                  
__________________________________________________________________________________________________
leaky_re_lu_13 (LeakyReLU)      (None, 38, 38, 512)  0           batch_normalization_13[0][0]     
__________________________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D)  (None, 19, 19, 512)  0           leaky_re_lu_13[0][0]             
__________

In [54]:

yolo_outputs = yolo_head(yolo_model.output, anchors, len(class_names))

scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)

In [55]:
def predict(sess, image_file):
    """
    Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.
    
    Arguments:
    sess -- your tensorflow/Keras session containing the YOLO graph
    image_file -- name of an image stored in the "images" folder.
    
    Returns:
    out_scores -- tensor of shape (None, ), scores of the predicted boxes
    out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
    out_classes -- tensor of shape (None, ), class index of the predicted boxes
    
    Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes. 
    """

    # Preprocess your image
    image, image_data = preprocess_image("images/" + image_file, model_image_size = (608, 608))

    # Run the session with the correct tensors and choose the correct placeholders in the feed_dict.
    # You'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})

    out_scores, out_boxes, out_classes = sess.run(yolo_eval(yolo_outputs, image_shape),feed_dict={yolo_model.input: image_data, K.learning_phase(): 0})

    # Print predictions info
    print('Found {} boxes for {}'.format(len(out_boxes), image_file))
    # Generate colors for drawing bounding boxes.
    colors = generate_colors(class_names)
    # Draw bounding boxes on the image file
    draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
    # Save the predicted bounding box on the image
    image.save(os.path.join("out", image_file), quality=90)
    # Display the results in the notebook
    output_image = scipy.misc.imread(os.path.join("out", image_file))
    
    #imshow(output_image)
    
    return out_scores, out_boxes, out_classes




<img src="images/test.jpg" style="width:640px;height:360;">

In [58]:
out_scores, out_boxes, out_classes = predict(sess, "test.jpg")

Found 5 boxes for test.jpg
car 0.61 (915, 517) (1066, 611)
stop sign 0.62 (1507, 428) (1555, 465)
car 0.62 (960, 509) (1093, 594)
car 0.65 (630, 553) (886, 678)
bus 0.81 (17, 363) (568, 935)


Below are the output image we just predicted. Five boxes including a bus, three cars and a stop signal are detected
<img src="nb_images/test.jpg" style="width:640px;height:360;">

## 4 Video Visualization

Now I got a Video and test this YOLO model with it. 

<center>
<video width="640" height="360" src="images/BC.mov" type="video/mov" controls>
</video>
    
</center>
<caption><center> Vedio 2: Original video taken by Chaobin Yang from iphone hold on Boston College shuttle bus while driving around Boston College. 
</center></caption>

### 4.1 break video to many images

In [59]:
vidcap = cv2.VideoCapture('images/BC.mov')
success,image = vidcap.read()
count = 0
while success:
    cv2.imwrite("images/BC%d.jpg" % count, image)     # save frame as JPEG file      
    success,image = vidcap.read()
    count += 1
    if count>1197: break


### 4.2 Predict every image

In [46]:
images=list()
#for i in range(1198):
for i in range(30):
    images.append("BC"+str(i)+".jpg")
    out_scores, out_boxes, out_classes = predict(sess, "BC"+str(i)+".jpg")

Found 3 boxes for BC0.jpg
truck 0.61 (1729, 282) (1899, 581)
car 0.70 (446, 466) (605, 573)
car 0.79 (232, 492) (409, 559)
Found 3 boxes for BC1.jpg
truck 0.62 (1728, 282) (1900, 581)
car 0.67 (446, 466) (608, 573)
car 0.79 (231, 491) (411, 560)
Found 3 boxes for BC2.jpg
car 0.64 (443, 466) (607, 570)
truck 0.67 (1729, 285) (1900, 578)
car 0.78 (229, 492) (408, 559)
Found 3 boxes for BC3.jpg
car 0.64 (442, 466) (604, 569)
truck 0.64 (1727, 286) (1900, 577)
car 0.77 (229, 492) (410, 558)
Found 2 boxes for BC4.jpg
car 0.60 (443, 466) (607, 569)
car 0.76 (229, 492) (409, 558)
Found 3 boxes for BC5.jpg
truck 0.60 (1727, 284) (1899, 572)
car 0.61 (434, 473) (569, 568)
car 0.77 (230, 491) (412, 559)
Found 1 boxes for BC6.jpg
car 0.77 (229, 492) (411, 557)
Found 1 boxes for BC7.jpg
car 0.77 (229, 490) (408, 558)
Found 2 boxes for BC8.jpg
car 0.63 (430, 466) (636, 573)
car 0.78 (227, 491) (408, 558)
Found 2 boxes for BC9.jpg
car 0.65 (425, 466) (640, 575)
car 0.78 (228, 490) (409, 557)
Found 3

Found 2 boxes for BC65.jpg
car 0.78 (1657, 419) (1799, 521)
car 0.85 (297, 435) (567, 579)
Found 2 boxes for BC66.jpg
car 0.80 (1656, 417) (1798, 520)
car 0.83 (294, 433) (568, 579)
Found 2 boxes for BC67.jpg
car 0.81 (1656, 415) (1798, 521)
car 0.83 (302, 435) (565, 575)
Found 2 boxes for BC68.jpg
car 0.82 (1656, 415) (1797, 521)
car 0.83 (300, 438) (560, 573)
Found 2 boxes for BC69.jpg
car 0.80 (309, 436) (555, 574)
car 0.83 (1654, 415) (1796, 520)
Found 2 boxes for BC70.jpg
car 0.80 (301, 437) (548, 575)
car 0.83 (1654, 412) (1798, 519)
Found 2 boxes for BC71.jpg
car 0.79 (300, 438) (546, 576)
car 0.83 (1654, 411) (1795, 519)
Found 2 boxes for BC72.jpg
car 0.81 (298, 436) (546, 579)
car 0.83 (1654, 411) (1795, 518)
Found 2 boxes for BC73.jpg
car 0.78 (302, 436) (542, 580)
car 0.84 (1652, 411) (1794, 519)
Found 2 boxes for BC74.jpg
car 0.73 (306, 434) (537, 580)
car 0.84 (1650, 411) (1795, 519)
Found 3 boxes for BC75.jpg
traffic light 0.62 (805, 6) (927, 93)
car 0.69 (248, 452) (494,

### 4.3 Merge predicted images to one video

In [47]:
from moviepy.editor import *

# get every images for video
clips = [ImageClip("out\\"+ m).set_duration(0.03) for m in images]

#concatenate images to a video
concat_clip = concatenate_videoclips(clips, method="compose")

# output video to disk
concat_clip.write_videofile("Result_BC.mp4", fps=24)

[MoviePy] >>>> Building video Result_BC.mp4
[MoviePy] Writing video Result_BC.mp4


100%|██████████████████████████████████████████████████████████████████████████████████| 72/72 [00:01<00:00, 38.42it/s]


[MoviePy] Done.
[MoviePy] >>>> Video ready: Result_BC.mp4 



The output video "Result_BC.mp4" is vedeo 1 we shown in the beginning  

<center>
<video width="640" height="360" src="nb_images/Result_BC.mp4" type="video/mp4" controls>
</video>
</center>

## 5 Reference

[1] d  
[2] d  
[3] d  


