<h1 style="color:orange;text-align:center;font-family:courier;font-size:280%">YOLOv1 from scratch</h1>
<p style="color:orange;text-align:center;font-family:courier"> The objective is to understand how single stage object detectors work using YOLOv1 algorithm</p>

### Objectives 
* Understand the theory and building blocks of object detection problem.
* Generate a basic understanding why new algorithms are needed to solve complex detection problems.
* simplify the pedagogy of explaining computer vision topics.
<!-- * Though the code works there are significant drawbacks with yolov1 which has been addressed on YoloV2,YoloV3 -->


<p style="text-align:center"><img src="assets/yolo.jpeg" alt="yolov1" width="240"/>
<p style="text-align:center"><img src="assets/pipeline.png" alt="yolov1_od" width="640"/>
 

### Import dependencies

###### Essential links to understand tutorial
* <a href=https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/>Read about IoU (Intersection over union)</a>

In [1]:
import cv2
import numpy as np # linear algebra
import tensorflow as tf # deep learning
import pandas as pd
import matplotlib.pyplot as plt
from  tensorflow.keras.optimizers import *
from tensorflow.keras.utils import Sequence
from tensorflow.keras import layers,models
from tensorflow.keras.callbacks import *
from detect_utils import (iou,xmin_ymin_centre_wh,
                          rescale_csv,prepare_csv,
                         xw_xmin_ymin,basic_NMS) #object detection utilities

from detect_utils import (get_predictions,
                          process_predictions,get_output)

#### Dataset Structure information
* In general we use normalized coordinates and below is the sample csv structure.
* we have scaled all the image dimensions (464,464,3) before training, therefore coordinates as well. 

<p style="text-align:center"><img src="assets/dataset_csv.png" alt="yolov1_od" width="640"/>

#### Implementation of Custom DataLoader
* <a href=https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence>Read about Creating Tensorflow custom dataloader using Sequence API class</a>

In [2]:
class DataLoader(Sequence):
    def __init__(self,csv,img_dir,image_size,grid_size,classes,batch_size,aug=False,shuffle=True):
        self.csv   = csv
        self.img_dir    = img_dir
        self.imsize     = image_size
        self.batch_size = batch_size
        self.shuffle    = shuffle
        self.grid_size = grid_size
        self.classes = classes
        self.indices    = list(range(len(list(set(self.csv.image)))))
        self.aug = aug
        
    
    def __len__(self):
        return int(len(self.indices)/self.batch_size)-1
    
    def on_epoch_end(self):
        if self.shuffle == True:
            np.random.shuffle(self.indices)
    
    def __process__(self,index):
        out_grid = np.zeros((self.grid_size,self.grid_size,5+self.classes))
        index = index
        csvd = self.csv
        all_files = list(set(csvd.image))
        selected  = csvd[csvd["image"]==all_files[index]]
        selected = selected.values
        im_tensor = tf.io.decode_png(tf.io.read_file(self.img_dir+selected[0,0]),channels=3)
        im_tensor = tf.image.resize(im_tensor,self.imsize)
        if self.aug:
            p1 = np.random.randint(1,15,1)[0]
            p2 = np.random.randint(1,15,1)[0]
            if p1 >= 9:
                im_tensor=tf.image.random_contrast(im_tensor,0.6,1.6)
            if p2 >= 13:
                im_tensor=tf.image.random_brightness(im_tensor,0.3)
        im_tensor = im_tensor/255.
        boxes    = selected[:,2:-2]
        labels   = selected[:,1]
        labels = tf.cast([labels],tf.uint8)
        labels = tf.one_hot(labels,self.classes+1)[0][:,1:]
        #first convert from xmin,ymin to center
        centres = xmin_ymin_centre_wh(boxes) #[x_c,y_c,width,height]
        centres  =tf.concat((centres,labels),axis=1)

        #multiply by out_grid scale
        for bboxs in centres:
            bbox,cls_ = bboxs[:4],bboxs[4:]
            cx,cy,wid,hei = bbox 
            
            g_cx,g_cy   = cx*self.grid_size,cy*self.grid_size #convert between 0 to 1
            i,j = int(g_cx),int(g_cy)
            g_cx,g_cy   = g_cx-int(g_cx),g_cy-int(g_cy) #convert between 0 to 1
            g_wid,g_hei = wid*self.grid_size,hei*self.grid_size # convert with respect to predict grid

            
            if out_grid[i,j,4] ==0:
                out_grid[i,j,4] = 1
                out_grid[i,j,:4] = [g_cx,g_cy,g_wid,g_hei]
                if self.classes == 1:
                    out_grid[i,j,5] = 1
                else:
                    out_grid[i,j,5:] = cls_
        return im_tensor,out_grid

    
    def __getitem__(self,idx):
        x_ = []
        y_ = []

        batch_list = list(range(idx * self.batch_size,(idx + 1)*self.batch_size))
        for idx_ in batch_list:
            x,y = self.__process__(idx_)
            x_.append(x)
            y_.append(y)

        return tf.cast(x_,tf.float32),tf.cast(y_,tf.float32)

### Loading Dataset 
##### setting up paths and train configuration.

* we use grid size of 7x7 as per the paper.
* we will use pascal dataset just selecting 6 lables out of 20 to fasten the experiment with batch size of 4(GPU constraints).
* The exclude_list contains the labels which we will avoid during training.

In [3]:
train_csv_path  = "pascal/464_train_scaled.csv"
train_image_dir = "pascal/464_train_scaled/"

val_csv_path   = "pascal/464_valid_scaled.csv"
val_image_dir  = "pascal/464_valid_scaled/"

exclude_classes=['person','train','horse','tvmonitor',
                 'diningtable','cow','sofa','chair', 
                 'cat', 'bird', 'pottedplant', 'boat', 
                 'sheep', 'bottle']

grid_size = 7
# classes = 6
batch_size = 4
image_height, image_width = 464,464

train_csv_file,val_csv_file,classes,label_map = prepare_csv(train_csv_path=train_csv_path,
                                                            val_csv_path=val_csv_path,exclude_label=exclude_classes)

train_data = DataLoader(train_csv_file,img_dir=train_image_dir,
                        image_size=(image_height, image_width),
                        grid_size=grid_size,classes=classes,
                        batch_size=batch_size,aug=True)


val_data   = DataLoader(val_csv_file,img_dir=val_image_dir,
                       image_size=(image_height, image_width),
                       grid_size=grid_size,classes=classes,
                       batch_size=batch_size,aug=False)

train_steps = train_data.__len__()
val_steps = val_data.__len__()

print(f"Labels : {label_map}")

Labels : {0: 'aeroplane', 1: 'motorbike', 2: 'dog', 3: 'bicycle', 4: 'car', 5: 'bus'}


### Model
*  we will use ResNet50 as pretrained feature extractor.
* On top of it we will add detection and classification head as outputs.

<p style="text-align:center"><img src="assets/encode.png" alt="yolov1_od" width="1080"/></p>

#### Model Prediction  
<p style="text-align:center"><img src="assets/pred.png" alt="yolov1_od" width="350"/></p>


In [4]:
def YOLO_r50(gsize=7,classes=None):
    base_model = tf.keras.applications.ResNet50(include_top=False,input_shape=(None,None,3))
    for l in base_model.layers[:]: 
        l.trainable = True
#     for l in base_model.layers[:143]: #143 81
#         l.trainable = False

    features=base_model.output
    pool=layers.GlobalAveragePooling2D()(features)
    
    box_params  = 5 #[x,y,w,h,confidence]
    total_boxes = 2 #[x,y,w,h,confidence],[x,y,w,h,confidence]
        
    # Detection head
    out_dim  = (box_params*total_boxes)+classes
    det_map  = layers.Dense(gsize*gsize*out_dim)(pool)# detection map
    
    det_out = layers.Reshape((gsize,gsize,out_dim))(det_map)
    
    mod = models.Model(base_model.input,det_out)
    return mod

#### YOLOv1 Loss Function

* Loss function is one of the core part of detection problem, The model can diverge if loss function is not proper and can become unstable during training.
* YOLOv1 LOSS function can be split into three parts mainly, In YOLOv1 all these are penalized with squared error loss.
  * Bounding Box Loss 
  * Object Confidence Loss
  * Class Loss
  
  
     <p style="text-align:center"><img src="assets/loss.jpg" alt="yolov1_od" width="850"/></p>

In [5]:
def yolov1_loss(y_true,y_pred):
    """[g_cx,g_cy,g_wid,g_hei]
    [0,1,2,3,4],[5,6,7,8,9],[10,11]
    """
#     y_true = tf.cast(y_true,tf.float32)
#     y_pred = tf.cast(y_pred,tf.float32)
    s = 7
    classes_count  = 6
    lambda_coord = 5.0
    lambda_noobj = 0.5
    identity_obj = tf.expand_dims(y_true[...,4],axis=-1)
    identity_obj = tf.reshape(identity_obj,(-1,s*s,1))  #shape=(N, 49, 1)

    # box_loss
    true_box = tf.reshape(y_true[...,:4],(-1,s*s,4))
    predbox1 = tf.reshape(y_pred[...,:4],(-1,s*s,4))
    predbox2 = tf.reshape(y_pred[...,5:9],(-1,s*s,4))
    ious = tf.concat((iou(true_box,predbox1)[:,:,tf.newaxis],iou(true_box,predbox2)[:,:,tf.newaxis]),axis=-1)
    maxes = tf.expand_dims(tf.cast(tf.argmax(ious,axis=-1),tf.float32),axis=-1)
    
    true_box_xy = tf.reshape(y_true[...,:2],(-1,s*s,2))
    predbox1_xy = tf.reshape(y_pred[...,:2],(-1,s*s,2))
    predbox2_xy = tf.reshape(y_pred[...,5:7],(-1,s*s,2))
    
    box_pred_xy   = identity_obj*(((1-maxes)*predbox1_xy)+(maxes*predbox2_xy))
    box_target_xy = identity_obj*true_box_xy
    
    true_box_wh = tf.reshape(y_true[...,2:4],(-1,s*s,2))
    predbox1_wh = tf.reshape(y_pred[...,2:4],(-1,s*s,2))
    predbox2_wh = tf.reshape(y_pred[...,7:9],(-1,s*s,2))
    
    
    box_pred_wh   = identity_obj*(((1-maxes)*predbox1_wh)+(maxes*predbox2_wh))
    box_target_wh = identity_obj*true_box_wh
    
    box_pred_wh = tf.sqrt(tf.maximum(box_pred_wh, 1e-6))
    box_target_wh = tf.sqrt(tf.maximum(box_target_wh, 1e-6))
    
    box_wh_loss = tf.losses.mean_squared_error(box_target_wh,box_pred_wh)
    box_xy_loss = tf.losses.mean_squared_error(box_target_xy,box_pred_xy)
    box_loss = tf.reduce_sum(box_wh_loss+box_xy_loss)

    
    #object loss
    true_obj = tf.reshape(y_true[...,4],(-1,s*s,1))
    predobj1 = tf.reshape(y_pred[...,4],(-1,s*s,1))
    predobj2 = tf.reshape(y_pred[...,9],(-1,s*s,1))
    
    obj_box   = (((1-maxes)*predobj1)+(maxes*predobj2))
    obj_loss  = tf.keras.losses.mean_squared_error(identity_obj*obj_box,identity_obj*true_obj)
    obj_loss  = tf.reduce_sum(obj_loss)
    #no-object loss
    no_obj_loss1 = tf.keras.losses.mean_squared_error((1-identity_obj)*predobj1,(1-identity_obj)*true_obj)
    no_obj_loss2 = tf.keras.losses.mean_squared_error((1-identity_obj)*predobj2,(1-identity_obj)*true_obj)
    no_obj_loss  = tf.reduce_sum(no_obj_loss1+no_obj_loss2)

    
    #class loss
    true_class = identity_obj*tf.reshape(y_true[...,5:],(-1,s*s,classes_count))
    pred_class = identity_obj*tf.reshape(y_pred[...,10:],(-1,s*s,classes_count))
    class_loss = tf.reduce_sum(tf.keras.losses.mean_squared_error(true_class,pred_class))
    
    final_loss = (box_loss*lambda_coord)+tf.cast(obj_loss,tf.float32)+(lambda_noobj*tf.cast(no_obj_loss,tf.float32))+tf.cast(class_loss,tf.float32)

    return final_loss


### Training

In [6]:
yolo_r50 = YOLO_r50(gsize=grid_size,classes=classes)
yolo_r50.load_weights("checkpoint.h5")
ckpt = ModelCheckpoint("checkpoint_001.h5",monitor="val_loss",save_weights_only=True,save_best_only=True)
yolo_r50.compile(optimizer=Adam(5e-4),loss=yolov1_loss)
yolo_r50.fit(train_data,batch_size=batch_size,steps_per_epoch=train_steps,epochs=10,
            validation_data=val_data,validation_steps=val_steps,callbacks=[ckpt,ReduceLROnPlateau(patience=3,cooldown=1)])

### Testing 

In [8]:
item = "test_images/6294178496.jpg"#"pascal/464_train_scaled/"+tuple(tfs.sample(1)["image"])[0]
output,copy_tensor   = get_predictions(item,model=yolo_r50)
boxes,classes,scores = process_predictions(output,confidence=0.5)
output = get_output(copy_tensor,boxes,classes,scores,label_map=label_map)
cv2.imwrite("assets/output.jpg",output)

True

<p style="text-align:center"><img src="assets/output.jpg" alt="yolov1_od" width="450"/></p>

## Conclusion and limitations:

* It looks like the YOLOv1 is not great for detecting dense object groups like flock of birds or group of cars. but why?
  * The bouding box encoded schema is not great, so it can't detect densly packed objects in the image.
* How to overcome the flaws?
  * Anchor boxes technique, which are very similar to the above method, but brings in bag of tricks to cover most of the aspect ratios and scales to     detect wide range of  objetcs.
* How to classfiy Fundamental hierarchy for object detction systems up till now?
  * we can classify them as:
    * Single stage detectors -  They use pre-encoded box techniques like YOLO,SSD
    * Multi stage detectors -  They use model to guess and propose boxes with respect to the task, like Faster RCNN, they have something called   
      RPN(region proposal network which replaces brute force box encoding technique)
    * End-to-End  - Using Transformer based networks with set based loss functions we train end-to-end detectors which is most advanced and exciting       research topic.