# Architecture

For solving task of object detection, we used approach known as YOLO. It is an approach based on splitting image into a grid and then making prediction for each grid. Advantage of this approach is that it can predict objects in entire image in one passing of the image.

## Model

Architecture of our base model is based on Convolutional Neural Networks (CNNs). We use this type of neural networks beacuse we work with images and CNNs are designed to operate on images. As we can see from <em>Fig. 1</em> our model consists of 4 convolutional layers and 2 fully connected layers. Convolutional layers are used for feature extraction. Fully connected layers are used for operating on these features and making predictions. Activation function is ReLU on each layer except for the last layer where is sigmoid. We used sigmoid because outputs have to be between 0.0 and 1.0. 
<figure>
    <img src="images/model.svg"/>
    <figcaption><small>Fig.1: Model used for object detection. Due to readability we did not put the last fully connected layer to the figure. This layer is very wide and thus the figure would not be readable.</small></figcaption>
</figure>

Our base model has poor performance as we can see in the section Experiments. Base model is small and is not able to overfit on 1,000 images. In the next steps we plan to extend the model with additional convolution layers to get better feature extractor. We also plan to use fully convolutional architecture based on Darknet-19 [1] model.


[1] https://pjreddie.com/media/files/papers/YOLO9000.pdf

## Logging of results

We use TensorBoard for logging. To monitor long training runs we log values of loss and metric for both train and validation dataset. Each value is logged at the end of the epoch. We also log hyperparameters and the best value of loss and metric. The hyperparameters that are logged are batch size, number of epochs, lambda coordinate, lambda no object and learning rate. These values are logged at the end of the training.

### Saving of models

In order to keep trained models we utilize model saving. During training we are saving weights of the best model based on a value of validation loss.

## Evaluation

### Loss function

We adapted YOLO loss function from https://github.com/ecaradec/humble-yolo. This loss function (adapted from https://arxiv.org/pdf/1506.02640.pdf) is composed of multiple parts. First part represents loss of bounding box center compared to true bounding box center.
<figure>
    <img src="images/center-loss.png" width=600/>
    <figcaption><small>Fig.2: Loss on center of predicted bounding box.</small></figcaption>
</figure>

Second part represents loss of width and height compared to true bounding box width and height.
<figure>
    <img src="images/width-height-loss.png" width=600/>
    <figcaption><small>Fig.3: Loss on width and height of predicted bounding box.</small></figcaption>
</figure>

Third part represents loss of prediction when there is object inside true bounding box (first part) and when then there is no object inside true bounding box (second part).
<figure>
    <img src="images/confidence-loss.png" width=600/>
    <figcaption><small>Fig.4: Loss on confidence of predicted bounding box.</small></figcaption>
</figure>

Fourth part represents loss of misclassification of object inside predicted bounding box compared to true bounding box. We have not implemented this part yet, because we are currently only focusing on correct bounding box and confidence predictions.
<figure>
    <img src="images/class-loss.png" width=600/>
    <figcaption><small>Fig.5: Loss on predicted category of object.</small></figcaption>
</figure>

### Metrics

We are using **F1 score** for evaluating how good is our model. In order to calculate **F1 score** we need to know how many **true positives (TP)**, **false positives (FP)** and **false negatives (FN)** our model predicted.

Here are our definitions of **TP**, **FP** and **FN**:

 - **TP:** all predicted bounding boxes where IoU (Intersect over Union) is greater than threshold and there is some object in predicted bounding box
 - **FN:** all predicted bounding boxes where IoU (Intersect over Union) is lower than threshold and there is some object in predicted bounding box
 - **FP:** all predicted bounding boxes where IoU (Intersect over Union) is greater than threshold and there is no object in predicted bounding box

We also plan to use metric **Mean Average Precision**, however out implementation is currently not working correctly.

# Data

We used data from MS COCO 2017 challenge (http://cocodataset.org/#home). We used XXXX images for training, YYYY for validation and ZZZZ for testing.

## Data loading

For loading images and their annotations we used COCO API available at https://github.com/cocodataset/cocoapi. After loading images and annotaions, we compressed them into one .npz file by using function **compress_coco** from file **scripts/compress_coco.py**. When we use images for analysis or training, we load them from .npz file.

Compressed images and annotations are loaded by using function **load_dataset** from **src/data/load_data.py**. This function simply loads .npz file and reads image and annotaion data from it.

Before compressing images, it is possible to choose how many images will be compressed (and therefore used later) and with which categories of objects.

## Image preprocessing

### Resizing of images

Images from MS COCO dataset are of various sizes. To be able to use them as input to neural network we had to resize all images to constant shape. We decided to use shape 256 * 256 pixels, because in dataset there are portrait and landscape images. For changing their shape, we simply stretched image dimension if it was smaller than 256 pixels or compressed it if it was larger than 256 pixels. We think that this way of resizing images is best in terms of not loosing information present in original image. Resizing of images was done using **tensorflow** function **resize** in function **resize_images** from file **src/data/preprocessing.py**.

We could also use padding and cropping of image, but we chose resizing of images, because padding would introduce line at the edge of images, which could be understood as part of object by neural network. Cropping of images could result in losing parts of objects, or even entire objects and therefore decrease amount of objects on which our network could train.

### Scaling of pixel values

Because neural networks are sensitive to values of input data, we scaled all pixel values to interval (0.0, 1.0). This results in smaller differences between pixel values and therefore network is able to learn more accurately, because all pixels are considered as more equal compared to unscaled pixel values.

## Annotations preprocessing

Original MS COCO annotations contained various information as bounding boxes, image id, category id, annotation id, etc. However, we use only bounding boxes, image id and later we will use category id. To be able to use these annotations with YOLO method for object detection, we processed these annotations into YOLO vectors.

### YOLO vector

YOLO vector comprises of multiple different values. First value represents confidence that there is some object inside of the bounding box. Second (**x**) and third (**y**) values represent coordinates of center of bounding box. These coordinates are scaled according to position of center of bounding box inside of grid cell. Fourth and fifth values represent width and height of bounding box relative to size of image.

# Experiments

## Sanity check