# Object Detection  

This example program shows how you can use dlib to make a HOG based object detector for things like faces, pedestrians, and any other semi-rigid object.  In particular, we go though the steps to train the kind of sliding window object detector first published by Dalal and Triggs in 2005 in the  paper Histograms of Oriented Gradients for Human Detection.

It uses the DLIB Library 
(conda install -c conda-forge dlib=19.4)

Note: I installed dlib in the "dlib" python environment and not the tf_nn environment because I saw a warning that some packages would have been downgraded



## Create the XML files
DLIB requires images and bounding boxes around the labelled object. It has its own strructure for the XML files:

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="RunBib" color="#032585"/>
    </tags>
    <images>
        <image file="B:/DataSets/2016_USATF_Sprint_TrainingDataset/_hsp3997.jpg">
            <box top="945" left="887" width="85" height="53">
                <label>RunBib</label>
            </box>
            <box top="971" left="43" width="103" height="56">
                <label>RunBib</label>
            </box>
            <box top="919" left="533" width="100" height="56">
                <label>RunBib</label>
            </box>
        </image>
        <image file="B:/DataSets/2016_USATF_Sprint_TrainingDataset/_hsp3989.jpg">
            <box top="878" left="513" width="111" height="62">
                <label>my_label</label>
            </box>
        </image>     
   </images>
</dataset>
top: Top left y value
height: Height (positive down)
left: Top left x value
width: Width (positive to the right)

To create your own XML files you can use the imglab tool which can be found in the tools/imglab folder.  It is a simple graphical tool for labeling objects in images with boxes.  To see how to use it read the tools/imglab/README.txt file.  But for this example, we just use the training.xml file included with dlib.

Its a two part process to load the tagger.
1.) typing the following command:
#####    b:\HoodMachineLearning\dlib\tools\build\Release\imglab.exe -c mydataset.xml B:\HoodMachineLearning\datasets\MyImage
2.) 
####     b:\HoodMachineLearning\dlib\tools\build\Release\imglab.exe -c mydataset.xml

## Image pyramids and sliding windows

The technique uses image pyramids and sliding windows to minimize the effect of object location and object size. The pyramid is a set of subsample images and the sliding window remains the same and moves from left to right and top to bottom of each scale of the image.

### Image Pyramids
<img src="ImagePyramid.jpg">

Note: Dalai and Triggs showed that performance is reduced if you apply gaussian smoothing at each layer==> ski this stip

### Sliding Window

<img src="sliding_window_example.gif" loop=3>

* It is common to use a stepSize of 4 to 8 pixels
* windowSize is the size of the Kernal. An object detector will work best if the aspect ratio of the kernal is close to that of the desired object. Note: The sliding window size is also important for the HOG filter. For the HOG filter two parameters are important: <b>pixels_per_cell</b> and <b>cells_per_block </b>

In order to avoid having to 'guess' at the best window size that will satisfy object detector requirements and HOG requirments, a "explore_dims.py" method is used.

1.) Meet object detection requirments: loads all the images and computes the average width, average height, and computes the aspect ratio from those values.
2.) Meet HOG requirments: Pyimage rule of thumb is to divide the above values by two (ie, 1/4th the average size)
    * This reduces the size of the HOG feature vector
    * By dividing by two, a nice balance is struck between HOG feature vector size and reasonable window size.
    * Note: Our sliding_window dimension needs to be divisible by pixels_per_cell and cells_per_block so that the HOG descriptor will 'fit' into the window size
    * Its common for 'pixels_per_cell' to be a multiple of 4 and cells_per_block in the set (1,2,3)
    * Start with pixels_per_cell=(4,4) and cells_per_block=(2,2)
    * For example, in the Pyimage example, average W: 184 and average H:62. Divide by 2 ==> 92,31
    * Find values close to 92,31 that are divisible by 4 (and 2): 96,32  (Easy)
    * OBSERVATION:  When defining the binding boxes, it is best if all are around the same size. This can be difficult.  

### The 6 Step Framework
1. Sample P positive samples for your training data of the objects you want to detect. Extract HOG features from these objects.
    * If given an a general image containing the object, bounding boxes will also need to be given that indicate the location of the image
2. Sample N negative samples that do not contain the object and extract HOG features. In general N>>P  (I'd suggest images similar in size and aspect ratio to the P samples. I'd also avoid the bounding boxes and make the entire image the negative image. Pyimagesearch recommends using the 13 Natural Scene Category of the vision.stanford.edu/resources_links.html page
3. Train a Linear Support Vector Machine (SVM) on the negative images (class 0) and positive image (class 1)
4. Hard Negative Mining - for the N negative images, apply Sliding window and test the classifier. Ideally, they should all return 0. If they return a 1 indicating an incorrect classification, add it to the training set (for the next round of re-training)
5. Re-train classifier using with the added images from Hard Negative Mining (Usually once is enough)
6. Apply against test dataset, define a box around regions of high probability, when finished with the image, find the boxed region with the highest probability using "non-maximum suppression" to removed redundant and overlapping bounding boxes and make that the final box.

#### Note on DLIB library
* Similar to the 6 step framework but uses the entire training image to get the P's (indicated by bounding boxes) and the N's (not containing bounding boxes).  Note: It looks like it is important that all of the objects are identified in the image. For example, when doing running bibs, I may ignore some bibs for some reasons (too small, partially blocked, too many). My guess is that these images should just simply be avoided. This technique eliminates steps 2, 4, and 5.
* non-maximum supression is applied during the trainig phase helping to reduce false positives
* dlib using a highly accurate SVM engine used to find the hyperplane separating the TWO classes.


#### Use a JSON file to hold the hyper-parameters
{

"faces_folder": "B:\\DataSets\\2016_USATF_Sprint_TrainingDataset"
"myTrainingFilename": "trainingset_small.xml"
"myTestingFilename: "trainingset_small.xml"
"myDetector": "detector.svm"
}

#### Load and Dump hdf5 file
* hdf5 provides efficient data storage


In [8]:
import os
import glob
import dlib
import cv2

faces_folder: Path to Main photos folder
myTraining:  Path to Training Set of Photos
myTesting:   Path to Testing Set (Not used for training, just for quantifying performance)
myDetector:  I Think: Filename of the Model that is created and then used.

In [9]:
faces_folder ="B:\\DataSets\\2016_USATF_Sprint_TrainingDataset"
myTrainingFilename="trainingset_small.xml"
myTestingFilename="trainingset_small.xml"
myDetector="detector.svm"

### Load the dlib object to simple object detections.
     The train_simple_object_detector() function has a bunch of options, all of which come with reasonable default values.  The next few lines goes over some of these options.
### Select the C Value
    # The trainer is a kind of support vector machine and therefore has the usual
    # SVM C parameter.  In general, a bigger C encourages it to fit the training
    # data better but might lead to overfitting.  You must find the best C value
    # empirically by checking how well the trained detector works on a test set of
    # images you haven't trained on.  Don't just leave the value set at 5.  Try a
    # few different C values and see what works best for your data.

In [10]:
#def train_object_detector(faces_folder,myTraining,myTesting,myDetector,myDetector2):
options = dlib.simple_object_detector_training_options()
options.add_left_right_image_flips = True
options.C = 5
# Tell the code how many CPU cores your computer has for the fastest training.
# Note: DLIB does not use the GPU
options.num_threads = 4 
options.be_verbose = True

In [11]:
training_xml_path = os.path.join(faces_folder, myTrainingFilename)
print(training_xml_path)
testing_xml_path = os.path.join(faces_folder, myTestingFilename)
print(testing_xml_path)

B:\DataSets\2016_USATF_Sprint_TrainingDataset\trainingset_small.xml
B:\DataSets\2016_USATF_Sprint_TrainingDataset\trainingset_small.xml


### Begin Training
    This function does the actual training.  It will save the final detector to the file specified by myDetectorFileName.  The input is an XML file that lists the images in the training dataset and also contains the positions of the face boxes.  


In [12]:
dlib.train_simple_object_detector(training_xml_path, myDetector, options)

RuntimeError: 
Error! An impossible set of object boxes was given for training. All the boxes 
need to have a similar aspect ratio and also not be smaller than about 400 
pixels in area. The following images contain invalid boxes: 
  B:/DataSets/2016_USATF_Sprint_TrainingDataset/dsc_4062.jpg


In [None]:
print("")  # Print blank line to create gap from previous output

In [None]:
print("Training accuracy: {}".format(dlib.test_simple_object_detector(training_xml_path, myDetector)))

In [None]:
print("Testing accuracy: {}".format(dlib.test_simple_object_detector(testing_xml_path, myDetector)))