# Word detection system
## (THIS CODE IS NOT COMPLETE)

### This project uses a large amount of the code found [here](https://mlblr.com/includes/mlai/index.html#yolov2) which describes in detail how YOLOv2 works.

This notebook contains all the code necessary for creating and training a convolutional word detection network similar to the YOLO network used to detect physical objects. This notebook uses images as training data along with their associated annotation files which contain bounding box information. The tool I used to annotate my images can be found on my github [here](https://github.com/bradylowe/word-detection-with-conv-net.git).

## Load necessary packages

You will certainly want the GPU version of tensorflow running to train the model.

In [1]:
# Essentials
import math
import os
import numpy as np

# Image processing
%matplotlib inline
import matplotlib.pyplot as plt
from scipy import misc

# Machine learning
import tensorflow as tf
from sklearn.metrics import confusion_matrix

# Gauge performance
import time
from datetime import timedelta

# Program configuration file
import config

In [2]:
tf.__version__

'1.7.0'

## Define some project configuration constants

Here, we make some choices about the overall project. These choices 
will determine the exact functionality of the computation graph 
we will create and train. 

First of all, we need to define the classes we will be trying to detect.

The image height and width set here are chosen because we are expecting
text documents of size 8.5" x 11". The ANCHORS variable stores the width
and height of the anchor boxes \[width1, height1, width2, height2, ...\].
The threshold variables are for determining which boxes have positive
detections.

The scale variables are for determining which predictions are the most important
to get right during training. These weights scale individual terms in the 
custom loss function defined later.

In [3]:
LABELS = config.class_names

#IMAGE_H, IMAGE_W = 800 / 2, 620 / 2
IMAGE_H, IMAGE_W =  416, 416
GRID_H,  GRID_W  = 13, 13
#GRID_H, GRID_W  = 13, 10
ANCHORS          = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828]
BOX              = len(ANCHORS) // 2
CLASS            = len(LABELS)
CLASS_WEIGHTS    = np.ones(CLASS, dtype='float32')
OBJ_THRESHOLD    = 0.3#0.5
NMS_THRESHOLD    = 0.3#0.45

NO_OBJECT_SCALE  = 1.0
OBJECT_SCALE     = 5.0
COORD_SCALE      = 1.0
CLASS_SCALE      = 1.0

BATCH_SIZE       = 16
WARM_UP_BATCHES  = 0
TRUE_BOX_BUFFER  = 500


## Load dataset

The dataset needed for training a detection network consists of:
 - Bare training images containing words to detect 
 - List of bounding boxes which label each object in the image
   * Need top-left corner (xmin, ymin)
   * Need bottom-right corner (xmax, ymax)
   * Need class contained in box
   * box ==>  \[xmin, ymin, xmax, ymax, cls0, cls1, cls2, cls3, cls4\] where 
     cls( j ) = 1 if this box bounds the j^th class, else 0
   
### (Note about pre-training)
Some pre-training will need to be performed on most of the network. 
This is because of the large number of predictions made in the network.
First, we will train many layers of the network on simple classification.
This means that we will need some images in the dataset that do not contain words.
Pre-training on classification will make detection training much faster and easier.

In [4]:
def get_one_hot_from_label(label):
    cur_class = LABELS.index(label)
    class_vec = np.zeros(CLASS, dtype="float")
    class_vec[cur_class] = 1.
    return class_vec

In [5]:
def load_boxes_for_image(im_path):
    """ 
    Load all the boxes defined in the xml_file sent in.
    A box consists of [class, top_left_corner, bottom_right_corner, parent_grid_cell].
    """
    
    # Calculate xml path from given path
    xml_path = os.path.splitext(im_path)[0] + '.xml'
    
    # Read lines from file
    with open(xml_path, 'r') as f:
        lines = f.readlines()
        
    # Grab image dimensions
    width = int(lines[4].strip()[7:-8])
    height = int(lines[5].strip()[8:-9])
    
    # Read info from boxes
    boxes = []
    for idx, line in enumerate(lines):
        
        if '<object>' in line:
            box = np.zeros(4 + CLASS, dtype="float")
            
            xmin = int(lines[idx + 7].strip()[6:-7])
            ymin = int(lines[idx + 8].strip()[6:-7])
            xmax = int(lines[idx + 9].strip()[6:-7])
            ymax = int(lines[idx + 10].strip()[6:-7])
            box[0:4] = [xmin, ymin, xmax, ymax]
            
            label = lines[idx + 1].strip()[6:-7]
            box[4:] = get_one_hot_from_label(label)
            
            boxes.append(box)
    
    return boxes

In [33]:
def load_all_images_and_boxes(annot_dir, im_dir, max_examples=None):
    
    images = []
    boxes = []
    
    for im_path in os.listdir(im_dir):
        
        if "_boxes" in im_path or '.png' not in im_path:
            continue
        
        # Make paths
        xml_path = os.path.splitext(im_path)[0] + '.xml'
        im_path = os.path.join(im_dir, im_path)
        xml_path = os.path.join(annot_dir, xml_path)
        
        if not os.path.exists(xml_path):
            continue
        
        # Load data
        cur_image = misc.imread(im_path)
        cur_image = misc.imresize(cur_image, (IMAGE_H, IMAGE_W))
        images.append(cur_image)
        boxes = load_boxes_for_image(xml_path)
        
        if max_examples is not None and len(images) >= max_examples:
            break
        
    return images, boxes

In [28]:
def get_random_batch(data, batch_size=64):
    max_index = min(batch_size, len(data))
    perm_idx = np.random.permutation(max_index)
    return (data[perm_idx[idx]] for idx in range(max_index))

## Helper functions for the network computations

Neural networks are complex things with many degrees of freedom. There are a few functions that we must perform repeatedly that take many inputs. With helper functions, we can hard-code some things and reduce the number of inputs to make the problem easier to manage.

In [7]:
def space_to_depth_2x(x):
    return tf.space_to_depth(x, 2)

In [8]:
def concatenate(inputs):
    return tf.concat(values=inputs, axis=-1)

In [9]:
def new_weights(shape):
    return tf.Variable(tf.truncated_normal(shape, stddev=0.05))

In [10]:
def new_fc_layer(layer, num_inputs, num_outputs, use_relu=True):
    weights = new_weights(shape=[num_inputs, num_outputs])
    biases = new_biases(length=num_outputs)
    layer = tf.matmul(input, weights) + biases
    if use_relu:
        layer = tf.nn.leaky_relu(layer)
    return layer

In [11]:
def new_conv_layer(layer,              # The previous layer.
                   num_input_channels, # Num. channels in prev. layer.
                   filter_size,        # Width and height of each filter.
                   num_filters,        # Number of filters.
                   use_pooling=True,   # Use 2x2 max-pooling.
                   use_batch_norm=True):
    
    shape = [filter_size, filter_size, num_input_channels, num_filters]
    weights = new_weights(shape=shape)
    
    layer = tf.nn.conv2d(input=layer,
                         filter=weights,
                         strides=[1, 1, 1, 1],
                         padding='SAME')
    
    if use_pooling:
        layer = tf.nn.max_pool(value=layer,
                               ksize=[1, 2, 2, 1],
                               strides=[1, 2, 2, 1],
                               padding='SAME')
        
    if use_batch_norm:
        bn_means, bn_variances = tf.nn.moments(layer, [0])
        layer = tf.nn.batch_normalization(layer, bn_means, bn_variances, None, None, 1e-6)
        
    layer = tf.nn.leaky_relu(layer)
    
    return layer, weights

In [38]:
def flatten_layer(layer):
    layer_shape = layer.get_shape()
    num_features = layer_shape[1:4].num_elements()
    layer_flat = tf.reshape(layer, [-1, num_features])
    return layer_flat, num_features

## Create TensorFlow input and output (architecture)

We need to now create the Tensorflow variables and graph. This included placeholders for the inputs (images array and one-hot labels array), as well as a computation we want performed (the network). 

 - The input image placeholder is called x_image
 - The input labels placeholder is called y_true
 - The final output computation is called y_pred
    * y_pred is the arg max of the final output activations from the softmax layer.

In [12]:
input_image = tf.placeholder(tf.float32, shape=[None, IMAGE_H, IMAGE_W, 1], name='input_image')
true_boxes  = tf.placeholder(tf.float32, shape=[None, 1, 1, 1, TRUE_BOX_BUFFER, 4], name='true_boxes')

In [13]:
# Yolo model (greyscale)

# 1 - 5
x, _ = new_conv_layer(input_image, 1, 3, 32)
x, _ = new_conv_layer(x, 32, 3, 64)
x, _ = new_conv_layer(x, 64, 3, 128, use_pooling=False)
x, _ = new_conv_layer(x, 128, 1, 64, use_pooling=False)
x, _ = new_conv_layer(x, 64, 3, 128)

# 6 - 10
x, _ = new_conv_layer(x, 128, 3, 256, use_pooling=False)
x, _ = new_conv_layer(x, 256, 1, 128, use_pooling=False)
x, _ = new_conv_layer(x, 128, 3, 256)
x, _ = new_conv_layer(x, 256, 3, 512, use_pooling=False)
x, _ = new_conv_layer(x, 512, 1, 256, use_pooling=False)

# 11 - 15
x, _ = new_conv_layer(x, 256, 3, 512, use_pooling=False)
x, _ = new_conv_layer(x, 512, 1, 256, use_pooling=False)
x, _ = new_conv_layer(x, 256, 3, 512, use_pooling=False)
skip_con = x
x, _ = new_conv_layer(x, 512, 3, 1024, use_pooling=False)
x, _ = new_conv_layer(x, 1024, 1, 512, use_pooling=False)

# 16 - 20
x, _ = new_conv_layer(x, 512, 3, 1024, use_pooling=False)
x, _ = new_conv_layer(x, 1024, 1, 512, use_pooling=False)
x, _ = new_conv_layer(x, 512, 3, 1024, use_pooling=False)
x, _ = new_conv_layer(x, 1024, 3, 1024, use_pooling=False)
x, _ = new_conv_layer(x, 1024, 3, 1024, use_pooling=False)

# 21
skip_con, _= new_conv_layer(skip_con, 512, 1, 64, use_pooling=False)
#skip_con = space_to_depth_2x(skip_con)

x = concatenate([skip_con, x])


# 22
#x, _ = new_conv_layer(x, 1280, 3, 1024, use_pooling=False)
x, _ = new_conv_layer(x, 1088, 3, 1024, use_pooling=False)

### Predict boxes with confidences and classes

In [14]:
# 23
x, _ = new_conv_layer(x, 1024, 1, BOX * (5 + CLASS), use_pooling=False)

y_pred = tf.reshape(x, (GRID_H, GRID_W, BOX, 4 + 1 + CLASS))

### Predict classes only

In [None]:
# 23
x, _ = new_conv_layer(x, 1024, 1, BOX * CLASS, use_pooling=False)

y_pred = Reshape((GRID_H, GRID_W, BOX, CLASS))(x)

## Define loss 

In [15]:
def custom_loss(y_true, y_pred):
    mask_shape = tf.shape(y_true)[:4]

    cell_x = tf.to_float(tf.reshape(tf.tile(tf.range(GRID_W), [GRID_H]), (1, GRID_H, GRID_W, 1, 1)))
    cell_y = tf.transpose(cell_x, (0,2,1,3,4))

    cell_grid = tf.tile(tf.concat([cell_x,cell_y], -1), [BATCH_SIZE, 1, 1, 5, 1])

    coord_mask = tf.zeros(mask_shape)
    conf_mask  = tf.zeros(mask_shape)
    class_mask = tf.zeros(mask_shape)

    seen = tf.Variable(0.)
    total_recall = tf.Variable(0.)

    """
    Adjust prediction
    """
    ### adjust x and y      
    pred_box_xy = tf.sigmoid(y_pred[..., :2]) + cell_grid

    ### adjust w and h
    pred_box_wh = tf.exp(y_pred[..., 2:4]) * np.reshape(ANCHORS, [1,1,1,BOX,2])

    ### adjust confidence
    pred_box_conf = tf.sigmoid(y_pred[..., 4])

    ### adjust class probabilities
    pred_box_class = y_pred[..., 5:]

    """
    Adjust ground truth
    """
    ### adjust x and y
    true_box_xy = y_true[..., 0:2] # relative position to the containing cell

    ### adjust w and h
    true_box_wh = y_true[..., 2:4] # number of cells accross, horizontally and vertically

    ### adjust confidence
    true_wh_half = true_box_wh / 2.
    true_mins    = true_box_xy - true_wh_half
    true_maxes   = true_box_xy + true_wh_half

    pred_wh_half = pred_box_wh / 2.
    pred_mins    = pred_box_xy - pred_wh_half
    pred_maxes   = pred_box_xy + pred_wh_half

    intersect_mins  = tf.maximum(pred_mins,  true_mins)
    intersect_maxes = tf.minimum(pred_maxes, true_maxes)
    intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)
    intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]

    true_areas = true_box_wh[..., 0] * true_box_wh[..., 1]
    pred_areas = pred_box_wh[..., 0] * pred_box_wh[..., 1]
    union_areas = pred_areas + true_areas - intersect_areas
    iou_scores  = tf.truediv(intersect_areas, union_areas)

    true_box_conf = iou_scores * y_true[..., 4]

    ### adjust class probabilities
    true_box_class = tf.argmax(y_true[..., 5:], -1)

    """
    Determine the masks
    """
    ### coordinate mask: simply the position of the ground truth boxes (the predictors)
    coord_mask = tf.expand_dims(y_true[..., 4], axis=-1) * COORD_SCALE

    ### confidence mask: penelize predictors + penalize boxes with low IOU
    # penalize the confidence of the boxes, which have IOU with some ground truth box < 0.6
    true_xy = true_boxes[..., 0:2]
    true_wh = true_boxes[..., 2:4]

    true_wh_half = true_wh / 2.
    true_mins    = true_xy - true_wh_half
    true_maxes   = true_xy + true_wh_half

    pred_xy = tf.expand_dims(pred_box_xy, 4)
    pred_wh = tf.expand_dims(pred_box_wh, 4)

    pred_wh_half = pred_wh / 2.
    pred_mins    = pred_xy - pred_wh_half
    pred_maxes   = pred_xy + pred_wh_half

    intersect_mins  = tf.maximum(pred_mins,  true_mins)
    intersect_maxes = tf.minimum(pred_maxes, true_maxes)
    intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)
    intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]

    true_areas = true_wh[..., 0] * true_wh[..., 1]
    pred_areas = pred_wh[..., 0] * pred_wh[..., 1]

    union_areas = pred_areas + true_areas - intersect_areas
    iou_scores  = tf.truediv(intersect_areas, union_areas)

    best_ious = tf.reduce_max(iou_scores, axis=4)
    conf_mask = conf_mask + tf.to_float(best_ious < 0.6) * (1 - y_true[..., 4]) * NO_OBJECT_SCALE

    # penalize the confidence of the boxes, which are reponsible for corresponding ground truth box
    conf_mask = conf_mask + y_true[..., 4] * OBJECT_SCALE

    ### class mask: simply the position of the ground truth boxes (the predictors)
    class_mask = y_true[..., 4] * tf.gather(CLASS_WEIGHTS, true_box_class) * CLASS_SCALE

    """
    Warm-up training
    """
    no_boxes_mask = tf.to_float(coord_mask < COORD_SCALE/2.)
    seen = tf.assign_add(seen, 1.)
    true_box_xy, true_box_wh, coord_mask = tf.cond(tf.less(seen, WARM_UP_BATCHES),
                          lambda: [true_box_xy + (0.5 + cell_grid) * no_boxes_mask,
                                   true_box_wh + tf.ones_like(true_box_wh) * np.reshape(ANCHORS, [1,1,1,BOX,2]) * no_boxes_mask,
                                   tf.ones_like(coord_mask)],
                          lambda: [true_box_xy,
                                   true_box_wh,
                                   coord_mask])

    """
    Finalize the loss
    """
    nb_coord_box = tf.reduce_sum(tf.to_float(coord_mask > 0.0))
    nb_conf_box  = tf.reduce_sum(tf.to_float(conf_mask  > 0.0))
    nb_class_box = tf.reduce_sum(tf.to_float(class_mask > 0.0))

    loss_xy    = tf.reduce_sum(tf.square(true_box_xy-pred_box_xy)     * coord_mask) / (nb_coord_box + 1e-6) / 2.
    loss_wh    = tf.reduce_sum(tf.square(true_box_wh-pred_box_wh)     * coord_mask) / (nb_coord_box + 1e-6) / 2.
    loss_conf  = tf.reduce_sum(tf.square(true_box_conf-pred_box_conf) * conf_mask)  / (nb_conf_box  + 1e-6) / 2.
    loss_class = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=true_box_class, logits=pred_box_class)
    loss_class = tf.reduce_sum(loss_class * class_mask) / (nb_class_box + 1e-6)

    loss = loss_xy + loss_wh + loss_conf + loss_class

    nb_true_box = tf.reduce_sum(y_true[..., 4])
    nb_pred_box = tf.reduce_sum(tf.to_float(true_box_conf > 0.5) * tf.to_float(pred_box_conf > 0.3))

    """
    Debugging code
    """
    current_recall = nb_pred_box/(nb_true_box + 1e-6)
    total_recall = tf.assign_add(total_recall, current_recall)

    loss = tf.Print(loss, [tf.zeros((1))], message='Dummy Line \t', summarize=1000)
    loss = tf.Print(loss, [loss_xy], message='Loss XY \t', summarize=1000)
    loss = tf.Print(loss, [loss_wh], message='Loss WH \t', summarize=1000)
    loss = tf.Print(loss, [loss_conf], message='Loss Conf \t', summarize=1000)
    loss = tf.Print(loss, [loss_class], message='Loss Class \t', summarize=1000)
    loss = tf.Print(loss, [loss], message='Total Loss \t', summarize=1000)
    loss = tf.Print(loss, [current_recall], message='Current Recall \t', summarize=1000)
    loss = tf.Print(loss, [total_recall/seen], message='Average Recall \t', summarize=1000)

    return loss




In [26]:
generator_config = {
    'IMAGE_H'         : IMAGE_H, 
    'IMAGE_W'         : IMAGE_W,
    'GRID_H'          : GRID_H,  
    'GRID_W'          : GRID_W,
    'BOX'             : BOX,
    'LABELS'          : LABELS,
    'CLASS'           : len(LABELS),
    'ANCHORS'         : ANCHORS,
    'BATCH_SIZE'      : BATCH_SIZE,
    'TRUE_BOX_BUFFER' : 50,
}

train_image_folder = config.im_dir
train_annot_folder = train_image_folder

train_images, train_labels = load_all_images_and_boxes(train_annot_folder, train_image_folder)

`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.
`imresize` is deprecated in SciPy 1.0.0, and will be removed in 1.3.0.
Use Pillow instead: ``numpy.array(Image.fromarray(arr).resize())``.


## Create a Tensorflow session

Tensorflow is not doing anything yet. We have created some Tensorflow structures, but no computuation has been performed. We were just at the drawing board having a discussion.

Now, let's fire up the Tensorflow computation engine, initialize all the variables we had "discussed" before, and perform the planned computations.

In [17]:
session = tf.Session()

In [18]:
session.run(tf.global_variables_initializer())

In [35]:
cost = custom_loss(train_labels, y_pred)

optimizer = tf.train.AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999).minimize(cost)

ValueError: Dimensions must be equal, but are 9 and 5 for 'add_13' (op: 'Add') with input shapes: [9], [16,13,13,5].

### More helper functions for optimization, evaluation, and printing

Now that we have a "session" defined and running, we will write a couple more helper functions that use a reference to that session. These helper functions will actually call on tensorflow to compute predictions given input (and sometimes, perform optimization)

In [29]:
train_batch_size = 64
total_iterations = 0

def optimize(num_iterations):
    
    global total_iterations
    start_time = time.time()

    for i in range(total_iterations,
                   total_iterations + num_iterations):
        x_batch, y_true_batch = \
            get_random_batch((train_images, train_labels), train_batch_size)
        feed_dict_train = {input_image: train_images,
                           true_boxes: train_labels}
        session.run(optimizer, feed_dict=feed_dict_train)
        
        if i % 100 == 0:
            acc = session.run(accuracy, feed_dict=feed_dict_train)
            msg = "Optimization Iteration: {0:>6}, Training Accuracy: {1:>6.1%}"
            print(msg.format(i + 1, acc))
            
            
    total_iterations += num_iterations
    end_time = time.time()
    time_dif = end_time - start_time
    print("Time usage: " + str(timedelta(seconds=int(round(time_dif)))))

## Update the weights using the optimizer

The job of the optimizer is to optimize the value of the weights in each layer so that the computation graph results in the correct label for the given input.

Run the optimizer for some number of iteration. The model learns very quickly (under 100 iterations) how to correctly classify images at least 80% of the time. It takes a bit longer (over 1,000 iterations) to drive that accuracy up to 98%. 

Optimizing the hyper-parameters (changing model architecture, optimiziation algorithm, using regularization, changing data input, etc.) can help get the accuracy above 99%.

In [34]:
optimize(num_iterations=1)

NameError: name 'optimizer' is not defined

In [37]:
def predict(sess, image_file):
    """
    Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.
    
    Arguments:
    sess -- your tensorflow/Keras session containing the YOLO graph
    image_file -- name of an image stored in the "images" folder.
    
    Returns:
    out_scores -- tensor of shape (None, ), scores of the predicted boxes
    out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
    out_classes -- tensor of shape (None, ), class index of the predicted boxes
    
    Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes. 
    """

    # Preprocess your image
    image, image_data = preprocess_image("images/" + image_file, model_image_size = (608, 608))

    # Run the session with the correct tensors and choose the correct placeholders in the feed_dict.
    # You'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})
    ### START CODE HERE ### (≈ 1 line)
    out_scores, out_boxes, out_classes = sess.run([scores, boxes, classes], feed_dict={yolo_model.input: image_data, K.learning_phase(): 0})
    ### END CODE HERE ###

    # Print predictions info
    print('Found {} boxes for {}'.format(len(out_boxes), image_file))
    # Generate colors for drawing bounding boxes.
    colors = generate_colors(class_names)
    # Draw bounding boxes on the image file
    draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
    # Save the predicted bounding box on the image
    image.save(os.path.join("out", image_file), quality=90)
    # Display the results in the notebook
    output_image = scipy.misc.imread(os.path.join("out", image_file))
    imshow(output_image)
    
    return out_scores, out_boxes, out_classes

In [3]:
session.close()

NameError: name 'session' is not defined