
## 多位数字识别

应用背景：将建筑物添加到谷歌地图中，街景车拍摄建筑物，同时记录各张照片中建筑物相关的GPS坐标，通过机器识别每张图片上的门牌号码，方便谷歌地图数据库在正确的位置添加该建筑地址。

多位数字识别问题是一种特殊类型的序列识别（sequence recognition），其特殊之处在于：序列的长度是有界的，极少出现门牌号包括5位以上的数字。本例中假设门牌号最长5位。

传统做法：将操作步骤分为定位(localization)、分割(segmentation)、识别(recognition)三个阶段，[Goodfellow,2014](https://arxiv.org/abs/1312.6082)提出一种基于深度卷积神经网络的三个阶段整合的方案，这一方案也是本例实践的主要指导。本例代码主要参考[potterhsu](https://github.com/potterhsu/SVHNClassifier)、[thomalm](https://github.com/thomalm/svhn-multi-digit)。

In [1]:
import os
import numpy as np
import tensorflow as tf
from PIL import Image
from datetime import datetime
import json
import time

数据处理方面，由于SVHN数据集某些数字裁剪的太紧，导致地址数字被裁剪操作剪除。因此需要将裁剪区域范围扩大：
> expand this bounding box by 30% in both the x and the y direction

In [2]:
def preProcess(image, bbox_left, bbox_top, bbox_width, bbox_height):
        cropped_left, cropped_top, cropped_width, cropped_height = (int(round(bbox_left - 0.15 * bbox_width)),
                                                                    int(round(bbox_top - 0.15 * bbox_height)),
                                                                    int(round(bbox_width * 1.3)),
                                                                    int(round(bbox_height * 1.3)))
        image = image.crop([cropped_left, cropped_top, cropped_left + cropped_width, cropped_top + cropped_height])
        image = image.resize([64, 64])
        return image

为增加数据样本量，将原本一张64$*$64像素的图片随机裁剪为若干54$*$54像素的图片：
> crop a 54$*$54 pixel image from a random location within the 64$*$64 pixel image

In [3]:
class Donkey(object):
    @staticmethod
    def _preprocess(image):
        image = tf.image.convert_image_dtype(image, dtype=tf.float32)
        image = tf.multiply(tf.subtract(image, 0.5), 2)
        image = tf.image.resize(image, [64, 64])
        image = tf.image.random_crop(image, [54, 54, 3])
        return image

    @staticmethod
    def _parse_function(proto):
        features = {
            'image': tf.io.FixedLenFeature([], tf.string),
            'length': tf.io.FixedLenFeature([], tf.int64),
            'digits': tf.io.FixedLenFeature([5], tf.int64)
        }
        parsed_features = tf.io.parse_single_example(proto, features)
        image = Donkey._preprocess(tf.io.decode_raw(parsed_features['image'], tf.uint8))
        length = tf.cast(parsed_features['length'], tf.int32)
        digits = tf.cast(parsed_features['digits'], tf.int32)
        return image, length, digits

    @staticmethod
    def build_batch(path_to_tfrecords_file, num_examples, batch_size, shuffled):
        assert tf.io.gfile.exists(path_to_tfrecords_file), '%s not found' % path_to_tfrecords_file

        dataset = tf.data.TFRecordDataset(filenames=[path_to_tfrecords_file])
        dataset = dataset.map(Donkey._parse_function, num_parallel_calls=tf.data.AUTOTUNE)

        if shuffled:
            dataset = dataset.shuffle(buffer_size=int(0.4 * num_examples))

        dataset = dataset.batch(batch_size)
        dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)
        
        return dataset



模型结构及具体参数沿用Goodfellow的模型方案：
> - Our best architecture consists of eight convolutional hidden layers, one locally connected hidden layer, and two densely connected hidden layers. All connections are feedforward and go from one layer to the next (no skip connections).
> - The number of units at each spatial location in each layer is [48, 64, 128, 160] for the first four layers and 192 for all other locally connected layers. The fully connected layers contain 3,072 units each. 
> - Each convolutional layer includes max pooling and subtractive normalization. The max pooling window size is 2$*$2. 
> - All convolutions use zero padding on the input to preserve representation size.
> - All convolution kernels were of size 5$*$5. We trained with dropout applied to all hidden layers but not the input.

模型机构如图所示：
![model-graph](../img/model-graph.PNG)

In [4]:
class Model(tf.keras.Model):
    def __init__(self, drop_rate):
        super().__init__()
        self.drop_rate = drop_rate

    def __call__(self, x, training=False):
        return self.inference(x, self.drop_rate, training=training)

    @staticmethod
    def inference(self, x, drop_rate, training):
        hidden1 = tf.keras.layers.Conv2D(filters=48, kernel_size=(5, 5), padding='same')(x)
        hidden1 = tf.keras.layers.BatchNormalization()(hidden1, training=training)
        hidden1 = tf.keras.layers.ReLU()(hidden1)
        hidden1 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2, padding='same')(hidden1)
        hidden1 = tf.keras.layers.Dropout(rate=drop_rate)(hidden1, training=training)

        hidden2 = tf.keras.layers.Conv2D(filters=64, kernel_size=(5, 5), padding='same')(hidden1)
        hidden2 = tf.keras.layers.BatchNormalization()(hidden2, training=training)
        hidden2 = tf.keras.layers.ReLU()(hidden2)
        hidden2 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=1, padding='same')(hidden2)
        hidden2 = tf.keras.layers.Dropout(rate=drop_rate)(hidden2, training=training)

        hidden3 = tf.keras.layers.Conv2D(filters=128, kernel_size=(5, 5), padding='same')(hidden2)
        hidden3 = tf.keras.layers.BatchNormalization()(hidden3, training=training)
        hidden3 = tf.keras.layers.ReLU()(hidden3)
        hidden3 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2, padding='same')(hidden3)
        hidden3 = tf.keras.layers.Dropout(rate=drop_rate)(hidden3, training=training)

        hidden4 = tf.keras.layers.Conv2D(filters=160, kernel_size=(5, 5), padding='same')(hidden3)
        hidden4 = tf.keras.layers.BatchNormalization()(hidden4, training=training)
        hidden4 = tf.keras.layers.ReLU()(hidden4)
        hidden4 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=1, padding='same')(hidden4)
        hidden4 = tf.keras.layers.Dropout(rate=drop_rate)(hidden4, training=training)

        hidden5 = tf.keras.layers.Conv2D(filters=192, kernel_size=(5, 5), padding='same')(hidden4)
        hidden5 = tf.keras.layers.BatchNormalization()(hidden5, training=training)
        hidden5 = tf.keras.layers.ReLU()(hidden5)
        hidden5 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2, padding='same')(hidden5)
        hidden5 = tf.keras.layers.Dropout(rate=drop_rate)(hidden5, training=training)

        hidden6 = tf.keras.layers.Conv2D(filters=192, kernel_size=(5, 5), padding='same')(hidden5)
        hidden6 = tf.keras.layers.BatchNormalization()(hidden6, training=training)
        hidden6 = tf.keras.layers.ReLU()(hidden6)
        hidden6 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=1, padding='same')(hidden6)
        hidden6 = tf.keras.layers.Dropout(rate=drop_rate)(hidden6, training=training)

        hidden7 = tf.keras.layers.Conv2D(filters=192, kernel_size=(5, 5), padding='same')(hidden6)
        hidden7 = tf.keras.layers.BatchNormalization()(hidden7, training=training)
        hidden7 = tf.keras.layers.ReLU()(hidden7)
        hidden7 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2, padding='same')(hidden7)
        hidden7 = tf.keras.layers.Dropout(rate=drop_rate)(hidden7, training=training)

        hidden8 = tf.keras.layers.Conv2D(filters=192, kernel_size=(5, 5), padding='same')(hidden7)
        hidden8 = tf.keras.layers.BatchNormalization()(hidden8, training=training)
        hidden8 = tf.keras.layers.ReLU()(hidden8)
        hidden8 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=1, padding='same')(hidden8)
        hidden8 = tf.keras.layers.Dropout(rate=drop_rate)(hidden8, training=training)

        flatten = tf.keras.layers.Flatten()(hidden8)

        hidden9 = tf.keras.layers.Dense(units=3072, activation='relu')(flatten)
        hidden10 = tf.keras.layers.Dense(units=3072, activation='relu')(hidden9)

        length = tf.keras.layers.Dense(units=7)(hidden10)
        digit1 = tf.keras.layers.Dense(units=11)(hidden10)
        digit2 = tf.keras.layers.Dense(units=11)(hidden10)
        digit3 = tf.keras.layers.Dense(units=11)(hidden10)
        digit4 = tf.keras.layers.Dense(units=11)(hidden10)
        digit5 = tf.keras.layers.Dense(units=11)(hidden10)

        length_logits, digits_logits = length, tf.stack([digit1, digit2, digit3, digit4, digit5], axis=1)
        return length_logits, digits_logits

    @staticmethod
    def loss(length_logits, digits_logits, length_labels, digits_labels):
        length_cross_entropy = tf.reduce_mean(
            tf.keras.losses.sparse_categorical_crossentropy(length_labels, length_logits))
        digit1_cross_entropy = tf.reduce_mean(
            tf.keras.losses.sparse_categorical_crossentropy(digits_labels[:, 0], digits_logits[:, 0, :]))
        digit2_cross_entropy = tf.reduce_mean(
            tf.keras.losses.sparse_categorical_crossentropy(digits_labels[:, 1], digits_logits[:, 1, :]))
        digit3_cross_entropy = tf.reduce_mean(
            tf.keras.losses.sparse_categorical_crossentropy(digits_labels[:, 2], digits_logits[:, 2, :]))
        digit4_cross_entropy = tf.reduce_mean(
            tf.keras.losses.sparse_categorical_crossentropy(digits_labels[:, 3], digits_logits[:, 3, :]))
        digit5_cross_entropy = tf.reduce_mean(
            tf.keras.losses.sparse_categorical_crossentropy(digits_labels[:, 4], digits_logits[:, 4, :]))

        loss = (length_cross_entropy + digit1_cross_entropy +
                digit2_cross_entropy + digit3_cross_entropy +
                digit4_cross_entropy + digit5_cross_entropy)
        return loss

提取数据集中的样本数量

In [5]:
class Meta(object):
    def __init__(self):
        self.num_train_examples = None
        self.num_val_examples = None
        self.num_test_examples = None

    def save(self, path_to_json_file):
        with open(path_to_json_file, 'w') as f:
            content = {
                'num_examples': {
                    'train': self.num_train_examples,
                    'val': self.num_val_examples,
                    'test': self.num_test_examples
                }
            }
            json.dump(content, f)

    def load(self, path_to_json_file):
        with open(path_to_json_file, 'r') as f:
            content = json.load(f)
            self.num_train_examples = content['num_examples']['train']
            self.num_val_examples = content['num_examples']['val']
            self.num_test_examples = content['num_examples']['test']


计算模型的accuracy等

In [6]:
class Evaluator(object):
    def __init__(self, path_to_eval_log_dir):
        self.summary_writer = tf.summary.create_file_writer(path_to_eval_log_dir)

    @tf.function
    def evaluate(self, path_to_checkpoint, path_to_tfrecords_file, num_examples, global_step):
        batch_size = 128
        num_batches = num_examples // batch_size
        needs_include_length = False

        model = Model(0.0)

        image_batch, length_batch, digits_batch = Donkey.build_batch(path_to_tfrecords_file,
                                                                     num_examples=num_examples,
                                                                     batch_size=batch_size,
                                                                     shuffled=False)
        length_logits, digits_logits = model(image_batch, training=True)
        length_predictions = tf.argmax(length_logits, axis=1)
        digits_predictions = tf.argmax(digits_logits, axis=2)

        if needs_include_length:
            labels = tf.concat([tf.reshape(length_batch, [-1, 1]), digits_batch], axis=1)
            predictions = tf.concat([tf.reshape(length_predictions, [-1, 1]), digits_predictions], axis=1)
        else:
            labels = digits_batch
            predictions = digits_predictions

        labels_string = tf.reduce_join(tf.as_string(labels), axis=1)
        predictions_string = tf.reduce_join(tf.as_string(predictions), axis=1)

        accuracy = tf.reduce_mean(tf.cast(tf.equal(labels_string, predictions_string), tf.float32))
        accuracy_val = accuracy.numpy().item()

        with self.summary_writer.as_default():
            tf.summary.image('image', image_batch, step=global_step)
            tf.summary.scalar('accuracy', accuracy_val, step=global_step)
            tf.summary.histogram('variables',
                                 tf.concat([tf.reshape(var, [-1]) for var in tf.trainable_variables()], axis=0),
                                 step=global_step)

        return accuracy_val

训练模型

In [7]:
def _train(path_to_train_tfrecords_file, num_train_examples, path_to_val_tfrecords_file, num_val_examples,
           path_to_train_log_dir, path_to_restore_checkpoint_file, training_options):
    batch_size = training_options['batch_size']
    initial_patience = training_options['patience']
    num_steps_to_show_loss = 100
    num_steps_to_check = 1000

    train_dataset = Donkey.build_batch(path_to_train_tfrecords_file, num_examples=num_train_examples,
                                        batch_size=batch_size, shuffled=True).prefetch(tf.data.experimental.AUTOTUNE)

    model = Model(0.2)
    optimizer = tf.keras.optimizers.SGD(learning_rate=training_options['learning_rate'])
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

    summary_writer = tf.summary.create_file_writer(path_to_train_log_dir)
    evaluator = Evaluator(os.path.join(path_to_train_log_dir, 'eval/val'))

    # tf.keras.Model内置检查点
    checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
    if path_to_restore_checkpoint_file:
        checkpoint.restore(path_to_restore_checkpoint_file).expect_partial()
        print('Model restored from file: %s' % path_to_restore_checkpoint_file)

    print('Start training')
    patience = initial_patience
    best_accuracy = 0.0
    steps = 0

    while True:
        for image_batch, length_batch, digits_batch in train_dataset:
            with tf.GradientTape() as tape:
                length_logits, digits_logits = model(image_batch, training=True)
                loss = Model.loss(length_logits, digits_logits, length_batch, digits_batch)

            grads = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(grads, model.trainable_variables))

            steps += 1

            if steps % num_steps_to_show_loss == 0:
                print(f'=> {datetime.now()}: step {steps}, loss = {loss.numpy()}')

            if steps % num_steps_to_check == 0:
                print('=> Evaluating on validation dataset...')
                path_to_latest_checkpoint_file = os.path.join(path_to_train_log_dir, 'latest.ckpt')
                accuracy = evaluator.evaluate(path_to_latest_checkpoint_file, path_to_val_tfrecords_file,
                                              num_val_examples, steps)
                print('==> accuracy = %f, best accuracy %f' % (accuracy, best_accuracy))

                if accuracy > best_accuracy:
                    path_to_checkpoint_file = os.path.join(path_to_train_log_dir, 'model.ckpt')
                    checkpoint.save(file_prefix=path_to_checkpoint_file)
                    print('=> Model saved to file: %s' % path_to_checkpoint_file)
                    patience = initial_patience
                    best_accuracy = accuracy
                else:
                    patience -= 1

                print('=> patience = %d' % patience)
                if patience == 0:
                    break

        print('Finished')

主函数，其中设置模型超参数。

In [8]:
def main():
    data_dir = './data'
    path_to_train_tfrecords_file = os.path.join(data_dir, 'train.tfrecords')
    path_to_val_tfrecords_file = os.path.join(data_dir, 'val.tfrecords')
    path_to_tfrecords_meta_file = os.path.join(data_dir, 'meta.json')
    path_to_train_log_dir = './logs/train'
    path_to_restore_checkpoint_file = None
    training_options = {
        'batch_size': 32,
        'learning_rate': 1e-2,
        'patience': 100,
        'decay_steps': 10000,
        'decay_rate': 0.9
    }
    meta = Meta()
    meta.load(path_to_tfrecords_meta_file)
    
    _train(path_to_train_tfrecords_file, meta.num_train_examples,
           path_to_val_tfrecords_file, meta.num_val_examples,
           path_to_train_log_dir, path_to_restore_checkpoint_file,
           training_options)


In [None]:
main()

Start training
