# Chinese Name OCR  by CRNN

I used CRNN model to identify traditional Chinese name in (MingLiu font) in computer generated image with fixed size ( 36 pixel hight * 256 pixel width). The reason I am using CRNN is because

- CNN can automatically extract image features with different bouch of filters
- RNN can capture sequential relationship in image features.
- CTC can transcript sequence of image features to Chinese characters. 

For prototype purpose, I am using Keras to code the CRNN. Due to use of CNN, there is no need to do image preprocessing by other liberaries.

For Chinese name, I sourced data from http://taiwan.chtsai.org/2006/01/10/taiwan_baijiaxing/

The model code here is sourced from BaiXiang's CNN paper.

## Reference

1) CRNN model
- @article{ShiBY15, author = {Baoguang Shi and Xiang Bai and Cong Yao}, title = {An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition}, journal = {CoRR}, volume = {abs/1507.05717}, year = {2015} }


2) names in TaiWan
- http://taiwan.chtsai.org/2006/01/10/taiwan_baijiaxing/

In [1]:
from keras.layers.convolutional import Conv2D,MaxPooling2D,ZeroPadding2D
from keras.layers.normalization import BatchNormalization
from keras.layers.core import Reshape,Masking,Lambda,Permute
from keras.layers import Input,Dense,Flatten
from keras.preprocessing.sequence import pad_sequences
from keras.layers.recurrent import GRU,LSTM
from keras.layers.wrappers import Bidirectional
from keras.models import Model
from keras import backend as K
from keras.preprocessing import image
from keras.optimizers import Adam,SGD,Adadelta
from keras import losses
from keras.layers.wrappers import TimeDistributed
from keras.callbacks import EarlyStopping,ModelCheckpoint,TensorBoard
from keras.utils import plot_model
from matplotlib import pyplot as plt

import numpy as np 
import os
from PIL import Image,ImageDraw,ImageFont 
import json
import threading
import pandas as pd
from opencc import OpenCC 

import tensorflow as tf  
import keras.backend.tensorflow_backend as K  


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
name_corpus = pd.read_csv("./input/tw_names.csv")

In [3]:
name_corpus.head(10)

Unnamed: 0,name_tra_chi
0,丁一中
1,丁一平
2,丁一民
3,丁一洪
4,丁一玲
5,丁一展
6,丁一軒
7,丁一釗
8,丁一婷
9,丁一琦


In [4]:
name_corpus['len']=name_corpus['name_tra_chi'].apply(lambda x:len(x))
name_corpus.to_csv("./input/tw_names_big5.csv",index=False)
name_corpus.drop_duplicates(inplace=True)

In total, there are 1,203,132 Chinese names with maximum name length of 3. There are 2,361 distinct characters.

In [5]:
name_corpus_filtered = name_corpus[(name_corpus['len']<=5) &((name_corpus['len']>1))].copy()

In [6]:
print("total name is {}".format(len(name_corpus_filtered)))

#print("possible name length is ")
max_label_length=np.max(name_corpus_filtered['len'].unique())

total name is 726461


Interesting to see the distribution of Chinese surname

In [7]:
name_set=list(set(name_corpus_filtered['name_tra_chi']))

char_list=[' ']

for name in name_set:
    char_list.extend(list(name))
    
char_to_id = {j:i for i,j in enumerate(char_list)}
id_to_char = {i:j for i,j in enumerate(char_list)}

char_df = pd.DataFrame(data=char_list)
char_df.columns=['char']
char_df['count']=1
char_stat = char_df.groupby('char').sum().sort_values(by='count',ascending=False)
char_stat

Unnamed: 0_level_0,count
char,Unnamed: 1_level_1
陳,51609
林,43742
黃,35808
張,30581
李,29155
王,27796
吳,24018
劉,21035
蔡,18167
楊,18090


In [8]:
## initialize gloabl variables
max_label_length = max_label_length
img_h = 32
img_w = 128

rnnunit=256
batch_size =32

Copy MingLiu font from Windows/fonts to Ubuntu /usr/share/fonts and update system font cache

- sudo mkfontscale (if package missing need to do sudo apt-get installttf-mscorefonts-installer)
- sudo mkfontdir (if package missing need to do sudo apt-get install fontconfig)
- sudo fc-cache -fv( refresh system font cache)

In [9]:
font=ImageFont.truetype('/usr/share/fonts/truetype/windows/mingliu0.ttf',24) 

Generate training image for Chinese name in MingLiu font
 - training image by random generation
 - validation image by random generation

In [10]:
def generate_image_sample(n,train_image_path,train_label_path,valid_image_path,valid_label_path):
    
    import os
    #os.remove(del_file_path)
    sample_name=name_corpus_filtered.sample(int(n))
    train_name=sample_name.sample(int(n*0.75))
    
    for index, row in train_name.iterrows():

        img = img = Image.new('L',(img_w,img_h),(255))
        draw = ImageDraw.Draw(img)  
        name = row['name_tra_chi']
        label = ""
        for chr in name:
            label = label + chr +" "
        draw.text((0,5),label.strip() ,fill=(0),font=font)  
        img.save(train_image_path+str(index)+'.png')
        
    train_name.reset_index().to_csv(train_label_path,index=False)     
        
    valid_name=sample_name.sample(int(n*0.25))
    for index, row in valid_name.iterrows():

        img = img = Image.new('L',(img_w,img_h),(255))
        draw = ImageDraw.Draw(img)  
        name = row['name_tra_chi']
        label = ""
        for chr in name:
            label = label + chr +" "
        draw.text((0,5),label.strip() ,fill=(0),font=font)  
        img.save(valid_image_path+str(index)+'.png')    
        
    valid_name.reset_index().to_csv(valid_label_path,index=False)   

 
    return sample_name

In [11]:
sample_image = generate_image_sample(5120,'./tw_train/train_','./tw_train/train_label.csv','./tw_validate/valid_','./tw_validate/valid_label.csv')

In [12]:
sample_image[sample_image['name_tra_chi'].str.contains(u'\ue0ea')]

Unnamed: 0,name_tra_chi,len


In [13]:
name_set=list(set(sample_image['name_tra_chi']))

char_list=[' ']

for name in name_set:
    char_list.extend(list(name))
    
char_list=list(set(char_list))    
char_to_id = {j:i for i,j in enumerate(char_list)}
id_to_char = {i:j for i,j in enumerate(char_list)}

char_df = pd.DataFrame(data=char_list)
char_df.columns=['char']
char_df['count']=1
char_stat = char_df.groupby('char').sum().sort_values(by='count',ascending=False)
print(len(char_list))
char_stat

1520


Unnamed: 0_level_0,count
char,Unnamed: 1_level_1
,1
羿,1
翡,1
翠,1
翟,1
翕,1
翔,1
翎,1
翌,1
翊,1


design of the deep learning model: VGG + Bidirectional LSTM + CTC

In [14]:
nclass = len(char_stat)
input = Input(shape=(img_h,None,1),name='the_input')

m = Conv2D(64,kernel_size=(3,3),activation='relu',padding='same',name='conv1')(input)
m = MaxPooling2D(pool_size=(2,2),strides=(2,2),name='pool1')(m)
m = Conv2D(128,kernel_size=(3,3),activation='relu',padding='same',name='conv2')(m)
m = MaxPooling2D(pool_size=(2,2),strides=(2,2),name='pool2')(m)
m = Conv2D(256,kernel_size=(3,3),activation='relu',padding='same',name='conv3')(m)
m = BatchNormalization(axis=3)(m)
m = Conv2D(256,kernel_size=(3,3),activation='relu',padding='same',name='conv4')(m)

m = ZeroPadding2D(padding=(0,1))(m)
m = MaxPooling2D(pool_size=(2,2),strides=(2,1),padding='valid',name='pool3')(m)

m = Conv2D(512,kernel_size=(3,3),activation='relu',padding='same',name='conv5')(m)
m = BatchNormalization(axis=3)(m)
m = Conv2D(512,kernel_size=(3,3),activation='relu',padding='same',name='conv6')(m)

m = ZeroPadding2D(padding=(0,1))(m)
m = MaxPooling2D(pool_size=(2,2),strides=(2,1),padding='valid',name='pool4')(m)
m = Conv2D(512,kernel_size=(2,2),activation='relu',padding='valid',name='conv7')(m)

m = BatchNormalization(axis=3)(m)
m = Permute((2,1,3),name='permute')(m)
m = TimeDistributed(Flatten(),name='timedistrib')(m)

m = Bidirectional(GRU(rnnunit,return_sequences=True,implementation=2),name='blstm1')(m)
#m = Bidirectional(LSTM(rnnunit,return_sequences=True),name='blstm1')(m)
m = Dense(rnnunit,name='blstm1_out',activation='linear',)(m)
#m = Bidirectional(LSTM(rnnunit,return_sequences=True),name='blstm2')(m)
m = Bidirectional(GRU(rnnunit,return_sequences=True,implementation=2),name='blstm2')(m)
y_pred = Dense(nclass,name='blstm2_out',activation='softmax')(m)

basemodel = Model(inputs=input,outputs=y_pred)

In [15]:
def ctc_lambda_func(args):
    y_pred,labels,input_length,label_length = args
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

In [16]:
labels = Input(name='the_labels',shape=[max_label_length],dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')

loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')([y_pred, labels, input_length, label_length]) 

model = Model(inputs=[input, labels, input_length, label_length], outputs=loss_out)

model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
the_input (InputLayer)           (None, 32, None, 1)   0                                            
____________________________________________________________________________________________________
conv1 (Conv2D)                   (None, 32, None, 64)  640         the_input[0][0]                  
____________________________________________________________________________________________________
pool1 (MaxPooling2D)             (None, 16, None, 64)  0           conv1[0][0]                      
____________________________________________________________________________________________________
conv2 (Conv2D)                   (None, 16, None, 128) 73856       pool1[0][0]                      
___________________________________________________________________________________________

In [17]:
adadelta = Adadelta()
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=adadelta,metrics=['accuracy'])
#checkpoint = ModelCheckpoint(r'weights-{epoch:02d}.hdf5',save_weights_only=True)
earlystop = EarlyStopping(patience=10)
tensorboard = TensorBoard(r'crnn/logs',write_graph=True)

In [None]:
def image_name(index,train):
    
    if train:
    
        return "./tw_train/train_"+str(index)+".png"
    else:
        return "./tw_validate/valid_"+str(index)+".png"

def generate_image_from_file(path,batch_size=32,maxlabellength=5,train=True):
    
    images = pd.read_csv(path)
    images['file_name']=images['index'].apply(lambda x:image_name(x,train))
    
    x = np.zeros((batch_size, img_h, img_w, 1), dtype=np.float)
    labels = np.ones([batch_size,maxlabellength])
    input_length = np.zeros([batch_size,1])
    label_length = np.zeros([batch_size,1])
    
    samples = images.sample(batch_size,replace=True).reset_index()

    while True:
       
        for i,row in samples.iterrows():

            img1 = Image.open(row['file_name'])
            img = np.array(img1,'f')/255.0-0.5
           
            x[i] = np.expand_dims(img,axis=2)
            
            name = row['name_tra_chi'].strip(' \t\r\n\0')
            #pad space to name if length less than 5
            while 5-len(name)>0:
                name=name+' '

            label_length[i] = len(name)        
            input_length[i] = img_w//4+1
            labels[i,:len(name)] = [char_to_id[i] for i in name]
        
        inputs = {'the_input': x,
                 'the_labels': labels,
                 'input_length': input_length,
                 'label_length': label_length,
                }
        outputs = {'ctc': np.zeros([batch_size])} 
        yield (inputs,outputs)            


In [None]:
model.fit_generator(generate_image_from_file('./tw_train/train_label.csv',batch_size=batch_size,maxlabellength=max_label_length),\
                    steps_per_epoch=int(len(sample_image)*0.75), \
                    validation_data =generate_image_from_file('./tw_validate/valid_label.csv',batch_size=batch_size,maxlabellength=max_label_length,train=False) ,\
                    validation_steps = int(len(sample_image)*0.25),\
                    epochs=10,\
                    verbose=1,\
                    callbacks =[earlystop])

Epoch 1/10

In [None]:
model.save_weights('./crnn/crnn_tw_model_weights_5120.h5')
model_file=model.to_json()

### model.predict(x, batch_size=None, verbose=0, steps=None)