# Training Sample Generator

![image](https://raw.githubusercontent.com/crackml/EE-GL9123-Intro2ML-Project/master/captcha010.png)

Carefully observing some the verification codes, we find that all Chinese characters are written in 粗楷简体(Bold Kai Style Simplified). Such font can be downloaded from Internet. There exists over 60,000 Chinese characters in total, though, considering that a verification code must be identified by a real human, the verification code is less likely to use those uncommon characters. As such, only so-called [3500 commonly used Chinese characters](http://hanyu.iciba.com/zt/3500.html), which have been catalogued by Ministry of Education of China, will be used to generate training samples for our model. Training Sample Generator takes them as a dictionary, randomly fetches one character from it and generate a picture of this character. 

`Tools` is a package which contains some functions related to this project. 

In [1]:
from Tools import *
# import all tool function from tool set

In [2]:
from PIL import Image, ImageFont, ImageDraw
import numpy as np

from random import randint, choice
from math import sin, cos, radians, fabs

import os,sys
dir_path = os.path.abspath('.')
print(dir_path)
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

G:\Python_Lab\Proj


The original verification code is sized of 128×400. The characters, however, do not take over the whole area of the code, rather, each character is bounded within 80×80. To accelerate training our model and to avoid the loss accuracy of identification, I use 40×40 as our input size to our model. Therefore, Training Sample Generator will generate samples with size of 40×40.

Because the characters on the original verification code are rotated. The rotating angle is within about [-30°,30°] if no headstand, about [150°,210°] if headstand. When generating a sample, Training Sample Generator will produce a uniformly random angle, based on the status input, to rotate the character generated from the original font. Besides, in consideration of that the space between each character is varied, and it can be both positive or negative, when they are cut off by Scissor, the cut single character may contain incomplete part of the neighbor character or, may contain most of but incomplete itself, or, luckily, may contain just itself. So, Training Sample Generator will produce a scale factor ranging within [0.95,1.05] to zoom the produced character, and then clip off only the centered 40×40 part.

The followings are three examples of the results cut off by Scissor.

![image](https://raw.githubusercontent.com/crackml/EE-GL9123-Intro2ML-Project/master/pic1.png)

"女": This character cut is luckily complete itself.

![image](https://raw.githubusercontent.com/crackml/EE-GL9123-Intro2ML-Project/master/pic2.png)

"绒": This character cut contains most of itself but some parts have been cut off.

![image](https://raw.githubusercontent.com/crackml/EE-GL9123-Intro2ML-Project/master/pic3.png)

"恩": This character cut contains some remaining part of the neiborhor character.

Our sample generator must cover all of those three conditions.

`GenerateOneChar` generates a picture of a single charater with rotation and zooming.

In [3]:
def GenerateOneChar(status=None, character=None, corp_r=20, ran_bias=5):
    """
    status == 0 front stand
    status == 1 head stand
    radius generally < 50
    ran_bias: random bias used in generating a single character
    character: Chinese character to produce. is string-type
    """
    bwh = 160 # width and height of background
    iwh = (81, 90) # width and height of image
    fontsize = 72
    
    if status == None:
        status = 0
    if status == 0:
        angle = random.uniform(-30, 30) 
    else:
        angle = random.uniform(150, 210) 
    
    if character == None:
        character = GetA_GB2312_Char()
    
    im = Image.new("RGBA", iwh, (0,0,0,0)) # black font
    
    global dir_path 
    font = ImageFont.truetype(dir_path + "\\Kai-Simplified-Bold.ttf", fontsize)
    
    ch = ImageDraw.Draw(im)
    ch.fontmode = "1"
    ch.text((0, 0), character, font=font, fill="#000000") # fill with black
    im = CutOffRegion(im) # reserve only character region
    
    # roration
    fore = im.rotate(angle, resample=Image.BICUBIC,expand=1)
    width, height = fore.size
    # zooming
    scale = np.random.uniform(0.95, 1.05)
    fore = fore.resize((int(width * scale), int(height * scale)), Image.ANTIALIAS)
    width, height = fore.size
    
    background = Image.new("RGB", (bwh, bwh), (255,255,255,255)) #white background
    background.paste(fore, (bwh//2 - width // 2 + randint(-ran_bias, ran_bias),\
                            bwh//2 - height // 2 + randint(-ran_bias, ran_bias)), fore)
    
    return character,background.crop((bwh//2 - corp_r, bwh//2 - corp_r,\
                                      bwh//2 + corp_r, bwh//2 + corp_r))

function `generate` uses `GenerateOneChar` to randomly generate a large amount of pictures of characters.

In [18]:
import sys,time
def Generate(nums=10,corp_r=20,ran_bias=5,status=-1,path=None,random_c=False,outC=False):
    """
    nums: the number of characters to produce
    corp_r: radius of corping
    ran_bias: random bias used in generating a single character
    status: headstand or not
    random_c: true if use randomly corping in generating a single character
    outC: true if want to write characters into files. At this time, path should not be None
    """
    
    X = np.zeros((nums,40,40,1))
    if status == 0:
        y = np.zeros(nums)
    elif status == 1:
        y = np.ones(nums)
    else:
        y = np.zeros(nums)
        for i in range(nums):
            #randomly generate statuses
            y[i] = np.random.binomial(1, 0.5)
    if path == None:
        path = dir_path
        
    for i in range(nums):
        #cited and modified from http://hovertree.com/h/bjaf/opmktjog.htm
        #print the progress
        p = int(((i+1)/nums)*100)
        String = '>'*(p//2)+' '*((100-p-1)//2)
        sys.stdout.write('\rPocessing: '+String+'[%s%%]'%(p))
        sys.stdout.flush()
        #############
        if random_c == False:
            Char,CharImg = GenerateOneChar(status=y[i],corp_r=corp_r)
        else:
            Char,CharImg = GenerateOneChar(status=y[i],corp_r=randint(20, 50))
        
        CharImg = CharImg.resize((40,40))
        #CharImg2 = np.asarray(CharImg.convert('RGB').convert('L'),dtype='float64')
        #CharImg2[CharImg2 <= 180] = 0
        #CharImg2[CharImg2 > 180] = 255
        #plt.imshow(Image.fromarray(CharImg2))
        #plt.show()
        CharImg = np.asarray(CharImg.convert('RGB').convert('L'),dtype='float64')
        CharImg[CharImg <= 180] = 0
        CharImg[CharImg > 180] = 255
        
        if outC == True:
            filename = path+'\\'+"%06d"%(i)+Char+str(int(y[i]))+'.bmp'
            #print(filename)
            Image.fromarray(CharImg).convert('RGB').save(filename)
        
        X[i,:,:,:] = CharImg[:,:,None]
        
    return X, y

To sufficiently train our model, we randomly generate 131072 samples, so that each character has approximately 38 samples and each status shows about 19 times. In addition, 2048 samples are generated for validation. 
Save those samples in file Xr,yr,Xs,ys

In [19]:
Xr,yr=Generate(nums=131072,corp_r=35,ran_bias=10,status=-1,path='./tr',random_c=False,outC=True)

Pocessing: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>[100%]

In [22]:
np.save("Xr.npy", Xr)
np.save("yr.npy", yr)

In [23]:
Xs,ys=Generate(nums=2048,corp_r=35,ran_bias=10,status=-1,path='./ts',random_c=False,outC=True)

Pocessing: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>[100%]

In [24]:
np.save("Xs.npy", Xs)
np.save("ys.npy", ys)