# ZYu Data Helper v3

1.	The notebook reads “lines_copy.txt” into a dataframe (df), df has all the information about every image in the IAM lines dataset.

2.	The notebook builds a tokenizer (here it uses keras) using the CHARLIST file, the tokenizer maps each character to a number between 0-189.

3.	The notebook then uses the tokenizer to convert text into a string of numbers, for example: `" and he is to be backed by Mr. Will "`  will become  `"0 62 75 … 73 73 0"`

4.	The tokenized string is added back to the df.

5.	You choose a subset from df, this new dataframe is called df_data, this defines the data you are going to use.

6.	You split df_data into  df_train, df_val, df_test

7.	Based on the information in df_train, df_val, df_test, the notebook first cleans up the destination folders, then it copies files (original and deslanted) from source (i.e. HRS/data/lines and HRS/data/lines_deslanted) to destination folders (i.e. HRS/data/train etc.), and creates label files etc. in these folders.


In [1]:
import warnings
warnings.filterwarnings('ignore')

import re
import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

# from tqdm import tnrange, tqdm_notebook

In [2]:
# take a look at lines_copy.txt
!head lines/lines_copy.txt

a01-000u-00 ok 154 19 408 746 1661 89 A|MOVE|to|stop|Mr.|Gaitskell|from
a01-000u-01 ok 156 19 395 932 1850 105 nominating|any|more|Labour|life|Peers
a01-000u-02 ok 157 16 408 1106 1986 105 is|to|be|made|at|a|meeting|of|Labour
a01-000u-03 err 156 23 430 1290 1883 70 M Ps|tomorrow|.|Mr.|Michael|Foot|has
a01-000u-04 ok 157 20 395 1474 1830 94 put|down|a|resolution|on|the|subject
a01-000u-05 err 156 21 379 1643 1854 88 and|he|is|to|be|backed|by|Mr.|Will
a01-000u-06 ok 159 20 363 1825 2051 87 Griffiths|,|M P|for|Manchester|Exchange|.
a01-000x-00 ok 182 30 375 748 1561 148 A|MOVE|to|stop|Mr.|Gaitskell|from|nominating
a01-000x-01 ok 181 23 382 924 1595 148 any|more|Labour|life|Peers|is|to|be|made|at|a
a01-000x-02 ok 181 30 386 1110 1637 140 meeting|of|Labour|0M Ps|tomorrow|.|Mr.|Michael


# 1. Read meta-data into a dataframe


### NOTE: ADDED SPACE TO BEGINNING AND END OF TEXT

add "space" to the beginning and end of the text for the tokenizing purpose, because in "HandwritingRecognition", the ground truth starts and ends with "0", which is "space"

### 1.1 Read information about the IAM lines from `lines_copy.txt` into dataframe `df`

In [3]:
file_in = 'lines/lines_copy.txt'

fhand = open(file_in)
data_list = [] 

for line in fhand:
    item = line.split(" ")   # split the line
    num = len(item)          # number of fields if line delimited by " "
    
    if num == 9:             # good lines
        label = item[8]      # last field is label if only 9 field
    else:                    # bad lines with spaces in "label"
        label = " ".join(item[8:])    # join the remaining fields - this is label

    new_data  ={'id': item[0], 
                'wseg_status': item[1], 
                'graylevel': int(item[2]), 
                'num_components': int(item[3]), 
                'x': int(item[4]), 
                'y': int(item[5]), 
                'w': int(item[6]), 
                'h': int(item[7]), 
                'label': " "+label.rstrip().replace("|", " ")+" "}   # remove "\n", replace "|" with " "
    
    data_list.append(new_data)
    
fhand.close()

df = pd.DataFrame(data_list, 
                  columns=['id', 'wseg_status', 'graylevel', 'num_components', 'x', 'y', 'w', 'h', 'label'])

df.head()

Unnamed: 0,id,wseg_status,graylevel,num_components,x,y,w,h,label
0,a01-000u-00,ok,154,19,408,746,1661,89,A MOVE to stop Mr. Gaitskell from
1,a01-000u-01,ok,156,19,395,932,1850,105,nominating any more Labour life Peers
2,a01-000u-02,ok,157,16,408,1106,1986,105,is to be made at a meeting of Labour
3,a01-000u-03,err,156,23,430,1290,1883,70,M Ps tomorrow . Mr. Michael Foot has
4,a01-000u-04,ok,157,20,395,1474,1830,94,put down a resolution on the subject


# 2. Build a tokenizer using `CHAR_LIST`

NOTE: this adds 0 to the beginning and end of string based on the convention in "HandwritingRecognitionSystem"

In [4]:
# There are 190 characters in CHAR_LIST
!wc CHAR_LIST

190 190 511 CHAR_LIST


In [5]:
# take a look at CHAR_LIST
!head CHAR_LIST

<SPACE>
<UNK>
!
"
#
$
%
&
'
(


### Build a tokenizer that converts characters to numbers

Here we borrow the "Tokenizer" class from keras.

In [6]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=190, char_level=True, lower=False)  # NOT TO COVERT TO LOWER CASE

file_in = 'CHAR_LIST'

fhand = open(file_in)

chars = {}

num = 0
for line in fhand:
    char = line.rstrip()
    chars[char]=num
    print(line.rstrip(), end="")
    num += 1
fhand.close()

print()
print(num)

chars[' ']=0   # encode <space> as 0
tokenizer.word_index = chars    # set word_map for tokenizer
tokenizer.oov_token=1           # encode <UNK> character as 1

reverse_word_map = dict(map(reversed, chars.items()))   

def sequence_to_string(list_of_numbers):
    list_of_num_strings = list(map(str, list_of_numbers))  # Turn list of numbers to list of strings
    string_of_numbers = ' '.join(list_of_num_strings)      # Join into one string
    return string_of_numbers

def sequence_to_text(list_of_indices):
    # Looking up words in dictionary
    words = [reverse_word_map.get(letter) for letter in list_of_indices]
    text = ''.join(words)    
    return text

def tokenized_string_to_text(tokenized_string):
    token_strings = tokenized_string.split(" ")
    tokens = [int(token_string) for token_string in token_strings]
    words = [reverse_word_map.get(letter) for letter in tokens]
    text = ''.join(words)    
    return text

Using TensorFlow backend.


<SPACE><UNK>!"#$%&'()*+,-./0123456789:;=>?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_`abcdefghijklmnopqrstuvwxyz|~£§¨«¬­°²´·º»¼½¾ÀÂÄÇÈÉÊÔÖÜßàáâäæçèéêëìîïñòóôöøùúûüÿłŒœΓΖΤάήαδεηικλμνξοπρτυχψωόώІ–—†‡‰‹›₂₤℔⅓⅔⅕⅗⅘⅛∫≠□✓ｆ
190


# 3. Turn texts into tokenized strings

In [7]:
texts = list(df['label'])
sequences_of_texts = tokenizer.texts_to_sequences(texts) 
tokenized_strings = list(map(sequence_to_string, sequences_of_texts))

texts_back = list(map(tokenized_string_to_text, tokenized_strings))

In [8]:
# take a look at text
texts[:5]

[' A MOVE to stop Mr. Gaitskell from ',
 ' nominating any more Labour life Peers ',
 ' is to be made at a meeting of Labour ',
 ' M Ps tomorrow . Mr. Michael Foot has ',
 ' put down a resolution on the subject ']

In [9]:
# take a look at tokens
print(sequences_of_texts[:5])

[[0, 32, 0, 44, 46, 53, 36, 0, 81, 76, 0, 80, 81, 76, 77, 0, 44, 79, 15, 0, 38, 62, 70, 81, 80, 72, 66, 73, 73, 0, 67, 79, 76, 74, 0], [0, 75, 76, 74, 70, 75, 62, 81, 70, 75, 68, 0, 62, 75, 86, 0, 74, 76, 79, 66, 0, 43, 62, 63, 76, 82, 79, 0, 73, 70, 67, 66, 0, 47, 66, 66, 79, 80, 0], [0, 70, 80, 0, 81, 76, 0, 63, 66, 0, 74, 62, 65, 66, 0, 62, 81, 0, 62, 0, 74, 66, 66, 81, 70, 75, 68, 0, 76, 67, 0, 43, 62, 63, 76, 82, 79, 0], [0, 44, 0, 47, 80, 0, 81, 76, 74, 76, 79, 79, 76, 84, 0, 15, 0, 44, 79, 15, 0, 44, 70, 64, 69, 62, 66, 73, 0, 37, 76, 76, 81, 0, 69, 62, 80, 0], [0, 77, 82, 81, 0, 65, 76, 84, 75, 0, 62, 0, 79, 66, 80, 76, 73, 82, 81, 70, 76, 75, 0, 76, 75, 0, 81, 69, 66, 0, 80, 82, 63, 71, 66, 64, 81, 0]]


In [10]:
# take a look at tokenized strings, this is the "ground truth" used in HRS
tokenized_strings[:5]

['0 32 0 44 46 53 36 0 81 76 0 80 81 76 77 0 44 79 15 0 38 62 70 81 80 72 66 73 73 0 67 79 76 74 0',
 '0 75 76 74 70 75 62 81 70 75 68 0 62 75 86 0 74 76 79 66 0 43 62 63 76 82 79 0 73 70 67 66 0 47 66 66 79 80 0',
 '0 70 80 0 81 76 0 63 66 0 74 62 65 66 0 62 81 0 62 0 74 66 66 81 70 75 68 0 76 67 0 43 62 63 76 82 79 0',
 '0 44 0 47 80 0 81 76 74 76 79 79 76 84 0 15 0 44 79 15 0 44 70 64 69 62 66 73 0 37 76 76 81 0 69 62 80 0',
 '0 77 82 81 0 65 76 84 75 0 62 0 79 66 80 76 73 82 81 70 76 75 0 76 75 0 81 69 66 0 80 82 63 71 66 64 81 0']

In [11]:
# make sure we can convert to tokenized_string back to the original text
texts_back[:5]

[' A MOVE to stop Mr. Gaitskell from ',
 ' nominating any more Labour life Peers ',
 ' is to be made at a meeting of Labour ',
 ' M Ps tomorrow . Mr. Michael Foot has ',
 ' put down a resolution on the subject ']

# 4. Add tokenized strings back to the dataframe

There are **13353** images inth `lines` dataset

In [12]:
df['truth'] = tokenized_strings
df.head(10)

Unnamed: 0,id,wseg_status,graylevel,num_components,x,y,w,h,label,truth
0,a01-000u-00,ok,154,19,408,746,1661,89,A MOVE to stop Mr. Gaitskell from,0 32 0 44 46 53 36 0 81 76 0 80 81 76 77 0 44 ...
1,a01-000u-01,ok,156,19,395,932,1850,105,nominating any more Labour life Peers,0 75 76 74 70 75 62 81 70 75 68 0 62 75 86 0 7...
2,a01-000u-02,ok,157,16,408,1106,1986,105,is to be made at a meeting of Labour,0 70 80 0 81 76 0 63 66 0 74 62 65 66 0 62 81 ...
3,a01-000u-03,err,156,23,430,1290,1883,70,M Ps tomorrow . Mr. Michael Foot has,0 44 0 47 80 0 81 76 74 76 79 79 76 84 0 15 0 ...
4,a01-000u-04,ok,157,20,395,1474,1830,94,put down a resolution on the subject,0 77 82 81 0 65 76 84 75 0 62 0 79 66 80 76 73...
5,a01-000u-05,err,156,21,379,1643,1854,88,and he is to be backed by Mr. Will,0 62 75 65 0 69 66 0 70 80 0 81 76 0 63 66 0 6...
6,a01-000u-06,ok,159,20,363,1825,2051,87,"Griffiths , M P for Manchester Exchange .",0 38 79 70 67 67 70 81 69 80 0 13 0 44 0 47 0 ...
7,a01-000x-00,ok,182,30,375,748,1561,148,A MOVE to stop Mr. Gaitskell from nominating,0 32 0 44 46 53 36 0 81 76 0 80 81 76 77 0 44 ...
8,a01-000x-01,ok,181,23,382,924,1595,148,any more Labour life Peers is to be made at a,0 62 75 86 0 74 76 79 66 0 43 62 63 76 82 79 0...
9,a01-000x-02,ok,181,30,386,1110,1637,140,meeting of Labour 0M Ps tomorrow . Mr. Michael,0 74 66 66 81 70 75 68 0 76 67 0 43 62 63 76 8...


# 5. Select a subset of files

You can choose a subset of "df" as "data_df" for training/testing your model.

Here we set **data_df=df**, that means we are going to copy all files.

We also split the data into train/val/test

### select data to be used

In [13]:
## If you only want to use a randomly selected subset of data ####################
#data_size=10000
#df_data = df.sample(n=data_size, replace=False, axis=0, random_state=8)

## If you want to copy all the data ##############################################
df_data = df

### split the selected data into train/val/test

In [14]:
def train_val_test_split(df, val_size=0.1, test_size=0.1, random_seed=8):
    # create shuffled index
    # seed rest every time the function is called, index of same length will be cut the same way 
    indices = np.arange(df.shape[0])
    np.random.seed(random_seed)
    np.random.shuffle(indices)
    # shuffle the data
    df = df.iloc[indices,:]
    
    # cut position for train and validation
    train_cut = round(df.shape[0] * (1 - test_size - val_size))     
    val_cut = round(df.shape[0] * (1 - test_size))  
    
    # split the data into train, val, test
    train_df = df.iloc[:train_cut,:]
    val_df = df.iloc[train_cut:val_cut,:]
    test_df = df.iloc[val_cut:,:]
    
    return train_df, val_df, test_df    

In [15]:
# call the function to split the data
df_train, df_val, df_test = train_val_test_split(df_data, val_size=0.1, test_size=0.1, random_seed=8)

print(df_train.shape[0])
print(df_val.shape[0])
print(df_test.shape[0])

10682
1336
1335


In [16]:
# take a look
df_train.head()

Unnamed: 0,id,wseg_status,graylevel,num_components,x,y,w,h,label,truth
3633,c04-056-01,ok,167,31,333,933,1873,113,comics bolted in and out of holes so often .,0 64 76 74 70 64 80 0 63 76 73 81 66 65 0 70 7...
6489,g02-062-05,ok,152,48,304,1841,1687,108,"the dissections of living animals , Harvey no...",0 81 69 66 0 65 70 80 80 66 64 81 70 76 75 80 ...
2031,b01-118-05,ok,162,18,327,1631,1800,124,Hitler in the thirties . It was Dr. Verwoerd ...,0 39 70 81 73 66 79 0 70 75 0 81 69 66 0 81 69...
3918,c06-091-04,err,181,32,320,1403,1867,205,"Gina , walking out on the man who has so far",0 38 70 75 62 0 13 0 84 62 73 72 70 75 68 0 76...
8728,h04-082-07,err,179,49,386,1659,1768,100,"poultry , eggs , canned vegetables , fresh fr...",0 77 76 82 73 81 79 86 0 13 0 66 68 68 80 0 13...


# 6. Create folders for train, val, test

Also add folders for deslanted data.

In [17]:
!pwd

/home/ubuntu/HRS/data


In [20]:
# For original IAM data
!mkdir -p train/Images train/Labels train/Text
!mkdir -p val/Images val/Labels val/Text
!mkdir -p test/Images test/Labels test/Text

# For deslanted IAM data
!mkdir -p train_deslanted/Images train_deslanted/Labels train_deslanted/Text
!mkdir -p val_deslanted/Images val_deslanted/Labels val_deslanted/Text
!mkdir -p test_deslanted/Images test_deslanted/Labels test_deslanted/Text

In [18]:
!ls

CHAR_LIST	 test		 train_deslanted  zyu_data_helper_v3.ipynb
lines		 test_deslanted  val
lines_deslanted  train		 val_deslanted


# 7. Copy selected files to destination

For example, the first line:

- source: `./lines/a03/a03-066/a03-066-01.png`

- destination: `./train/Images/a03-066-01.png`


Work flow:

- initialize `file_list={}`

- loop through each line in the data frame

- use `id` field to create `source` and `destination`, `label_file_name`, `text_file_name`

- use `label` to create `*.txt` file

- use `truth` to create `*.tru` file

- copy `source` to `destination`

- ads id to `list`

- create `./train/list`


### define helper functions 

In [22]:
def clean_folder(destination_dir):
    """
    only delete files and to used the os.path.join() method  
    If you also want to remove subdirectories, uncomment the elif statement.
    """
    
    import os, shutil
    folder = destination_dir
    
    count = 0
    for the_file in os.listdir(folder):
        file_path = os.path.join(folder, the_file)
        try:
            if os.path.isfile(file_path):
                os.unlink(file_path)
                count += 1
            #elif os.path.isdir(file_path): shutil.rmtree(file_path)
        except Exception as e:
            print(e)
    print("Successfully removed", count, "files from", destination_dir)  
    
    
def prepare_data_files(df, source_dir, destination_dir):
    """Copy original png to destination and create label, txt, list files"""
    destination_images_dir = destination_dir + '/Images'
    destination_labels_dir = destination_dir + '/Labels'
    destination_text_dir = destination_dir + '/Text'
    destination_list_file = destination_dir + '/list'
    
    clean_folder(destination_dir)
    clean_folder(destination_images_dir)
    clean_folder(destination_labels_dir)
    clean_folder(destination_text_dir)
    
    from shutil import copy2

    lines_list = []

    for i in range(df.shape[0]):    

        # read data from each row
        img_data = df.iloc[i]            # each line in dataframe

        img_id = img_data['id']         # data for file
        img_file = img_id + '.png'      # img file_name
        label_file = img_id + '.tru'    # label file_name
        text_file = img_id + '.txt'     # text file_name

        img_text = img_data['label']    # img text
        img_truth = img_data['truth']   # img ground truth (tokenized string) 

        # create source and destination file names
        id_pc = img_id.split("-")
        source_img_file = source_dir + "/" +id_pc[0] + "/" + id_pc[0] + "-" + id_pc[1] + "/" + img_file
        destination_img_file = destination_images_dir  + "/" + img_file

        destination_label_file = destination_labels_dir + "/" + label_file
        destination_text_file = destination_text_dir + "/" + text_file
        destination_list_file = destination_list_file

        lines_list.append(img_id)

        # copy source to destination
        copy2(source_img_file, destination_img_file)

        # write label file (ground truth)
        f_label = open(destination_label_file, "w")
        f_label.write(img_truth)
        f_label.close()    

        # write text file (img text)
        f_text = open(destination_text_file, "w")
        f_text.write(img_text)
        f_text.close()

    # write list file
    with open(destination_list_file, 'a') as fout:
        for line in lines_list:
            fout.write(line + "\n")

def prepare_deslanted_data_files(df, source_dir, destination_dir):
    """Copy deslanted png to destination and create label, txt, list files"""
    destination_images_dir = destination_dir + '/Images'
    destination_labels_dir = destination_dir + '/Labels'
    destination_text_dir = destination_dir + '/Text'
    destination_list_file = destination_dir + '/list'
    
    clean_folder(destination_dir)
    clean_folder(destination_images_dir)
    clean_folder(destination_labels_dir)
    clean_folder(destination_text_dir)
    
    from shutil import copy2

    lines_list = []

    for i in range(df.shape[0]):    

        # read data from each row
        img_data = df.iloc[i]            # each line in dataframe

        img_id = img_data['id']         # data for file
        img_file = img_id + '.png'      # img file_name
        label_file = img_id + '.tru'    # label file_name
        text_file = img_id + '.txt'     # text file_name

        img_text = img_data['label']    # img text
        img_truth = img_data['truth']   # img ground truth (tokenized string) 

        # create source and destination file names
        id_pc = img_id.split("-")
        source_img_file = source_dir + "/" + img_file
        destination_img_file = destination_images_dir  + "/" + img_file

        destination_label_file = destination_labels_dir + "/" + label_file
        destination_text_file = destination_text_dir + "/" + text_file
        destination_list_file = destination_list_file

        lines_list.append(img_id)

        # copy source to destination
        copy2(source_img_file, destination_img_file)

        # write label file (ground truth)
        f_label = open(destination_label_file, "w")
        f_label.write(img_truth)
        f_label.close()    

        # write text file (img text)
        f_text = open(destination_text_file, "w")
        f_text.write(img_text)
        f_text.close()

    # write list file
    with open(destination_list_file, 'a') as fout:
        for line in lines_list:
            fout.write(line + "\n")

### Define source and destination folders

In [23]:
# directories for the original IAM data
source_dir = './lines'
train_destination_dir = './train'
val_destination_dir = './val'
test_destination_dir = './test'

# directories for the deslanted IAM data
source_dir_deslanted = './lines_deslanted'
train_destination_dir_deslanted = './train_deslanted'
val_destination_dir_deslanted = './val_deslanted'
test_destination_dir_deslanted = './test_deslanted'

### Copy the "original" IAM lines data to destination

In [24]:
prepare_data_files(df_train, source_dir, train_destination_dir)
prepare_data_files(df_val, source_dir, val_destination_dir)
prepare_data_files(df_test, source_dir, test_destination_dir)

Successfully removed 0 files from ./train
Successfully removed 0 files from ./train/Images
Successfully removed 0 files from ./train/Labels
Successfully removed 0 files from ./train/Text
Successfully removed 0 files from ./val
Successfully removed 0 files from ./val/Images
Successfully removed 0 files from ./val/Labels
Successfully removed 0 files from ./val/Text
Successfully removed 0 files from ./test
Successfully removed 0 files from ./test/Images
Successfully removed 0 files from ./test/Labels
Successfully removed 0 files from ./test/Text


### Copy the "deslanted" IAM lines data to destination

In [25]:
prepare_deslanted_data_files(df_train, source_dir_deslanted, train_destination_dir_deslanted)
prepare_deslanted_data_files(df_val, source_dir_deslanted, val_destination_dir_deslanted)
prepare_deslanted_data_files(df_test, source_dir_deslanted, test_destination_dir_deslanted)

Successfully removed 0 files from ./train_deslanted
Successfully removed 0 files from ./train_deslanted/Images
Successfully removed 0 files from ./train_deslanted/Labels
Successfully removed 0 files from ./train_deslanted/Text
Successfully removed 0 files from ./val_deslanted
Successfully removed 0 files from ./val_deslanted/Images
Successfully removed 0 files from ./val_deslanted/Labels
Successfully removed 0 files from ./val_deslanted/Text
Successfully removed 0 files from ./test_deslanted
Successfully removed 0 files from ./test_deslanted/Images
Successfully removed 0 files from ./test_deslanted/Labels
Successfully removed 0 files from ./test_deslanted/Text
