In [1]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
from os import listdir
import unicodedata
import random

## Example: adding characters

Christine Roughan (2019)

The text I want to OCR uses multiple characters which are not included in the sample text I will use to generate artificial training data. This notebook therefore adds the following characters:

- alef with hamza above and below
- commas
- colons
- semicolons
- dashes
- quotation marks
- brackets
- Arabic numerals (0-9)
- asterisks (\*)
- parenthetical mathematical labels (ا ب ج)
- triple dot
- plus
- minus

The first cell below shows what characters are originally present in the sample text. The second cell randomly adds all of the desired characters above into this sample text. The third cell shows the new character list and count.

After the sample text goes through this notebook it will no longer make sense, since it will have new material randomly appearing in it. But the computer does not need the OCR training data to make grammatical or logical sense, so this is not a problem.

In [2]:
arabic = open('arabic.txt').read()

chars = {}
for char in arabic:
    if char not in chars:
        chars[char] = 1
    else:
        chars[char] +=1
        
keys = list(chars.keys())
keys.sort()
for char in keys:
    if char != '\n':
        print(chars[char], '\t', char, unicodedata.name(char))

15264 	   SPACE
9 	 ( LEFT PARENTHESIS
9 	 ) RIGHT PARENTHESIS
98 	 . FULL STOP
59 	 [ LEFT SQUARE BRACKET
59 	 ] RIGHT SQUARE BRACKET
1 	 ؟ ARABIC QUESTION MARK
7 	 ء ARABIC LETTER HAMZA
1 	 آ ARABIC LETTER ALEF WITH MADDA ABOVE
3 	 ؤ ARABIC LETTER WAW WITH HAMZA ABOVE
829 	 ئ ARABIC LETTER YEH WITH HAMZA ABOVE
10335 	 ا ARABIC LETTER ALEF
2406 	 ب ARABIC LETTER BEH
1968 	 ة ARABIC LETTER TEH MARBUTA
1341 	 ت ARABIC LETTER TEH
396 	 ث ARABIC LETTER THEH
1490 	 ج ARABIC LETTER JEEM
875 	 ح ARABIC LETTER HAH
1698 	 خ ARABIC LETTER KHAH
1996 	 د ARABIC LETTER DAL
608 	 ذ ARABIC LETTER THAL
2173 	 ر ARABIC LETTER REH
1677 	 ز ARABIC LETTER ZAIN
1216 	 س ARABIC LETTER SEEN
182 	 ش ARABIC LETTER SHEEN
257 	 ص ARABIC LETTER SAD
282 	 ض ARABIC LETTER DAD
2642 	 ط ARABIC LETTER TAH
202 	 ظ ARABIC LETTER ZAH
1954 	 ع ARABIC LETTER AIN
107 	 غ ARABIC LETTER GHAIN
814 	 ـ ARABIC TATWEEL
1362 	 ف ARABIC LETTER FEH
1744 	 ق ARABIC LETTER QAF
1469 	 ك ARABIC LETTER KAF
4936 	 ل ARABIC LETTER LAM
436

In [3]:
###################################################################
# This cell splits the text into separate words (tokens) and then #
# iterates over the list of tokens. I use the random module to    #
# make the script insert my desired characters at random points   #
# while it iterates through the tokens.                           #
###################################################################

tokens = arabic.split()
new_arabic = ""
punctuation = ['؛','،',':',' —','؛','،',':',' —','.','؛','،',':',' —','؟']
numerals = ['٠','١','٢','٣','٤','٥','٦','٧','٨','٩']
alphabet = ['ا','ب','ت','ث','ج','ح','خ','د','ذ','ر','ز','س','ش','ص','ض','ط','ظ','ع','غ','ف','ق','ك','ل','م','ن','ه','و','ي']

for token in tokens:
    # add punctuation
    if token[0] == 'و':
        if random.randint(0,100) > 50:
            new_arabic = new_arabic + random.choice(punctuation) + " " + token
        else:
            new_arabic = new_arabic + " " + token
    
    # add hamza
    elif token[0] == 'ا':
        if random.randint(0,100) > 34:
            new_arabic = new_arabic + " " + random.choice(['أ','إ']) + token[1:]
        else:
            new_arabic = new_arabic + " " + token
    elif token[-1] == 'ا':
        if random.randint(0,100) > 34:
            new_arabic = new_arabic + " " + token + "ء"
    
    # add quotes / brackets
    elif random.randint(0,100) < 10:
        if random.randint(0,1) == 0:
            new_arabic = new_arabic + ' "' + token + '"'
        else:
            new_arabic = new_arabic + " [" + token + "]"
            
    # add parentheticals
    elif random.randint(0,100) < 25:
        choice = random.randint(0,2)
        if choice == 0:
            new_arabic = new_arabic + "(*) " + token
        elif choice == 1:
            new_arabic = new_arabic + "(" + random.choice(numerals) + ") " + token
        elif choice == 2:
            labels = []
            n = random.randint(1,4)
            for i in range(1,n+1):
                labels.append(random.choice(alphabet))
            labels = list(set(labels))
            new_arabic = new_arabic + " (" + " ".join(labels) + ") " + token 
    
    # add + - ∴
    elif random.randint(0,100) < 10:
        new_arabic = new_arabic + " " + random.choice(['+','-','∴']) + " " + token
    
    # default
    else:
        new_arabic = new_arabic + " " + token

In [4]:
chars = {}
for char in new_arabic:
    if char not in chars:
        chars[char] = 1
    else:
        chars[char] +=1
        
keys = list(chars.keys())
keys.sort()
for char in keys:
    try:
        print(chars[char], '\t', char, unicodedata.name(char))
    except:
        pass

19123 	   SPACE
1022 	 " QUOTATION MARK
2338 	 ( LEFT PARENTHESIS
2338 	 ) RIGHT PARENTHESIS
797 	 * ASTERISK
251 	 + PLUS SIGN
211 	 - HYPHEN-MINUS
128 	 . FULL STOP
109 	 : COLON
571 	 [ LEFT SQUARE BRACKET
571 	 ] RIGHT SQUARE BRACKET
121 	 ، ARABIC COMMA
119 	 ؛ ARABIC SEMICOLON
31 	 ؟ ARABIC QUESTION MARK
706 	 ء ARABIC LETTER HAMZA
1 	 آ ARABIC LETTER ALEF WITH MADDA ABOVE
1345 	 أ ARABIC LETTER ALEF WITH HAMZA ABOVE
1 	 ؤ ARABIC LETTER WAW WITH HAMZA ABOVE
1379 	 إ ARABIC LETTER ALEF WITH HAMZA BELOW
826 	 ئ ARABIC LETTER YEH WITH HAMZA ABOVE
7178 	 ا ARABIC LETTER ALEF
2434 	 ب ARABIC LETTER BEH
1968 	 ة ARABIC LETTER TEH MARBUTA
1386 	 ت ARABIC LETTER TEH
465 	 ث ARABIC LETTER THEH
1541 	 ج ARABIC LETTER JEEM
943 	 ح ARABIC LETTER HAH
1723 	 خ ARABIC LETTER KHAH
2031 	 د ARABIC LETTER DAL
637 	 ذ ARABIC LETTER THAL
2217 	 ر ARABIC LETTER REH
1720 	 ز ARABIC LETTER ZAIN
1264 	 س ARABIC LETTER SEEN
247 	 ش ARABIC LETTER SHEEN
324 	 ص ARABIC LETTER SAD
340 	 ض ARABIC LETTER DAD
2