# MakeGroundTruth

In this notebook, a PDF and a corresponding CSV file are created. The CSV is guaranteed to contain the text that was used to generate the PDF.

Due to space considerations, the output of this notebook will not be tracked in Git or GitHub. It will, however, be available via Google Drive.

## Selection of Languages

The languages used here are the top eleven languages when ordered by estimated literate population. Languages were chosen based on an estimate of those who could write them instead of those who could speak them for obvious reasons: We are interested in analyzing _writing_, not speech. These languages are
1. Chinese (Mandarin)
2. English
3. Spanish
4. Hindi
5. Arabic
6. French
7. Russian
8. Portuguese
9. Japanese
10. Bengali
11. German

This covers a variety of scripts: Chinese, Latin, Devanagari, Arabic, Cyrillic, Hiragana/Katakana, and Bengali. Furthermore, multiple variations of Latin, one of the most important scripts, will appear, including varied diacritics.

Hebrew is not among these commonly written scripts; however, given the methods that we are using, we can infer that they will be likely to generalize well to correct handling of Hebrew if they can correctly handle the aforementioned scripts.

Although Urdu appears in the source of this list, Urdu will be excluded because it is not written in Devanagari and so is a different language; furthermore, Urdu is less widely used than Hindi.


The above list is sourced from [this PDF](https://journal.lib.uoguelph.ca/index.php/perj/article/view/826/1358). This might not be a reliable source, and the numbers it provides are known to be only approximate. However, this source is perfectly transparent about its sources (primarily the CIA), as well as the assumptions/approximations used in synthesizing its estimates. I propose that we regard the data as both trustworthy and sufficiently accurate for the purpose of deciding which languages to include in our benchmark.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd /content/drive/My Drive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/benchmark
!ls

/content/drive/.shortcut-targets-by-id/1W2EROe2FItlaK99U-WY_qaBOc2UD_LI0/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/benchmark
Examples.ipynb	      MakeGroundTruth.ipynb  RandomRenderer.py
fonts		      pdf_make.py	     text_chunk.py
ground_truth_scratch  __pycache__	     web_walk.py


Full credit to [this article](https://towardsdatascience.com/introduction-to-googles-compact-language-detector-v3-in-python-b6887101ae47) for the following cell.

In [3]:
!apt-get install -y --no-install-recommends g++ protobuf-compiler libprotobuf-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libprotobuf-dev is already the newest version (3.0.0-9.1ubuntu1).
protobuf-compiler is already the newest version (3.0.0-9.1ubuntu1).
g++ is already the newest version (4:7.4.0-1ubuntu2.3).
0 upgraded, 0 newly installed, 0 to remove and 40 not upgraded.


The following "pip install" cells are needed for Google Colab. On a local machine, it is suggested to use a virtual environment and do it from the command line as needed.

In [4]:
!pip install gcld3



In [5]:
!pip install fonttools



In [29]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading PyMuPDF-1.18.17-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 6.8 MB/s 
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.18.17


In [30]:
import web_walk, text_chunk, pdf_make
from RandomRenderer import RandomRenderer
from numpy.random import default_rng
import os
import re
import time
import pandas as pd
import collections

In [7]:
OUTPUT_DIR = os.path.join('.', 'ground_truth0')
PAGES_DIR  = os.path.join(OUTPUT_DIR, 'pages')
PAGES_CSV  = os.path.join(OUTPUT_DIR, 'page.csv')

## Collection of Text

I endeavor to collect about 1000 pages of text from Wikipedia in each of these 11 languages. This will give on the order of 10<sup>5</sup> pages of text in total. This is about 3 million characters of each language.

It is worth noting that this is not necessarily an even split in terms of the amount of information from each language, or the number of tokens. For instance, about four or five times as many tokens of Chinese will be collected than tokens of English, since Chinese provides one token per character whereas English provides about one-fifth of a token per character. Furthermore, more information is provided in Chinese because each character could take many different values, whereas each character of English typically could take one of only a few dozen values.

The ISO 639-1 language codes associated with each of these languages are as follows:

In [8]:
language_codes = {
  'Chinese': 'zh',
  'English': 'en',
  'Spanish': 'es',
  'Hindi': 'hi',
  'Arabic': 'ar',
  'French': 'fr',
  'Russian': 'ru',
  'Portuguese': 'pt',
  'Japanese': 'ja',
  'Bengali': 'bn',
  'German': 'de'
}

Fortunately, the BCP-47-style language codes are consistent with the ISO language codes. (The documentation for [Google's language detector](https://github.com/google/cld3) refers to Bengali as Bangla, but that means the same as Bengali.)

In [9]:
def get_text_for_language(langcode):
  return web_walk.web_walk(
    start=web_walk.wikipedia_about_page(langcode),
    desired_text_len=int(3e6), # As explained above, we desire 3 million characters.
    rng=default_rng(1539), # 1539 is a random seed chosen based on the time of
                           # day. It has no special significance.
    language=langcode,
    websites={web_walk.wikipedia(langcode)},
    fringe_size=5, # A smaller value will allow more variation in the subject
                   # matter of the sample text, but it will also increase the
                   # risk of failing to collect the desired quantity of text.
    url_resolver=web_walk.get_query_string_remover(web_walk.get_prefixer(
        web_walk.wikipedia(langcode)
    )),
    verbose=True
  )

Based on some feedback on an initial version of this workflow, we want text that is representative of natural language, in the strict sense of the word. That means that extremely short lines of text, such as single words or phrases as might appear in a table of contents, probably are not desired. Obviously, CSS and JavaScript should be omitted as well.

By default, `web_walk` will filter out the CSS and JavaScript. Extremely short lines will also be removed as part of the default behavior of the text chunker.

Users should be warned that the following takes about 15 minutes to execute.

In [10]:
all_text = ''
for language, langcode in language_codes.items():
  print("Searching for {} language text...".format(language))
  new_text = get_text_for_language(langcode)
  print("Collected {} characters of {} text.\n".format(len(new_text), language))
  all_text += new_text

Searching for Chinese language text...
Visiting https://zh.wikipedia.org/wiki/Wikipedia:About...
Visiting https://zh.wikipedia.org/wiki/Category:%E8%A2%AB%E5%8D%8A%E4%BF%9D%E6%8A%A4%E7%9A%84%E9%A1%B5%E9%9D%A2...
Visiting https://zh.wikipedia.org/wiki/Special:%E7%BB%9F%E8%AE%A1...
Visiting https://zh.wikipedia.org/wiki/%E4%B8%AD%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91...
Visiting https://zh.wikipedia.org/wiki/Wikipedia:%E8%81%94%E7%BB%9C%E6%88%91%E4%BB%AC...
Visiting https://zh.wikipedia.org/wiki/Wikipedia:CC_BY-SA_3.0%E5%8D%8F%E8%AE%AE%E6%96%87%E6%9C%AC...
Visiting https://zh.wikipedia.org/wiki/%E4%BC%8A%E4%B8%B9%E7%B7%9A...
Visiting https://zh.wikipedia.org/wiki/%E7%91%9E%E5%85%B8%E8%AF%AD...
Visiting https://zh.wikipedia.org/wiki/Wikipedia:%E5%88%97%E6%98%8E%E6%9D%A5%E6%BA%90...
Visiting https://zh.wikipedia.org/wiki/File:Wikipedia_old_logo2.png...
Visiting https://zh.wikipedia.org/wiki/Bunga_Citra_Lestari...
Visiting https://zh.wikipedia.org//zh.wikipedia.org/wiki/Wikipedia:CC-

The output indicates that we have indeed collected an appropriate number of characters (just over three million) for each language.

In [11]:
all_text = re.sub(r'\n+\s*', '\n', all_text)

## Segmentation of Text

The next task is to segment text. This serves two purposes:
* It makes the text fit onto pages. This will not happen perfectly. The font size is normally distributed so that in principle it can be arbitrarily large, and some languages (Chinese, for example) may have wider characters than others, so there may be pages in which not all of the characters fit. This is not ideal, but it is okay: the important thing is to ensure that no bias is created toward one OCR system over another.
* It permutes the text so that occasionally, multiple different languages will appear on the same page.

In [12]:
help(text_chunk.get_chunker)

Help on function get_chunker in module text_chunk:

get_chunker(rng: numpy.random._generator.Generator, min_line_len: int = 30, max_line_len: int = 60, max_page_len: int = 30, mean_section_len: float = 15000) -> Callable[[str], str]
    Returns a text chunker that takes in text and produces pages of
    chunked and permuted text.
    :param rng: The random number generator that determines the behavior
        of the returned chunker
    :param max_line_len: The maximum number of characters per line
    :param mean_section_len: The mean number of characters per section of
        contiguous text



The initial version of the section breaker used all available RAM because its implementation had the wrong space complexity. It got rewritten because having the wrong space complexity in an application that is supposed to scale is a bug, not an imperfection.

In [13]:
pages = text_chunk.get_chunker(
    default_rng(1020),
    max_line_len=60,
    max_page_len=30,
    mean_section_len=15000
)(all_text)
print('Split text into {} pages.'.format(len(pages)))
for page in pages[:10]:
  print(page)

Split text into 20010 pages.
recueillis par le roi Arion, qui élève l'enfant comme le sie
n[2].
Naissance de la cité[modifier | modifier le code]
Milet est fondée par des Grecs au début du XIe siècle av. J.
-C. (entre 1086 et 1085 av. J.-C. selon la Chronique de Paro
s). Cela en fait l'une des plus vieilles cité-État grecque d
'Ionie avec Clazomènes, fondée à peu près à la même époque. 
On sait très peu de choses sur cette période de l'histoire d
e la cité. À cette époque, elle est vraisemblablement gouver
née par un roi. 
Liste des tyrans de Milet[modifier | modifier le code]
La royauté finit par faire place à un régime oligarchique. L
a tyrannie arrive au VIIe siècle av. J.-C.. 
610 av. J.-C.-… : Thrasybule de Milet.
…-514 av. J.-C. : Histiée († 493 av. J.-C.). Il règne sous l
a suzeraineté de l'empire achéménide.
En -514, Histiée accompagne son suzerain Darius Ier lors d'u
ne expédition en Thrace contre les Scythes. Lors du retour, 
Histiée est emmené à Suse comme conseiller du gran

## Generation of Images

Here, images are saved to the drive and then left to be garbage collected because we do not wish to have thousands of images in memory at once.

In [14]:
renderer = RandomRenderer(
    default_rng(1142),
    size=(800, 800),
    top_left=(20, 20),
    fonts_dir='./fonts',
    orientation_dist=(1/4, 1/4, 1/4, 1/4),
    fontsize_mean=14,
    fontsize_std=3,
    background_color_means=(230, 230, 230),
    background_color_stds=(1.5, 1.5, 1.5),
    foreground_color_means=(25, 25, 25),
    foreground_color_stds=(1.5, 1.5, 1.5)
)

Warning: The following takes several hours, and it produces over a gigabyte of images.

In [15]:
if not os.path.exists(PAGES_DIR):
    os.makedirs(PAGES_DIR)
t0 = time.time()
for i, page in enumerate(pages, start=1):
  renderer.render(page).save(os.path.join(PAGES_DIR, 'page{}.png'.format(i)))
  if i % 100 == 0:
    print('Rendered {} out of {} pages after {:.2f} hours. {:.1f}% complete.'.format(
      i, len(pages), (time.time() - t0) / 3600, (i / len(pages) * 100)
    ))

Rendered 100 out of 20010 pages after 0.05 hours. 0.5% complete.
Rendered 200 out of 20010 pages after 0.07 hours. 1.0% complete.
Rendered 300 out of 20010 pages after 0.10 hours. 1.5% complete.
Rendered 400 out of 20010 pages after 0.11 hours. 2.0% complete.
Rendered 500 out of 20010 pages after 0.13 hours. 2.5% complete.
Rendered 600 out of 20010 pages after 0.14 hours. 3.0% complete.
Rendered 700 out of 20010 pages after 0.15 hours. 3.5% complete.
Rendered 800 out of 20010 pages after 0.17 hours. 4.0% complete.
Rendered 900 out of 20010 pages after 0.19 hours. 4.5% complete.
Rendered 1000 out of 20010 pages after 0.20 hours. 5.0% complete.
Rendered 1100 out of 20010 pages after 0.21 hours. 5.5% complete.
Rendered 1200 out of 20010 pages after 0.22 hours. 6.0% complete.
Rendered 1300 out of 20010 pages after 0.24 hours. 6.5% complete.
Rendered 1400 out of 20010 pages after 0.26 hours. 7.0% complete.
Rendered 1500 out of 20010 pages after 0.27 hours. 7.5% complete.
Rendered 1600 out o

In [16]:
for page in default_rng(0).choice(pages, 100):
  print(page)

ly and Transport (BEST)। ২০০৬-০৭-১৮ তারিখে মূল থেকে আর্কাইভ 
করা। সংগ্রহের তারিখ ২০০৬-১০-১২। 
↑ "সংরক্ষণাগারভুক্ত অনুলিপি"। ১০ জুন ২০১০ তারিখে মূল থেকে আর
্কাইভ করা। সংগ্রহের তারিখ ২০ মার্চ ২০১০। 
↑ "Bus Transport Profile"। Brihanmumbai Electric Supply and 
Transport (BEST)। ২০০২-০৬-২৮ তারিখে মূল থেকে আর্কাইভ করা। সং
গ্রহের তারিখ ২০০৯-০৮-২৮। 
↑ Tembhekar, Chittaranjan (২০০৮-০৮-০৪)। "MSRTC to make long 
distance travel easier"। The Times of India। সংগ্রহের তারিখ 
২০০৯-০৬-১৪। 
↑ "MSRTC adds Volvo luxury to Mumbai trip"। The Times of Ind
ia। ২০০২-১২-২৯। সংগ্রহের তারিখ ২০০৯-০৬-১৪। 
↑ Seth, Urvashi (২০০৯-০৩-৩১)। "Traffic claims Mumbai darshan
 hot spots"। MiD DAY। সংগ্রহের তারিখ ২০০৯-০৬-১৪। 
↑ "Bus Routes Under Bus Rapid Transit System" (PDF)। Brihanm
umbai Electric Supply and Transport (BEST)। পৃষ্ঠা 5। ২০০৯-০
১-২৬ তারিখে মূল (PDF) থেকে আর্কাইভ করা। সংগ্রহের তারিখ ২০০৯-
০৩-২৩। 
↑ Khanna, Gaurav। "7 Questions You Wanted to Ask About the M
umbai Metro"। Businessworld। ২০০৯-০৬-২৫ তারিখে মূল থ

Regrettably, not all of the images rendered perfectly. For example, Japanese pages might have some characters that did not render, and lines of text may extend slightly past the edge of the page. This might not hinder us from making unbiased comparisons between OCR methods, however -- it could merely exaggerate their deficiencies when handling a language such as Japanese.

## Saving of Pages

In [26]:
data = {'page': list(range(len(pages))), 'text': pages}
for choice in renderer.choices:
  data[choice] = renderer.choices[choice]
pd.DataFrame(data=data).to_csv(PAGES_CSV, index=False)

## Conversion of Images to PDFs

To keep the PDFs to a manageable size, I have chosen to make 20 of them instead of creating a single 20,010-page PDF. They will have 1,001 pages each, except for the last one, which will have 991 pages.

Warning: The following cell may take roughly 15 minutes to execute.

It will continue from where it was last interrupted, as appropriate.

In [35]:
max_len = len(pages) // 20 + 1
for i, (start, end) in enumerate(zip(
    range(0, len(pages), max_len),
    range(max_len, len(pages) + max_len, max_len)
)):
  if end > len(pages):
    end = len(pages) - 1
  group_dir = os.path.join(PAGES_DIR, 'group{}'.format(i))
  os.makedirs(group_dir, exist_ok=True)
  for file in ('page{}.png'.format(j + 1) for j in range(start, end)):
    target = os.path.join(group_dir, file)
    if not os.path.exists(target):
      os.rename(os.path.join(PAGES_DIR, file), target)
  target = os.path.join(PAGES_DIR, 'group{}.pdf'.format(i))
  if not os.path.exists(target):
    pdf_make.pdf_from_images(group_dir, target)

Saving PDF as ./ground_truth0/pages/group19.pdf...


## Conclusion

We now have two things that seldom can be found together: Large PDFs consisting of images, and text that is known with absolute certainty to be the text from which those images were generated. All that remains is to OCR these PDFs and compare the results with the generated text. These results may not be fully realistic due to the synthetic nature of this dataset, but the dataset is accurate, large, and cheap to acquire, and it should serve as an indicator of the relative performance of the different OCR methods available to us.