# Testing `Tesseract` OCR for Comics
> Accuracy Enhancements for OCR in `PanelCleaner`


## Settings for Google Colab

We will install the more up-to-date version of PanelCleaner from GitHub. Only affects Colab notebooks.

In [1]:
DEV_INSTALL = True

The best way to get the images source of the experiments is to mount your Google Drive.


In [2]:
MOUNT_DRIVE = DEV_INSTALL
GDRIVE_MOUNT_POINT = 'drive'


# install (Colab)

In [3]:
import fastcore.all as FC


In [4]:
if FC.IN_COLAB:
  !pip install -q pyngrok


Mount Google Drive

In [5]:
import os
import re
from pathlib import Path

from rich import print as cprint
from rich.text import Text

def info(msg: str):
    text = Text(msg)
    text.stylize("bold red", 0, 6)
    cprint("_" * 10, text, "_" * 10)


if FC.IN_COLAB:
    if MOUNT_DRIVE:
        mnt_point = f"/content/{GDRIVE_MOUNT_POINT}"
        if not Path(mnt_point).exists():
            info("Mounting Google Drive")
            from google.colab import drive

            drive.mount(mnt_point, force_remount=True)


Install **PanelCleaner**

In [6]:
if FC.IN_COLAB:
    info('Installing PanelCleaner')
    if DEV_INSTALL:
        assert MOUNT_DRIVE, "DEV_INSTALL need a mounted google drive"
        info('Installing PanelCleaner from Google Drive')
        os.chdir(f"/content/{GDRIVE_MOUNT_POINT}/MyDrive/Shared/PanelCleaner/")
        !pip install -e .
    else:
        info('Installing PanelCleaner from Github')
        !pip install -q git+https://github.com/civvic/PanelCleaner.git@testbed


**PanelCleaner** is a heavy-weight and sometimes **Colab** refuses (*silently*) to install it. If  the cell below gives an error, re-run the cell above. That usually fixes the problem.

In [7]:
import importlib.resources
package_path = importlib.resources.files('pcleaner')
assert package_path.name == 'pcleaner'

os.chdir(package_path/'_testbed')

In [8]:
from pcleaner._testbed.testbed.experiments import ExperimentsVisor, CropMethod, OCRExperimentContext


## Tesseract setup

Get current version of Tesseract

In [9]:
out = !tesseract --version  # type: ignore
cprint(out)
if 'tesseract 5.' not in out[0]:
    if 'tesseractd 4.' in out[0]:
        cprint('Old Tesseract 4.x is installed. You should uninstall it and install Tesseract 5.x')
    else:
        cprint('You should install Tesseract 5.x')


> **NOTE: in below cells, when you encounter lines starting with the exclamation mark `!` (`bang`), uncoment them if you want to excute the shell commands**


### Remove Tesseract installation
> I you have the old 4.x version, you should consider removing the installation with the following commands.


#### Mac (TBD)

#### WIndows (TBD)

#### Ubuntu

In [10]:
# !sudo apt-get remove tesseract-ocr


### Tesseract installation

#### Mac (TBD)

#### WIndows (TBD)

#### Ubuntu

The **5.x** release series is available in the [another PPA](https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr5) for Ubuntu **18.04**, **20.04**, and **22.04**.


In [11]:
# !sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5

refresh system package cache in case you’re still running old Ubuntu 18.04

In [12]:
# !sudo apt update

install the software engine

In [13]:
# !sudo apt install -y tesseract-ocr

and check version:

In [14]:
out = !tesseract --version  # type: ignore
cprint(out)

### Install Tesseract languages

In [15]:
out = !tesseract --list-langs  # type: ignore
tessdata = Path(out[0].split('"')[1])
cprint(f"tessdata path: {tessdata}")
cprint("Installed languages:", [', '.join(sub) for sub in [out[i:i + 15] for i in range(1, len(out), 15)]])

####  Install **best** languages and **jpn_ver** Tesseract lang
> Much better results than default langs and `jpn` language model.


Download from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best).  
Donwload from [here](https://groups.google.com/g/tesseract-ocr/c/FwjSZzoVgeg/m/u-zyFYQiBgAJ) a model trained for vertical Japanese text as found in manga.

See [here](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) the languages codes.

> Note: I've not play much with `jpn` or `jpn_vert`, `manag-ocr` is surely a much better fit, but it can be educational to compare.

Uncomment and excute to download the best language models:


In [16]:
# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/osd.traineddata
# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/eng.traineddata
# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/jpn.traineddata

# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/jpn_vert.traineddata
# or
# !wget -O jpn_vert.traineddata https://github.com/zodiac3539/jpn_vert/blob/master/jpn_ver5.traineddata

# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/spa.traineddata
# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/fra.traineddata

Copy downloaded models to tessdata folder (double check that `tessdata` variable points to the right folder):


In [17]:
cprint(f"tessdata path: {tessdata}")

In [18]:
# !sudo mv *.traineddata $tessdata

and remove the downloaded models:


In [19]:
# !rm *.traineddata

Check installed languages


In [20]:
cprint(list(filter(lambda x: re.match(r'eng|jpn|jpn_vert|fra|spa', x.name), tessdata.ls())))  # type: ignore
# cprint(pytesseract.get_languages())


----
# Tesseract experiments

## Experiment directory

Directory where the images reside (`EXP_DIR/source/`), the auxiliary images will be cached (`EXP_DIR/cache/`), and the experiment results will be saved. You can change the default location here.


In [21]:
EXP_DIR = Path('./experiment')
cprint(f"{'Working dir':>15}: {Path('.').resolve()}\nExperiments dir: {EXP_DIR.resolve()}")


# Setup ngrok (Colab)

The experiments can generate hundreds of images, and maintaining the **PIL** images in memory is not efficient. All the generated images are cached and visualized on demand through a URL pointing to the local cache. This approach prevents the kernel from being overloaded with **PIL** images, with the front-end responsible for fetching the image and the backend web server (not the kernel) for serving the image in another process. This method is quick and efficient. As an added bonus, the saved notebook remains lean and fit; it doesn't store the Base64 versions of all the output cell images.

Unfortunately, this approach does not work as is in **Colab**. Google Colab runs on an older Ubuntu 18.04 VM, so all the usual networking challenges with Docker, or whatever VMs Google is using, apply. Google also goes to great lengths to avoid exposing its internal architecture. We have two options:
- Let the Jupyter kernel serve the images itself, which is slow and memory-consuming.
- Use a tunnel to map localhost (server) to whatever IP and port the front-end (the browser you're currently using) is running on. We can use **ngrok** for this, but *ngrok* is a commercial service that has been abused and now requires confirmation the first time the tunnel connects, which can be inconvenient for the user. It also requires the user to open a free account and obtain an auth token.

You choose.

If the notebook is running in Colab and ngrok has been successfully installed and the tunnel has been created, the default setting is `USE_PIL=False`. You can set the environment variable `USE_PIL=True` to force the use of PIL images, but note that in certain circumstances, Colab will complain because the free tiers are usually memory constrained.

I you don't change the default settings and
- the notebook is running locally, it'll serve the images directly without any additional setup.
- the notebook is running in Colab, it'll serve the images through a web server and ngrok.


In [22]:
os.environ['USE_TUNNEL'] = 'True' if FC.IN_COLAB else 'False'
os.environ['USE_PIL'] = 'True' if FC.IN_COLAB and os.environ['USE_TUNNEL'] == 'False' else 'False'


In [23]:
SERVER = None
if os.environ['USE_PIL'].lower() == 'false' and os.environ['USE_TUNNEL'].lower() == 'true':
    import pcleaner._testbed.testbed.web_server as web_server

    SERVER = web_server.setup_ngrok(web_server.WebServerBottle, Path(EXP_DIR))


Creates the `OCRExperimentContext` object we'll use to manage the experiments.


In [24]:
CONTEXT = OCRExperimentContext('Tesseract', EXP_DIR, server=SERVER)
CONTEXT.show()

Current Configuration:

Locale: System default
Default Profile: Built-in
Saved Profiles:
- victess: /Users/vic/dev/repo/DL-mac/cleaned/victess.conf
- vicmang: /Users/vic/dev/repo/DL-mac/cleaned/vicmang.conf

Profile Editor: cursor
Cache Directory: System default
Default Torch Model Path: /Users/vic/Library/Caches/pcleaner/model/comictextdetector.pt
Default CV2 Model Path: /Users/vic/Library/Caches/pcleaner/model/comictextdetector.pt.onnx
GUI Theme: System default

--------------------

Config file located at: /Users/vic/Library/Application Support/pcleaner/pcleanerconfig.ini
System default cache directory: /Users/vic/Library/Caches/pcleaner


# Test images


Copy your images to the source directory:


In [None]:
cprint((EXP_DIR/'source').resolve())

or download the standard set:


In [None]:
# !gdown --id 18TSXLCYAPxAlUsdHmgAe6FZM5d8K6gcT -O experiment.zip

In [None]:
# !unzip -qn experiment.zip -d .

Check the images are in place

In [25]:
[f"{i:02}: {_.name}" for i,_ in enumerate(CONTEXT.image_paths)]


['00: Action_Comics_1960-01-00_(262).JPG',
 '01: Adolf_Cap_01_008.jpg',
 '02: Barnaby_v1-028.png',
 '03: Barnaby_v1-029.png',
 '04: Buck_Danny_-_12_-_Avions_Sans_Pilotes_-_013.jpg',
 '05: Cannon-292.jpg',
 '06: Contrato_con_Dios_028.jpg',
 '07: Erase_una_vez_en_Francia_02_88.jpg',
 '08: FOX_CHILLINTALES_T17_012.jpg',
 '09: Furari_-_Jiro_Taniguchi_selma_056.jpg',
 '10: Galactus_12.jpg',
 '11: INOUE_KYOUMEN_002.png',
 '12: MCCALL_ROBINHOOD_T31_010.jpg',
 '13: MCCAY_LITTLENEMO_090.jpg',
 '14: Mary_Perkins_On_Stage_v2006_1_-_P00068.jpg',
 '15: PIKE_BOYLOVEGIRLS_T41_012.jpg',
 '16: Sal_Buscema_Spaceknights_&_Superheroes_Ocular_Edition_1_1.png',
 '17: Sal_Buscema_Spaceknights_&_Superheroes_Ocular_Edition_1_1_K.png',
 '18: Sal_Buscema_Spaceknights_&_Superheroes_Ocular_Edition_1_2.png',
 '19: Spirou_Et_Fantasio_Integrale_06_1958_1959_0025_0024.jpg',
 '20: Strange_Tales_172005.jpg',
 '21: Strange_Tales_172021.jpg',
 '22: Tarzan_014-21.JPG',
 '23: Tintin_21_Les_Bijoux_de_la_Castafiore_page_39.jp

----

In [26]:
tesseract_experiment = ExperimentsVisor(
    CONTEXT,
    image_idx='Strange_Tales_172005.jpg',
)
tesseract_experiment.display()

VBox(children=(HTML(value="<style id='stl-6206823376'>\n    .wrapper-spinner {\n        overflow: hidden;\n   …

----

In [None]:
CONTEXT.cleanup_model()

if SERVER is not None:
    SERVER.stop()
    SERVER = None
    os.environ['USE_TUNNEL'] = 'False'
