# Testing `Tesseract` OCR for comics
> Accuracy Enhancements for OCR in `PanelCleaner`


In this notebook, you can test how Tesseract performs with texts from a diverse array of comics, manga, languages, and styles. You can run this notebook locally using Jupyter Lab/Notebook or on any Jupyter-compatible platform like Google Colab or VSCode.

We'll begin by setting up [PanelCleaner](https://github.com/VoxelCubes/PanelCleaner) and the [testbed](https://github.com/civvic/PanelCleaner/tree/testbed) in Colab, though the instructions are applicable to other platforms such as [Kaggle](https://www.kaggle.com/code). We will then verify the Tesseract installation, prime an `ExperimentContext`, and create a visor to experiment with different parameters and configurations.  

**New to Jupyter Notebooks?** If you are not familiar with Jupyter environments, consider exploring the [Introduction to Colab](https://colab.research.google.com/notebooks/intro.ipynb) and the others provided by Google. It offers a quick and comprehensive guide to using Jupyter Notebooks effectively. The Jupyter project is a great way to learn about the notebook interface and the [Jupyter ecosystem](https://jupyter.org/).

# Settings for Google Colab

To efficiently manage the image sources for our experiments, we recommend mounting your Google Drive and storing the experiment files there. If you are not familiar with Colab or Jupyter environments, it's best to leave these settings at their default values to ensure smooth operation.

- Set `MOUNT_DRIVE` to `True` to enable mounting Google Drive in the Colab environment.
This allows the notebook to access files stored in your Google Drive.

- `GDRIVE_MOUNT_POINT` specifies the local directory in Colab where your Google Drive will be mounted.
This acts as the root directory for accessing any files within your Google Drive from the notebook.

- `PANELCLEANER_IN_GDRIVE` specifies the path within your Google Drive where the PanelCleaner project is located.
This path is used to access or store any files related to the PanelCleaner project directly from Google Drive.


In [None]:
MOUNT_DRIVE = True
GDRIVE_MOUNT_POINT = 'drive'
PANELCLEANER_IN_GDRIVE = 'MyDrive/Shared/PanelCleaner'

## install (Colab)


In [None]:
import fastcore.all as FC
import os
import re
import sys
from pathlib import Path

from rich import print as cprint
from rich.text import Text

def info(msg: str):
    (t := Text(msg)).stylize("bold red", 0, 6)
    cprint("_" * 10, t, "_" * 10)


Mount Google Drive

In [None]:
mnt_point = Path(f"/content/{GDRIVE_MOUNT_POINT}")
if FC.IN_COLAB:
    if MOUNT_DRIVE:
        if not mnt_point.exists():
            info("Mounting Google Drive")
            from google.colab import drive
            drive.mount(str(mnt_point), force_remount=True)


### Install **PanelCleaner**

> We will attempt to use the version of **PanelCleaner** stored in your Google Drive. If it's not available, we'll install it from GitHub.

Note that we specifically require the `testbed` branch of the **PanelCleaner** repository, not the main trunk. This branch contains necessary configurations and experimental features that are crucial for the tests conducted in this notebook.

In [None]:
if FC.IN_COLAB:
    pc_path = mnt_point/PANELCLEANER_IN_GDRIVE
    tb_path = pc_path/'pcleaner/_testbed'
    if tb_path.exists():
        info('Installing PanelCleaner from your Google Drive')
    else:
        info('Installing PanelCleaner from GitHub')
        !rm -rf PanelCleaner
        !git clone -b testbed https://github.com/civvic/PanelCleaner.git
        pc_path = Path('PanelCleaner').absolute()
        tb_path = pc_path/'pcleaner/_testbed'
    assert tb_path.exists(), "PanelCleaner not found"
    os.chdir(tb_path)
    sys.path.insert(0, f"{tb_path}")
    sys.path.insert(0, f"{pc_path}")
    !pip install -q -r requirements-colab.txt


# Prologue

In this section, we import essential components from the `PanelCleaner` testbed. `ExperimentsVisor` is used to manage and visualize the experiments, `CropMethod` defines the cropping strategies for image preprocessing, and `OCRExperimentContext` sets up the context for OCR experiments.

If you're curious about the inner workings of these components, you can explore the notebooks that develop them in the `nbs` folder, or check out the source code they generate in the `testbed` directory. For instance, see [experiments.ipynb](nbs/experiments.ipynb) and [`_testbed/testbed/experiments.py`](testbed/experiments.py) for more details.

In [None]:
from testbed.experiments import ExperimentsVisor, CropMethod, OCRExperimentContext


## Tesseract setup
> This section ensures that Tesseract OCR is correctly installed and configured for our experiments. We require Tesseract version 5.x due to its improved accuracy and features.


> **NOTE:** In the following cells, lines starting with an exclamation mark `!` (also known as a "bang") are shell commands. Uncomment these lines if you wish to execute the commands directly from this notebook.

### Check Current Tesseract Version

In [None]:
import subprocess

def check_tesseract_version():
    version_output = subprocess.run(["tesseract", "--version"], capture_output=True, text=True)
    if 'tesseract 5.' in version_output.stdout:
        cprint("Correct version of Tesseract is installed.")
    else:
        cprint("No version or Incorrect version of Tesseract is installed. Please install Tesseract 5.x.")

check_tesseract_version()

### Remove Tesseract installation
> I you have the old 4.x version, you should consider removing the installation with the following commands.


#### Mac (TBD)

#### WIndows (TBD)

#### Ubuntu

In [None]:
# !sudo apt-get remove tesseract-ocr


### Install Tesseract 5.x (if necessary)

#### Mac (TBD)

#### WIndows (TBD)

#### Linux (Ubuntu)

The **5.x** release series is available in the [another PPA](https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr5) for Ubuntu **18.04**, **20.04**, and **22.04**.


In [None]:
# !sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5

refresh system package cache in case you’re still running old Ubuntu 18.04

In [None]:
# !sudo apt update

install the software engine

In [None]:
# !sudo apt install -y tesseract-ocr


### Re-check version after installation

In [None]:
check_tesseract_version()


### Install Tesseract languages

In [None]:
out = !tesseract --list-langs  # type: ignore
tessdata = Path(out[0].split('"')[1])
cprint(f"tessdata path: {tessdata}")
cprint("Installed languages:", [', '.join(sub) for sub in [out[i:i + 15] for i in range(1, len(out), 15)]])

####  Install **best** languages and **jpn_ver** Tesseract lang
> to get better results than default langs and `jpn` language model.


Download from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best).  
Donwload from [here](https://groups.google.com/g/tesseract-ocr/c/FwjSZzoVgeg/m/u-zyFYQiBgAJ) a model trained for vertical Japanese text as found in manga.

See [here](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) the languages codes.

**Note:** While the `jpn` and `jpn_vert` language models are available, the `manga-ocr` model use by `PanelCleaner`is generally more suited for manga text recognition. However, comparing these models can provide educational insights into their respective strengths and limitations.

Uncomment and excute to download the best language models:


In [None]:
# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/osd.traineddata
# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/eng.traineddata
# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/jpn.traineddata

# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/jpn_vert.traineddata
# or
# !wget -O jpn_vert.traineddata https://github.com/zodiac3539/jpn_vert/blob/master/jpn_ver5.traineddata

# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/spa.traineddata
# !wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/fra.traineddata

Copy downloaded models to tessdata folder (double check that `tessdata` variable points to the right folder):


In [None]:
cprint(f"tessdata path: {tessdata}")

In [None]:
# !sudo mv *.traineddata $tessdata

and remove the downloaded models:


In [None]:
# !rm *.traineddata

rm: cannot remove '*.traineddata': No such file or directory


Check installed languages


In [None]:
cprint(list(filter(lambda x: re.match(r'eng|jpn|jpn_vert|fra|spa', x.name), tessdata.ls())))  # type: ignore
# cprint(pytesseract.get_languages())


----
# Tesseract Experiments

In this notebook, we focus on applying Tesseract OCR to a variety of comic book images to evaluate its performance across different text styles, languages, and image qualities.  

The experiments are specifically designed to explore how different cropping methods affect Tesseract's ability to recognize text in complex visual contexts typical of comic panels. By experimenting with various cropping strategies, we want to determine whether feeding Tesseract single cropped boxes, as opposed to whole pages, can enhance OCR accuracy.

## Objectives

- **Evaluate basic OCR performance:** Assess how well Tesseract recognizes text across a diverse set of comic book images.
- **Test different cropping methods:** Systematically vary the way images are cropped to isolate text boxes and see if this improves the accuracy of text recognition.
- **Optimize OCR settings (TBD):** Adjust Tesseract's configuration settings based on the results of the cropping experiments to optimize performance for comic texts.


## Experiment directory

Defines the directory structure for storing images, caching auxiliary data, and saving experiment results.

- **Source Directory (`EXP_DIR/source/`):** This is where the original images for the experiments are stored.
- **Cache Directory (`EXP_DIR/cache/`):** This directory is used for caching processed images or other auxiliary files that are generated during the experiments.

You can modify the default locations of these directories as needed. The default setup assumes that you are working within the `PanelCleaner/pcleaner/_testbed` directory. Use the following code to verify your current working directory and to set up the experiment directory:


In [None]:
EXP_DIR = Path('./experiment')


In [None]:
cprint(f"{'Working dir':>15}: {Path('.').resolve()}\nExperiments dir: {EXP_DIR.resolve()}")


# Test images

Prepare and manage the comic book images for OCR testing.

If you have specific comic book images you want to test, upload them to the `EXP_DIR/source/` directory. Ensure that each image file is accompanied by a corresponding text file containing the ground truth data. The text file should have the same name as the image but with a `.txt` extension. Each line in the text file should represent one text box as detected and processed by PanelCleaner.

For those who prefer to use a standardized set of images for comparison purposes, we provide a link to download a pre-selected set of comic book images. After downloading, ensure to place these images in the `EXP_DIR/source/` directory.

Optionally, you can include a `.json` file for each image, specifying the language of the text on the page. This file should have the same name as the image and a `.json` extension. Here is an example of the content for a language specification file:

```json
{
"lang": "Spanish"
}
```

In [None]:
cprint((EXP_DIR/'source').resolve())

or download the standard set:


In [None]:
# !gdown --id 1MCqUImwFS5iQ271CD9_t2FSugJXdYj0a -O experiment.zip

In [None]:
# !unzip -qn experiment.zip -d .

# Setup ngrok (Colab)

The experiments can generate hundreds of images, and maintaining the **PIL** images in memory is not efficient. All the generated images are cached and visualized on demand through a URL pointing to the local cache. This approach prevents the kernel from being overloaded with **PIL** images, with the front-end responsible for fetching the image and the backend web server (not the kernel) for serving the image in another process. This method is quick and efficient. As an added bonus, the saved notebook remains lean and fit; it doesn't store the Base64 versions of all the output cell images.

Unfortunately, this approach does not work as is in **Colab**. Google Colab runs on an older Ubuntu 18.04 VM, so all the usual networking challenges with Docker, or whatever VMs Google is using, apply. Google also goes to great lengths to avoid exposing its internal architecture. We have two options:
- Let the Jupyter kernel serve the images itself, which is slow and memory-consuming.
- Use a tunnel to map localhost (server) to whatever IP and port the front-end (the browser you're currently using) is running on. We can use **ngrok** for this, but *ngrok* is a commercial service that has been abused and now requires confirmation the first time the tunnel connects, which can be inconvenient for the user. It also requires the user to open a free account and obtain an auth token.

You choose.

If the notebook is running in Colab and ngrok has been successfully installed and the tunnel has been created, the default setting is `USE_PIL=False`. You can set the environment variable `USE_PIL=True` to force the use of PIL images, but note that in certain circumstances, Colab will complain because the free tiers are usually memory constrained.

I you don't change the default settings and
- the notebook is running locally, it'll serve the images directly without any additional setup.
- the notebook is running in Colab, it'll serve the images through a web server and ngrok.


In [None]:
if FC.IN_COLAB:
    os.environ['USE_TUNNEL'] = 'True'
    os.environ['USE_PIL'] = 'False'


In [None]:
SERVER = None
if os.environ['USE_PIL'].lower() == 'false' and os.environ['USE_TUNNEL'].lower() == 'true':
    import testbed.web_server as web_server
    SERVER = web_server.setup_ngrok(web_server.WebServerBottle, Path(EXP_DIR))


# CONTEXT
> Creates the `OCRExperimentContext` object we'll use to manage the experiments and visualize the configuration.


In [None]:
CONTEXT = OCRExperimentContext('Tesseract', EXP_DIR, server=SERVER, load=True)
CONTEXT.show()

Current Configuration:

Locale: System default
Default Profile: Built-in
Saved Profiles:
- victess: /Users/vic/dev/repo/DL-mac/cleaned/victess.conf
- vicmang: /Users/vic/dev/repo/DL-mac/cleaned/vicmang.conf

Profile Editor: cursor
Cache Directory: System default
Default Torch Model Path: /Users/vic/Library/Caches/pcleaner/model/comictextdetector.pt
Default CV2 Model Path: /Users/vic/Library/Caches/pcleaner/model/comictextdetector.pt.onnx
GUI Theme: System default

--------------------

Config file located at: /Users/vic/Library/Application Support/pcleaner/pcleanerconfig.ini
System default cache directory: /Users/vic/Library/Caches/pcleaner


## Verify images setup

Before visualizing the experiments, verify that all images are correctly recognized and accessible.

In [None]:
[f"{i:02}: {_.name}" for i,_ in enumerate(CONTEXT.image_paths)]


['00: Action_Comics_1960-01-00_(262).JPG',
 '01: Adolf_Cap_01_008.jpg',
 '02: Barnaby_v1-028.png',
 '03: Barnaby_v1-029.png',
 '04: Buck_Danny_-_12_-_Avions_Sans_Pilotes_-_013.jpg',
 '05: Cannon-292.jpg',
 '06: Contrato_con_Dios_028.jpg',
 '07: Erase_una_vez_en_Francia_02_88.jpg',
 '08: FOX_CHILLINTALES_T17_012.jpg',
 '09: Furari_-_Jiro_Taniguchi_selma_056.jpg',
 '10: Galactus_12.jpg',
 '11: INOUE_KYOUMEN_002.png',
 '12: MCCALL_ROBINHOOD_T31_010.jpg',
 '13: MCCAY_LITTLENEMO_090.jpg',
 '14: Mary_Perkins_On_Stage_v2006_1_-_P00068.jpg',
 '15: PIKE_BOYLOVEGIRLS_T41_012.jpg',
 '16: Sal_Buscema_Spaceknights_&_Superheroes_Ocular_Edition_1_1.png',
 '17: Sal_Buscema_Spaceknights_&_Superheroes_Ocular_Edition_1_1_K.png',
 '18: Sal_Buscema_Spaceknights_&_Superheroes_Ocular_Edition_1_2.png',
 '19: Spirou_Et_Fantasio_Integrale_06_1958_1959_0025_0024.jpg',
 '20: Strange_Tales_172005.jpg',
 '21: Strange_Tales_172021.jpg',
 '22: Tarzan_014-21.JPG',
 '23: Tintin_21_Les_Bijoux_de_la_Castafiore_page_39.jp

----
# Running an experiment

Conduct an OCR experiment using the established context and tools. You will select an image, choose a cropping method, and decide which text box to analyze. The results will be visualized so you can assess the effectiveness of the OCR process.

### Selecting and configuring the experiment

1. **Choose an image:** Start by selecting an image from the loaded dataset.
2. **Specify cropping method:** Choose how the image should be cropped. Different cropping methods can affect OCR accuracy, as they change how the text is presented to the OCR engine.
3. **Select text box:** Select the specific text box within the image to focus the OCR process.

### Visualizing results

The results are visualized immediately. Here, it is crucial to have accurate **ground truth** data to effectively compare and assess the OCR results.

You can assess the accuracy of OCR results at various levels: box by box, method by method, and overall. Currently, we use a simplified version of the `edit distance` metric to calculate accuracy. However, we plan to adopt more standardized metrics, such as the `Levenshtein distance`, in future updates.

Additionally, we should probably develop a metric specifically tailored to the unique characteristics of comic texts, such as the prevalence of all-caps and handwritten text, to provide more relevant evaluations. OCR models are trained with typeset text, synthetic or real-world, and business, forms, news or literary data, and usually don't perform well on handwritten text. We haven't found any OCR dataset that incorporates comics style data.


### Managing experiment data

- **Save results:** You have the option to save the results of the experiment, useful for documenting performance and changes over time. However, be cautious with this option as it will overwrite existing results without confirmation.


> **Note:** This visor functionality is currently a work in progress (WIP). The interface and options are being refined to enhance the experience and provide more robust data management. This section gives you a preliminary look at what we are aiming for with the `testbed` project.


In [None]:
tesseract_experiment = ExperimentsVisor(
    CONTEXT,
    image_idx='Strange_Tales_172005.jpg',
)
tesseract_experiment

VBox(children=(HTML(value="<style id='stl-13020371824'>\n    .wrapper-spinner {\n        overflow: hidden;\n  …

----

In [None]:
CONTEXT.cleanup_model()


In [None]:
if SERVER is not None:
    SERVER.stop()
    SERVER = None
    os.environ['USE_TUNNEL'] = 'False'
