# Testing `Idefics` OCR for Comics
> Accuracy Enhancements for OCR in `PanelCleaner`


## Settings for Google Colab

To efficiently manage the image sources for our experiments, we recommend mounting your Google Drive and storing the experiment files there. If you are not familiar with Colab or Jupyter environments, it's best to leave these settings at their default values to ensure smooth operation.

- Set `MOUNT_DRIVE` to `True` to enable mounting Google Drive in the Colab environment.
This allows the notebook to access files stored in your Google Drive.

- `GDRIVE_MOUNT_POINT` specifies the local directory in Colab where your Google Drive will be mounted.
This acts as the root directory for accessing any files within your Google Drive from the notebook.

- `PANELCLEANER_IN_GDRIVE` specifies the path within your Google Drive where the PanelCleaner project is located.
This path is used to access or store any files related to the PanelCleaner project directly from Google Drive.


In [1]:
MOUNT_DRIVE = True
GDRIVE_MOUNT_POINT = 'drive'
PANELCLEANER_IN_GDRIVE = 'MyDrive/Shared/PanelCleaner'

# install (Colab)


In [2]:
import fastcore.all as FC
import os
import re
import sys
from pathlib import Path

from rich import print as cprint
from rich.text import Text

def info(msg: str):
    (t := Text(msg)).stylize("bold red", 0, 6)
    cprint("_" * 10, t, "_" * 10)


Mount Google Drive

In [3]:
mnt_point = Path(f"/content/{GDRIVE_MOUNT_POINT}")
if FC.IN_COLAB:
    if MOUNT_DRIVE:
        if not mnt_point.exists():
            info("Mounting Google Drive")
            from google.colab import drive
            drive.mount(str(mnt_point), force_remount=True)


### Install **PanelCleaner**

> We will attempt to use the version of **PanelCleaner** stored in your Google Drive. If it's not available, we'll install it from GitHub.

Note that we specifically require the `testbed` branch of the **PanelCleaner** repository, not the main trunk. This branch contains necessary configurations and experimental features that are crucial for the tests conducted in this notebook.

In [4]:
if FC.IN_COLAB:
    pc_path = mnt_point/PANELCLEANER_IN_GDRIVE
    tb_path = pc_path/'pcleaner/_testbed'
    if tb_path.exists():
        info('Installing PanelCleaner from your Google Drive')
    else:
        info('Installing PanelCleaner from GitHub')
        !git clone -b testbed https://github.com/civvic/PanelCleaner.git
        tb_path = Path('PanelCleaner/pcleaner/_testbed')
    assert tb_path.exists(), "PanelCleaner not found"
    os.chdir(tb_path)
    sys.path.append(f"{pc_path}")
    sys.path.append(f"{tb_path}")
    !pip install -q -r requirements-colab.txt


In [5]:
if FC.IN_COLAB:
  !pip install -q flash-attn --no-build-isolation
  !pip install -q transformers accelerate datasets peft bitsandbytes

import transformers
transformers.__version__

'4.42.0.dev0'

# Prologue

In [6]:
from testbed.experiments import ExperimentsVisor, CropMethod, OCRExperimentContext
from testbed.ocr_idefics import IdeficsOCR, get_gpu_vram
from testbed.helpers import IN_MAC, IN_LINUX


In [7]:
if IN_MAC:
    !pip install -q mlx_vlm

    import mlx.core as mx


# GPU

In [8]:
if IN_MAC:
    gpu_name = mx.metal.device_info()['architecture']
    cprint(
        f"{'metal.is_available()':>30}: {mx.metal.is_available()}\n"
        f"{'metal.device_info()':>30}: {mx.metal.device_info()}\n"
        f"{'metal.get_active_memory()':>30}: {mx.metal.get_active_memory()//1024//1024}\n"
        f"{'metal.get_peak_memory()':>30}: {mx.metal.get_peak_memory()//1024//1024}\n"
        f"{'metal.get_cache_memory()':>30}: {mx.metal.get_cache_memory()//1024//1024}\n"
    )
else:
    !nvidia-smi
    import subprocess
    gpu_name = subprocess.check_output(
            "nvidia-smi --query-gpu=gpu_name --format=csv,noheader", shell=True
        ).decode('utf-8').strip()
    

cprint( f"{'GPU':>15}: {gpu_name}\n"
        f"{'total VRAM':>15}: {get_gpu_vram()} MiB\n"
        f"{'active VRAM':>15}: {get_gpu_vram(False)} MiB")



----
# Idefics experiments

## Experiment directory

Directory where the images reside (`EXP_DIR/source/`), the auxiliary images will be cached (`EXP_DIR/cache/`), and the experiment results will be saved. You can change the default location here.

NOTE: the default value assumes we are currently inside `PanelCleaner/pcleaner/_testbed` directory. You can check that is the case with `Path('.').resolve()`.

In [9]:
EXP_DIR = Path('./experiment')
cprint(f"{'Working dir':>15}: {Path('.').resolve()}\nExperiments dir: {EXP_DIR.resolve()}")


# Test images


Copy your images to the source directory:


In [None]:
cprint((EXP_DIR/'source').resolve())

or download the standard set:


In [None]:
# !gdown --id 1MCqUImwFS5iQ271CD9_t2FSugJXdYj0a -O experiment.zip

In [None]:
# !unzip -qn experiment.zip -d .

# Setup ngrok (Colab)

The experiments can generate hundreds of images, and maintaining the **PIL** images in memory is not efficient. All the generated images are cached and visualized on demand through a URL pointing to the local cache. This approach prevents the kernel from being overloaded with **PIL** images, with the front-end responsible for fetching the image and the backend web server (not the kernel) for serving the image in another process. This method is quick and efficient. As an added bonus, the saved notebook remains lean and fit; it doesn't store the Base64 versions of all the output cell images.

Unfortunately, this approach does not work as is in **Colab**. Google Colab runs on an older Ubuntu 18.04 VM, so all the usual networking challenges with Docker, or whatever VMs Google is using, apply. Google also goes to great lengths to avoid exposing its internal architecture. We have two options:
- Let the Jupyter kernel serve the images itself, which is slow and memory-consuming.
- Use a tunnel to map localhost (server) to whatever IP and port the front-end (the browser you're currently using) is running on. We can use **ngrok** for this, but *ngrok* is a commercial service that has been abused and now requires confirmation the first time the tunnel connects, which can be inconvenient for the user. It also requires the user to open a free account and obtain an auth token.

You choose.

If the notebook is running in Colab and ngrok has been successfully installed and the tunnel has been created, the default setting is USE_PIL=False. You can set the environment variable USE_PIL=True to force the use of PIL images, but note that in certain circumstances, Colab will complain because the free tiers are usually memory constrained.

I you don't change the default settings and
- the notebook is running locally, it'll serve the images directly without any additional setup.
- the notebook is running in Colab, it'll serve the images through a web server and ngrok.


In [10]:
if FC.IN_COLAB:
    os.environ['USE_TUNNEL'] = 'True'
    os.environ['USE_PIL'] = 'False'


In [11]:
SERVER = None
if os.environ['USE_PIL'].lower() == 'false' and os.environ['USE_TUNNEL'].lower() == 'true':
    import testbed.web_server as web_server
    SERVER = web_server.setup_ngrok(web_server.WebServerBottle, Path(EXP_DIR))


# CONTEXT

| quant, attn \ platform | Mem   | Mac  | Linux | Windows | Colab T4 | Colab L4/H100 |
| ---                    | ---   | ---  | ---   | ---     | ---      | ---           |
| **float16**            | 17 GB | ✅   | ✅    | ?       | ❌       | ✅            |
| **float16 + attn**     | 17 GB | ❌   | ✅    | ?       | ❌       | ✅            |
| **8bit**               | 10 GB | ✅   | ✅    | ?       | ✅       | ✅            |
| **8bit + attn**        | 10 GB | ❌   | ✅    | ?       | ❌       | ✅            |
| **4bit**               |  6 GB | ✅   | ✅    | ?       | ✅       | ✅            |
| **4bit + attn**        |  6 GB | ❌   | ✅    | ?       | ❌       | ✅            |


Creates the `IdeficsExperimentContext` object we'll use to manage the experiments.


In [15]:
quant = '4bit' if IN_MAC or FC.IN_COLAB else 'float16'
flashattn = True if not FC.IN_COLAB else False
CONTEXT = OCRExperimentContext('Idefics', EXP_DIR, 
                            quant=quant, flashattn=flashattn, 
                            server=SERVER, run_name='Idefics-crop-post', load=True)
CONTEXT.show()


Current Configuration:

Locale: System default
Default Profile: Built-in
Saved Profiles:
- victess: /Users/vic/dev/repo/DL-mac/cleaned/victess.conf
- vicmang: /Users/vic/dev/repo/DL-mac/cleaned/vicmang.conf

Profile Editor: cursor
Cache Directory: System default
Default Torch Model Path: /Users/vic/Library/Caches/pcleaner/model/comictextdetector.pt
Default CV2 Model Path: /Users/vic/Library/Caches/pcleaner/model/comictextdetector.pt.onnx
GUI Theme: System default

--------------------

Config file located at: /Users/vic/Library/Application Support/pcleaner/pcleanerconfig.ini
System default cache directory: /Users/vic/Library/Caches/pcleaner


In [16]:
ocr_model = CONTEXT.setup_ocr_model(False)
ocr_model.show_info()

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

Check the images are in place

In [17]:
[f"{i:02}: {_.name}" for i,_ in enumerate(CONTEXT.image_paths)]


['00: Action_Comics_1960-01-00_(262).JPG',
 '01: Adolf_Cap_01_008.jpg',
 '02: Barnaby_v1-028.png',
 '03: Barnaby_v1-029.png',
 '04: Buck_Danny_-_12_-_Avions_Sans_Pilotes_-_013.jpg',
 '05: Cannon-292.jpg',
 '06: Contrato_con_Dios_028.jpg',
 '07: Erase_una_vez_en_Francia_02_88.jpg',
 '08: FOX_CHILLINTALES_T17_012.jpg',
 '09: Furari_-_Jiro_Taniguchi_selma_056.jpg',
 '10: Galactus_12.jpg',
 '11: INOUE_KYOUMEN_002.png',
 '12: MCCALL_ROBINHOOD_T31_010.jpg',
 '13: MCCAY_LITTLENEMO_090.jpg',
 '14: Mary_Perkins_On_Stage_v2006_1_-_P00068.jpg',
 '15: PIKE_BOYLOVEGIRLS_T41_012.jpg',
 '16: Sal_Buscema_Spaceknights_&_Superheroes_Ocular_Edition_1_1.png',
 '17: Sal_Buscema_Spaceknights_&_Superheroes_Ocular_Edition_1_1_K.png',
 '18: Sal_Buscema_Spaceknights_&_Superheroes_Ocular_Edition_1_2.png',
 '19: Spirou_Et_Fantasio_Integrale_06_1958_1959_0025_0024.jpg',
 '20: Strange_Tales_172005.jpg',
 '21: Strange_Tales_172021.jpg',
 '22: Tarzan_014-21.JPG',
 '23: Tintin_21_Les_Bijoux_de_la_Castafiore_page_39.jp

In [20]:
idefics_experiment = ExperimentsVisor(
                        CONTEXT,
                        'Idefics',
                        image_idx='Strange_Tales_172005.jpg',
                        box_idx=0,
                        method=CropMethod.DEFAULT_GREY_PAD
                    )
idefics_experiment


VBox(children=(HTML(value="<style id='stl-19373146016'>\n    .wrapper-spinner {\n        overflow: hidden;\n  …

----

In [21]:
CONTEXT.cleanup_model()

In [22]:
if SERVER is not None:
    SERVER.stop()
    SERVER = None
    os.environ['USE_TUNNEL'] = 'False'
