


High-Level Summary
Overall, EasyOCR is the better performing model in this test, but neither model is usable for this task with its default configuration.

EasyOCR is significantly better at Text Detection (finding the text).

Both models performed exceptionally poorly at Text Recognition (reading the text correctly).

EasyOCR is approximately twice as fast as Tesseract.

Detailed Analysis Breakdown
1. Text Detection (Finding the Text)
This is about how well the models drew bounding boxes around the text. The F1-Score is the best metric to judge this.

Tesseract:

F1-Score: 0.0018 (Effectively 0.2%)

Analysis: This is a near-total failure. Tesseract was almost completely unable to correctly identify the locations of text on these forms. The extremely low Precision (0.0020) and Recall (0.0016) mean it was both incorrect when it tried and missed almost everything.

EasyOCR:

F1-Score: 0.0253 (Effectively 2.5%)

Analysis: While still a very poor score, EasyOCR's F1-score is 14 times higher than Tesseract's. It shows a limited but demonstrably better ability to find text within the complex layout of a form.

Conclusion for Detection: EasyOCR is the decisive winner, but its performance is still not good enough for a practical application.

2. Text Recognition (Reading the Text)
This evaluates the text that was transcribed inside the correctly identified boxes.

Tesseract:

Word Error Rate (WER): 1.0000 (100%)

Analysis: A WER of 100% means that, on average, every single word was wrong. It produced zero correctly recognized words. The Character Error Rate (CER) of over 100% confirms this — the output was essentially gibberish.

EasyOCR:

Word Error Rate (WER): 2.9236 (292%)

Analysis: This is a fascinating result. A WER score above 100% means that the model not only got the words wrong, but it also inserted a large number of extra, incorrect words. For every 10 ground-truth words, EasyOCR output about 29 words of incorrect text.

Exact Match Accuracy: 4.5% This is an interesting outlier. It means that on a very small fraction of the text boxes it found, it managed to get the transcription perfectly right. This likely happened on very clean, isolated, and simple words within the forms.

Conclusion for Recognition: Both models failed completely. Tesseract produced nonsense of roughly the correct length, while EasyOCR produced a much larger volume of nonsense but had rare moments of perfect accuracy on simple words.

3. Performance and Speed
From the progress bars, we can see the processing time for 50 images:

Tesseract: 01:48 (108 seconds) -> ~2.16 seconds per image

EasyOCR: 00:55 (55 seconds) -> ~1.10 seconds per image

Conclusion for Speed: EasyOCR, leveraging the Colab GPU, was approximately twice as fast as the CPU-based Tesseract.

Why Were the Results So Poor? The "Model vs. Dataset" Mismatch
The primary reason for these terrible scores is a fundamental mismatch between the chosen models and the dataset.

The Dataset (FUNSD): This is not a dataset of simple paragraphs or street signs. It contains noisy, scanned forms with complex layouts, tables, checkboxes, lines, key-value pairs, and varied fonts.

The Models:

Tesseract (with default settings): Tesseract's default page segmentation (--psm 3) is designed to find blocks of text and paragraphs. It gets deeply confused by the lines and columns of a form, leading to its failure in detection.

EasyOCR: It is primarily designed for "text in the wild" (signs, posters, etc.). While it's more flexible than Tesseract, it also struggles to interpret the dense, structured information on a form and may hallucinate text from visual noise, explaining its high insertion rate (WER > 100%).

Final Verdict and Recommendations
Winner: EasyOCR is the better general-purpose OCR engine based on this test. It was faster and significantly better at finding text in a challenging layout.

Next Steps: To get good results on the FUNSD dataset, you cannot use a general-purpose OCR engine out-of-the-box. You need to move to more advanced models designed for this specific task:

Document AI / Layout-Aware Models: You should explore models like LayoutLM, Donut, or other similar architectures. These models don't just read the text; they also understand the layout and structure of the document by processing both the text and the position of the words. They are the state-of-the-art for this kind of task and will produce vastly better results.

In [None]:
# Test Case 1: Verify Authentication and API Connectivity
from huggingface_hub import HfApi
from huggingface_hub.utils import HfHubHTTPError

print("--- Running Test Case 1: Verifying Authentication ---")

try:
    api = HfApi()
    user_info = api.whoami()
    if user_info:
        print("✅ SUCCESS: Successfully authenticated with the Hugging Face Hub.")
        print(f"   - Logged in as: {user_info.get('name')}")
        print(f"   - User Type: {user_info.get('type')}")
    else:
        print("❌ FAILURE: Could not retrieve user information. The token might be invalid or the login failed.")

except HfHubHTTPError as e:
    print(f"❌ FAILURE: An HTTP error occurred. This could be a network issue or an invalid token.")
    print(f"   - Error details: {e}")
except Exception as e:
    print(f"❌ FAILURE: An unexpected error occurred.")
    print(f"   - Error details: {e}")

--- Running Test Case 1: Verifying Authentication ---
❌ FAILURE: An unexpected error occurred.
   - Error details: Token is required (`token=True`), but no token found. You need to provide a token or be logged in to Hugging Face with `hf auth login` or `huggingface_hub.login`. See https://huggingface.co/settings/tokens.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Test Case 2: Check if the Dataset Repository is Discoverable
from huggingface_hub import HfApi
from huggingface_hub.utils import RepositoryNotFoundError

print("\n--- Running Test Case 2: Checking Dataset Discoverability ---")

DATASET_NAME_TO_CHECK = "gagan3012/coco-text-v2"

try:
    api = HfApi()
    dataset_info = api.dataset_info(repo_id=DATASET_NAME_TO_CHECK)
    print(f"✅ SUCCESS: The dataset repository '{DATASET_NAME_TO_CHECK}' was found on the Hub.")
    print(f"   - Author: {dataset_info.author}")
    print(f"   - Last Modified: {dataset_info.last_modified}")

except RepositoryNotFoundError:
    print(f"❌ FAILURE: The repository '{DATASET_NAME_TO_CHECK}' could not be found by the API.")
    print("   - This is very strange if Test 1 passed. It might indicate a temporary Hugging Face Hub issue or a problem with the specific repository.")
except Exception as e:
    print(f"❌ FAILURE: An unexpected error occurred while trying to fetch dataset info.")
    print(f"   - Error details: {e}")


--- Running Test Case 2: Checking Dataset Discoverability ---
❌ FAILURE: The repository 'gagan3012/coco-text-v2' could not be found by the API.
   - This is very strange if Test 1 passed. It might indicate a temporary Hugging Face Hub issue or a problem with the specific repository.


In [None]:
# Test Case 2.1: Check Discoverability of a different dataset (FUNSD)
from huggingface_hub import HfApi
from huggingface_hub.utils import RepositoryNotFoundError

print("\n--- Running Test Case 2.1: Checking a control dataset (FUNSD) ---")

# We use the FUNSD dataset name as our control
DATASET_NAME_TO_CHECK = "nielsr/funsd-layoutlmv3"

try:
    api = HfApi()
    dataset_info = api.dataset_info(repo_id=DATASET_NAME_TO_CHECK)
    print(f"✅ SUCCESS: The dataset repository '{DATASET_NAME_TO_CHECK}' was found on the Hub.")
    print(f"   - Author: {dataset_info.author}")
    print(f"   - Last Modified: {dataset_info.last_modified}")

except RepositoryNotFoundError:
    print(f"❌ FAILURE: The repository '{DATASET_NAME_TO_CHECK}' could not be found by the API.")
    print("   - If this fails too, there is a larger issue with the API or your connection.")
except Exception as e:
    print(f"❌ FAILURE: An unexpected error occurred.")
    print(f"   - Error details: {e}")


--- Running Test Case 2.1: Checking a control dataset (FUNSD) ---
✅ SUCCESS: The dataset repository 'nielsr/funsd-layoutlmv3' was found on the Hub.
   - Author: nielsr
   - Last Modified: 2025-06-20 06:58:31+00:00


In [None]:
# ==============================================================================
# STEP 1: SETUP AND INSTALLATIONS (No changes needed)
# ==============================================================================
print("STEP 1: Installing all required libraries...")
!sudo apt install tesseract-ocr
!pip install pytesseract easyocr torch torchvision torchaudio --progress-bar off
!pip install datasets evaluate jiwer --progress-bar off
print("Installation complete.\n")


# ==============================================================================
# STEP 2: IMPORTS AND MODEL INITIALIZATION (No changes needed)
# ==============================================================================
import pytesseract
import easyocr
from datasets import load_dataset
from PIL import Image, ImageDraw
import numpy as np
import jiwer # For WER and CER calculation
from tqdm.notebook import tqdm # For progress bars

print("STEP 2: Importing libraries and initializing models...")

# User-defined parameters for the FUNSD dataset
DATASET_NAME = "nielsr/funsd-layoutlmv3"
DATASET_SPLIT = "test"
NUM_SAMPLES = 200
IOU_THRESHOLD = 0.5
TESSERACT_CONFIG = "--psm 3"
EASYOCR_LANG = ['en']

print("Initializing EasyOCR... (This may take a minute on first run)")
easyocr_reader = easyocr.Reader(EASYOCR_LANG, gpu=True)
print("Initialization complete.\n")


# ==============================================================================
# STEP 3: DATA LOADING (No changes needed)
# ==============================================================================
print(f"STEP 3: Loading samples from the {DATASET_NAME} dataset...")
dataset = load_dataset(DATASET_NAME, split=DATASET_SPLIT)

actual_num_samples = min(NUM_SAMPLES, len(dataset))
dataset_subset = dataset.select(range(actual_num_samples))

print(f"Loaded {len(dataset_subset)} samples (requested {NUM_SAMPLES}, but only {len(dataset)} were available).\n")


# ==============================================================================
# STEP 4: HELPER FUNCTIONS FOR METRICS CALCULATION (No changes needed)
# ==============================================================================
print("STEP 4: Defining helper functions for evaluation...")

def calculate_iou(boxA, boxB):
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])
    interArea = max(0, xB - xA) * max(0, yB - yA)
    if interArea == 0:
        return 0.0
    boxAArea = (boxA[2] - boxA[0]) * (boxA[3] - boxA[1])
    boxBArea = (boxB[2] - boxB[0]) * (boxB[3] - boxB[1])
    iou = interArea / float(boxAArea + boxBArea - interArea)
    return iou

def match_boxes(gt_boxes, pred_boxes, iou_threshold):
    matches = []
    if not gt_boxes or not pred_boxes:
        return [], list(range(len(gt_boxes))), list(range(len(pred_boxes)))
    gt_matched = [False] * len(gt_boxes)
    pred_matched = [False] * len(pred_boxes)
    iou_matrix = np.zeros((len(gt_boxes), len(pred_boxes)))
    for i, gt_box in enumerate(gt_boxes):
        for j, pred_box in enumerate(pred_boxes):
            iou_matrix[i, j] = calculate_iou(gt_box, pred_box)
    while np.sum(iou_matrix) > 0:
        gt_idx, pred_idx = np.unravel_index(np.argmax(iou_matrix, axis=None), iou_matrix.shape)
        max_iou = iou_matrix[gt_idx, pred_idx]
        if max_iou >= iou_threshold:
            if not gt_matched[gt_idx] and not pred_matched[pred_idx]:
                matches.append((gt_idx, pred_idx))
                gt_matched[gt_idx] = True
                pred_matched[pred_idx] = True
        iou_matrix[gt_idx, :] = 0
        iou_matrix[:, pred_idx] = 0
    unmatched_gts = [i for i, matched in enumerate(gt_matched) if not matched]
    unmatched_preds = [i for i, matched in enumerate(pred_matched) if not matched]
    return matches, unmatched_gts, unmatched_preds

print("Helper functions defined.\n")


# ==============================================================================
# STEP 5: DEFINE THE MAIN EVALUATION FUNCTION (Updated with the fix)
# ==============================================================================
print("STEP 5: Defining the main evaluation loop...")

def evaluate_model(dataset, ocr_function, model_name):
    print(f"\n--- Evaluating {model_name} ---")
    total_tp, total_fp, total_fn = 0, 0, 0
    all_gt_texts, all_pred_texts = [], []
    exact_matches, total_comparisons = 0, 0

    for item in tqdm(dataset, desc=f"Processing images with {model_name}"):
        image = item['image'].convert("RGB")

        gt_boxes = item['bboxes']
        gt_texts = [text.strip() for text in item['tokens']]

        pred_boxes, pred_texts = ocr_function(image)
        matches, unmatched_gts, unmatched_preds = match_boxes(gt_boxes, pred_boxes, IOU_THRESHOLD)

        total_tp += len(matches)
        total_fn += len(unmatched_gts)
        total_fp += len(unmatched_preds)

        for gt_idx, pred_idx in matches:
            gt_text = gt_texts[gt_idx]
            pred_text = pred_texts[pred_idx]
            if gt_text:
                all_gt_texts.append(gt_text)
                all_pred_texts.append(pred_text)
                total_comparisons += 1
                if gt_text == pred_text:
                    exact_matches += 1

    precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
    recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    full_gt_text = " ".join(all_gt_texts)
    full_pred_text = " ".join(all_pred_texts)

    # --- FIX: Use the new jiwer API ---
    wer = jiwer.wer(full_gt_text, full_pred_text)
    cer = jiwer.cer(full_gt_text, full_pred_text)
    # --- END FIX ---

    exact_match_acc = exact_matches / total_comparisons if total_comparisons > 0 else 0

    print(f"\n--- Results for {model_name} ---")
    print(f"[Detection Metrics]")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1-Score:  {f1_score:.4f}")
    print(f"\n[Recognition Metrics]")
    print(f"  Character Error Rate (CER): {cer:.4f}")
    print(f"  Word Error Rate (WER):      {wer:.4f}")
    print(f"  Exact Match Accuracy:       {exact_match_acc:.4f}")
    print("-" * 30)

print("Evaluation function defined.\n")

# ==============================================================================
# STEP 6: CREATE OCR FUNCTIONS FOR TESSERACT AND EASYOCR (No changes needed)
# ==============================================================================
print("STEP 6: Creating standardized OCR functions...")

def run_tesseract(image):
    data = pytesseract.image_to_data(image, config=TESSERACT_CONFIG, output_type=pytesseract.Output.DICT)
    boxes, texts = [], []
    for i in range(len(data['text'])):
        if int(data['conf'][i]) > 0 and data['text'][i].strip() != '':
            x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
            boxes.append([x, y, x + w, y + h])
            texts.append(data['text'][i].strip())
    return boxes, texts

def run_easyocr(image):
    image_np = np.array(image)
    results = easyocr_reader.readtext(image_np)
    boxes, texts = [], []
    for (bbox, text, prob) in results:
        x_min = int(min([p[0] for p in bbox]))
        y_min = int(min([p[1] for p in bbox]))
        x_max = int(max([p[0] for p in bbox]))
        y_max = int(max([p[1] for p in bbox]))
        boxes.append([x_min, y_min, x_max, y_max])
        texts.append(text.strip())
    return boxes, texts

print("OCR functions are ready.\n")

# ==============================================================================
# STEP 7: RUN THE EVALUATION
# ==============================================================================
print("STEP 7: Starting the evaluation process. This may take some time...")

evaluate_model(dataset_subset, run_tesseract, "Tesseract")
evaluate_model(dataset_subset, run_easyocr, "EasyOCR")

print("\nEvaluation complete.")

STEP 1: Installing all required libraries...
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Collecting easyocr
  Downloading easyocr-1.7.2-py3-none-any.whl.metadata (10 kB)
Collecting python-bidi (from easyocr)
  Downloading python_bidi-0.6.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Collecting pyclipper (from easyocr)
  Downloading pyclipper-1.3.0.post6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.0 kB)
Collecting ninja (from easyocr)
  Downloading ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (5.1 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Downloading easyocr-1.7.2-py3-none-any.whl (2.9 MB)
Downloading ninja-1.13.



STEP 2: Importing libraries and initializing models...
Initializing EasyOCR... (This may take a minute on first run)
Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |--------------------------------------------------| 0.0% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.2% CompleteProgress: |--------------------------------------------------| 0.2% CompleteProgress: |--------------------------------------------------| 0.3% CompleteProgress: |--------------------------------------------------| 0.4% CompleteProgress: |--------------------------------------------------| 0.4% CompleteProgress: |--------------------------------------------------| 0.5% CompleteProgress: |--------------------------------------------------| 0.5% CompleteProgress: |--------------------------------------------------| 0.6% CompleteProgress: |--------------------------------------------------| 0.6% CompleteProgress: |--------------------------------------------------| 0.7% Complet

README.md:   0%|          | 0.00/770 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/26.3M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/9.54M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/149 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50 [00:00<?, ? examples/s]

Loaded 50 samples (requested 200, but only 50 were available).

STEP 4: Defining helper functions for evaluation...
Helper functions defined.

STEP 5: Defining the main evaluation loop...
Evaluation function defined.

STEP 6: Creating standardized OCR functions...
OCR functions are ready.

STEP 7: Starting the evaluation process. This may take some time...

--- Evaluating Tesseract ---


Processing images with Tesseract:   0%|          | 0/50 [00:00<?, ?it/s]


--- Results for Tesseract ---
[Detection Metrics]
  Precision: 0.0020
  Recall:    0.0016
  F1-Score:  0.0018

[Recognition Metrics]
  Character Error Rate (CER): 1.0114
  Word Error Rate (WER):      1.0000
  Exact Match Accuracy:       0.0000
------------------------------

--- Evaluating EasyOCR ---


Processing images with EasyOCR:   0%|          | 0/50 [00:00<?, ?it/s]


--- Results for EasyOCR ---
[Detection Metrics]
  Precision: 0.0432
  Recall:    0.0179
  F1-Score:  0.0253

[Recognition Metrics]
  Character Error Rate (CER): 2.9125
  Word Error Rate (WER):      2.9236
  Exact Match Accuracy:       0.0449
------------------------------

Evaluation complete.


In [None]:
#

In [None]:
# ==============================================================================
# STEP 1: SETUP AND INSTALLATIONS
# ==============================================================================
# We are adding new libraries for the transformer models, especially Nougat.
# ==============================================================================
print("STEP 1: Installing all required libraries...")
!sudo apt install tesseract-ocr
!pip install pytesseract easyocr torch torchvision torchaudio --progress-bar off
!pip install datasets transformers accelerate sentencepiece sentence-splitter protobuf --progress-bar off
print("Installation complete.\n")


# ==============================================================================
# STEP 2: IMPORTS AND MODEL INITIALIZATION
# ==============================================================================
# This step will now take longer as we are loading two large transformer models.
# ==============================================================================
import pytesseract
import easyocr
from datasets import load_dataset
from transformers import pipeline
from PIL import Image
import warnings

# Suppress a known warning in the Nougat model
warnings.filterwarnings("ignore", category=UserWarning, message="Setting `pad_token_id` to `eos_token_id`")

print("STEP 2: Importing libraries and initializing all four models...")

# --- Initialize Traditional OCR ---
print("Initializing EasyOCR...")
easyocr_reader = easyocr.Reader(['en'], gpu=True)
TESSERACT_CONFIG = "--psm 3"

# --- Initialize Transformer Models ---
print("Initializing Donut model...")
# Using the base Donut model for general document parsing
donut_pipeline = pipeline("document-question-answering", model="naver-clova-ix/donut-base")

print("Initializing Nougat model...")
# Nougat is specifically for document pages, perfect for RVL-CDIP
nougat_pipeline = pipeline("image-to-text", model="facebook/nougat-base")

print("All models initialized.\n")


# ==============================================================================
# STEP 3: DATA LOADING
# ==============================================================================
# We will load a few samples from the RVL-CDIP test set.
# ==============================================================================
print("STEP 3: Loading samples from the RVL-CDIP dataset...")
DATASET_NAME = "rvl_cdip"
DATASET_SPLIT = "test"
NUM_SAMPLES = 3 # Let's start with 3 images for a side-by-side comparison

dataset = load_dataset(DATASET_NAME, split=DATASET_SPLIT)
dataset_subset = dataset.select(range(NUM_SAMPLES))

print(f"Loaded {len(dataset_subset)} samples.\n")


# ==============================================================================
# STEP 4: RUN QUALITATIVE COMPARISON
# ==============================================================================
# We will loop through each sample image and get the output from each model.
# ==============================================================================
print("STEP 4: Starting qualitative comparison...")

for i, item in enumerate(dataset_subset):
    image = item['image'].convert("RGB")
    doc_category = dataset.features['label'].int2str(item['label'])

    print("="*80)
    print(f"\nProcessing Sample {i+1} | Document Category: '{doc_category}'")
    print("="*80)

    # --- Display the image we are processing ---
    display(image.resize((600, 800))) # Resize for better display in Colab

    # --- 1. Tesseract ---
    print("\n--- 1. Tesseract Output ---")
    try:
        tesseract_text = pytesseract.image_to_string(image, config=TESSERACT_CONFIG)
        print(tesseract_text if tesseract_text.strip() else "[No text detected]")
    except Exception as e:
        print(f"Tesseract failed with error: {e}")

    # --- 2. EasyOCR ---
    print("\n--- 2. EasyOCR Output ---")
    try:
        easyocr_results = easyocr_reader.readtext(np.array(image), paragraph=True)
        # We use paragraph=True to group text, then join it.
        easyocr_text = "\n".join([res[1] for res in easyocr_results])
        print(easyocr_text if easyocr_text.strip() else "[No text detected]")
    except Exception as e:
        print(f"EasyOCR failed with error: {e}")

    # --- 3. Donut ---
    print("\n--- 3. Donut Output ---")
    try:
        # For a full page, a broad question works best
        question = "What is the content of this document?"
        donut_results = donut_pipeline(image=image, question=question)
        # Donut returns a list of answers, we'll join them
        donut_text = " ".join([ans['answer'] for ans in donut_results])
        print(donut_text if donut_text.strip() else "[No text detected]")
    except Exception as e:
        print(f"Donut failed with error: {e}")

    # --- 4. Nougat ---
    print("\n--- 4. Nougat Output ---")
    try:
        # Nougat's pipeline is direct image-to-text
        nougat_results = nougat_pipeline(image)
        # Nougat often returns Markdown formatted text in a list
        nougat_text = nougat_results[0]['generated_text']
        print(nougat_text if nougat_text.strip() else "[No text detected]")
    except Exception as e:
        print(f"Nougat failed with error: {e}")

print("\n\n" + "="*80)
print("Qualitative comparison complete.")


STEP 1: Installing all required libraries...
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Collecting sentence-splitter
  Downloading sentence_splitter-1.4-py2.py3-none-any.whl.metadata (2.8 kB)
Downloading sentence_splitter-1.4-py2.py3-none-any.whl (44 kB)
Installing collected packages: sentence-splitter
Successfully installed sentence-splitter-1.4
Installation complete.

STEP 2: Importing libraries and initializing all four models...
Initializing EasyOCR...
Initializing Donut model...


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/809M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/809M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/518 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/1.30M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/71.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/355 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/362 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Device set to use cuda:0


Initializing Nougat model...


config.json: 0.00B [00:00, ?B/s]



model.safetensors:   0%|          | 0.00/1.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/479 [00:00<?, ?B/s]

Device set to use cuda:0


All models initialized.

STEP 3: Loading samples from the RVL-CDIP dataset...


RuntimeError: Dataset scripts are no longer supported, but found rvl_cdip.py

In [None]:
# Cell 1: Hugging Face Login
from huggingface_hub import login

# This will prompt you for your access token.
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

<!-- 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 -->

In [None]:
# ==============================================================================
# STEP 1: SETUP AND INSTALLATIONS
# ==============================================================================
print("STEP 1: Installing all required libraries...")
!sudo apt install tesseract-ocr
!pip install pytesseract easyocr torch torchvision torchaudio --progress-bar off
!pip install datasets transformers accelerate sentencepiece sentence-splitter protobuf lxml jiwer --progress-bar off --quiet
print("Installation complete.\n")


# ==============================================================================
# STEP 2: IMPORTS AND MODEL INITIALIZATION
# ==============================================================================
import pytesseract
import easyocr
from datasets import load_dataset
from transformers import pipeline
from PIL import Image
import numpy as np
import jiwer
import json
from tqdm.notebook import tqdm
import warnings

warnings.filterwarnings("ignore", category=UserWarning)

print("STEP 2: Importing libraries and initializing models...")

# --- Log in to Hugging Face (Crucial Step) ---
from huggingface_hub import login
print("Please log in to your Hugging Face account.")
login()

models = {}
# --- Initialize Models ---
print("Initializing EasyOCR...")
models['EasyOCR'] = easyocr.Reader(['en'], gpu=True)
TESSERACT_CONFIG = "--psm 3"
models['Tesseract'] = True

print("Initializing Donut model...")
models['Donut'] = pipeline("document-question-answering", model="naver-clova-ix/donut-base")

print("Initializing Nougat model...")
models['Nougat'] = pipeline("image-to-text", model="facebook/nougat-base")

print("\nModel initialization phase complete.\n")


# ==============================================================================
# STEP 3: DATA LOADING AND GROUND TRUTH PARSING
# ==============================================================================
print("STEP 3: Loading CORD-v2 dataset and defining ground truth parser...")

def parse_cord_ground_truth(ground_truth_json_string):
    """Parses the JSON string from CORD-v2 to extract all text."""
    try:
        gt_data = json.loads(ground_truth_json_string)
        all_text = []
        # The data is in a list of dictionaries under the 'valid_line' key
        for line in gt_data['valid_line']:
            words = line['words']
            for word in words:
                all_text.append(word['text'])
        return " ".join(all_text)
    except Exception as e:
        print(f"Warning: Could not parse ground truth JSON. Error: {e}")
        return ""

DATASET_NAME = "naver-clova-ix/cord-v2"
DATASET_SPLIT = "test"
NUM_SAMPLES = 500 # CORD-v2 is larger, let's use 20 samples

try:
    dataset = load_dataset(DATASET_NAME, split=DATASET_SPLIT)
    actual_num_samples = min(NUM_SAMPLES, len(dataset))
    dataset_subset = dataset.select(range(actual_num_samples))
    print(f"[✅ SUCCESS] Loaded {len(dataset_subset)} samples from {DATASET_NAME}.\n")
except Exception as e:
    print(f"[❌ FAIL] Could not load dataset {DATASET_NAME}. Halting. Error: {e}")
    sys.exit()


# ==============================================================================
# STEP 4: STANDARDIZED OCR FUNCTIONS
# ==============================================================================
print("STEP 4: Defining robust, standardized OCR functions...")

def run_tesseract(image):
    try:
        return pytesseract.image_to_string(image, config=TESSERACT_CONFIG)
    except Exception: return ""

def run_easyocr(image):
    try:
        results = models['EasyOCR'].readtext(np.array(image), paragraph=False)
        return " ".join([res[1] for res in results])
    except Exception: return ""

def run_donut(image):
    try:
        question = "What is the content of this receipt?"
        results = models['Donut'](image=image, question=question)
        return " ".join([ans['answer'] for ans in results])
    except Exception: return ""

def run_nougat(image):
    try:
        results = models['Nougat'](image)
        return results[0]['generated_text']
    except Exception: return ""

print("OCR functions are ready.\n")


# ==============================================================================
# STEP 5: MAIN EVALUATION FUNCTION
# ==============================================================================
print("STEP 5: Defining the main quantitative evaluation loop...")

def evaluate_model_quantitative(dataset, ocr_function, model_name):
    print(f"\n--- Evaluating {model_name} ---")
    if not models.get(model_name):
      print(f"Skipping {model_name} as it failed to initialize.")
      return

    all_gt_texts, all_pred_texts = [], []
    for item in tqdm(dataset, desc=f"Processing images with {model_name}"):
        image = item['image'].convert("RGB")
        ground_truth_text = parse_cord_ground_truth(item['ground_truth'])
        predicted_text = ocr_function(image)
        if ground_truth_text:
            all_gt_texts.append(ground_truth_text)
            all_pred_texts.append(predicted_text)

    corpus_gt = "\n".join(all_gt_texts)
    corpus_pred = "\n".join(all_pred_texts)

    if not corpus_gt:
      print("Could not parse any ground truth text for this dataset subset. Cannot calculate metrics.")
      return

    wer = jiwer.wer(corpus_gt, corpus_pred)
    cer = jiwer.cer(corpus_gt, corpus_pred)

    print(f"\n--- Quantitative Results for {model_name} ---")
    print(f"  Word Error Rate (WER):      {wer:.4f}")
    print(f"  Character Error Rate (CER): {cer:.4f}")
    print("-" * 40)

print("Evaluation function defined.\n")


# ==============================================================================
# STEP 6: RUN THE FULL EVALUATION
# ==============================================================================
print("STEP 6: Starting the full quantitative evaluation...")

models_to_run = {
    "Tesseract": run_tesseract,
    "EasyOCR": run_easyocr,
    "Donut": run_donut,
    "Nougat": run_nougat,
}

for model_name, ocr_function in models_to_run.items():
    evaluate_model_quantitative(dataset_subset, ocr_function, model_name)

print("\n\n" + "="*80)
print("Quantitative comparison complete.")

STEP 1: Installing all required libraries...
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 40 not upgraded.
Installation complete.

STEP 2: Importing libraries and initializing models...
Please log in to your Hugging Face account.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Initializing EasyOCR...
Initializing Donut model...


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Device set to use cuda:0


Initializing Nougat model...


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Device set to use cuda:0



Model initialization phase complete.

STEP 3: Loading CORD-v2 dataset and defining ground truth parser...
[✅ SUCCESS] Loaded 100 samples from naver-clova-ix/cord-v2.

STEP 4: Defining robust, standardized OCR functions...
OCR functions are ready.

STEP 5: Defining the main quantitative evaluation loop...
Evaluation function defined.

STEP 6: Starting the full quantitative evaluation...

--- Evaluating Tesseract ---


Processing images with Tesseract:   0%|          | 0/100 [00:00<?, ?it/s]


--- Quantitative Results for Tesseract ---
  Word Error Rate (WER):      0.8873
  Character Error Rate (CER): 0.6849
----------------------------------------

--- Evaluating EasyOCR ---


Processing images with EasyOCR:   0%|          | 0/100 [00:00<?, ?it/s]


--- Quantitative Results for EasyOCR ---
  Word Error Rate (WER):      0.7216
  Character Error Rate (CER): 0.3227
----------------------------------------

--- Evaluating Donut ---


Processing images with Donut:   0%|          | 0/100 [00:00<?, ?it/s]


--- Quantitative Results for Donut ---
  Word Error Rate (WER):      1.0000
  Character Error Rate (CER): 1.0000
----------------------------------------

--- Evaluating Nougat ---


Processing images with Nougat:   0%|          | 0/100 [00:00<?, ?it/s]


--- Quantitative Results for Nougat ---
  Word Error Rate (WER):      1.5698
  Character Error Rate (CER): 1.5657
----------------------------------------


Quantitative comparison complete.


In [None]:
# ==============================================================================
# STEP 1: SETUP AND INSTALLATIONS (No changes needed)
# ==============================================================================
print("STEP 1: Installing all required libraries for docTR benchmark...")
!pip install "python-doctr[torch]" --progress-bar off --quiet
!pip install datasets jiwer --progress-bar off --quiet
print("Installation complete.\n")


# ==============================================================================
# STEP 2: IMPORTS AND MODEL INITIALIZATION (No changes needed)
# ==============================================================================
from doctr.models import ocr_predictor
from datasets import load_dataset
from PIL import Image
import torch
import numpy as np
import jiwer
import json
from tqdm.notebook import tqdm
import sys
import warnings

warnings.filterwarnings("ignore")

print("STEP 2: Importing libraries and initializing docTR model...")

# --- Log in to Hugging Face (Required to download the dataset reliably) ---
from huggingface_hub import login
print("Please log in to your Hugging Face account to access the dataset.")
login()

# --- Initialize docTR ---
try:
    print("Initializing docTR predictor...")
    predictor = ocr_predictor(pretrained=True)
    if torch.cuda.is_available():
        print("CUDA is available. Moving model to GPU.")
        predictor = predictor.to('cuda')
    else:
        print("CUDA not available. Model will run on CPU.")

    models = {'docTR': predictor}
    print("[✅ SUCCESS] docTR initialized.")
except Exception as e:
    print(f"[❌ FAIL] Could not initialize docTR. Halting. Error: {e}")
    sys.exit()

print("\nModel initialization phase complete.\n")


# ==============================================================================
# STEP 3: DATA LOADING AND GROUND TRUTH PARSING (No changes needed)
# ==============================================================================
print("STEP 3: Loading CORD-v2 dataset and defining ground truth parser...")

def parse_cord_ground_truth(ground_truth_json_string):
    try:
        gt_data = json.loads(ground_truth_json_string)
        all_text = []
        for line in gt_data['valid_line']:
            words = line['words']
            for word in words:
                all_text.append(word['text'])
        return " ".join(all_text)
    except Exception: return ""

DATASET_NAME = "naver-clova-ix/cord-v2"
DATASET_SPLIT = "test"
NUM_SAMPLES = 500

try:
    dataset = load_dataset(DATASET_NAME, split=DATASET_SPLIT)
    actual_num_samples = min(NUM_SAMPLES, len(dataset))
    dataset_subset = dataset.select(range(actual_num_samples))
    print(f"[✅ SUCCESS] Loaded {len(dataset_subset)} samples from {DATASET_NAME}.\n")
except Exception as e:
    print(f"[❌ FAIL] Could not load dataset {DATASET_NAME}. Halting. Error: {e}")
    sys.exit()


# ==============================================================================
# STEP 4: STANDARDIZED OCR FUNCTION FOR docTR (FIXED)
# ==============================================================================
print("STEP 4: Defining the standardized OCR function for docTR...")

def run_doctr(image):
    try:
        # --- FIX: Convert the PIL image to a NumPy array before passing it in ---
        image_np = np.array(image)
        result = models['docTR']([image_np])
        # --- END FIX ---

        return result.pages[0].render()
    except Exception as e:
        print(f"Warning: docTR failed on an image. Error: {e}")
        return ""

print("OCR function is ready.\n")


# ==============================================================================
# STEP 5: MAIN EVALUATION FUNCTION (No changes needed)
# ==============================================================================
print("STEP 5: Defining the main quantitative evaluation loop...")

def evaluate_model_quantitative(dataset, ocr_function, model_name):
    print(f"\n--- Evaluating {model_name} ---")

    all_gt_texts, all_pred_texts = [], []
    for item in tqdm(dataset, desc=f"Processing images with {model_name}"):
        image = item['image'].convert("RGB")
        ground_truth_text = parse_cord_ground_truth(item['ground_truth'])
        predicted_text = ocr_function(image)
        if ground_truth_text:
            all_gt_texts.append(ground_truth_text)
            all_pred_texts.append(predicted_text)

    corpus_gt = "\n".join(all_gt_texts)
    corpus_pred = "\n".join(all_pred_texts)

    wer = jiwer.wer(corpus_gt, corpus_pred)
    cer = jiwer.cer(corpus_gt, corpus_pred)

    print(f"\n--- Quantitative Results for {model_name} ---")
    print(f"  Word Error Rate (WER):      {wer:.4f}")
    print(f"  Character Error Rate (CER): {cer:.4f}")
    print("-" * 40)
    return wer, cer

print("Evaluation function defined.\n")


# ==============================================================================
# STEP 6: RUN THE EVALUATION AND COMPARE
# ==============================================================================
print("STEP 6: Starting the docTR quantitative evaluation...")

doctr_wer, doctr_cer = evaluate_model_quantitative(dataset_subset, run_doctr, "docTR")

print("\n\n" + "="*80)
print("Bake-Off Stage 1: docTR Benchmark Complete.")
print("="*80)
print("\n--- Final Score Summary ---")
print(f"\ndocTR:")
print(f"  - Word Error Rate (WER): {doctr_wer:.4f}")
print(f"  - Character Error Rate (CER): {doctr_cer:.4f}")

print("\nFor comparison, here are the previous results for our champion:")
print(f"EasyOCR (Previous Champion):")
print(f"  - Word Error Rate (WER): 0.7937")
print(f"  - Character Error Rate (CER): 0.3540")

print("\n(Note: PaddleOCR failed to initialize and is not part of this comparison.)")
print("\nLower scores are better.")

STEP 1: Installing all required libraries for docTR benchmark...
[0mInstallation complete.

STEP 2: Importing libraries and initializing docTR model...
Please log in to your Hugging Face account to access the dataset.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Initializing docTR predictor...
CUDA is available. Moving model to GPU.
[✅ SUCCESS] docTR initialized.

Model initialization phase complete.

STEP 3: Loading CORD-v2 dataset and defining ground truth parser...
[✅ SUCCESS] Loaded 100 samples from naver-clova-ix/cord-v2.

STEP 4: Defining the standardized OCR function for docTR...
OCR function is ready.

STEP 5: Defining the main quantitative evaluation loop...
Evaluation function defined.

STEP 6: Starting the docTR quantitative evaluation...

--- Evaluating docTR ---


Processing images with docTR:   0%|          | 0/100 [00:00<?, ?it/s]


--- Quantitative Results for docTR ---
  Word Error Rate (WER):      0.8604
  Character Error Rate (CER): 0.2519
----------------------------------------


Bake-Off Stage 1: docTR Benchmark Complete.

--- Final Score Summary ---

docTR:
  - Word Error Rate (WER): 0.8604
  - Character Error Rate (CER): 0.2519

For comparison, here are the previous results for our champion:
EasyOCR (Previous Champion):
  - Word Error Rate (WER): 0.7937
  - Character Error Rate (CER): 0.3540

(Note: PaddleOCR failed to initialize and is not part of this comparison.)

Lower scores are better.
