## README

### Overview
This program performs OCR (Optical Character Recognition) on all PDF files in the `input` folder and saves the processed, searchable PDFs to the `output` folder. It uses [ocrmypdf](https://ocrmypdf.readthedocs.io/) and supports multiple languages, currently set to Portuguese and English (`por+eng`).

### Features
- Batch processing of PDFs
- Automatic deskew and page rotation
- Multi-language OCR support (customizable via the `LANGUAGES` variable)
- Output files are saved with `_iter` appended to the original filename

### Usage
1. Install dependencies as described above.
2. Place your PDF files in the `input` directory.
3. Run all cells in the notebook.
4. Find the OCR-processed PDFs in the `output` directory.

### Customization
- To change the OCR languages, modify the `LANGUAGES` variable (e.g., `'por+eng+spa'` for Portuguese, English, and Spanish).
- Adjust input/output folder paths as needed.

### Credits
Created by Felipe Bertoglio @ INF-UFRGS 2025.

In [2]:
LANGUAGES = "por+eng" # Portuguese and English
# Optional: por+eng+deu+fra+ita+spa+rus+chi_sim+jpn+kor+ara+heb+tha+vie

In [3]:
import ocrmypdf as ocr
from pathlib import Path

INPUT_FOLDER = Path("input")
OUTPUT_FOLDER = Path("output")


OUTPUT_FOLDER.mkdir(parents=True, exist_ok=True)


for pdf_file in INPUT_FOLDER.glob("*.pdf"):
    input_pdf = pdf_file
    output_pdf = OUTPUT_FOLDER / (pdf_file.stem + "_iter.pdf")

    print(f"Processing {input_pdf}...")

    ocr.ocr(
        input_pdf,
        output_pdf,
        language= LANGUAGES,
        deskew=True,
        rotate_pages=True,
    )

    print(f"OCR completed. Output saved to {output_pdf}.")

Processing input\Contrato raw-scan.pdf...


Output()

Output()

Output()

Output()

Output()

Output()

Output()

OCR completed. Output saved to output\Contrato raw-scan_iter.pdf.
Processing input\input.pdf...


Output()

Output()

Output()

Output()

Output()

Output()

Output()

The output file size is 3.96× larger than the input file.
Possible reasons for this include:
--deskew was issued, causing transcoding.
The optional dependency 'jbig2' was not found, so some image optimizations could not be attempted.
The optional dependency 'pngquant' was not found, so some image optimizations could not be attempted.
PDF/A conversion was enabled. (Try `--output-type pdf`.)



OCR completed. Output saved to output\input_iter.pdf.
Processing input\sv600_g_automatic.pdf...


Output()

Output()

Output()

Output()

Output()

Output()

Output()

OCR completed. Output saved to output\sv600_g_automatic_iter.pdf.
