ocr-data

A multilingual OCR (Optical Character Recognition) dataset repository.

This dataset is specifically designed for fine-grained word-level OCR tasks, providing precise word-level bounding box annotations for each image.

Each word is annotated with pixel-accurate localization, enabling tasks such as text detection, text recognition, and end-to-end OCR.

This repository currently provides an Arabic OCR dataset. Datasets for English, German, Italian, and Spanish will be released soon.

📦 Available Datasets

The technical specifications of each dataset are listed in the table below.

Language	Version	Link	Pages Count	Unique Words	Fonts Count	Full Dataset Link
Arabic	v1.0	Gdrive (only 25k pages)	~521K	3012869	1	To obtain the full dataset, please contact us at: craneset[at]outlook.com [not free]
Arabic	v2.0	Gdrive (only 13k pages)	~534K	2502545	5	To obtain the full dataset, please contact us at: craneset[at]outlook.com [not free]

*The total dataset in Arabic is over 1M pages and 6 separate fonts.
*See the sample folder for examples of each font.
*The count of unique words after removing numbers and punctuation has been calculated.

📁 Dataset Structure

The dataset is organized into three main directories at the root level:

ocr-data/
├── images/
├── labels/
├── texts/

📷 images/

Contains OCR images in PNG format.
Each image has a corresponding JSON annotation file with the same base filename.
The JSON file precisely defines the location of each word in the image.

🏷 labels/

Contains JSON annotation files corresponding to the images.
Each JSON file shares the same base filename as its related image.
These files define word-level annotations with exact bounding box coordinates.
The annotation structure is identical to the JSON format described in the images/ section.

JSON Annotation Format (per image)

import json
with open(path_image_label, 'r', encoding='utf-8') as f:
    data = json.load(f)

{
  "0": {
    "word": "كلمة",
    "location": {
      "x": 3927,
      "y": 481,
      "w": 397,
      "h": 170
    }
  },
  "1": {
    "word": "عربية",
    "location": {
      "x": 3544,
      "y": 481,
      "w": 355,
      "h": 170
    }
  }
}

word: The recognized word in the image.
location: Bounding box of the word:
- x, y: Top-left corner coordinates
- w, h: Width and height of the bounding box

📝 texts/

Contains TXT files.
Each text file corresponds to an image.
Stores the continuous (full) text related to the image content.

🚀 Roadmap

📜 License

Please check the LICENSE file for usage terms and conditions.

🤝 Contributing

Contributions, issues, and feature requests are welcome. Feel free to open an issue or submit a pull request.

📬 Contact

For access requests or questions, please open an issue in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
sample		sample
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ocr-data

📦 Available Datasets

📁 Dataset Structure

📷 images/

🏷 labels/

📝 texts/

🚀 Roadmap

📜 License

🤝 Contributing

📬 Contact

About

Uh oh!

Releases

Packages

License

craneset/ocr-data

Folders and files

Latest commit

History

Repository files navigation

ocr-data

📦 Available Datasets

📁 Dataset Structure

📷 images/

🏷 labels/

📝 texts/

🚀 Roadmap

📜 License

🤝 Contributing

📬 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages