A multilingual OCR (Optical Character Recognition) dataset repository.
This dataset is specifically designed for fine-grained word-level OCR tasks, providing precise word-level bounding box annotations for each image.
Each word is annotated with pixel-accurate localization, enabling tasks such as text detection, text recognition, and end-to-end OCR.
This repository currently provides an Arabic OCR dataset. Datasets for English, German, Italian, and Spanish will be released soon.
The technical specifications of each dataset are listed in the table below.
| Language | Version | Link | Pages Count | Unique Words | Fonts Count | Full Dataset Link |
|---|---|---|---|---|---|---|
| Arabic | v1.0 | Gdrive (only 25k pages) | ~521K | 3012869 | 1 | To obtain the full dataset, please contact us at: craneset[at]outlook.com [not free] |
| Arabic | v2.0 | Gdrive (only 13k pages) | ~534K | 2502545 | 5 | To obtain the full dataset, please contact us at: craneset[at]outlook.com [not free] |
*The total dataset in Arabic is over 1M pages and 6 separate fonts.
*See the sample folder for examples of each font.
*The count of unique words after removing numbers and punctuation has been calculated.
The dataset is organized into three main directories at the root level:
ocr-data/
βββ images/
βββ labels/
βββ texts/
- Contains OCR images in PNG format.
- Each image has a corresponding JSON annotation file with the same base filename.
- The JSON file precisely defines the location of each word in the image.
- Contains JSON annotation files corresponding to the images.
- Each JSON file shares the same base filename as its related image.
- These files define word-level annotations with exact bounding box coordinates.
- The annotation structure is identical to the JSON format described in the
images/section.
JSON Annotation Format (per image)
import json
with open(path_image_label, 'r', encoding='utf-8') as f:
data = json.load(f){
"0": {
"word": "ΩΩΩ
Ψ©",
"location": {
"x": 3927,
"y": 481,
"w": 397,
"h": 170
}
},
"1": {
"word": "ΨΉΨ±Ψ¨ΩΨ©",
"location": {
"x": 3544,
"y": 481,
"w": 355,
"h": 170
}
}
}word: The recognized word in the image.location: Bounding box of the word:x,y: Top-left corner coordinatesw,h: Width and height of the bounding box
- Contains TXT files.
- Each text file corresponds to an image.
- Stores the continuous (full) text related to the image content.
- Arabic OCR Dataset
- English OCR Dataset
- German OCR Dataset
- Italian OCR Dataset
- Spanish OCR Dataset
Please check the LICENSE file for usage terms and conditions.
Contributions, issues, and feature requests are welcome. Feel free to open an issue or submit a pull request.
For access requests or questions, please open an issue in this repository.
