Skip to content

VirtualLUOUCAS/MMTIT_Bench

Repository files navigation

MMTIT-Bench

A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation

中文版PaperGitHubHuggingFace

Overview

MMTIT-Bench is a human-verified benchmark for end-to-end Text-Image Machine Translation (TIMT). It contains 1,400 images spanning 14 non-English and non-Chinese languages across diverse real-world scenarios, with bilingual (Chinese & English) translation annotations.

We also propose CPR-Trans (Cognition–Perception–Reasoning for Translation), a reasoning-oriented data paradigm that unifies scene cognition, text perception, and translation reasoning within a structured chain-of-thought framework.

MMTIT-Bench Overview

Benchmark Statistics

Item Details
Total Images 1,400
Languages 14 (AR, DE, ES, FR, ID, IT, JA, KO, MS, PT, RU, TH, TR, VI)
Translation Directions Other→Chinese, Other→English
Scenarios Documents, Menus, Books, Attractions, Posters, Commodities, etc.
Annotation Human-verified OCR + Bilingual translations

Data Format

Directory Structure

MMTIT-Bench/
├── README.md
├── README_ZH.md
├── annotation.jsonl        # Benchmark annotations
├── images.zip              # Benchmark images
├── eval_comet_demo.py      # COMET evaluation script
└── prediction_demo.jsonl   # Example prediction file

Annotation (annotation.jsonl)

Each line is a JSON object:

{
    "image_id": "Korea_Menu_20843.jpg",
    "parsing_anno": "멜로우스트리트\n\n위치: 서울특별시 관악구...",
    "translation_zh": "梅尔街\n\n位置:首尔特别市 冠岳区...",
    "translation_en": "Mellow Street\n\nLocation: 1st Floor, 104 Gwanak-ro..."
}
Field Description
image_id Image filename, formatted as {Language}_{Scenario}_{ID}.jpg
parsing_anno OCR text parsing annotation (source language)
translation_zh Chinese translation
translation_en English translation

Prediction File

Your prediction file should be a JSONL with the following fields:

{"image_id": "Korea_Menu_20843.jpg", "pred": "Your model's translation output"}

Evaluation

We use COMET (Unbabel/wmt22-comet-da) as the rule-based evaluation metric.

Install

pip install unbabel-comet

Run

# Other → Chinese
python eval_comet_demo.py \
    --prediction your_prediction.jsonl \
    --annotation annotation.jsonl \
    --direction other2zh \
    --batch_size 16 --gpus 0

# Other → English
python eval_comet_demo.py \
    --prediction your_prediction.jsonl \
    --annotation annotation.jsonl \
    --direction other2en \
    --batch_size 16 --gpus 1

Arguments

Argument Default Description
--prediction (required) Path to your prediction JSONL
--annotation annotation.jsonl Path to benchmark annotations
--direction (required) other2zh or other2en
--batch_size 16 Batch size for inference
--gpus 0 Number of GPUs (0 = CPU)
--output comet_results_{direction}.jsonl Output path for per-sample scores

Citation

@misc{li2026mmtitbench,
      title={MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation},
      author={Gengluo Li and Chengquan Zhang and Yupu Liang and Huawen Shen and Yaping Zhang and Pengyuan Lyu and Weinong Wang and Xingyu Wan and Gangyan Zeng and Han Hu and Can Ma and Yu Zhou},
      year={2026},
      journal={arXiv preprint arXiv:2603.23896},
      url={https://arxiv.org/abs/2603.23896},
}

License

This benchmark is released for research purposes only.

About

A Multilingual and Multi-Scenario Benchmark for Text-Image Machine Translation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages