A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation
中文版 • Paper • GitHub • HuggingFace
MMTIT-Bench is a human-verified benchmark for end-to-end Text-Image Machine Translation (TIMT). It contains 1,400 images spanning 14 non-English and non-Chinese languages across diverse real-world scenarios, with bilingual (Chinese & English) translation annotations.
We also propose CPR-Trans (Cognition–Perception–Reasoning for Translation), a reasoning-oriented data paradigm that unifies scene cognition, text perception, and translation reasoning within a structured chain-of-thought framework.
| Item | Details |
|---|---|
| Total Images | 1,400 |
| Languages | 14 (AR, DE, ES, FR, ID, IT, JA, KO, MS, PT, RU, TH, TR, VI) |
| Translation Directions | Other→Chinese, Other→English |
| Scenarios | Documents, Menus, Books, Attractions, Posters, Commodities, etc. |
| Annotation | Human-verified OCR + Bilingual translations |
MMTIT-Bench/
├── README.md
├── README_ZH.md
├── annotation.jsonl # Benchmark annotations
├── images.zip # Benchmark images
├── eval_comet_demo.py # COMET evaluation script
└── prediction_demo.jsonl # Example prediction file
Each line is a JSON object:
{
"image_id": "Korea_Menu_20843.jpg",
"parsing_anno": "멜로우스트리트\n\n위치: 서울특별시 관악구...",
"translation_zh": "梅尔街\n\n位置:首尔特别市 冠岳区...",
"translation_en": "Mellow Street\n\nLocation: 1st Floor, 104 Gwanak-ro..."
}| Field | Description |
|---|---|
image_id |
Image filename, formatted as {Language}_{Scenario}_{ID}.jpg |
parsing_anno |
OCR text parsing annotation (source language) |
translation_zh |
Chinese translation |
translation_en |
English translation |
Your prediction file should be a JSONL with the following fields:
{"image_id": "Korea_Menu_20843.jpg", "pred": "Your model's translation output"}We use COMET (Unbabel/wmt22-comet-da) as the rule-based evaluation metric.
pip install unbabel-comet# Other → Chinese
python eval_comet_demo.py \
--prediction your_prediction.jsonl \
--annotation annotation.jsonl \
--direction other2zh \
--batch_size 16 --gpus 0
# Other → English
python eval_comet_demo.py \
--prediction your_prediction.jsonl \
--annotation annotation.jsonl \
--direction other2en \
--batch_size 16 --gpus 1| Argument | Default | Description |
|---|---|---|
--prediction |
(required) | Path to your prediction JSONL |
--annotation |
annotation.jsonl |
Path to benchmark annotations |
--direction |
(required) | other2zh or other2en |
--batch_size |
16 |
Batch size for inference |
--gpus |
0 |
Number of GPUs (0 = CPU) |
--output |
comet_results_{direction}.jsonl |
Output path for per-sample scores |
@misc{li2026mmtitbench,
title={MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation},
author={Gengluo Li and Chengquan Zhang and Yupu Liang and Huawen Shen and Yaping Zhang and Pengyuan Lyu and Weinong Wang and Xingyu Wan and Gangyan Zeng and Han Hu and Can Ma and Yu Zhou},
year={2026},
journal={arXiv preprint arXiv:2603.23896},
url={https://arxiv.org/abs/2603.23896},
}This benchmark is released for research purposes only.
