This repository contains data and scripts for reproducing the evaluation of the WMT23 Sign Language Translation Task.
The files related to the human evaluation reside in the directory human_evaluation
.
Scripts:
generate-batches.sh
- produces batches for Appraisegenerate-snippets.sh
- splits documents into 10-segment long chunksgenerate-ranking.sh
- computes ranking from scores exported from Appraisescripts/iaa.py
- for generating intra-annotator agreementsscripts/create_histogram.py
- for generating the histogram that appears in the paper
Data:
slttest2023.dsgs-de.all.xml
- the official test setsubmissions/*.xml
- the official submissions to the shared tasksubmissions/slttest23-doc-snippets.tsv
- document chunksbatches/*.json
- JSON batches for creating a campaign in Appraisescores/*.csv
- scores exported from Appraiseranking.log
- output of Appraise script for computing system rankings
The files related to the human evaluation reside in the directory automatic_evaluation
.
Scripts (directory tools
):
automaticEval.py
- Automatic evaluation with BLEU, chrF++ and BLEURT for WMT-SLT 2023 Confidence intervals obtained via bootstrap resamplingcorrMetricsHuman.py
- Pearson and Spearman correlations for the automatic metricsplotMetrics.py
- 3D plot of the correlation between the automatic metrics
For running most scripts one needs to create a Python virtual enviroment and install
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
*(The BLEURT requirements include Tensorflow which a heavy thing to download. If you don't need that, feel free to comment out the BLEURT requirement entry to save time and hard disk space).
As an exception, the script generate-batches.sh
requires one to install Appraise from this repository:
git clone https://github.com/AppraiseDev/Appraise.git
cd Appraise
git checkout 147865c284d340085d1333e1b7ed2a40d52bd703