Skip to content

besacier/WCE-SLT-LIG

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code
This branch is up to date with getalp/WCE-SLT-LIG:master.

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
TXT
 
 
 
 

WCE-SLT-LIG

Corpus for evaluation of WCE in Spoken Language Translation

======

This corpus contains 6693 speech utterances (DEV: 2643 speech utterances; TST: 4050 speech utterances) for which a quintuplet containing: ASR output (src-asr), verbatim transcript (src-ref), text translation output (tgt-mt), speech translation output (tgt-slt) and post-edition of translation (tgt-pe), is made available.

If you are using this corpus please cite the following paper (pdf given on this root directory):

@InProceedings{besacier14, Title = {Word Confidence Estimation for Speech Translation}, Author = {Laurent Besacier and Benjamin Lecouteux and Ngoc Quang Luong and Kaing Hour and Marwa Hadjsalah}, Booktitle = {Proceedings of The International Workshop on Spoken Language Translation (IWSLT)}, Year = {2014},

Address = {Lake Tahoe, USA}, Month = {December},

Date-added = {2014-10-01 07:42:11 +0000}, Date-modified = {2014-10-01 07:44:40 +0000} }

The paper above describes the V1 of this corpus available in CORPUS-V1.old directory (2683 utterances only). The other folders described below correspond to the new and full version of this corpus (6693 utterances).

======================================================================== The folder WAV_TRANSCRIPTION contains the 6693 speech signals recorded and transcriptions (DEV: 2643 speech signals recorded; TST: 4050 speech signals recorded)

The folder TXT contains the following files:

Source Language (French)

./SRC/

 10881     268285    1761184 SRC/src-ref-all.fr              => 10881 sentences in French (all)
      
   881      20315     132195 SRC/src-ref-dev.fr              => 881 sentences in French (REF Output for dev part)
  2643      65964     397461 SRC/ref-asr-dev-3times.fr.pre   => 881*3 sentences in French (ASR Reference for dev part)
  
  1350      33784     225462 SRC/src-ref-tst.fr              => 1350 sentences in French (REF Output for tst part)
  4050     109212     681789 SRC/ref-asr-tst-3times.fr.pre   => 1350*3 sentences in French (ASR Reference for tst part)

./SRC/ASR1/

  2643      66435     395274 SRC/ASR1/scr-asr-dev-3times.fr  => 881*3 sentences in French (ASR1 Hypothesis for dev part)
  4050     108333     671792 SRC/ASR1/scr-asr-tst-3times.fr  => 1350*3 sentences in French (ASR1 Hypothesis for tst part)

./SRC/ASR2/

  2643      66837     396883 SRC/ASR2/scr-asr-dev-3times.fr  => 881*3 sentences in French (ASR2 Hypothesis for dev part)
  4050     108600     674572 SRC/ASR2/scr-asr-tst-3times.fr  => 1350*3 sentences in French (ASR2 Hypothesis for tst part)

Target Language (English)

./TGT/

 10881     251410    1522348 TGT/tgt-pe-all.en               => 10881 post-edition of MT sentences in English (all)     
   881      19606     117857 TGT/tgt-pe-dev.en               => 881 post-edition of MT sentences in English (dev part)             
  1350      31396     193176 TGT/tgt-pe-tst.en               => 1350 post-edition of MT sentences in English (tst part)            
  
 10881     238190    1473247 TGT/tgt-ref-all.en              => 10881 manually translated sentences in English (all)
   881      18490     112884 TGT/tgt-ref-dev.en              => 881 manually translated sentences in English (dev part)      
  1350      28886     183741 TGT/tgt-ref-tst.en              => 1350 manually translated sentences in English (tst part)      

 10881     281868    1542560 TGT/tgt-mt-all.en               => 10881 automatically translated sentences in English (all)
   881      22340     120183 TGT/tgt-mt-dev.en               => 881 automatically translated sentences in English (dev part)       
  1350      35213     197574 TGT/tgt-mt-tst.en               => 1350 automatically translated sentences in English (tst part)

./TGT/ASR1/

  2643      61787     352660 TGT/ASR1/tgt-slt-dev-3times.en  => 881*3 SLT (ASR+MT) sentences in English (ASR+MT hypothesis for dev part)
  4050      97977     581357 TGT/ASR1/tgt-slt-tst-3times.en  => 1350*3 SLT (ASR+MT) sentences in English (ASR+MT hypothesis for tst part)

./TGT/ASR2/

  2643      62213     354786 TGT/ASR2/tgt-slt-dev-3times.en  => 881*3 SLT (ASR+MT) sentences in English (ASR+MT hypothesis for dev part)
  4050      97804     581655 TGT/ASR2/tgt-slt-tst-3times.en  => 1350*3 SLT (ASR+MT) sentences in English (ASR+MT hypothesis for tst part)

Labels by calculating WER or TERp-A

./Labels/

 15897     187714    1441827 Labels/Labels-ASR1-dev.pra      => obtained from WER(scr-asr-dev-3times.fr of ASR1, ref-asr-dev-3times.fr.pre)
 24363     297838    2402596 Labels/Labels-ASR1-tst.pra      => obtained from WER(scr-asr-tst-3times.fr of ASR1, ref-asr-tst-3times.fr.pre)
 15897     184286    1432164 Labels/Labels-ASR2-dev.pra      => obtained from WER(scr-asr-dev-3times.fr of ASR2, ref-asr-dev-3times.fr.pre)
 24363     292032    2388547 Labels/Labels-ASR2-tst.pra      => obtained from WER(scr-asr-tst-3times.fr of ASR2, ref-asr-tst-3times.fr.pre)
 
 15182     193887    1115346 Labels/Labels-MT-dev.pra        => obtained from TERp-A(tgt-mt-dev.en, tgt-pe-dev.en)
 23092     301782    1780907 Labels/Labels-MT-tst.pra        => obtained from TERp-A(tgt-mt-tst.en, tgt-pe-tst.en)
 
 45556     556446    3286938 Labels/Labels-SLT-ASR1-dev.pra  => obtained from TERp-A(tgt-slt-dev-3times.en of ASR1, tgt-pe-dev-3times.en)
 69842     874435    5287617 Labels/Labels-SLT-ASR1-tst.pra  => obtained from TERp-A(tgt-slt-tst-3times.en of ASR1, tgt-pe-tst-3times.en)
 45472     558097    3290231 Labels/Labels-SLT-ASR2-dev.pra  => obtained from TERp-A(tgt-slt-dev-3times.en of ASR2, tgt-pe-dev-3times.en)
 69626     871641    5272400 Labels/Labels-SLT-ASR2-tst.pra  => obtained from TERp-A(tgt-slt-tst-3times.en of ASR2, tgt-pe-tst-3times.en) 

About

Corpus for evaluation of WCE in Spoken Language Translation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published