WCE LIG: an open-source toolkit for Word Confidence Estimation V1.5

This toolkit, written in python (python3), enables you to estimate the quality of an automatic translation at word level. It outputs a good (G) or a bad (B) label foreach word in of the translation hypothesis.

For instance:

Source: give me some pills
Translation hypothesis: me donner des pilules
WCE: B B G G
Human post-edition: donnes moi des pilules

What the toolkit do?

First, the toolkit pre-process the data, then, it extract some internal and external features. Finally, it outputs a good (G) or a bad (B) label foreach word in of the translation hypothesis based on those features. Actually, the internal features belongs to the translation system and the external features uses external toolkits to extract informations (linguistic or probabilistic)

+ New: add DBnary as feature component. More details in the tools directory.

What are the features extracted?

Here is the list of all the features which are used in the toolkit.

1 Proper Name	17 Left Target POS	25 WPP Exact
2 Unknown Stemming	18 Left Target Word	26 WPP Any
3 Number of Word Occurrences	19 Left Target Stem	27 Max
4 Number of Stemming Occurrences	20 Right Target POS	28 Min
5 Source POS	21 Right Target Word	29 Nodes
6 Source Word	22 Right Target Stem	30 Constituent Label
7 Source Stem	15 Target Word	31 Distance To Root
8 Left Source POS	16 Target Stem	32 Numeric
9 Left Source Word	17 Left Target POS	33 Punctuation
10 Left Source Stem	18 Left Target Word	34 Stop Word
11 Right Source POS	19 Left Target Stem	35 Occur in Google Translate
12 Right Source Word	20 Right Target POS	36 Occur in Bing Translator
13 Right Source Stem	21 Right Target Word	37 Polysemy Count -- Target
14 Target POS	22 Right Target Stem	38 Backoff Behaviour -- Target
15 Target Word	23 Longest Target $N$-gram Length
16 Target Stem	24 Longest Source $N$-gram Length

Detailed description can be founded if the paper directory.

How far can we go?

You can achieve State-of-the-Art WCE results in the WMT shared task (http://www.statmt.org/wmt14/quality-estimation-task.html) For English-French quality estimation task:

B	Pr=0.4831	Rc=0.3615	F1=0.4135
G	Pr=0.8417	Rc=0.8978	F1=0.8688

Metrics used are Precision (Pr), Recall (Rc) and F-Measure F1.

What is needed?

Set the WCE_ROOT environment variable (see Readme file)
python3
PyYAML-3.11
NLTK for python 3
tools: see tools directory
7zip to decompress data in the input_data directory

Repository description

wce_system: contains the core of the system
input_data: contains the data used to train your WCE system
tools: contains all the tools needed to use fully the toolkit
docs: contains the documentation and the scientific papers related to this toolkit

Acknowledgement

This toolkit is part of the project KEHATH (https://kehath.imag.fr/) funded by the French National Research Agency.

Citation

When using this software, please cite:

Christophe Servan, Ngoc Tien Le, Ngoc Quang Luong, Benjamin Lecouteux and Laurent Besacier, 
“An Open Source Toolkit for Word-level Confidence Estimation in Machine Translation”, 
in The Proceedings of The 12th International Workshop on Spoken Language Translation (IWSLT 2015), 
Da Nang, Vietnam, Dec. 2015.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
docs		docs
input_data		input_data
tools		tools
wce_system		wce_system
LICENSE		LICENSE
README.md		README.md
README.txt		README.txt
Requirements.txt		Requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

input_data

input_data

tools

tools

wce_system

wce_system

LICENSE

LICENSE

README.md

README.md

README.txt

README.txt

Requirements.txt

Requirements.txt

Repository files navigation

WCE LIG: an open-source toolkit for Word Confidence Estimation V1.5

What the toolkit do?

What are the features extracted?

How far can we go?

What is needed?

Repository description

Acknowledgement

Citation

About

Releases

Packages

Languages

License

besacier/WCE-LIG

Folders and files

Latest commit

History

Repository files navigation

WCE LIG: an open-source toolkit for Word Confidence Estimation V1.5

What the toolkit do?

What are the features extracted?

How far can we go?

What is needed?

Repository description

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Languages