Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

No 5000,1500 unique words in Ground-truth bilingual dictionaries. #8

Closed
kimdwkimdw opened this issue Dec 28, 2017 · 1 comment
Closed

Comments

@kimdwkimdw
Copy link

Except main language pairs like 'en-es', there are missing words in Ground-truth bilingual dictionaries.

For example,

I've tried with simple shell script for counting words.

cat no-en.0-5000.txt | awk -F' ' '{print $1}' | uniq | wc -l

Here is sample of counting results.

/en-ko.0-5000.txt 4870
/en-tr.0-5000.txt 4998
/en-vi.0-5000.txt 4993
/ko-en.0-5000.txt 4685
/ms-en.0-5000.txt 4998
/no-en.0-5000.txt 4999
/tr-en.0-5000.txt 4943
/vi-en.0-5000.txt 4998

/en-ko.5000-6500.txt 1465
/ko-en.5000-6500.txt 1461
/ms-en.5000-6500.txt 1499
/tr-en.5000-6500.txt 1499
@aconneau
Copy link
Contributor

@kimdwkimdw We will add a note in the README saying that it consists of approximately 5k and 1.5k. Thanks for your remark

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants