Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on the Validator #11

Closed
wenjie-p opened this issue Sep 7, 2021 · 5 comments
Closed

on the Validator #11

wenjie-p opened this issue Sep 7, 2021 · 5 comments

Comments

@wenjie-p
Copy link
Contributor

wenjie-p commented Sep 7, 2021

Hi, thanks for the practical toolkit for CV data preprocessing!

I recently utilized this toolkit to validate data of different languages, but found the Validator failed to initialize, i.e. it. After checking the code I found, the initialization of Validator demands data/$lang/validate.tsv to be given.

Thus my question is: 1) Will the missing data be updated recently? and 2) How to prepare the data/$lang/validate.tsv file from the scratch?

Thanks in advance!

@ftyers
Copy link
Owner

ftyers commented Sep 7, 2021

Dear @wenjie-p,

Yep, some of the parts are missing, you can find out which using covo missing:

$ covo missing
Missing: 
 Alphabets: 
 Validators: bas bg ca da gl gn ha hy-AM ia it kmr mk mr myv nl rw sk sv-SE ug uz vot
 Phonemisers: ar az ca cnh da de en eo fa fr fy-NL ga-IE hsb mk mr myv or pa-IN pt rm-sursilv rm-vallader ru vot
 Segmenters: ab ar as az ba bas be bg ca cnh cs cy da el en eo es et eu fa fi fr ga-IE gl gn ha hi hu hy-AM ia id it ka kab kk kmr lg lt lv mk mr myv nl pa-IN pl pt ro ru rw sk sl sr sv-SE sw ta th ug uk ur uz vot

The validate.tsv file is language dependent, to make it, I start with a symbol list from the validated parts of the transcripts:

$ cat validated.tsv | cut -f3 | sed 's/./&\n/g' | sort -f | uniq -c | sort -gr 

1718502  
1126608 e
1103376 a
1091936 i
 912285 o
 733084 n
 692549 t
 649041 r
 622639 l
 508112 s
 407269 c
 351004 d
 313187 u
 261561 p
 243335 m
 194733 
 180632 .
 168332 g
 157350 v
 131152 "
 105937 f
  89786 z
  83428 b
  81091 h
  69627 ,
  38359 L
  37739 I
  34061 '
  33000 q
  32320 S
  31289 è
  25796 C
  25536 A
  21411 P
  20173 M
  18394 à
  16505 ò
  16401 D
  15365 N
  13421 T
  12977 G
  12823 F
  12220 B
  12165 E
  11882 R
  11233 y
   9070 V
   8676 k
   7948 ù
   7407 H
   7354 È
   7350 Q
   6635 U
   6466 -
   5833 O
   5683 w
   5589 ì
   4231 J
   3664 é
   3461 W
   2787 K
   2361 x
   2277 !
   2188 ?
   2130 :
   2125 ’
   1158 Y
   1140 j
    937 Z
    475 ”
    446 ;
    431 “
    333 í
    320 á
    244 X
    237 ó
    231 )
    219 (
    140 –
    116 ō
    103 ú
     99 …
     84 š
     81 č
     71 É
     68 ū
     64 ä
     60 ñ
     59 ʿ
     52 /
     49 ã
     45 ī
     43 ø
     40 ć
     39 Á
     38 ‘
     34 ë
     29 ï
     28 ô
     24 å
     22 ê
     20 ž
     20 ß
     20 °
     19 Ō
     16 Š
     16 ń
     15 ł
     14 æ
     13 î
     13 Č
     10 Ø
     10 Ö
     10 ḥ
      9 Ž
      9 `
      8 ş
      8 Ḥ
      7 ṣ
      7 ř
      7 =
      7 +
      6 Î
      6 ė
      6 đ
      5 ő
      5 ň
      5 ı
      5 ə
      5 ě
      5 ー
      5 ′
      5 „
      5 ´
      5 _
      5 ]
      5 [
      4 ё
      4 Ú
      4 ś
      4 œ
      4 Ł
      4 ę
      4 ð
      4 ː
      4 »
      4 «
      4 ¡
      4 음
      3 б
      3 а
      3 ʾ
      3 þ
      3 û
      3 ț
      3 ṭ
      3 Ș
      3 Ó
      3 ʻ
      3 Đ
      3 $
      3 >
      3 <
      2 張
      2 三
      2 ン
      2 リ
      2 フ
      2 ザ
      2 ة
      2 ד
      2 с
      2 е
      2 µ
      2 ʼ
      2 ż
      2 ź
      2 Ľ
      2 ğ
      2 Ā
      2 À
      2 ・
      2 —
      2 #
      1 禅
      1 旅
      1 峰
      1 家
      1 多
      1 古
      1 丰
      1 万
      1 ノ
      1 サ
      1 キ
      1 カ
      1 ア
      1 あ
      1 ي
      1 ل
      1 غ
      1 ص
      1 س
      1 ر
      1 ו
      1 ה
      1 Ъ
      1 Ц
      1 У
      1 С
      1 О
      1 о
      1 ң
      1 н
      1 љ
      1 л
      1 Д
      1 ꞌ
      1 ÿ
      1 Ü
      1 ŭ
      1 Ṣ
      1 Ş
      1 Ś
      1 Ṛ
      1 º
      1 İ
      1 Ħ
      1 Ė
      1 ễ
      1 Æ
      1 ą
      1 ̨
      1 ☆
      1 ‑
      1 ʹ
      1 ~
      1 }
      1 {

Then I make an ALLOW list from the alphabet.txt:

ALLOW	a	_	0061	_
ALLOW	à	_	00e0	_
ALLOW	b	_	0062	_
ALLOW	c	_	0063	_
ALLOW	d	_	0064	_
ALLOW	e	_	0065	_
ALLOW	é	_	00e9	_
ALLOW	è	_	00e8	_
ALLOW	f	_	0066	_
ALLOW	g	_	0067	_
ALLOW	h	_	0068	_
ALLOW	i	_	0069	_
ALLOW	í	_	00ed	_
ALLOW	ì	_	00ec	_
ALLOW	j	_	006a	_
ALLOW	k	_	006b	_
ALLOW	l	_	006c	_
ALLOW	m	_	006d	_
ALLOW	n	_	006e	_
ALLOW	o	_	006f	_
ALLOW	ó	_	00f3	_
ALLOW	ò	_	00f2	_
ALLOW	p	_	0070	_
ALLOW	q	_	0071	_
ALLOW	r	_	0072	_
ALLOW	s	_	0073	_
ALLOW	t	_	0074	_
ALLOW	u	_	0075	_
ALLOW	ú	_	00fa	_
ALLOW	ù	_	00f9	_
ALLOW	v	_	0076	_
ALLOW	w	_	0077	_
ALLOW	x	_	0078	_
ALLOW	y	_	0079	_
ALLOW	z	_	007a	_
ALLOW	'	_	0027	_
ALLOW	_	_	0020	_

Then I add in the punctuation, replacing by nothing:

REPL	"	_	_	_
REPL	”	_	_	_
REPL	“	_	_	_
REPL	'	_	_	_
REPL	,	_	_	_
REPL	-	_	_	_
REPL	–	_	_	_
REPL	.	_	_	_
REPL	:	_	_	_
REPL	;	_	_	_
REPL	?	_	_	_

Punctuation symbols that are less frequent or might result in non-words, e.g. / I leave out.

Then I look for symbols which should be replaced with other symbols, e.g. normalising different kinds of apostrophes:

NORM	’	'	_	_

I added Italian in 9e7da6a, if you would like other languages from the missing list, please feel free to open separate issues so that we can discuss if there are some complicated points.

@ftyers
Copy link
Owner

ftyers commented Sep 7, 2021

Closed in 9e7da6a.

@ftyers ftyers closed this as completed Sep 7, 2021
@wenjie-p
Copy link
Contributor Author

wenjie-p commented Sep 8, 2021

Dear @ftyers

Thanks for the update of file data/it/validate.tsv so quickly! But I found such update won't take effect unless we add the line
include cvutils/data/it/validate.tsv in MANIFEST.in.

btw, I am happy to add more missing data for this toolkit : )

@ftyers
Copy link
Owner

ftyers commented Sep 8, 2021

I have a Makefile locally that makes that addition:

all: MANIFEST.in
	rm -f dist/*
	python setup.py sdist bdist_wheel

MANIFEST.in:
	find cvutils/data/ | sed 's/^/include /g' > MANIFEST.in

And then I can upload it to pip with twine upload dist/commonvoice-utils-0.2.9.tar.gz. I just did.

And thanks for the offer, I am happy to accept PRs! :)

@ftyers
Copy link
Owner

ftyers commented Sep 10, 2021

Update, as of 8dd63d9 we are only missing validators for languages that have yet to be released + the one in PR #12

 Validators: da mk mr myv sv-SE

The existing ones need to be checked with the new dataset though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants