Corpora

Corpus	Description
CNA	Chinese Gigaword 5, CNA (Central News Agency) part
Wiki	Wikipedia, Chinese part, 2019-05-20 pages-articles dump
ASBC	Sinica corpus 4.0
OntoNotes	OntoNotes 5.0, Chinese part

normalized_string = unicodedata.normalize("NFKD", raw_string)

Corpus	#sents	#words	#characters	#words/sent	#chars/sent	"sent" Type
CNA	13,366,581	632,289,913	1,098,546,752	47.3	82.2	Paragraph
Wiki	5,557,141	247,714,633	461,862,002	44.6	83.1	Paragraph
ASBC	1,297,793	10,409,751	16,331,383	8.0	12.6	Clause
OntoNotes	46,905	958,345	1,515,151	20.4	32.3	Sentence

Embedding	Corpora	Corpora size	Final embedding size	Dimension
Character	CNA, Wiki	1,560,408,754	13,136	300
Word	CNA, Wiki, ASBC-train	890,414,297	1,355,791	300

Call unicodedata.normalize (see above) before using the embeddings for custom models
Word corpora are segmented by CkipTagger WS
The words that are neither most frequent 30,000 nor length<=20 are removed from final embedding

Provide feedback