Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization error on Vietnamese alignment #287

Open
monzug opened this issue Mar 15, 2021 · 0 comments
Open

Tokenization error on Vietnamese alignment #287

monzug opened this issue Mar 15, 2021 · 0 comments
Assignees
Labels
question Further information is requested

Comments

@monzug
Copy link

monzug commented Mar 15, 2021

this is an interesting scenario. if I have Vietnamese language as Origin and any other language as Target, and the text has some segments, the tokenization process will fail with "The tokenization process was cancelled because origin and target texts don't have the same amount of segments.". Note that it works when both languages are set to Vietnamese. I am pretty sure that the error is right but why is tokenization creating different number of lines? Is space different in Vietnamese vs other languages? note that this does not happen in case of text in one block with no segment. @irina060981 , any idea? this is SUPER low priority as I do not think we will have many user doing alignment in Vietnamese, still I am curious to understand why it is failing.
related to #145

sample of Vietnamese text:

1882年,法國殖民者占領河內,在對越南進行殖民統治的初期,法國人便開始了根除越南知識階層中的傳統思想的工作。當時的越南,統治階級以外的普通民衆大多數是文盲,趁此機會,法國殖民者故意將人們對漢字的印象複雜化。他們聲稱漢字是特權階級用來統治平民的遺物般的工具,幷煽動、離間民衆對傳統文字的態度,讓人們必須使用所謂簡單易學的平民文字,至此,廢除漢字、專用拉丁字的運動成功開展。這場運動得到了越南的左派的擁護,其成爲了法國殖民者的越南語拉丁化運動成功的莫大助力。

西元939年,吳權建立了越南的獨立王朝以後,歷代政權都沒有盲目地排斥中國文化。尤其是11世紀初越南文化蓬勃發展的李朝時期,越南引進了大量中國文化。除了科舉制度以外,也導入了各種中國的法制度,構築了一個古代中國式的國家形態。位于東亞文化圈的越南至今仍保留了許多具有東亞特色的冠婚喪祭等傳統。

20世紀中葉以前的越南文書主要由兩種語言寫成。一種是被稱爲「漢文」的自古代中國傳入的古漢語文言文,其爲完全意義上的書面語言,與人們在日常交談中使用的語言不同;另一種是漢字與民族文字喃字混合書寫的「漢喃文」,其爲用來記錄日常生活的口語的越南語的書寫系統。歷史上,除了胡朝和西山朝,其他各個時代的統治階層和一部分知識階層大多用漢文書寫正式場合的文書;而普通民衆和一部分知識階層則常使用漢喃文。在有著這樣的歷史的越南語中,60%以上的詞彙都是以越南漢字音「漢越音」發音的漢源詞(漢越詞)。根據法國漢學家昂利·馬伯樂(Henri Maspero)的研究,漢越音爲基于9世紀前後唐代長安方言的漢字音。

@monzug monzug added the question Further information is requested label Mar 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants