Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downsize seems does not handle Asian languages #15

Open
liushuping opened this issue Apr 27, 2014 · 15 comments
Open

downsize seems does not handle Asian languages #15

liushuping opened this issue Apr 27, 2014 · 15 comments
Labels

Comments

@liushuping
Copy link

For the character based Asian languages, "word" and "character" are actually the same concept, and words are not separated.
For example the English sentence "The quick brown fox jumps over the lazy dog" in Chinese is "敏捷的棕毛狐狸从懒狗身上跃过". downsize that sentence to 2, we expect the result to be "敏捷", but the actual result(treated the whole sentence as a single word) is not.

@cgiffard
Copy link
Owner

Japanese is even harder, since it has a mixture of single and multi-character words, and words with both ideographic and phonetic components. I think it would be possible to implement a solution correcting this problem for Hanzi and Hangul that increments the word counter on every character in that range, but as far as I'm able to ascertain, that would actually make it harder to provide accurate word counts in Japanese. This isn't a straightforward solution by any stretch, and I might need to implement a technical standard for doing the word breaking for CJK. Other languages such as Arabic and Thai are also problematic — and unlike east asian languages, I've got absolutely no idea where to start with those.

The long and short of it is that I cannot, and do not want to include language dictionaries in order to do the word count (there are copyright issues there as well.) If I can come up with a solution that gets close enough, I think that's good enough — because the reality is that word-breaking many non-latin languages is insanely hard.

I would welcome any help on this from people who know their i10n!

#discuss :)

@cgiffard
Copy link
Owner

I figure we: split the word count out into a special function separate from the counting block, and put the i10n logic in there. It's possible that the CJK counting might require lookahead which is not available via the streaming parser... which means the entire architecture will need a review.

@cgiffard
Copy link
Owner

It's actually really annoying that Han Unification happened — otherwise we'd be able to very easily tell whether text was Traditional Chinese, Simplified Chinese, Korean, or Japanese, just by looking at character ranges. As it is I think we might have to add a significant lookahead to the parser to try and guess the language before truncating. :-/

We could add a flag that allows shortcutting that by letting the user specify a language manually, eliminating false guesses and improving performance by removing the requirement for lookahead.

In the event that the lookahead sampler guessed wrongly — it would only likely guess Chinese where Japanese was the language (rather than the other way around) — the outcome would be that the Japanese snippets would be very short. I think that's manageable!

I would set the initial lookahead buffer size by the determining the longest kanji-only word in Japanese, and adding a bit of padding for HTML, etc. Hangul is easy to detect so that'd be an immediate shortcut. If a Japanese user uses a completely Kanji word that's longer than our buffer — well, that's a crazy edge case we probably shouldn't stress about.

I am concerned about mixed language posts. Mixing Chinese and English is relatively straightforward, but Japanese and English could be a bit of a headache... I need to research this more.

@cgiffard
Copy link
Owner

@yangl1996 @liushuping What's your expectation for a multi-character word like '北京'? Do you consider that one word or two?

@yangl1996
Copy link

It is two words. Actually every single Chinese character is a word.
Thanks ;)

@yangl1996
Copy link

If there is anything I could help as a native Chinese speaker, I am more than glad to help. :P

@adam-zethraeus
Copy link
Contributor

Since there'd still be clear issues when one is quoting text different from ones base language I think taking user input to define what language the text should be considered as makes most sense. Perhaps even the ability to specify language per segment?

Actually doing language detection in Downsize seems like a huge can of worms to me.
Perhaps there's a good project somewhere that will identify the segments of an article with different languages that can provide the metadata? In terms of function modularization I'd strongly advocate for making or using an external project rather than baking the functionality into Downsize.

@cgiffard
Copy link
Owner

I think, reading the unicode text-segmentation report/whitepaper that building a proper solution is not just hard — it might actually be computationally impossible. I think I could probably build a very naive set of rules for Japanese, which would provide a somewhat ugly but workable solution for those users.

If Chinese users are happy for text to be truncated character-wise, we could create an option { chinese: true } which turns split-by-character on, but then any English would be broken in that way too. Alternately we could increment the word-counter on every hanzi character using a simple range check.

Either way, this is probably going to be the hardest bug to fix... I've ever hit... in my life. If the unicode body thinks it's impossible...

@liushuping
Copy link
Author

@cgiffard character-wise truncation for Chinese is expected. However, if content is mixed language (for instance English + Chinese), we expect English to be word-wise truncation while Chinese is character-wise truncation.

For your information, Chinese characters are normally in range \u4e00 to \u9fa5. You may find more information at http://www.unicode.org/charts/PDF/U4E00.pdf

@cgiffard
Copy link
Owner

I think the killer is still going to be Chinese-vs-Japanese text counting. It's easy to see the following text and split by character:

我住在北京 = 5 words

But when hit with many of the exact same kanji/hanzi and kana characters, the expectation is totally different.

北京に住んでいます =   北京に   住んでいます = 2 words

So there've got to be at least three different counting rules. We can have them set by a flag, but it isn't going to be straightforward even then.

@adam-zethraeus
Copy link
Contributor

It's pretty clear that making a solution that works for all languages is out of scope (i.e. language guessing heuristics (ergh look at the size of the java projects that do this), many rule sets, and months of work).

However, is it the case that as a simple heuristic one-chinese-character===one-word on the internet?

If adding just a third counting type, say 'simple-chinese' that allows mixing anglo words and chinese characters as words will make downsize usable for the majority of casual online chinese writing (i.e. blogs) maybe it's a useful (if totally wrong) heuristic we could use.

@yangl1996 @liushuping :
Is it the case that this heuristic would be an improvement for the majority of Chinese blogs?

@cgiffard
Copy link
Owner

How about an option like { breakHanzi: true/false } which toggles between Japanese and Chinese Hanzi/Kanji breaking modes? Then for the Chinese mode we could:

  • Break on every character in the hanzi-kanji range
  • English breaking would be unaffected

And then in Japanese:

  • Consider strings of Kanji to be single words
  • Break on any Japanese non-word characters
  • Break on the switch from Hiragana to Kanji (but not the other way around, to allow for verb conjugation. Rough, since it chops off the respectful /, but should mostly work. What this doesn't account for is long strings of Hiragana without Kanji, but we're already stuffed with those. We'd have to embed a dictionary — not happening.)
  • Break on switches to or from Katakana to any other script
  • Break on strings of numbers

That leaves Arabic, Thai, Hindi... and dozens of other non-latin scripts unaccounted for. But with only a rough set of rules we can cover an additional 1.5 billion citizens of Earth, approximately. I figure it's a good enough start. :)

@adam-zethraeus
Copy link
Contributor

Doing it extensibly is going to be tricky, but i like the 1.5billion thing :)

@cgiffard
Copy link
Owner

Once you're happy with #16, I might attack this. Sorry for my absence over here!

@Arch1tect
Copy link

Wow, never thought that this is so complex...Let me know if you need any help, I have a blog that's written in both Chinese and English here: Lifeislikeaboat.com and I'm using Ghost which uses Downsize..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants