-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
downsize seems does not handle Asian languages #15
Comments
Japanese is even harder, since it has a mixture of single and multi-character words, and words with both ideographic and phonetic components. I think it would be possible to implement a solution correcting this problem for Hanzi and Hangul that increments the word counter on every character in that range, but as far as I'm able to ascertain, that would actually make it harder to provide accurate word counts in Japanese. This isn't a straightforward solution by any stretch, and I might need to implement a technical standard for doing the word breaking for CJK. Other languages such as Arabic and Thai are also problematic — and unlike east asian languages, I've got absolutely no idea where to start with those. The long and short of it is that I cannot, and do not want to include language dictionaries in order to do the word count (there are copyright issues there as well.) If I can come up with a solution that gets close enough, I think that's good enough — because the reality is that word-breaking many non-latin languages is insanely hard. I would welcome any help on this from people who know their i10n! #discuss :) |
I figure we: split the word count out into a special function separate from the counting block, and put the i10n logic in there. It's possible that the CJK counting might require lookahead which is not available via the streaming parser... which means the entire architecture will need a review. |
It's actually really annoying that Han Unification happened — otherwise we'd be able to very easily tell whether text was Traditional Chinese, Simplified Chinese, Korean, or Japanese, just by looking at character ranges. As it is I think we might have to add a significant lookahead to the parser to try and guess the language before truncating. :-/ We could add a flag that allows shortcutting that by letting the user specify a language manually, eliminating false guesses and improving performance by removing the requirement for lookahead. In the event that the lookahead sampler guessed wrongly — it would only likely guess Chinese where Japanese was the language (rather than the other way around) — the outcome would be that the Japanese snippets would be very short. I think that's manageable! I would set the initial lookahead buffer size by the determining the longest kanji-only word in Japanese, and adding a bit of padding for HTML, etc. Hangul is easy to detect so that'd be an immediate shortcut. If a Japanese user uses a completely Kanji word that's longer than our buffer — well, that's a crazy edge case we probably shouldn't stress about. I am concerned about mixed language posts. Mixing Chinese and English is relatively straightforward, but Japanese and English could be a bit of a headache... I need to research this more. |
@yangl1996 @liushuping What's your expectation for a multi-character word like '北京'? Do you consider that one word or two? |
It is two words. Actually every single Chinese character is a word. |
If there is anything I could help as a native Chinese speaker, I am more than glad to help. :P |
Since there'd still be clear issues when one is quoting text different from ones base language I think taking user input to define what language the text should be considered as makes most sense. Perhaps even the ability to specify language per segment? Actually doing language detection in Downsize seems like a huge can of worms to me. |
I think, reading the unicode text-segmentation report/whitepaper that building a proper solution is not just hard — it might actually be computationally impossible. I think I could probably build a very naive set of rules for Japanese, which would provide a somewhat ugly but workable solution for those users. If Chinese users are happy for text to be truncated character-wise, we could create an option Either way, this is probably going to be the hardest bug to fix... I've ever hit... in my life. If the unicode body thinks it's impossible... |
@cgiffard character-wise truncation for Chinese is expected. However, if content is mixed language (for instance English + Chinese), we expect English to be word-wise truncation while Chinese is character-wise truncation. For your information, Chinese characters are normally in range \u4e00 to \u9fa5. You may find more information at http://www.unicode.org/charts/PDF/U4E00.pdf |
I think the killer is still going to be Chinese-vs-Japanese text counting. It's easy to see the following text and split by character:
But when hit with many of the exact same kanji/hanzi and kana characters, the expectation is totally different.
So there've got to be at least three different counting rules. We can have them set by a flag, but it isn't going to be straightforward even then. |
It's pretty clear that making a solution that works for all languages is out of scope (i.e. language guessing heuristics (ergh look at the size of the java projects that do this), many rule sets, and months of work). However, is it the case that as a simple heuristic one-chinese-character===one-word on the internet? If adding just a third counting type, say 'simple-chinese' that allows mixing anglo words and chinese characters as words will make downsize usable for the majority of casual online chinese writing (i.e. blogs) maybe it's a useful (if totally wrong) heuristic we could use. @yangl1996 @liushuping : |
How about an option like
And then in Japanese:
That leaves Arabic, Thai, Hindi... and dozens of other non-latin scripts unaccounted for. But with only a rough set of rules we can cover an additional 1.5 billion citizens of Earth, approximately. I figure it's a good enough start. :) |
Doing it extensibly is going to be tricky, but i like the 1.5billion thing :) |
Once you're happy with #16, I might attack this. Sorry for my absence over here! |
Wow, never thought that this is so complex...Let me know if you need any help, I have a blog that's written in both Chinese and English here: Lifeislikeaboat.com and I'm using Ghost which uses Downsize.. |
For the character based Asian languages, "word" and "character" are actually the same concept, and words are not separated.
For example the English sentence "The quick brown fox jumps over the lazy dog" in Chinese is "敏捷的棕毛狐狸从懒狗身上跃过". downsize that sentence to 2, we expect the result to be "敏捷", but the actual result(treated the whole sentence as a single word) is not.
The text was updated successfully, but these errors were encountered: