FIX: a temporary fix when CJK user tries to add a long title #7045

erickguan · 2019-02-21T07:54:52Z

Discourse doesn't analyze the sentence components. So it counts the whole sentence as a word for CJK.

https://meta.discoursecn.org/t/topic/3033

Discourse doesn't analyze the sentence components. So it counts the whole sentence as a word for CJK. https://meta.discoursecn.org/t/topic/3033

discoursebot · 2019-02-21T07:54:57Z

You've signed the CLA, fantasticfears. Thank you! This pull request is ready for review.

gschlager · 2019-02-22T13:38:57Z

Wouldn't it make more sense to improve the word counting in TextSentinel? Something like this looks promising: https://stackoverflow.com/questions/12488565/how-to-count-words-in-a-multi-language-text-using-ruby-javascript/12488887

erickguan · 2019-02-22T22:13:51Z

It's also in client side right? Unicode regexp implementation in JS and Ruby are different. Plus Chinese words are connected without blanks.

ZogStriP · 2019-02-27T09:40:49Z

I agree with @gschlager here. We should instead improve TextSentinel to support CJK. It's only used on the server-side.

ZogStriP

We should instead improve how TextSentinel works for CJK locales.

erickguan · 2019-02-27T10:04:23Z

Ok. TextSentinel can be complicated then. Let's start with the design. In a word, how much semantic we want to achieve? (How smart our system should be?) The core question is that if we focus on the character level? Or on the word level? This can evolve the problem from a string problem to an NLP task.

Some related information would be:

CJK doesn't have the concept of uppercase.
Korean has blanks between words but not Japanese nor Chinese.
The user might put a blank between English and their own languages. However, maybe most of the time, they don't.

I think the team would prefer an implementation on character level. In that case, title_max_word_length doesn't make much sense for CJK users anyway. And seems_* checks have been confusing for the admins.

ZogStriP · 2019-02-27T10:50:02Z

This can evolve the problem from a string problem to an NLP task.

Yeah, let's not do that please. KISS

CJK doesn't have the concept of uppercase.

I'm fine bypassing the check for CJK.

And seems_* checks have been confusing for the admins

I'm fine defining new seem_* rules for CJK. What would you recommend?

erickguan · 2019-02-27T11:05:05Z

entropy check works. seems_pronounceable? doesn't work for Unicode but it doesn't hurt.
seems_unpretentious? and seems_quiet? can't split words. So shall we disable them for Chinese and Japanese?

ZogStriP · 2019-02-27T11:15:39Z

seems_pronounceable? doesn't work for Unicode but it doesn't hurt.

Happy to improve it to work with Unicode provided we don't need a huge regexp :)

seems_unpretentious? and seems_quiet? can't split words. So shall we disable them for Chinese and Japanese?

I guess we'll have to. Are there any ways text can be badly written in CJK and we can identify it?

erickguan · 2019-02-27T19:13:15Z

I see. I'll find some time to get this rolling.

seems_quiet? doesn't apply to CJK. seems_unpretentious? is just hard to reach 100% accuracy. CJK search in Discourse works in a similar way. We can split the sentences into words but it's not accurate enough. It works with some complains. So any kinds of improvement towords CJK require a redesign of modules.

SamSaffron · 2019-03-13T06:22:44Z

I think to put out the immediate fire we are ok to merge this, but I would bypass all word length tests in CJK titles as they make little sense.

erickguan · 2019-03-13T06:29:15Z

I haven't got the time for implementation but I was thinking a different mechanism. If we can make some improvement on the control flow, such methods could return probability (or confidence) on their decision. Then we can ask admin to check if posts met the requirement when the confidence is low. Otherwise, we could just stop the posting as now.

FIX: a temporary fix when CJK user tries to add a long title

b62e4ae

Discourse doesn't analyze the sentence components. So it counts the whole sentence as a word for CJK. https://meta.discoursecn.org/t/topic/3033

eviltrout approved these changes Feb 21, 2019

View reviewed changes

ZogStriP requested changes Feb 27, 2019

View reviewed changes

ZogStriP added the Changes Requested label Feb 27, 2019

SamSaffron merged commit bd2edbb into discourse:master Mar 13, 2019

erickguan mentioned this pull request Apr 4, 2019

FIX: skip some checks for CJK locale in TextSentinel #7322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: a temporary fix when CJK user tries to add a long title #7045

FIX: a temporary fix when CJK user tries to add a long title #7045

erickguan commented Feb 21, 2019

discoursebot commented Feb 21, 2019

gschlager commented Feb 22, 2019

erickguan commented Feb 22, 2019

ZogStriP commented Feb 27, 2019

ZogStriP left a comment

erickguan commented Feb 27, 2019

ZogStriP commented Feb 27, 2019

erickguan commented Feb 27, 2019

ZogStriP commented Feb 27, 2019

erickguan commented Feb 27, 2019

SamSaffron commented Mar 13, 2019

erickguan commented Mar 13, 2019

FIX: a temporary fix when CJK user tries to add a long title #7045

FIX: a temporary fix when CJK user tries to add a long title #7045

Conversation

erickguan commented Feb 21, 2019

discoursebot commented Feb 21, 2019

gschlager commented Feb 22, 2019

erickguan commented Feb 22, 2019

ZogStriP commented Feb 27, 2019

ZogStriP left a comment

Choose a reason for hiding this comment

erickguan commented Feb 27, 2019

ZogStriP commented Feb 27, 2019

erickguan commented Feb 27, 2019

ZogStriP commented Feb 27, 2019

erickguan commented Feb 27, 2019

SamSaffron commented Mar 13, 2019

erickguan commented Mar 13, 2019