Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: a temporary fix when CJK user tries to add a long title #7045

Merged
merged 1 commit into from
Mar 13, 2019

Conversation

erickguan
Copy link
Contributor

Discourse doesn't analyze the sentence components. So it counts the whole sentence as a word for CJK.

https://meta.discoursecn.org/t/topic/3033

Discourse doesn't analyze the sentence components. So it counts the whole sentence as a word for CJK.

https://meta.discoursecn.org/t/topic/3033
@discoursebot
Copy link

You've signed the CLA, fantasticfears. Thank you! This pull request is ready for review.

@gschlager
Copy link
Member

Wouldn't it make more sense to improve the word counting in TextSentinel? Something like this looks promising: https://stackoverflow.com/questions/12488565/how-to-count-words-in-a-multi-language-text-using-ruby-javascript/12488887

@erickguan
Copy link
Contributor Author

It's also in client side right? Unicode regexp implementation in JS and Ruby are different. Plus Chinese words are connected without blanks.

@ZogStriP
Copy link
Member

I agree with @gschlager here. We should instead improve TextSentinel to support CJK. It's only used on the server-side.

Copy link
Member

@ZogStriP ZogStriP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should instead improve how TextSentinel works for CJK locales.

@erickguan
Copy link
Contributor Author

Ok. TextSentinel can be complicated then. Let's start with the design. In a word, how much semantic we want to achieve? (How smart our system should be?) The core question is that if we focus on the character level? Or on the word level? This can evolve the problem from a string problem to an NLP task.

Some related information would be:

  1. CJK doesn't have the concept of uppercase.
  2. Korean has blanks between words but not Japanese nor Chinese.
  3. The user might put a blank between English and their own languages. However, maybe most of the time, they don't.

I think the team would prefer an implementation on character level. In that case, title_max_word_length doesn't make much sense for CJK users anyway. And seems_* checks have been confusing for the admins.

@ZogStriP
Copy link
Member

This can evolve the problem from a string problem to an NLP task.

Yeah, let's not do that please. KISS

CJK doesn't have the concept of uppercase.

I'm fine bypassing the check for CJK.

And seems_* checks have been confusing for the admins

I'm fine defining new seem_* rules for CJK. What would you recommend?

@erickguan
Copy link
Contributor Author

entropy check works. seems_pronounceable? doesn't work for Unicode but it doesn't hurt.
seems_unpretentious? and seems_quiet? can't split words. So shall we disable them for Chinese and Japanese?

@ZogStriP
Copy link
Member

seems_pronounceable? doesn't work for Unicode but it doesn't hurt.

Happy to improve it to work with Unicode provided we don't need a huge regexp :)

seems_unpretentious? and seems_quiet? can't split words. So shall we disable them for Chinese and Japanese?

I guess we'll have to. Are there any ways text can be badly written in CJK and we can identify it?

@erickguan
Copy link
Contributor Author

I see. I'll find some time to get this rolling.

seems_quiet? doesn't apply to CJK. seems_unpretentious? is just hard to reach 100% accuracy. CJK search in Discourse works in a similar way. We can split the sentences into words but it's not accurate enough. It works with some complains. So any kinds of improvement towords CJK require a redesign of modules.

@SamSaffron
Copy link
Member

I think to put out the immediate fire we are ok to merge this, but I would bypass all word length tests in CJK titles as they make little sense.

@SamSaffron SamSaffron merged commit bd2edbb into discourse:master Mar 13, 2019
@erickguan
Copy link
Contributor Author

I haven't got the time for implementation but I was thinking a different mechanism. If we can make some improvement on the control flow, such methods could return probability (or confidence) on their decision. Then we can ask admin to check if posts met the requirement when the confidence is low. Otherwise, we could just stop the posting as now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants