Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for non-latin characters #277

Open
blackforestboi opened this issue Feb 1, 2018 · 3 comments
Open

Better support for non-latin characters #277

blackforestboi opened this issue Feb 1, 2018 · 3 comments
Labels
Projects

Comments

@blackforestboi
Copy link
Member

blackforestboi commented Feb 1, 2018

As described in this community post, non-latin characters (e.g. chinese, japanese) are not parsed correctly and thus are not searchable.

Potential solution: detect language of website, and if non-latin, split all characters before indexing?

@bluesun
Copy link

bluesun commented Feb 2, 2018

Hi @oliversauter , this is not enough, as Chinese-like languages have often several UT8 characters. We should look in the field of natural language processing for existing proven approaches covering the main languages that are not latin based.

@blackforestboi blackforestboi added this to To Do in Up Next Feb 8, 2018
@blackforestboi blackforestboi changed the title Better support for non-latin characters MTNI-44 ⁃ Better support for non-latin characters Apr 19, 2018
@blackforestboi blackforestboi changed the title MTNI-44 ⁃ Better support for non-latin characters Better support for non-latin characters Apr 19, 2018
@kehao95
Copy link

kehao95 commented Jan 19, 2019

In case you may need it. I'd suggest adopting this tool for Chinese Word Segmentation.
https://github.com/yanyiwu/nodejieba

@mmqmzk
Copy link

mmqmzk commented Jul 29, 2020

Is this in progress? Seems Chinese still not searchable now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Up Next
  
To Do
Development

No branches or pull requests

4 participants