Better support for non-latin characters #277

blackforestboi · 2018-02-01T00:27:28Z

As described in this community post, non-latin characters (e.g. chinese, japanese) are not parsed correctly and thus are not searchable.

Potential solution: detect language of website, and if non-latin, split all characters before indexing?

bluesun · 2018-02-02T13:17:57Z

Hi @oliversauter , this is not enough, as Chinese-like languages have often several UT8 characters. We should look in the field of natural language processing for existing proven approaches covering the main languages that are not latin based.

kehao95 · 2019-01-19T00:49:08Z

In case you may need it. I'd suggest adopting this tool for Chinese Word Segmentation.
https://github.com/yanyiwu/nodejieba

mmqmzk · 2020-07-29T01:50:41Z

Is this in progress? Seems Chinese still not searchable now.

blackforestboi added bug Newcomer Task GSoC 2018 Projects labels Feb 1, 2018

blackforestboi added this to To Do in Up Next Feb 8, 2018

blackforestboi removed the GSoC 2018 Projects label Feb 13, 2018

blackforestboi changed the title ~~Better support for non-latin characters~~ MTNI-44 ⁃ Better support for non-latin characters Apr 19, 2018

blackforestboi changed the title ~~MTNI-44 ⁃ Better support for non-latin characters~~ Better support for non-latin characters Apr 19, 2018

blackforestboi removed Newcomer Task labels Dec 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for non-latin characters #277

Better support for non-latin characters #277

blackforestboi commented Feb 1, 2018 •

edited

bluesun commented Feb 2, 2018

kehao95 commented Jan 19, 2019

mmqmzk commented Jul 29, 2020

Better support for non-latin characters #277

Better support for non-latin characters #277

Comments

blackforestboi commented Feb 1, 2018 • edited

bluesun commented Feb 2, 2018

kehao95 commented Jan 19, 2019

mmqmzk commented Jul 29, 2020

blackforestboi commented Feb 1, 2018 •

edited