Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search function multi-language support #80

Closed
netcmcc opened this issue Sep 16, 2019 · 9 comments
Closed

Search function multi-language support #80

netcmcc opened this issue Sep 16, 2019 · 9 comments
Labels
enhancement New feature or request

Comments

@netcmcc
Copy link

netcmcc commented Sep 16, 2019

Other aspects of this theme are good for multi-language support, but the search function currently only supports English, and I hope to be able to extend the search function to support beyond English, such as Chinese support.

@netcmcc
Copy link
Author

netcmcc commented Sep 16, 2019

Screen Shot 2019-09-16 at 12 51 32 PM

The title and directory support Chinese, but searching for Chinese is not working properly.

Screen Shot 2019-09-16 at 12 51 14 PM

@alex-shpak
Copy link
Owner

Hi!
Looks like lunr.js, which is used for search, does not have support for Chinese olivernn/lunr.js#173

I will need to check what can be done.
This might be an alternative https://github.com/nextapps-de/flexsearch

@alex-shpak alex-shpak added the enhancement New feature or request label Oct 3, 2019
alex-shpak added a commit that referenced this issue Oct 20, 2019
@alex-shpak
Copy link
Owner

I tried flexsearch and looks like it works with this configuration
https://github.com/nextapps-de/flexsearch#cjk-word-break-chinese-japanese-korean

Unfortunately it's either chinese or english (or other language) not both at the same time.
Or I didn't find correct way to have both.

alex-shpak added a commit that referenced this issue Oct 23, 2019
@oshliaer
Copy link
Contributor

Russian search is not supported too.

@alex-shpak Any recommendations. How can I help?

@alex-shpak
Copy link
Owner

Hi!
I pushed changes to master. It introduces https://github.com/nextapps-de/flexsearch in replace for lunr.js. FlexSearch has more configuration options for multi-language support.

There is now BookSearchConfig parameter which is flexsearch configuration object.
So for example for chinese, accroding to this https://github.com/nextapps-de/flexsearch#cjk-word-break-chinese-japanese-korean

BookSearchConfig = '''{
  encode: false,
  tokenize: function(str){
    return str.replace(/[\x00-\x7F]/g, "").split("");
  }
}'''

For russian I think stemmer needs to be set
https://github.com/nextapps-de/flexsearch#add-language-specific-stemmer-andor-filter

Future work will include integration with multi-lang mode, having different configuration for indexing per language.
Unfortunately there is no easy way to make support for multiple languages in same index.

@alex-shpak
Copy link
Owner

alex-shpak commented Nov 11, 2019

I think this config should work for russian, filter is optional.

BookSearchConfig = '''{
  split: /[^a-zа-яё0-9]/gi,
  filter: [ 
    "в", "на", "и", "не", "о", "от", "с"
  ]
}'''

@alex-shpak
Copy link
Owner

Changes has been merged to master

@kevinclcn
Copy link

Hi!
I pushed changes to master. It introduces https://github.com/nextapps-de/flexsearch in replace for lunr.js. FlexSearch has more configuration options for multi-language support.

There is now BookSearchConfig parameter which is flexsearch configuration object.
So for example for chinese, accroding to this https://github.com/nextapps-de/flexsearch#cjk-word-break-chinese-japanese-korean

BookSearchConfig = '''{
  encode: false,
  tokenize: function(str){
    return str.replace(/[\x00-\x7F]/g, "").split("");
  }
}'''

For russian I think stemmer needs to be set
https://github.com/nextapps-de/flexsearch#add-language-specific-stemmer-andor-filter

Future work will include integration with multi-lang mode, having different configuration for indexing per language.
Unfortunately there is no easy way to make support for multiple languages in same index.

I worked around this issue by below config:

    {
      encode: false,
      tokenize: function(str) {
        return str.split(/\W+/).concat(str.replace(/[\x00-\x7F]/g, '').split('')).filter(e => !!e)
      }
    }

@peter-liu
Copy link

never mind, it's wrong (only works when you search upper case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants