-
-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[new language support] Apply a PR for chinese language support #88
Comments
Hi @SoTosorrow , |
thx for your reply. |
It's not easy to support chinese (or any other language which use consecutive-word with no split) by append a simple regular expression in pure lyra, if i want to retrieval chinese, i need to break down words before "insert" and "search". |
@SoTosorrow we could make rules for languages such as Chinese where we operate on tokens differently. But we need examples and documentation to understand how to operate properly, here we might need your help 🙂 |
I'd love to help with examples and documentation. I will give the relevant information after sorting it out. |
Let's open a discussion for that, will act as future documentation |
copy that! thanks |
I wonder if this feature can benefit from Intl.Segmenter (requires a polyfill for FireFox). Segmenter can take the locale and automatically determine where the word boundaries should be. Also, potentially reducing library size and improving tokenization performance. It works on the server side too. |
It seemd works, i will do more test, thanks for your guidance! |
@SoTosorrow Did you manage to get Chinese working? If so could you provide an example? |
Based on the help provided by the comments above, I implemented the Chinese tokenizer using // override default english tokenizer
const chineseTokenizer = {
language: "english",
normalizationCache: new Map(),
tokenize: (raw: string) => {
const segmenter = new Intl.Segmenter("zh", { granularity: "word" });
const _iterator = segmenter.segment(raw)[Symbol.iterator]();
return Array.from(_iterator).map((i) => i.segment);
},
};
const db: Orama<typeof schema> = await create({
schema,
components: {
tokenizer: chineseTokenizer,
},
});
|
I have also tried Intl segmentation based on the comments above, but the result on Chinese is not always good, and there may be some dependency issues. |
@SoTosorrow What search engine did you choose in the end?I'm going to try algolia. |
I didn't use js search services in the end, so I regret that I can't give you more suggestions. |
Hello! Thanks for the lyra first.
I'm a new chinese user and I install it to my code after I saw the lyra 5 minutes, it is easy to understand and use.
But unfortunately it does not support Chinese. i find the tokenizer/index.ts is loosely coupled that i can add language support conveniently. may i have a chance to commit a PR for the "chinese language support"?
Ask for your permission (the guidelines say that i need to apply first to commit pr).
thx.
Is your feature request related to a problem? Please describe.
No chinese language support.
Describe the solution you'd like
add a Regular Expression in tokenizer/index.ts like
it can easy to test in nodejs like
(i'll do more test for the RE.)
The text was updated successfully, but these errors were encountered: