[new language support] Apply a PR for chinese language support #88

SoTosorrow · 2022-08-08T05:40:13Z

Hello! Thanks for the lyra first.
I'm a new chinese user and I install it to my code after I saw the lyra 5 minutes, it is easy to understand and use.
But unfortunately it does not support Chinese. i find the tokenizer/index.ts is loosely coupled that i can add language support conveniently. may i have a chance to commit a PR for the "chinese language support"?
Ask for your permission (the guidelines say that i need to apply first to commit pr).
thx.

Is your feature request related to a problem? Please describe.
No chinese language support.

Describe the solution you'd like
add a Regular Expression in tokenizer/index.ts like

chinese: /[^a-z0-9_\u4e00-\u9fa5-]+/gim

it can easy to test in nodejs like

"chinese support test 中文 支持 测试".match(/[a-z0-9_\u4e00-\u9fa5-]+/gim)
>"[ 'chinese', 'support', 'test', '中文', '支持', '测试' ]"

(i'll do more test for the RE.)

The text was updated successfully, but these errors were encountered:

micheleriva · 2022-08-08T06:52:00Z

Hi @SoTosorrow ,
absolutely, any PR is very appreciated 🙂

SoTosorrow · 2022-08-08T07:53:49Z

Hi @SoTosorrow , absolutely, any PR is very appreciated 🙂

thx for your reply.
I just realized that the index of lyra starts with the beginning of a split word,.
For example, lyra can search "lov" to "i love her", but can not search "ove" to "i love her"(with exact).
which means that for languages with consecutive word（with no or fewer split) such as Chinese, Japanese, similar rules cannot be simply applied.
Chinese's sentence always like "ABC,EF" ("iloveher,ofcourse"), that i can not search the sentence by "B"("love") or "C"("her").i can only search it by "A.."("ilove")
It seems that I can't give my PR easily.
hhhhhhh

SoTosorrow · 2022-08-09T07:35:24Z

Hi @SoTosorrow , absolutely, any PR is very appreciated 🙂

It's not easy to support chinese (or any other language which use consecutive-word with no split) by append a simple regular expression in pure lyra, if i want to retrieval chinese, i need to break down words before "insert" and "search".
Should i add the regular expression and prompt the user that chinese sentences needs to be processed first or give up this method?

micheleriva · 2022-08-09T07:58:05Z

@SoTosorrow we could make rules for languages such as Chinese where we operate on tokens differently. But we need examples and documentation to understand how to operate properly, here we might need your help 🙂

SoTosorrow · 2022-08-09T08:10:16Z

@SoTosorrow we could make rules for languages such as Chinese where we operate on tokens differently. But we need examples and documentation to understand how to operate properly, here we might need your help 🙂

I'd love to help with examples and documentation. I will give the relevant information after sorting it out.
Should i open a discussion for the examples and documentation or continue in this issue?

micheleriva · 2022-08-09T08:27:59Z

Let's open a discussion for that, will act as future documentation

SoTosorrow · 2022-08-09T08:32:57Z

Let's open a discussion for that, will act as future documentation

copy that! thanks

chuanqisun · 2022-09-01T21:51:32Z

I wonder if this feature can benefit from Intl.Segmenter (requires a polyfill for FireFox). Segmenter can take the locale and automatically determine where the word boundaries should be. Also, potentially reducing library size and improving tokenization performance. It works on the server side too.

SoTosorrow · 2022-09-03T14:41:29Z

I wonder if this feature can benefit from Intl.Segmenter (requires a polyfill for FireFox). Segmenter can take the locale and automatically determine where the word boundaries should be. Also, potentially reducing library size and improving tokenization performance. It works on the server side too.

It seemd works, i will do more test, thanks for your guidance！

OultimoCoder · 2023-01-04T20:15:41Z

@SoTosorrow Did you manage to get Chinese working? If so could you provide an example?

group900-3 · 2024-01-09T22:21:35Z

Based on the help provided by the comments above, I implemented the Chinese tokenizer using Intl.Segmenter, which may be able to help you.
Intl.Segmenter works great in chrome and cloudflare workers.

// override default english tokenizer
const chineseTokenizer = {
  language: "english",
  normalizationCache: new Map(),
  tokenize: (raw: string) => {
    const segmenter = new Intl.Segmenter("zh", { granularity: "word" });
    const _iterator = segmenter.segment(raw)[Symbol.iterator]();
    return Array.from(_iterator).map((i) => i.segment);
  },
};
const db: Orama<typeof schema> = await create({
  schema,
  components: {
    tokenizer: chineseTokenizer,
  },
});

update:
Although no errors were reported when doing this, most of the time I couldn't search for the results I wanted, and I think further adaptation is needed somewhere. But then I won't be able to do it. At present, I will choose other engines to connect to my project.

SoTosorrow · 2024-01-10T04:37:40Z

Based on the help provided by the comments above, I implemented the Chinese tokenizer using Intl.Segmenter, which may be able to help you.
Intl.Segmenter works great in chrome and cloudflare workers.
// override default english tokenizer
const chineseTokenizer = {
  language: "english",
  normalizationCache: new Map(),
  tokenize: (raw: string) => {
    const segmenter = new Intl.Segmenter("zh", { granularity: "word" });
    const _iterator = segmenter.segment(raw)[Symbol.iterator]();
    return Array.from(_iterator).map((i) => i.segment);
  },
};
const db: Orama<typeof schema> = await create({
  schema,
  components: {
    tokenizer: chineseTokenizer,
  },
});
update:
Although no errors were reported when doing this, most of the time I couldn't search for the results I wanted, and I think further adaptation is needed somewhere. But then I won't be able to do it. At present, I will choose other engines to connect to my project.

I have also tried Intl segmentation based on the comments above, but the result on Chinese is not always good, and there may be some dependency issues.
I have also tried other word segmentation libraries such as "jieba", and some of them have good results, but they will introduce additional third-party packages and need to modify the core function of word segmentation (at that time) to adapt to Chinese word segmentation.
considering the possible impact. so I had stop.

group900-3 · 2024-01-10T13:25:42Z

@SoTosorrow What search engine did you choose in the end?I'm going to try algolia.

SoTosorrow · 2024-01-18T03:34:45Z

@SoTosorrow What search engine did you choose in the end?I'm going to try algolia.

I didn't use js search services in the end, so I regret that I can't give you more suggestions.

SoTosorrow closed this as completed Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[new language support] Apply a PR for chinese language support #88

[new language support] Apply a PR for chinese language support #88

SoTosorrow commented Aug 8, 2022

micheleriva commented Aug 8, 2022

SoTosorrow commented Aug 8, 2022

SoTosorrow commented Aug 9, 2022

micheleriva commented Aug 9, 2022

SoTosorrow commented Aug 9, 2022

micheleriva commented Aug 9, 2022

SoTosorrow commented Aug 9, 2022

chuanqisun commented Sep 1, 2022

SoTosorrow commented Sep 3, 2022

OultimoCoder commented Jan 4, 2023

group900-3 commented Jan 9, 2024 •

edited

SoTosorrow commented Jan 10, 2024 •

edited

group900-3 commented Jan 10, 2024

SoTosorrow commented Jan 18, 2024

[new language support] Apply a PR for chinese language support #88

[new language support] Apply a PR for chinese language support #88

Comments

SoTosorrow commented Aug 8, 2022

micheleriva commented Aug 8, 2022

SoTosorrow commented Aug 8, 2022

SoTosorrow commented Aug 9, 2022

micheleriva commented Aug 9, 2022

SoTosorrow commented Aug 9, 2022

micheleriva commented Aug 9, 2022

SoTosorrow commented Aug 9, 2022

chuanqisun commented Sep 1, 2022

SoTosorrow commented Sep 3, 2022

OultimoCoder commented Jan 4, 2023

group900-3 commented Jan 9, 2024 • edited

SoTosorrow commented Jan 10, 2024 • edited

group900-3 commented Jan 10, 2024

SoTosorrow commented Jan 18, 2024

group900-3 commented Jan 9, 2024 •

edited

SoTosorrow commented Jan 10, 2024 •

edited