-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Palomar: Issue with Over-Normalization of Japanese Characters in Search Results #628
Comments
Hi @t2hnd! Thanks for raising this issue, and your detailed report. Including examples and links to the specific analyzers is very helpful. We are sorry that the current search experience is not very good for Japanese queries. We will probably need to do more research, but off the top of my head, I can imagine two approaches to this issue:
|
Thanks, @bnewbold, for reviewing this and for your suggestions! I agree that most users are familiar with the concept of using double quotes for an exact match, so enabling this feature should definitely be helpful. |
I don't have anything for you right now, but we are hoping to do a push and improve search overall (for all languages/users), and specifically improve Japanese indexing, in the next couple weeks. |
Understood, I'm looking forward to further updates. |
@t2hnd the current plan is to add an additional text field just for Japanese text content, and use the Kuromoji analyser for that field. We will just special-case Japanese for now, not have additional language-specific fields. Do you think that will be sufficient on it's own? This article discusses also using N-gram queries, which would be a larger increase in complexity: |
This is just for posts right now, not profiles (descriptions, display name, etc). I'm somewhat confident in the indexing approach (separate duplicate fields, gated by text detection). And this seems to work ok for simple cases. I'm not very confident about all-kanji text and indexing, and mixes of Japanese and non-english character sets. For example, Japanese and Korean (CJK), or Japanese and Thai (non-CJK). One positive thing is that everything is still being indexed in the regular text fields, using the existing analysis pipeline. So we can revert the query changes if needed, or improve some corner cases using query-time-only techniques. Closes: #628
@bnewbold Thanks for the update! I believe having kuromoji analyzer will greatly improve Japanese search. Having n-gram usually helps recall in Japanese search, but I think we can see how much the current field configuration works. |
Hi!
Hi! I've noticed an issue where the precision of search results for Japanese queries in bsky.app is compromised due to the over-normalization of Japanese characters, affecting the accuracy and relevance of search results for native speakers.
An example of such query is "パリ" (Paris in Japanese characters), which receives hits containing "ハリー・ポッター" (Harry potter) or "バリ" (Bali). As a native speaker I don't expect the query "パリ" to match these words because they don't share the character "パ".
This result can be explained by the behaviour of icu_folding token filters added here. This filter normalizes Japanese characters with different diacritics, like the ones in the example "パ", "バ", "ハ". They represents different sounds and used in different words, so normalizing these characters usually lowers precision of Japanese search results.
Steps to reproduce
Search a word "パリ" in bsky.app or try
https://bsky.social/xrpc/app.bsky.feed.searchPosts?q=%E3%83%91%E3%83%AA&limit=25
(The decoded query is パリ meaning Paris)As a side note, the same behaviour is observed for other characters such as "ガン" (gun) and "カン" (can).
Result
Query receives hits containing "パリ" along with the ones with "ハリ" or "バリ". The precision of the search result is low because of these hits.
Expected results
Query receives hits containing "パリ" only, which has fewer noise and helps to find related information.
Possible solution
Replace icu_folding filter with asciifoliding filter, which doesn't normalize these Japanese characters. This may, however, affects search results for other languages like Greek that benefits from the icu_folding filter.
Another solution would be to use Kuromoji analyzer specialized for Japanese language but it probably requires per-language search configuration.
I believe addressing this issue will significantly enhance the user experience for Japanese speakers. I appreciate your attention on this issue.
The text was updated successfully, but these errors were encountered: