Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Palomar: Issue with Over-Normalization of Japanese Characters in Search Results #628

Closed
t2hnd opened this issue Mar 25, 2024 · 6 comments · Fixed by #640
Closed

Palomar: Issue with Over-Normalization of Japanese Characters in Search Results #628

t2hnd opened this issue Mar 25, 2024 · 6 comments · Fixed by #640

Comments

@t2hnd
Copy link

t2hnd commented Mar 25, 2024

Hi!

Hi! I've noticed an issue where the precision of search results for Japanese queries in bsky.app is compromised due to the over-normalization of Japanese characters, affecting the accuracy and relevance of search results for native speakers.

An example of such query is "パリ" (Paris in Japanese characters), which receives hits containing "ハリー・ポッター" (Harry potter) or "バリ" (Bali). As a native speaker I don't expect the query "パリ" to match these words because they don't share the character "パ".
This result can be explained by the behaviour of icu_folding token filters added here. This filter normalizes Japanese characters with different diacritics, like the ones in the example "パ", "バ", "ハ". They represents different sounds and used in different words, so normalizing these characters usually lowers precision of Japanese search results.

Steps to reproduce

Search a word "パリ" in bsky.app or try https://bsky.social/xrpc/app.bsky.feed.searchPosts?q=%E3%83%91%E3%83%AA&limit=25 (The decoded query is パリ meaning Paris)

As a side note, the same behaviour is observed for other characters such as "ガン" (gun) and "カン" (can).

Result

Query receives hits containing "パリ" along with the ones with "ハリ" or "バリ". The precision of the search result is low because of these hits.

Expected results

Query receives hits containing "パリ" only, which has fewer noise and helps to find related information.

Possible solution

Replace icu_folding filter with asciifoliding filter, which doesn't normalize these Japanese characters. This may, however, affects search results for other languages like Greek that benefits from the icu_folding filter.
Another solution would be to use Kuromoji analyzer specialized for Japanese language but it probably requires per-language search configuration.

I believe addressing this issue will significantly enhance the user experience for Japanese speakers. I appreciate your attention on this issue.

@bnewbold
Copy link
Collaborator

Hi @t2hnd! Thanks for raising this issue, and your detailed report. Including examples and links to the specific analyzers is very helpful.

We are sorry that the current search experience is not very good for Japanese queries. We will probably need to do more research, but off the top of my head, I can imagine two approaches to this issue:

  • allow querying for more exact strings using quoting. this would include analysing and indexing all full-text fields twice, once with stemming, ICU normalization, etc; and again as a separate field without those transforms. strings and phrases in quotes would be matched "exactly" against the less-normalized field. I'm not sure how many cases this would help with, and could be confusing in the case of phrase search (sometimes want to search a phrase, but with normalization). What do you think about this path?
  • maintain differently-configured indices for different languages; either separate document fields in the same schema, or entirely separate schemas. this would be unfortunately complex, expensive, and difficult to scale to more and more languages in the future, unless we can find clever ways to keep it simple

@t2hnd
Copy link
Author

t2hnd commented Apr 3, 2024

Thanks, @bnewbold, for reviewing this and for your suggestions! I agree that most users are familiar with the concept of using double quotes for an exact match, so enabling this feature should definitely be helpful.
Regarding solution 1, could you share a sample configuration you have in mind? I'm curious to see how the index setting looks like (and would like to test if possible).

@bnewbold
Copy link
Collaborator

bnewbold commented Apr 3, 2024

I don't have anything for you right now, but we are hoping to do a push and improve search overall (for all languages/users), and specifically improve Japanese indexing, in the next couple weeks.

@t2hnd
Copy link
Author

t2hnd commented Apr 4, 2024

Understood, I'm looking forward to further updates.

@bnewbold
Copy link
Collaborator

bnewbold commented Apr 9, 2024

@t2hnd the current plan is to add an additional text field just for Japanese text content, and use the Kuromoji analyser for that field. We will just special-case Japanese for now, not have additional language-specific fields.

Do you think that will be sufficient on it's own? This article discusses also using N-gram queries, which would be a larger increase in complexity:
https://www.elastic.co/blog/how-to-implement-japanese-full-text-search-in-elasticsearch

bnewbold added a commit that referenced this issue Apr 10, 2024
This is just for posts right now, not profiles (descriptions, display
name, etc).

I'm somewhat confident in the indexing approach (separate duplicate
fields, gated by text detection). And this seems to work ok for simple
cases.

I'm not very confident about all-kanji text and indexing, and mixes of
Japanese and non-english character sets. For example, Japanese and
Korean (CJK), or Japanese and Thai (non-CJK).

One positive thing is that everything is still being indexed in the
regular text fields, using the existing analysis pipeline. So we can
revert the query changes if needed, or improve some corner cases using
query-time-only techniques.

Closes: #628
@t2hnd
Copy link
Author

t2hnd commented Apr 10, 2024

@bnewbold Thanks for the update! I believe having kuromoji analyzer will greatly improve Japanese search. Having n-gram usually helps recall in Japanese search, but I think we can see how much the current field configuration works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants