Palomar: Issue with Over-Normalization of Japanese Characters in Search Results #628

t2hnd · 2024-03-25T15:28:47Z

Hi!

Hi! I've noticed an issue where the precision of search results for Japanese queries in bsky.app is compromised due to the over-normalization of Japanese characters, affecting the accuracy and relevance of search results for native speakers.

An example of such query is "パリ" (Paris in Japanese characters), which receives hits containing "ハリー・ポッター" (Harry potter) or "バリ" (Bali). As a native speaker I don't expect the query "パリ" to match these words because they don't share the character "パ".
This result can be explained by the behaviour of icu_folding token filters added here. This filter normalizes Japanese characters with different diacritics, like the ones in the example "パ", "バ", "ハ". They represents different sounds and used in different words, so normalizing these characters usually lowers precision of Japanese search results.

Steps to reproduce

Search a word "パリ" in bsky.app or try https://bsky.social/xrpc/app.bsky.feed.searchPosts?q=%E3%83%91%E3%83%AA&limit=25 (The decoded query is パリ meaning Paris)

As a side note, the same behaviour is observed for other characters such as "ガン" (gun) and "カン" (can).

Result

Query receives hits containing "パリ" along with the ones with "ハリ" or "バリ". The precision of the search result is low because of these hits.

Expected results

Query receives hits containing "パリ" only, which has fewer noise and helps to find related information.

Possible solution

Replace icu_folding filter with asciifoliding filter, which doesn't normalize these Japanese characters. This may, however, affects search results for other languages like Greek that benefits from the icu_folding filter.
Another solution would be to use Kuromoji analyzer specialized for Japanese language but it probably requires per-language search configuration.

I believe addressing this issue will significantly enhance the user experience for Japanese speakers. I appreciate your attention on this issue.

The text was updated successfully, but these errors were encountered:

bnewbold · 2024-03-27T00:15:21Z

Hi @t2hnd! Thanks for raising this issue, and your detailed report. Including examples and links to the specific analyzers is very helpful.

We are sorry that the current search experience is not very good for Japanese queries. We will probably need to do more research, but off the top of my head, I can imagine two approaches to this issue:

allow querying for more exact strings using quoting. this would include analysing and indexing all full-text fields twice, once with stemming, ICU normalization, etc; and again as a separate field without those transforms. strings and phrases in quotes would be matched "exactly" against the less-normalized field. I'm not sure how many cases this would help with, and could be confusing in the case of phrase search (sometimes want to search a phrase, but with normalization). What do you think about this path?
maintain differently-configured indices for different languages; either separate document fields in the same schema, or entirely separate schemas. this would be unfortunately complex, expensive, and difficult to scale to more and more languages in the future, unless we can find clever ways to keep it simple

t2hnd · 2024-04-03T14:20:01Z

Thanks, @bnewbold, for reviewing this and for your suggestions! I agree that most users are familiar with the concept of using double quotes for an exact match, so enabling this feature should definitely be helpful.
Regarding solution 1, could you share a sample configuration you have in mind? I'm curious to see how the index setting looks like (and would like to test if possible).

bnewbold · 2024-04-03T22:01:41Z

I don't have anything for you right now, but we are hoping to do a push and improve search overall (for all languages/users), and specifically improve Japanese indexing, in the next couple weeks.

t2hnd · 2024-04-04T03:30:10Z

Understood, I'm looking forward to further updates.

bnewbold · 2024-04-09T01:37:38Z

@t2hnd the current plan is to add an additional text field just for Japanese text content, and use the Kuromoji analyser for that field. We will just special-case Japanese for now, not have additional language-specific fields.

Do you think that will be sufficient on it's own? This article discusses also using N-gram queries, which would be a larger increase in complexity:
https://www.elastic.co/blog/how-to-implement-japanese-full-text-search-in-elasticsearch

This is just for posts right now, not profiles (descriptions, display name, etc). I'm somewhat confident in the indexing approach (separate duplicate fields, gated by text detection). And this seems to work ok for simple cases. I'm not very confident about all-kanji text and indexing, and mixes of Japanese and non-english character sets. For example, Japanese and Korean (CJK), or Japanese and Thai (non-CJK). One positive thing is that everything is still being indexed in the regular text fields, using the existing analysis pipeline. So we can revert the query changes if needed, or improve some corner cases using query-time-only techniques. Closes: #628

t2hnd · 2024-04-10T03:14:56Z

@bnewbold Thanks for the update! I believe having kuromoji analyzer will greatly improve Japanese search. Having n-gram usually helps recall in Japanese search, but I think we can see how much the current field configuration works.

bnewbold mentioned this issue Apr 9, 2024

palomar: special-case Japanese text indexing using kuromoji #640

Merged

bnewbold closed this as completed in #640 Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Palomar: Issue with Over-Normalization of Japanese Characters in Search Results #628

Palomar: Issue with Over-Normalization of Japanese Characters in Search Results #628

t2hnd commented Mar 25, 2024

bnewbold commented Mar 27, 2024

t2hnd commented Apr 3, 2024 •

edited

Loading

bnewbold commented Apr 3, 2024

t2hnd commented Apr 4, 2024

bnewbold commented Apr 9, 2024

t2hnd commented Apr 10, 2024

Palomar: Issue with Over-Normalization of Japanese Characters in Search Results #628

Palomar: Issue with Over-Normalization of Japanese Characters in Search Results #628

Comments

t2hnd commented Mar 25, 2024

Steps to reproduce

Result

Expected results

Possible solution

bnewbold commented Mar 27, 2024

t2hnd commented Apr 3, 2024 • edited Loading

bnewbold commented Apr 3, 2024

t2hnd commented Apr 4, 2024

bnewbold commented Apr 9, 2024

t2hnd commented Apr 10, 2024

t2hnd commented Apr 3, 2024 •

edited

Loading