Improve the search quality especially for CJK queries #302

hkurokawa · 2023-09-08T06:20:14Z

Hello,

I am not sure if this is the right place to report an issue about the search quality of bsky.app. Please feel free to point me to the right PoC if not.

I found that the full-text search quality was not really great especially for CJK (Chinese, Japanese and Korean) queries. I guess this is because the analyzer used in the Elasticsearch is the default one and the tokenization is done in a uni-gram-ish way. CJK languages do not use a white space as a separator of words and we need some tokenization to do full-text search on them. Uni-gram tokenization is the most naive tokenization and that is not very useful most of the time.

Steps to reproduce

Run curl 'https://search.bsky.social/search/posts?q=%E7%86%B1%E5%8A%9B%E5%AD%A6' (The decoded query is 熱力学 that means thermodynamics in Japanese)

Result

At step 1, a post not containing "熱力学" but containing "熱" (heat) and "力" (power) and "学" (study) separately is returned.

For example, a post something like "学校から帰って熱いお風呂に入ったら力一杯がんばる" (This means "I will do my best after coming back from school and taking a hot bath") would be included in the response.

Expected result

At step 1, a post not containing "熱力学" is not included in the response.

Remarks

Someone may think a phrase search (e.g., a phrase surrounded by double quotes) would solve the issue. That may work to some extent but it would not solve the issue entirely. For example, "東京都" (Tokyo) and "京都" (Kyoto) are completely separate words in Japanese and if a text search returns a post containing "東京都" for a query "京都", the user would think that the search is just useless.

I would suggest configuring a better Elasticsearch Analyzer for certain languages such as CJK and use them. Please feel free to ask me if you have any questions. Thanks!

The text was updated successfully, but these errors were encountered:

bnewbold · 2023-09-12T01:28:49Z

Hi @hkurokawa, thanks for the detailed ticket!

We have a branch as work-in-progress which changes up how we use OpenSearch (based on Lucene, fork of Elasticsearch), which includes using the ICU plugin tokenization, normalization, and folding rules:

https://github.com/bluesky-social/indigo/pull/263/files#diff-a7cd828df6438861fe3ec63c63ca68be113cd0f7d670d0f52d371c27f3bea81e

I'm not positive this will resolve your specific issues, but it may, and we can do some testing with the examples you give.

hkurokawa · 2023-09-12T06:54:27Z

Thank you for the update, @bnewbold. Sure, let's revisit this after your change lands. Please feel free to close the issue or keep it around if you want to track this somewhere. Up to you. Thanks!

bnewbold · 2023-09-14T07:11:44Z

I haven't tested deeply, but for the specific examples you give I think the new index config should work (#263).

Larger refactors in this branch: - [x] local docker dev env documented - [x] specify mappings (schemas) for post and profile indices - [x] transform raw records in to the index schemas - [x] different doc _id syntax - [x] skip read+deserialization of records other than profile and post, for efficiency - [x] don't store records in database; database only used for firehose cursor state - [x] switch to informal /xrpc/app.bsky.unspecced.search*Skeleton endpoints - [x] return only skeleton responses (eg, AT-URI or DID lists) - [x] handle non-success OpenSearch responses as errors - [x] auto-create indices with schema when in indexing mode (not READONLY) (with `go:embed` schemas) - [x] switch logging to `log/slog`, including echo integration - [x] use `atproto/identity` package for identity caching and handling, not `User` database record - [x] merged in backfill worker code - [x] use `analysis-icu` plugin for (hopefully) better internationalized search - [x] special typeahead indexing and query parameter - [x] basic/simple query string parsing, which should be safe, supports quoted phrases, and `from:` filtering This branch includes a couple small commits to SDK code, which i've cherry-picked out as separate PRs for easier review. See also Lexicon PR in atproto repo: bluesky-social/atproto#1594 This is not compatible with the previous version of `palomar` at the HTTP API, opensearch index, or database schema levels. The config vars should be backwards compatible. The operational plan for staging and prod is to deploy this as an entirely new environment (eg, "prod2", "staging2"), get everything backfilled, and then flip over the AppView and then client app to use the lexicons/endpoints instead of the older version. ---- I think this is ready for review, merge, and deploy to staging. Some things to check before prod: - [ ] compare index size and performance to existing version/schema - [ ] real-world testing of profile typeahead (eg, do we need fuzzy?) - [ ] real-world search relevancy checks - [ ] real-world CJK text analysis checks (#302) Out of scope for this PR: - [ ] deal with `created_at` timestamp not being reliable, by adding a `sort_at` hybrid field, for future "sort by date" - [ ] instrumentation and metrics (Jaz to implement on top of this branch) - [ ] better bulk indexing performance, especially during backfill: disable refresh during backfill? longer refresh window? bulk (batch) indexing would be best - [x] integrate a better identity service/cache; current is probably Ok in context of backfill. or perhaps just bump the cache size to ~50k or ~100k identities in prod?

hkurokawa · 2023-12-05T15:39:06Z

Hello,

Really sorry for the delay in my response. I somehow missed the notification email. I ran the curl command today and I am afraid the issue is not fixed yet.

Please try to run the command and see if all the returned post contained "熱力学" in their text. I understand that it would be hard to recognize the Japanese text. Maybe you can just grep the text by the term.

Please feel free to let me know if there is anything I can help you to address the issue. Thanks!

bnewbold · 2023-12-05T18:13:28Z

The new search endpoint is:

http get 'https://api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=%E7%86%B1%E5%8A%9B%E5%AD%A6'

unfortunately this new version of search still hasn't arrived in the app for post search (it is being used for profile search).

hkurokawa · 2023-12-05T22:43:28Z

Thanks for the prompt update! Got it. I confirmed that the new endpoint seemed to return a much better result. Great job!

bnewbold · 2023-12-11T19:42:11Z

Ok, we finally shipped these changes in the app. You may need to refresh the web app, or wait a bit for mobile app updates.

I did a bit of testing and I don't think we have "solved" this issue yet. I hope it is at least a bit better? But curious for your feedback.

hkurokawa · 2023-12-12T10:57:55Z

Thank you so much for your hard work. I tried some queries and it seemed to me much better than before. Specifically, my original issue was resolved so I am going to close the issue.

I will test other queries going forward and will file another issue if I find anything. So far, it works really well. Great job and much appreciated. Thanks!

bnewbold · 2023-12-12T12:50:21Z

Thank you for your excellent original report, patience, and kind words!

bnewbold mentioned this issue Sep 14, 2023

palomar (search) iteration #263

Merged

24 tasks

hkurokawa closed this as completed Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the search quality especially for CJK queries #302

Improve the search quality especially for CJK queries #302

hkurokawa commented Sep 8, 2023

bnewbold commented Sep 12, 2023 •

edited

Loading

hkurokawa commented Sep 12, 2023

bnewbold commented Sep 14, 2023

hkurokawa commented Dec 5, 2023

bnewbold commented Dec 5, 2023

hkurokawa commented Dec 5, 2023

bnewbold commented Dec 11, 2023

hkurokawa commented Dec 12, 2023

bnewbold commented Dec 12, 2023

Improve the search quality especially for CJK queries #302

Improve the search quality especially for CJK queries #302

Comments

hkurokawa commented Sep 8, 2023

Steps to reproduce

Result

Expected result

Remarks

bnewbold commented Sep 12, 2023 • edited Loading

hkurokawa commented Sep 12, 2023

bnewbold commented Sep 14, 2023

hkurokawa commented Dec 5, 2023

bnewbold commented Dec 5, 2023

hkurokawa commented Dec 5, 2023

bnewbold commented Dec 11, 2023

hkurokawa commented Dec 12, 2023

bnewbold commented Dec 12, 2023

bnewbold commented Sep 12, 2023 •

edited

Loading