-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the search quality especially for CJK queries #302
Comments
Hi @hkurokawa, thanks for the detailed ticket! We have a branch as work-in-progress which changes up how we use OpenSearch (based on Lucene, fork of Elasticsearch), which includes using the ICU plugin tokenization, normalization, and folding rules: I'm not positive this will resolve your specific issues, but it may, and we can do some testing with the examples you give. |
Thank you for the update, @bnewbold. Sure, let's revisit this after your change lands. Please feel free to close the issue or keep it around if you want to track this somewhere. Up to you. Thanks! |
I haven't tested deeply, but for the specific examples you give I think the new index config should work (#263). |
Larger refactors in this branch: - [x] local docker dev env documented - [x] specify mappings (schemas) for post and profile indices - [x] transform raw records in to the index schemas - [x] different doc _id syntax - [x] skip read+deserialization of records other than profile and post, for efficiency - [x] don't store records in database; database only used for firehose cursor state - [x] switch to informal /xrpc/app.bsky.unspecced.search*Skeleton endpoints - [x] return only skeleton responses (eg, AT-URI or DID lists) - [x] handle non-success OpenSearch responses as errors - [x] auto-create indices with schema when in indexing mode (not READONLY) (with `go:embed` schemas) - [x] switch logging to `log/slog`, including echo integration - [x] use `atproto/identity` package for identity caching and handling, not `User` database record - [x] merged in backfill worker code - [x] use `analysis-icu` plugin for (hopefully) better internationalized search - [x] special typeahead indexing and query parameter - [x] basic/simple query string parsing, which should be safe, supports quoted phrases, and `from:` filtering This branch includes a couple small commits to SDK code, which i've cherry-picked out as separate PRs for easier review. See also Lexicon PR in atproto repo: bluesky-social/atproto#1594 This is not compatible with the previous version of `palomar` at the HTTP API, opensearch index, or database schema levels. The config vars should be backwards compatible. The operational plan for staging and prod is to deploy this as an entirely new environment (eg, "prod2", "staging2"), get everything backfilled, and then flip over the AppView and then client app to use the lexicons/endpoints instead of the older version. ---- I think this is ready for review, merge, and deploy to staging. Some things to check before prod: - [ ] compare index size and performance to existing version/schema - [ ] real-world testing of profile typeahead (eg, do we need fuzzy?) - [ ] real-world search relevancy checks - [ ] real-world CJK text analysis checks (#302) Out of scope for this PR: - [ ] deal with `created_at` timestamp not being reliable, by adding a `sort_at` hybrid field, for future "sort by date" - [ ] instrumentation and metrics (Jaz to implement on top of this branch) - [ ] better bulk indexing performance, especially during backfill: disable refresh during backfill? longer refresh window? bulk (batch) indexing would be best - [x] integrate a better identity service/cache; current is probably Ok in context of backfill. or perhaps just bump the cache size to ~50k or ~100k identities in prod?
Hello, Really sorry for the delay in my response. I somehow missed the notification email. I ran the curl command today and I am afraid the issue is not fixed yet. Please try to run the command and see if all the returned post contained "熱力学" in their text. I understand that it would be hard to recognize the Japanese text. Maybe you can just grep the text by the term. Please feel free to let me know if there is anything I can help you to address the issue. Thanks! |
The new search endpoint is: http get 'https://api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=%E7%86%B1%E5%8A%9B%E5%AD%A6' unfortunately this new version of search still hasn't arrived in the app for post search (it is being used for profile search). |
Thanks for the prompt update! Got it. I confirmed that the new endpoint seemed to return a much better result. Great job! |
Ok, we finally shipped these changes in the app. You may need to refresh the web app, or wait a bit for mobile app updates. I did a bit of testing and I don't think we have "solved" this issue yet. I hope it is at least a bit better? But curious for your feedback. |
Thank you so much for your hard work. I tried some queries and it seemed to me much better than before. Specifically, my original issue was resolved so I am going to close the issue. I will test other queries going forward and will file another issue if I find anything. So far, it works really well. Great job and much appreciated. Thanks! |
Thank you for your excellent original report, patience, and kind words! |
Hello,
I am not sure if this is the right place to report an issue about the search quality of bsky.app. Please feel free to point me to the right PoC if not.
I found that the full-text search quality was not really great especially for CJK (Chinese, Japanese and Korean) queries. I guess this is because the analyzer used in the Elasticsearch is the default one and the tokenization is done in a uni-gram-ish way. CJK languages do not use a white space as a separator of words and we need some tokenization to do full-text search on them. Uni-gram tokenization is the most naive tokenization and that is not very useful most of the time.
Steps to reproduce
curl 'https://search.bsky.social/search/posts?q=%E7%86%B1%E5%8A%9B%E5%AD%A6'
(The decoded query is熱力学
that means thermodynamics in Japanese)Result
At step 1, a post not containing "熱力学" but containing "熱" (heat) and "力" (power) and "学" (study) separately is returned.
For example, a post something like "学校から帰って熱いお風呂に入ったら力一杯がんばる" (This means "I will do my best after coming back from school and taking a hot bath") would be included in the response.
Expected result
At step 1, a post not containing "熱力学" is not included in the response.
Remarks
Someone may think a phrase search (e.g., a phrase surrounded by double quotes) would solve the issue. That may work to some extent but it would not solve the issue entirely. For example, "東京都" (Tokyo) and "京都" (Kyoto) are completely separate words in Japanese and if a text search returns a post containing "東京都" for a query "京都", the user would think that the search is just useless.
I would suggest configuring a better Elasticsearch Analyzer for certain languages such as CJK and use them. Please feel free to ask me if you have any questions. Thanks!
The text was updated successfully, but these errors were encountered: