Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the search quality especially for CJK queries #302

Closed
hkurokawa opened this issue Sep 8, 2023 · 9 comments
Closed

Improve the search quality especially for CJK queries #302

hkurokawa opened this issue Sep 8, 2023 · 9 comments

Comments

@hkurokawa
Copy link

Hello,

I am not sure if this is the right place to report an issue about the search quality of bsky.app. Please feel free to point me to the right PoC if not.

I found that the full-text search quality was not really great especially for CJK (Chinese, Japanese and Korean) queries. I guess this is because the analyzer used in the Elasticsearch is the default one and the tokenization is done in a uni-gram-ish way. CJK languages do not use a white space as a separator of words and we need some tokenization to do full-text search on them. Uni-gram tokenization is the most naive tokenization and that is not very useful most of the time.

Steps to reproduce

  1. Run curl 'https://search.bsky.social/search/posts?q=%E7%86%B1%E5%8A%9B%E5%AD%A6' (The decoded query is 熱力学 that means thermodynamics in Japanese)

Result

At step 1, a post not containing "熱力学" but containing "熱" (heat) and "力" (power) and "学" (study) separately is returned.

For example, a post something like "学校から帰って熱いお風呂に入ったら力一杯がんばる" (This means "I will do my best after coming back from school and taking a hot bath") would be included in the response.

Expected result

At step 1, a post not containing "熱力学" is not included in the response.

Remarks

Someone may think a phrase search (e.g., a phrase surrounded by double quotes) would solve the issue. That may work to some extent but it would not solve the issue entirely. For example, "東京都" (Tokyo) and "京都" (Kyoto) are completely separate words in Japanese and if a text search returns a post containing "東京都" for a query "京都", the user would think that the search is just useless.

I would suggest configuring a better Elasticsearch Analyzer for certain languages such as CJK and use them. Please feel free to ask me if you have any questions. Thanks!

@bnewbold
Copy link
Collaborator

bnewbold commented Sep 12, 2023

Hi @hkurokawa, thanks for the detailed ticket!

We have a branch as work-in-progress which changes up how we use OpenSearch (based on Lucene, fork of Elasticsearch), which includes using the ICU plugin tokenization, normalization, and folding rules:

https://github.com/bluesky-social/indigo/pull/263/files#diff-a7cd828df6438861fe3ec63c63ca68be113cd0f7d670d0f52d371c27f3bea81e

I'm not positive this will resolve your specific issues, but it may, and we can do some testing with the examples you give.

@hkurokawa
Copy link
Author

Thank you for the update, @bnewbold. Sure, let's revisit this after your change lands. Please feel free to close the issue or keep it around if you want to track this somewhere. Up to you. Thanks!

@bnewbold bnewbold mentioned this issue Sep 14, 2023
24 tasks
@bnewbold
Copy link
Collaborator

I haven't tested deeply, but for the specific examples you give I think the new index config should work (#263).

bnewbold added a commit that referenced this issue Sep 15, 2023
Larger refactors in this branch:

- [x] local docker dev env documented
- [x] specify mappings (schemas) for post and profile indices
- [x] transform raw records in to the index schemas
- [x] different doc _id syntax
- [x] skip read+deserialization of records other than profile and post,
for efficiency
- [x] don't store records in database; database only used for firehose
cursor state
- [x] switch to informal /xrpc/app.bsky.unspecced.search*Skeleton
endpoints
- [x] return only skeleton responses (eg, AT-URI or DID lists)
- [x] handle non-success OpenSearch responses as errors
- [x] auto-create indices with schema when in indexing mode (not
READONLY) (with `go:embed` schemas)
- [x] switch logging to `log/slog`, including echo integration
- [x] use `atproto/identity` package for identity caching and handling,
not `User` database record
- [x] merged in backfill worker code
- [x] use `analysis-icu` plugin for (hopefully) better internationalized
search
- [x] special typeahead indexing and query parameter
- [x] basic/simple query string parsing, which should be safe, supports
quoted phrases, and `from:` filtering

This branch includes a couple small commits to SDK code, which i've
cherry-picked out as separate PRs for easier review.

See also Lexicon PR in atproto repo:
bluesky-social/atproto#1594

This is not compatible with the previous version of `palomar` at the
HTTP API, opensearch index, or database schema levels. The config vars
should be backwards compatible. The operational plan for staging and
prod is to deploy this as an entirely new environment (eg, "prod2",
"staging2"), get everything backfilled, and then flip over the AppView
and then client app to use the lexicons/endpoints instead of the older
version.

----

I think this is ready for review, merge, and deploy to staging. Some
things to check before prod:

- [ ] compare index size and performance to existing version/schema
- [ ] real-world testing of profile typeahead (eg, do we need fuzzy?)
- [ ] real-world search relevancy checks
- [ ] real-world CJK text analysis checks
(#302)

Out of scope for this PR:

- [ ] deal with `created_at` timestamp not being reliable, by adding a
`sort_at` hybrid field, for future "sort by date"
- [ ] instrumentation and metrics (Jaz to implement on top of this
branch)
- [ ] better bulk indexing performance, especially during backfill:
disable refresh during backfill? longer refresh window? bulk (batch)
indexing would be best
- [x] integrate a better identity service/cache; current is probably Ok
in context of backfill. or perhaps just bump the cache size to ~50k or
~100k identities in prod?
@hkurokawa
Copy link
Author

Hello,

Really sorry for the delay in my response. I somehow missed the notification email. I ran the curl command today and I am afraid the issue is not fixed yet.

Please try to run the command and see if all the returned post contained "熱力学" in their text. I understand that it would be hard to recognize the Japanese text. Maybe you can just grep the text by the term.

Please feel free to let me know if there is anything I can help you to address the issue. Thanks!

@bnewbold
Copy link
Collaborator

bnewbold commented Dec 5, 2023

The new search endpoint is:

http get 'https://api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=%E7%86%B1%E5%8A%9B%E5%AD%A6'

unfortunately this new version of search still hasn't arrived in the app for post search (it is being used for profile search).

@hkurokawa
Copy link
Author

Thanks for the prompt update! Got it. I confirmed that the new endpoint seemed to return a much better result. Great job!

@bnewbold
Copy link
Collaborator

Ok, we finally shipped these changes in the app. You may need to refresh the web app, or wait a bit for mobile app updates.

I did a bit of testing and I don't think we have "solved" this issue yet. I hope it is at least a bit better? But curious for your feedback.

@hkurokawa
Copy link
Author

Thank you so much for your hard work. I tried some queries and it seemed to me much better than before. Specifically, my original issue was resolved so I am going to close the issue.

I will test other queries going forward and will file another issue if I find anything. So far, it works really well. Great job and much appreciated. Thanks!

@bnewbold
Copy link
Collaborator

Thank you for your excellent original report, patience, and kind words!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants