-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search query supports from:user
#252
Search query supports from:user
#252
Conversation
@whyrusleeping @ericvolp12 Would love to know if there's anything I can do to help this get merged! |
I haven't had the chance to deploy Palomar yet, the code looks okay to me but I'd wanna test the queries against the staging indexer first to make sure there's no regression. Why is out of town at the moment but when he gets back I'll get his help testing this. |
@ericvolp12 fair enough. Thank you! |
Right now it looks like the schema in Staging and Prod for search are actually pretty different so I can't validate the query against staging. Staging looks something like: {
"tid": "app.bsky.feed.post/3jsb4rbgsd22e",
"cid": "bafyreiaf6frbvdb6cnlm2koi5lylyakkgnqjckgdnn4edxdhtrsmzuxdm4",
"user": {
"did": "did:plc:6wpnzwmt3hid374jl3jktqiv",
"handle": ""
},
"post": {
"$type": "app.bsky.feed.post",
"createdAt": "2023-03-31T21:13:30.138Z",
"reply": {
"parent": {
"cid": "bafyreicnaatqhvn7iey3pch3cj4fdrhfozjwowadf5b6o2eexj4bydeili",
"uri": "at://did:plc:ycg2blb5xnaz42y7cwrgzzwt/app.bsky.feed.post/3jsb37ndvak2x"
},
"root": {
"cid": "bafyreicnaatqhvn7iey3pch3cj4fdrhfozjwowadf5b6o2eexj4bydeili",
"uri": "at://did:plc:ycg2blb5xnaz42y7cwrgzzwt/app.bsky.feed.post/3jsb37ndvak2x"
}
},
"text": "fala com a Jay ou com o Why por dm no Twitter"
}
} Prod looks like: {
"tid": "app.bsky.feed.post/3k3wsvhsr5b2z",
"cid": "bafyreicpdkuxylacqf6a4tan6uasetlfzh7ymp7zrwqptoakkjpzbukpuy",
"user": {
"did": "did:plc:muv2t3u2d7oywuzsxqspj6co",
"handle": "zelkin.xyz"
},
"post": {
"createdAt": 1690938133664000000,
"text": "Kinda wanna petition for the official Bsky mascot to be a dragon lol\n\nThough it'll most likely be a blue jay.",
"user": "zelkin.xyz"
}
} Bryan mentioned he's got a plan for a refactor of the search schema laying about in a WIP branch somewhere so I'm gonna check with him tomorrow and see what we can do. |
Whichever way we go, there's probably gonna be a schema change and we'll have to re-index things somehow to get everything up to date. I'll let you know how that goes here. |
Sounds great. If there are multiple places to look, then one could actually use a Should bool match with a minimum match of 1, effectively equivalent to a SQL OR. Also I figured at some point the server could do a resolveHandle equivalent lookup to convert from:abc.tld to did:plc:suchandsuch, though there is value in having the handle as a separate field, if not for ease of debugging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few code comments.
This PR should work "as is" for faceted search of just handles (though it needs some testing). I think there is enthusiasm for this feature both as a general usability thing (eg, finding one's own old posts), and potentially for investigating abuse.
I have a couple high-level concerns about hacking in facet support for just this one field, and would prefer to hold off until we have time to do a coherent (if small) iteration of the entire search index and query syntax, but i'm not a hard block on this PR.
High-level notes:
- the handle field on ES post documents doesn't currently get updated when the account handle changes, only the profile document does. this would result in handle-based searches working for only some posts, or worse returning incorrect docs (if the same handle has been used for multiple accounts). so, this is going to be sort of half-broken. this is my biggest concern. the mitigation would be to do a query-based update of all old posts as well, but that could be an expensive operation and should be implemented and tested a bit more carefully.
- this feature makes it easier to dig through old posts for individual accounts. that's something we intend to support, but at a minimum it requires some comms around roll-out
- what is the query syntax going to look like long-term? probably Lucene-style, and probably
from:<handle>
would be supported, in addition to booleans, range queries, quoted phrases, etc. should we just pull in the elasticsearch/opensearchquery_string
interface for now, with some fallback for parse errors?
I had some stale code lying around for what I was directionaly thinking for a small iteration of the palomar search indices, you can see those here: https://github.com/bluesky-social/indigo/pull/263/files
search/query.go
Outdated
offset int, | ||
size int, | ||
) (*EsSearchResponse, error) { | ||
var musts []map[string]interface{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a fall-through case here with q = ""
and fromHandle = ""
, where the musts
is empty. I'm not sure if ES would be happy about that. We could remove the conditional on the q
length; pretty sure ES is fine with empty string, and will probably optimize that branch away.
Or, if you know that an empty bool
"must" list is fine, then it's a non-issue.
search/handlers.go
Outdated
@@ -12,6 +13,8 @@ import ( | |||
otel "go.opentelemetry.io/otel" | |||
) | |||
|
|||
var /* const */ FromOperatorRegexp = regexp.MustCompile(`\bfrom:([\w\.]+)`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should a bit of test coverage for this regex pattern. AKA, a set of post strings, and whether the pattern matches.
hmm, yeah, maybe doing handle resolution in the query parser is a better option than updating the index, at least for posts. folks probably will want to just put the handle in the query string without a "from:" facet prefix; I guess we can try to parse and handle that query pattern as a special case |
Great feedback, thanks!
This needs more tests, especially the blank args as you mentioned; that makes sense. Comms rollout around additional functionality also makes sense. I'll have a look this week! |
FWIW, I would expect |
@ericvolp12 I see a different (flatter) schema after running {
"_index": "posts",
"_id": "GMYDQMZ2MFYHALTCONVXSLTGMVSWILTQN5ZXILZTNMZWMN3OOVZHU53RGJZQ====",
"_score": 1,
"_source": {
"createdAt": 1690333363220000000,
"text": "text text",
"user": "xyz"
}
} |
@ericvolp12 It occurs to me that maybe you were sharing the server query results, not the |
From the staging ES, the results look like this in the {
"$type": "app.bsky.feed.post",
"createdAt": "2023-03-13T21:50:17.125Z",
"reply": {
"parent": {
"cid": "bafyreiemnroin7ilp5rheiwbspmkyluji3um5ym6nbpy2gtq5fccjv4e3q",
"uri": "at://did:plc:ragtjsm2j2vknwkz3zp4oxrd/app.bsky.feed.post/3jqtwbua3wu2a"
},
"root": {
"cid": "bafyreiemnroin7ilp5rheiwbspmkyluji3um5ym6nbpy2gtq5fccjv4e3q",
"uri": "at://did:plc:ragtjsm2j2vknwkz3zp4oxrd/app.bsky.feed.post/3jqtwbua3wu2a"
}
},
"text": "A big benefit of being a dev for this app is any time I post something stupid I can just say I was testing a feature and that was a test post. As beta testers y’all can also use this"
} I also see some really old results that look like this: {
"createdAt": 1690930579311000000,
"text": "test",
"user": "dholms.dev"
} Not sure what the prod ES responses look like yet cause we haven't done a deploy in a while so my creds are not yet on the VPN config for those boxes. |
Thanks for sharing. That does trouble me a bit since there's no user/did field in that document. Does a root/non-reply post look the same? |
Once again, hitting |
I think the staging ES is just full of junk at the moment and has a bunch of different schemas thrown into it. I wouldn't be surprised if prod is in the same place right now with a bunch of competing schemas in the same index. We need to pick a proper schema and then do a full reindexing of the posts on the network for that schema into search both in staging and prod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bigger refactor but cleanly splits the transitions from:
- bunch of query string params -> well-structured
SearchQuery
for posts SearchQuery
-> ES search query
However the team decides to change the schema, this should be a good foundation for maintaining functionality during and after the reindex.
@@ -92,7 +89,6 @@ func doSearch(ctx context.Context, escli *es.Client, index string, query interfa | |||
escli.Search.WithIndex(index), | |||
escli.Search.WithBody(&buf), | |||
escli.Search.WithTrackTotalHits(true), | |||
escli.Search.WithSize(30), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing on the ES I have (8.4.0), this parameter overrides anything in the query body, so I've pushed this into the posts or profile query generators.
} | ||
} | ||
|
||
if len(musts) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An empty musts results in matching all, so good looking out @bnewbold.
) | ||
|
||
// DID is not currently being used presently | ||
type DidHandle struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This slightly futureproofs things when the user DID gets added into the posts
index mapping, so we can do a lookup from handle to DID and hold them in a logical struct instead of two separate string fields.
var /* const */ SearchQueryCountDefault = 30 | ||
var /* const */ SearchQueryCountMax = 100 | ||
|
||
func paramsToPostsSearchQuery(queryString string, offsetParam string, countParam string) (*SearchQuery, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One stop shop for searching posts: parses out from:
if present, ensure valid non-empty queries, handle offsets and counts.
@@ -75,14 +71,15 @@ func doSearchProfiles(ctx context.Context, escli *es.Client, q string) (*EsSearc | |||
"operator": "or", | |||
}, | |||
}, | |||
"size": SearchDefaultSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of escli.Search.WithSize(30)
surfaces back here for profile search.
I havent done a good job keeping up with pull requests around here, but one thing to keep in mind is that we want to search by DID and not by handle. If the query contains a handle we should map it to a did for them. |
I think this would be a good thing to get merged—perfect is the enemy of good and whatnot—even if the schema will undergo a huge overhaul. |
Gonna close this since |
This extracts
from:$user
and adds a term query foruser
field =$user
in the ESposts
collection.There's just a tiny bit of cleanup that seems to overlap with #184, apologies in advance to @ericvolp12.
Note this implementation does allow for
from:$user
only, andfrom:alice from:bob
will query that string against the text field as usual.