Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search query supports from:user #252

Conversation

jcsalterego
Copy link
Contributor

This extracts from:$user and adds a term query for user field = $user in the ES posts collection.

There's just a tiny bit of cleanup that seems to overlap with #184, apologies in advance to @ericvolp12.

Note this implementation does allow for from:$user only, and from:alice from:bob will query that string against the text field as usual.

@jcsalterego
Copy link
Contributor Author

@whyrusleeping @ericvolp12 Would love to know if there's anything I can do to help this get merged!

@ericvolp12
Copy link
Collaborator

I haven't had the chance to deploy Palomar yet, the code looks okay to me but I'd wanna test the queries against the staging indexer first to make sure there's no regression. Why is out of town at the moment but when he gets back I'll get his help testing this.

@jcsalterego
Copy link
Contributor Author

@ericvolp12 fair enough. Thank you!

@ericvolp12
Copy link
Collaborator

Right now it looks like the schema in Staging and Prod for search are actually pretty different so I can't validate the query against staging.

Staging looks something like:

{
  "tid": "app.bsky.feed.post/3jsb4rbgsd22e",
  "cid": "bafyreiaf6frbvdb6cnlm2koi5lylyakkgnqjckgdnn4edxdhtrsmzuxdm4",
  "user": {
    "did": "did:plc:6wpnzwmt3hid374jl3jktqiv",
    "handle": ""
  },
  "post": {
    "$type": "app.bsky.feed.post",
    "createdAt": "2023-03-31T21:13:30.138Z",
    "reply": {
      "parent": {
        "cid": "bafyreicnaatqhvn7iey3pch3cj4fdrhfozjwowadf5b6o2eexj4bydeili",
        "uri": "at://did:plc:ycg2blb5xnaz42y7cwrgzzwt/app.bsky.feed.post/3jsb37ndvak2x"
      },
      "root": {
        "cid": "bafyreicnaatqhvn7iey3pch3cj4fdrhfozjwowadf5b6o2eexj4bydeili",
        "uri": "at://did:plc:ycg2blb5xnaz42y7cwrgzzwt/app.bsky.feed.post/3jsb37ndvak2x"
      }
    },
    "text": "fala com a Jay ou com o Why por dm no Twitter"
  }
}

Prod looks like:

{
  "tid": "app.bsky.feed.post/3k3wsvhsr5b2z",
  "cid": "bafyreicpdkuxylacqf6a4tan6uasetlfzh7ymp7zrwqptoakkjpzbukpuy",
  "user": {
    "did": "did:plc:muv2t3u2d7oywuzsxqspj6co",
    "handle": "zelkin.xyz"
  },
  "post": {
    "createdAt": 1690938133664000000,
    "text": "Kinda wanna petition for the official Bsky mascot to be a dragon lol\n\nThough it'll most likely be a blue jay.",
    "user": "zelkin.xyz"
  }
}

Bryan mentioned he's got a plan for a refactor of the search schema laying about in a WIP branch somewhere so I'm gonna check with him tomorrow and see what we can do.

@ericvolp12
Copy link
Collaborator

Whichever way we go, there's probably gonna be a schema change and we'll have to re-index things somehow to get everything up to date. I'll let you know how that goes here.

@jcsalterego
Copy link
Contributor Author

Sounds great. If there are multiple places to look, then one could actually use a Should bool match with a minimum match of 1, effectively equivalent to a SQL OR.

Also I figured at some point the server could do a resolveHandle equivalent lookup to convert from:abc.tld to did:plc:suchandsuch, though there is value in having the handle as a separate field, if not for ease of debugging.

Copy link
Collaborator

@bnewbold bnewbold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few code comments.

This PR should work "as is" for faceted search of just handles (though it needs some testing). I think there is enthusiasm for this feature both as a general usability thing (eg, finding one's own old posts), and potentially for investigating abuse.

I have a couple high-level concerns about hacking in facet support for just this one field, and would prefer to hold off until we have time to do a coherent (if small) iteration of the entire search index and query syntax, but i'm not a hard block on this PR.

High-level notes:

  • the handle field on ES post documents doesn't currently get updated when the account handle changes, only the profile document does. this would result in handle-based searches working for only some posts, or worse returning incorrect docs (if the same handle has been used for multiple accounts). so, this is going to be sort of half-broken. this is my biggest concern. the mitigation would be to do a query-based update of all old posts as well, but that could be an expensive operation and should be implemented and tested a bit more carefully.
  • this feature makes it easier to dig through old posts for individual accounts. that's something we intend to support, but at a minimum it requires some comms around roll-out
  • what is the query syntax going to look like long-term? probably Lucene-style, and probably from:<handle> would be supported, in addition to booleans, range queries, quoted phrases, etc. should we just pull in the elasticsearch/opensearch query_string interface for now, with some fallback for parse errors?

I had some stale code lying around for what I was directionaly thinking for a small iteration of the palomar search indices, you can see those here: https://github.com/bluesky-social/indigo/pull/263/files

search/query.go Outdated
offset int,
size int,
) (*EsSearchResponse, error) {
var musts []map[string]interface{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a fall-through case here with q = "" and fromHandle = "", where the musts is empty. I'm not sure if ES would be happy about that. We could remove the conditional on the q length; pretty sure ES is fine with empty string, and will probably optimize that branch away.

Or, if you know that an empty bool "must" list is fine, then it's a non-issue.

@@ -12,6 +13,8 @@ import (
otel "go.opentelemetry.io/otel"
)

var /* const */ FromOperatorRegexp = regexp.MustCompile(`\bfrom:([\w\.]+)`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should a bit of test coverage for this regex pattern. AKA, a set of post strings, and whether the pattern matches.

@bnewbold
Copy link
Collaborator

bnewbold commented Aug 2, 2023

Sounds great. If there are multiple places to look, then one could actually use a Should bool match with a minimum match of 1, effectively equivalent to a SQL OR.

Also I figured at some point the server could do a resolveHandle equivalent lookup to convert from:abc.tld to did:plc:suchandsuch, though there is value in having the handle as a separate field, if not for ease of debugging.

hmm, yeah, maybe doing handle resolution in the query parser is a better option than updating the index, at least for posts. folks probably will want to just put the handle in the query string without a "from:" facet prefix; I guess we can try to parse and handle that query pattern as a special case

@jcsalterego
Copy link
Contributor Author

Great feedback, thanks!

from: extraction is definitely a stopgap and I didn't want to rock the boat too much by introducing an entirely different query method.

This needs more tests, especially the blank args as you mentioned; that makes sense.

Comms rollout around additional functionality also makes sense.

I'll have a look this week!

@jcsalterego
Copy link
Contributor Author

folks probably will want to just put the handle in the query string without a "from:" facet prefix

FWIW, I would expect @user.tld for mentions & replies, and from:user.tld (or from:@user.tld just to be nice) to return posts from said user. from: isn't too much of a leap for those with gmail search experience, or google site:example.com experience.

@jcsalterego
Copy link
Contributor Author

Right now it looks like the schema in Staging and Prod for search are actually pretty different so I can't validate the query against staging.

@ericvolp12 I see a different (flatter) schema after running palomar run. Am I indexing posts/profiles incorrectly?

{
  "_index": "posts",
  "_id": "GMYDQMZ2MFYHALTCONVXSLTGMVSWILTQN5ZXILZTNMZWMN3OOVZHU53RGJZQ====",
  "_score": 1,
  "_source": {
    "createdAt": 1690333363220000000,
    "text": "text text",
    "user": "xyz"
  }
}

@jcsalterego
Copy link
Contributor Author

@ericvolp12 It occurs to me that maybe you were sharing the server query results, not the posts index as-is from ES.

@ericvolp12
Copy link
Collaborator

ericvolp12 commented Aug 3, 2023

From the staging ES, the results look like this in the _source:

{
  "$type": "app.bsky.feed.post",
  "createdAt": "2023-03-13T21:50:17.125Z",
  "reply": {
    "parent": {
      "cid": "bafyreiemnroin7ilp5rheiwbspmkyluji3um5ym6nbpy2gtq5fccjv4e3q",
      "uri": "at://did:plc:ragtjsm2j2vknwkz3zp4oxrd/app.bsky.feed.post/3jqtwbua3wu2a"
    },
    "root": {
      "cid": "bafyreiemnroin7ilp5rheiwbspmkyluji3um5ym6nbpy2gtq5fccjv4e3q",
      "uri": "at://did:plc:ragtjsm2j2vknwkz3zp4oxrd/app.bsky.feed.post/3jqtwbua3wu2a"
    }
  },
  "text": "A big benefit of being a dev for this app is any time I post something stupid I can just say I was testing a feature and that was a test post. As beta testers y’all can also use this"
}

I also see some really old results that look like this:

{
  "createdAt": 1690930579311000000,
  "text": "test",
  "user": "dholms.dev"
}

Not sure what the prod ES responses look like yet cause we haven't done a deploy in a while so my creds are not yet on the VPN config for those boxes.

@jcsalterego
Copy link
Contributor Author

Thanks for sharing. That does trouble me a bit since there's no user/did field in that document. Does a root/non-reply post look the same?

@jcsalterego
Copy link
Contributor Author

Once again, hitting Comment before my brain finishes thinking. That may be fine if the user handle is being indexed but not stored :)

@ericvolp12
Copy link
Collaborator

I think the staging ES is just full of junk at the moment and has a bunch of different schemas thrown into it. I wouldn't be surprised if prod is in the same place right now with a bunch of competing schemas in the same index. We need to pick a proper schema and then do a full reindexing of the posts on the network for that schema into search both in staging and prod.

Copy link
Contributor Author

@jcsalterego jcsalterego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bigger refactor but cleanly splits the transitions from:

  • bunch of query string params -> well-structured SearchQuery for posts
  • SearchQuery -> ES search query

However the team decides to change the schema, this should be a good foundation for maintaining functionality during and after the reindex.

@@ -92,7 +89,6 @@ func doSearch(ctx context.Context, escli *es.Client, index string, query interfa
escli.Search.WithIndex(index),
escli.Search.WithBody(&buf),
escli.Search.WithTrackTotalHits(true),
escli.Search.WithSize(30),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing on the ES I have (8.4.0), this parameter overrides anything in the query body, so I've pushed this into the posts or profile query generators.

}
}

if len(musts) == 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty musts results in matching all, so good looking out @bnewbold.

)

// DID is not currently being used presently
type DidHandle struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slightly futureproofs things when the user DID gets added into the posts index mapping, so we can do a lookup from handle to DID and hold them in a logical struct instead of two separate string fields.

var /* const */ SearchQueryCountDefault = 30
var /* const */ SearchQueryCountMax = 100

func paramsToPostsSearchQuery(queryString string, offsetParam string, countParam string) (*SearchQuery, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One stop shop for searching posts: parses out from: if present, ensure valid non-empty queries, handle offsets and counts.

@@ -75,14 +71,15 @@ func doSearchProfiles(ctx context.Context, escli *es.Client, q string) (*EsSearc
"operator": "or",
},
},
"size": SearchDefaultSize,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of escli.Search.WithSize(30) surfaces back here for profile search.

@jcsalterego jcsalterego requested a review from bnewbold August 3, 2023 17:38
@whyrusleeping
Copy link
Collaborator

I havent done a good job keeping up with pull requests around here, but one thing to keep in mind is that we want to search by DID and not by handle. If the query contains a handle we should map it to a did for them.

@jcsalterego
Copy link
Contributor Author

I think this would be a good thing to get merged—perfect is the enemy of good and whatnot—even if the schema will undergo a huge overhaul.

@ericvolp12
Copy link
Collaborator

ericvolp12 commented Sep 29, 2023

Gonna close this since from:{handle} queries are now supported in Palomar and should be added to the posts search once that gets proxied through AppView properly. We're leaning away from search being a publicly addressable external service and are gonna wrap it with the AppView. The internally hosted version of it now supports from though so thanks for the push!

@ericvolp12 ericvolp12 closed this Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants