feat: Snuba tsdb implementation.#7834
Conversation
| # If nothing actually matches the requested range, just return the | ||
| # lowest resolution interval. | ||
| return list(self.rollups)[-1] | ||
| def get_optimal_rollup(self, start): |
There was a problem hiding this comment.
Will need @tkaemming to chime in on this, I'm pretty clueless on TSDB.
There was a problem hiding this comment.
Yeah this was a bit of a drive-by attempt at fixing the bug mentioned in the comments and simplifying the code. but doesn't actually need to be part of this PR
There was a problem hiding this comment.
Looking back at history, I think this is OK to do (unless anything has changed since then, but I can't think of what would have): #3860 (comment)
I personally decided to defer it to a later change (which obviously never happened) so that we could make sure that it was working correctly without too many other changes to muddy the water in case anything went awry, your risk tolerance may differ.
| col = 'tags[{}]'.format(tag) | ||
| if val == ANY: | ||
| conditions.append((col, 'IS NOT NULL', None)) | ||
| elif val == EMPTY: |
There was a problem hiding this comment.
I'm struggling to find/remember what EMPTY and ANY really mean again, but the other tagstore implementations seem to do something very different than checking whether the tag is null?
for k, v in tag_lookups:
if v is EMPTY:
return None
😕
There was a problem hiding this comment.
Probably could use @tkaemming here too because looking at the existing impls I'm more confused than not.
There was a problem hiding this comment.
IIRC, EMPTY is basically the equivalent of calling queryset.none() and only appears to be generated as part of this function:
sentry/src/sentry/search/utils.py
Lines 20 to 33 in 1244936
This basically precludes the necessity for doing any of the rest of the search (since we only AND conditions together, hitting this branch means there can't possibly be any results) so in my opinion we should should remove this from tagstore if it's not too much trouble.
There was a problem hiding this comment.
we should should remove this from tagstore
Do you just mean something at a higher level up should return None, so tagstore never has to see EMPTY? Or something different?
There was a problem hiding this comment.
In the meantime I've just put in an early return if there's any EMPTY tags, but open to considering a refactor where tagstore doesn't see EMPTY at all.
There was a problem hiding this comment.
Do you just mean something at a higher level up should return
None, so tagstore never has to seeEMPTY?
Yep.
In the meantime I've just put in an early return if there's any
EMPTYtags, but open to considering a refactor where tagstore doesn't seeEMPTYat all.
Makes sense to me. 👍
bretthoerner
left a comment
There was a problem hiding this comment.
I didn't review the TSDB side. Mostly small stuff/questions.
| # passed-in keys for project_id, or indrectly (eg the set of projects | ||
| # related to the queried set of issues or releases) | ||
| project_ids = [get_project_ids(k, ids) for k, ids in six.iteritems(filter_keys)] | ||
| if all(not ids for ids in project_ids): |
There was a problem hiding this comment.
I think this is more easily stated/read as if not any(project_ids):
There was a problem hiding this comment.
lol, yeah thats a whole lot better
| conditions.append((col, 'IN', keys)) | ||
|
|
||
| # project_ids will be the set of projects either referenced directly as | ||
| # passed-in keys for project_id, or indrectly (eg the set of projects |
| project_ids = [get_project_ids(k, ids) for k, ids in six.iteritems(filter_keys)] | ||
| if all(not ids for ids in project_ids): | ||
| raise Exception("No project_id filter, or none could be inferred from other filters.") | ||
| project_ids = list(set.intersection(*[set(ids) for ids in project_ids if ids])) |
There was a problem hiding this comment.
Is it intersection and not union? If it doesn't go with the largest possible selection of project_ids then why not just have the user pass in project_ids themselves? I guess I'm not sure why we don't require them as a top level param anyway?
There was a problem hiding this comment.
for something like get_group_tag_value_count(project_id, group_id, environment_id) in tagstore; if you passed in project_id = 1, and an environment_id from project 2, and a group_id from project 3 then:
- I guess you have a bug anyway, because why are you sending these unrelated things?
- Its pointless to do a query across the union of all 3 projects, (assuming all conditions are
ANDed together) because there will be no matches.
There was a problem hiding this comment.
Oh, to answer the other question. We don't use user-passed in project_ids exclusively because sometimes they don't exist. eg with TSDBModel.frequent_releases_by_group and others, we only have the group_id, release_id, environment_id etc.
So the main purpose of this whole block is just to infer the project_id for snuba purposes, when we are not passed a project id. In the cases where we are passed a project id, it doesn't really do much yeah,
There was a problem hiding this comment.
OK, I definitely buy it not being required. But I wonder if it should still be either,
- toplevel and not required, but if provided it is strictly used as the
project_idfilter - keep what you have an assert that the projects match, because I think you nailed my confusion that it just feels weird to magically pull out
project_ids and some of them vanish viaintersection
There was a problem hiding this comment.
@bretthoerner cleaned this up so that we use the project_ids directly if they are passed, and the union of any projects we can infer from other models if not.
| def get_project_issues(project_ids): | ||
| """ | ||
| Get a list of issues and associated fingerprint hashes for a project. | ||
| """ |
There was a problem hiding this comment.
This is returning every hash a project has? That definitely won't fit in RAM or in a query, and what's the difference between doing primary_hash IN (every_hash_in_the_project) and project_id = FOO?
Doesn't this method need to take a group_id argument and only return those hashes?
There was a problem hiding this comment.
Hmm, this could be a problem. In the case where you are only looking at a single group, then we should reduce the list here for sure. The snuba code already does reduce the issue expression to only the ones that are needed to filter/group by, but I guess its better to filter here so we don;t have to send as many hashes to snuba.
The thing is though, with things like TSDBModel.frequent_issues_by_project which would require sending all the issue->hash mappings, so that we could compute the most frequently seen issues. If that list is not going to fit in memory I guess we have a problem
There was a problem hiding this comment.
Just to repeat what I said in #snuba, my only idea is to select primary_hash, count() as c from sentry_dist where project_id = X group by primary_hash order by c desc limit 100 and then on Sentry side we'd have to reconcile the (hash, count) results into the biggest groups.
I definitely think "send all project hashes in the query" won't work, though. I could be wrong!
| equivalent ones in snuba. | ||
| """ | ||
| mappings = { | ||
| 'environment': (Environment, 'name'), |
There was a problem hiding this comment.
nit: I'd prefer to call it an environment_id (and so on) when it is an ID, as we do elsewhere. I guess that'd need another level of translation to Snuba columns though, bleh.
7828853 to
a5a3748
Compare
|
Nice, it's looking good to me except I didn't check TSDB and I'm not sure if you want to do something about Thanks again for putting this up so early, having the query util in a shareable state will be useful to break off more tagstore pieces. |
|
@bretthoerner I fixed get_project_issues to scope down to only the referenced set of issue_ids if they are present. So that should at least make this not terrible in that case that we know which issue we are looking for. I will defer solving the case where we need to solve the "group by issue for all issues" problem for later. |
bretthoerner
left a comment
There was a problem hiding this comment.
Nice, LGTM. But still need 👀 on TSDB before merge.
de6ff49 to
41b6039
Compare
tkaemming
left a comment
There was a problem hiding this comment.
Just a few notes inline from looking at the TSDB bits. I'm assuming that the plan here is just to work out any serious issues from doing parity testing with real data so I didn't review super closely, and I'm also assuming Brett covered anything in utils.snuba.
| def get_optimal_rollup(self, start): | ||
| """ | ||
| Return the size (in seconds) of the finest-granularity available rollup | ||
| that will have data available in the given time range. |
There was a problem hiding this comment.
This is a much better explanation, good clarification. Though, it's not a time range any longer, I guess, so it might make sense to say "available between the provided start time and the current timestamp"?
| # If nothing actually matches the requested range, just return the | ||
| # lowest resolution interval. | ||
| return list(self.rollups)[-1] | ||
| def get_optimal_rollup(self, start): |
There was a problem hiding this comment.
Looking back at history, I think this is OK to do (unless anything has changed since then, but I can't think of what would have): #3860 (comment)
I personally decided to defer it to a later change (which obviously never happened) so that we could make sure that it was working correctly without too many other changes to muddy the water in case anything went awry, your risk tolerance may differ.
| model_columns = self.model_columns(model) | ||
|
|
||
| if model_columns is None: | ||
| return None |
There was a problem hiding this comment.
This seems like it should raise an exception here if the backend can't handle the request, rather then returning None. Otherwise this is probably going to manifest itself as a weird combination of {Attribute,Type,Value}Error.
| # into | ||
| # {group: [(top1, score), ...]} | ||
| for k in result: | ||
| item_scores = [(v, float(i + 1)) for i, v in enumerate(reversed(result[k]))] |
There was a problem hiding this comment.
Is the enumeration ordinal just a placeholder since due to topK not returning the counts? I think that'd at least be worth a comment here clarifying this isn't a direct translation.
There was a problem hiding this comment.
Correct. I'll leave a comment.
| # into | ||
| # {group: [(timestamp, {top1: score, ...}), ...]} | ||
| for k in result: | ||
| result[k] = sorted([ |
There was a problem hiding this comment.
Same comment as above with regard to score.
| return result | ||
|
|
||
| def get_frequency_series(self, model, items, start, end=None, | ||
| rollup=None, environment_id=None, limit=10): |
There was a problem hiding this comment.
I don't think this method actually takes a limit parameter.
Came up with a reasonable helper method that works for all range, distinct, and frequency queries. The results of this helper sometimes have to be reformatted slightly to conform to the expected (historical) TSDB result formats
get_frequency_totals, get_distinct_union now supported. Some more cleanup and refactoring of the core get_data code. Added a test that the result shape was correct for every test.
For environment and release, the values stored in snuba are the name strings, not the db ids, when querying we need to dereference the ids to get the right values to query for in snuba.
- Mock a snuba response in SnubaTSDB tests and verify that the request and response formats are as expected. - Translate model keys to snuba fields, and then translate the result keys back to model ids in SnubaTSDB. This only applies to environment and release at the moment.
Return the correct response format from the search function, and test it.
If the query has a particular issue or set of issues its looking at, only expand the issue-> hash map for those issues.
46a6bf0 to
2252366
Compare
This introduces a utility to make queries to the snuba service from a set
of filter/group/aggregation parameters. The utility does the work of translating
local / postgres ids to the values stored in snuba.
Included is a TSDB backend implementation that uses this utility to construct
time series queries and return the correctly formatted results that TSDB clients expect.
Also included is a (mostly-stubbed) snuba backend for TagStore that currently only has the tag search method implemented.