Skip to content

feat: Snuba tsdb implementation.#7834

Merged
alex-hofsteede merged 20 commits into
masterfrom
hoff/snuba-tsdb
Mar 30, 2018
Merged

feat: Snuba tsdb implementation.#7834
alex-hofsteede merged 20 commits into
masterfrom
hoff/snuba-tsdb

Conversation

@alex-hofsteede

@alex-hofsteede alex-hofsteede commented Mar 28, 2018

Copy link
Copy Markdown
Contributor

This introduces a utility to make queries to the snuba service from a set
of filter/group/aggregation parameters. The utility does the work of translating
local / postgres ids to the values stored in snuba.

Included is a TSDB backend implementation that uses this utility to construct
time series queries and return the correctly formatted results that TSDB clients expect.

Also included is a (mostly-stubbed) snuba backend for TagStore that currently only has the tag search method implemented.

Comment thread src/sentry/tsdb/base.py Outdated
# If nothing actually matches the requested range, just return the
# lowest resolution interval.
return list(self.rollups)[-1]
def get_optimal_rollup(self, start):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need @tkaemming to chime in on this, I'm pretty clueless on TSDB.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was a bit of a drive-by attempt at fixing the bug mentioned in the comments and simplifying the code. but doesn't actually need to be part of this PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at history, I think this is OK to do (unless anything has changed since then, but I can't think of what would have): #3860 (comment)

I personally decided to defer it to a later change (which obviously never happened) so that we could make sure that it was working correctly without too many other changes to muddy the water in case anything went awry, your risk tolerance may differ.

Comment thread src/sentry/tagstore/snuba/backend.py Outdated
col = 'tags[{}]'.format(tag)
if val == ANY:
conditions.append((col, 'IS NOT NULL', None))
elif val == EMPTY:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to find/remember what EMPTY and ANY really mean again, but the other tagstore implementations seem to do something very different than checking whether the tag is null?

        for k, v in tag_lookups:
            if v is EMPTY:
                return None

😕

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably could use @tkaemming here too because looking at the existing impls I'm more confused than not.

@tkaemming tkaemming Mar 29, 2018

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, EMPTY is basically the equivalent of calling queryset.none() and only appears to be generated as part of this function:

def parse_release(project, value):
# TODO(dcramer): add environment support
if value == 'latest':
value = Release.objects.filter(
organization_id=project.organization_id,
projects=project,
).extra(select={
'sort': 'COALESCE(date_released, date_added)',
}).order_by('-sort').values_list(
'version', flat=True
).first()
if value is None:
return EMPTY
return value

This basically precludes the necessity for doing any of the rest of the search (since we only AND conditions together, hitting this branch means there can't possibly be any results) so in my opinion we should should remove this from tagstore if it's not too much trouble.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should should remove this from tagstore

Do you just mean something at a higher level up should return None, so tagstore never has to see EMPTY? Or something different?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime I've just put in an early return if there's any EMPTY tags, but open to considering a refactor where tagstore doesn't see EMPTY at all.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you just mean something at a higher level up should return None, so tagstore never has to see EMPTY?

Yep.

In the meantime I've just put in an early return if there's any EMPTY tags, but open to considering a refactor where tagstore doesn't see EMPTY at all.

Makes sense to me. 👍

@bretthoerner bretthoerner left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the TSDB side. Mostly small stuff/questions.

Comment thread src/sentry/utils/snuba.py Outdated
# passed-in keys for project_id, or indrectly (eg the set of projects
# related to the queried set of issues or releases)
project_ids = [get_project_ids(k, ids) for k, ids in six.iteritems(filter_keys)]
if all(not ids for ids in project_ids):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more easily stated/read as if not any(project_ids):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, yeah thats a whole lot better

Comment thread src/sentry/utils/snuba.py Outdated
conditions.append((col, 'IN', keys))

# project_ids will be the set of projects either referenced directly as
# passed-in keys for project_id, or indrectly (eg the set of projects

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indirectly

Comment thread src/sentry/utils/snuba.py Outdated
project_ids = [get_project_ids(k, ids) for k, ids in six.iteritems(filter_keys)]
if all(not ids for ids in project_ids):
raise Exception("No project_id filter, or none could be inferred from other filters.")
project_ids = list(set.intersection(*[set(ids) for ids in project_ids if ids]))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intersection and not union? If it doesn't go with the largest possible selection of project_ids then why not just have the user pass in project_ids themselves? I guess I'm not sure why we don't require them as a top level param anyway?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for something like get_group_tag_value_count(project_id, group_id, environment_id) in tagstore; if you passed in project_id = 1, and an environment_id from project 2, and a group_id from project 3 then:

  1. I guess you have a bug anyway, because why are you sending these unrelated things?
  2. Its pointless to do a query across the union of all 3 projects, (assuming all conditions are ANDed together) because there will be no matches.

@alex-hofsteede alex-hofsteede Mar 28, 2018

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, to answer the other question. We don't use user-passed in project_ids exclusively because sometimes they don't exist. eg with TSDBModel.frequent_releases_by_group and others, we only have the group_id, release_id, environment_id etc.

So the main purpose of this whole block is just to infer the project_id for snuba purposes, when we are not passed a project id. In the cases where we are passed a project id, it doesn't really do much yeah,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I definitely buy it not being required. But I wonder if it should still be either,

  1. toplevel and not required, but if provided it is strictly used as the project_id filter
  2. keep what you have an assert that the projects match, because I think you nailed my confusion that it just feels weird to magically pull out project_ids and some of them vanish via intersection

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bretthoerner cleaned this up so that we use the project_ids directly if they are passed, and the union of any projects we can infer from other models if not.

Comment thread src/sentry/utils/snuba.py
def get_project_issues(project_ids):
"""
Get a list of issues and associated fingerprint hashes for a project.
"""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is returning every hash a project has? That definitely won't fit in RAM or in a query, and what's the difference between doing primary_hash IN (every_hash_in_the_project) and project_id = FOO?

Doesn't this method need to take a group_id argument and only return those hashes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this could be a problem. In the case where you are only looking at a single group, then we should reduce the list here for sure. The snuba code already does reduce the issue expression to only the ones that are needed to filter/group by, but I guess its better to filter here so we don;t have to send as many hashes to snuba.

The thing is though, with things like TSDBModel.frequent_issues_by_project which would require sending all the issue->hash mappings, so that we could compute the most frequently seen issues. If that list is not going to fit in memory I guess we have a problem

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to repeat what I said in #snuba, my only idea is to select primary_hash, count() as c from sentry_dist where project_id = X group by primary_hash order by c desc limit 100 and then on Sentry side we'd have to reconcile the (hash, count) results into the biggest groups.

I definitely think "send all project hashes in the query" won't work, though. I could be wrong!

Comment thread src/sentry/utils/snuba.py
equivalent ones in snuba.
"""
mappings = {
'environment': (Environment, 'name'),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer to call it an environment_id (and so on) when it is an ID, as we do elsewhere. I guess that'd need another level of translation to Snuba columns though, bleh.

@bretthoerner

Copy link
Copy Markdown
Contributor

Nice, it's looking good to me except I didn't check TSDB and I'm not sure if you want to do something about get_project_issues now or later?

Thanks again for putting this up so early, having the query util in a shareable state will be useful to break off more tagstore pieces.

@alex-hofsteede

Copy link
Copy Markdown
Contributor Author

@bretthoerner I fixed get_project_issues to scope down to only the referenced set of issue_ids if they are present. So that should at least make this not terrible in that case that we know which issue we are looking for.

I will defer solving the case where we need to solve the "group by issue for all issues" problem for later.

@bretthoerner bretthoerner left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM. But still need 👀 on TSDB before merge.

@tkaemming tkaemming left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few notes inline from looking at the TSDB bits. I'm assuming that the plan here is just to work out any serious issues from doing parity testing with real data so I didn't review super closely, and I'm also assuming Brett covered anything in utils.snuba.

Comment thread src/sentry/tsdb/base.py Outdated
def get_optimal_rollup(self, start):
"""
Return the size (in seconds) of the finest-granularity available rollup
that will have data available in the given time range.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a much better explanation, good clarification. Though, it's not a time range any longer, I guess, so it might make sense to say "available between the provided start time and the current timestamp"?

Comment thread src/sentry/tsdb/base.py Outdated
# If nothing actually matches the requested range, just return the
# lowest resolution interval.
return list(self.rollups)[-1]
def get_optimal_rollup(self, start):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at history, I think this is OK to do (unless anything has changed since then, but I can't think of what would have): #3860 (comment)

I personally decided to defer it to a later change (which obviously never happened) so that we could make sure that it was working correctly without too many other changes to muddy the water in case anything went awry, your risk tolerance may differ.

Comment thread src/sentry/tsdb/snuba.py Outdated
model_columns = self.model_columns(model)

if model_columns is None:
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should raise an exception here if the backend can't handle the request, rather then returning None. Otherwise this is probably going to manifest itself as a weird combination of {Attribute,Type,Value}Error.

Comment thread src/sentry/tsdb/snuba.py
# into
# {group: [(top1, score), ...]}
for k in result:
item_scores = [(v, float(i + 1)) for i, v in enumerate(reversed(result[k]))]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the enumeration ordinal just a placeholder since due to topK not returning the counts? I think that'd at least be worth a comment here clarifying this isn't a direct translation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I'll leave a comment.

Comment thread src/sentry/tsdb/snuba.py
# into
# {group: [(timestamp, {top1: score, ...}), ...]}
for k in result:
result[k] = sorted([

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above with regard to score.

Comment thread src/sentry/tsdb/snuba.py Outdated
return result

def get_frequency_series(self, model, items, start, end=None,
rollup=None, environment_id=None, limit=10):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this method actually takes a limit parameter.

Came up with a reasonable helper method that works for all range,
distinct, and frequency queries. The results of this helper sometimes
have to be reformatted slightly to conform to the expected (historical)
TSDB result formats
get_frequency_totals, get_distinct_union now supported. Some more
cleanup and refactoring of the core get_data code.

Added a test that the result shape was correct for every test.
For environment and release, the values stored in snuba are the name
strings, not the db ids, when querying we need to dereference the ids
to get the right values to query for in snuba.
- Mock a snuba response in SnubaTSDB tests and verify that the request
and response formats are as expected.
- Translate model keys to snuba fields, and then translate the result
keys back to model ids in SnubaTSDB. This only applies to environment
and release at the moment.
Return the correct response format from the search function, and test it.
If the query has a particular issue or set of issues its looking at,
only expand the issue-> hash map for those issues.
@alex-hofsteede alex-hofsteede merged commit 468ffba into master Mar 30, 2018
@github-actions github-actions Bot locked and limited conversation to collaborators Dec 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants