feat: Snuba tsdb implementation. by alex-hofsteede · Pull Request #7834 · getsentry/sentry

alex-hofsteede · 2018-03-28T20:35:40Z

This introduces a utility to make queries to the snuba service from a set
of filter/group/aggregation parameters. The utility does the work of translating
local / postgres ids to the values stored in snuba.

Included is a TSDB backend implementation that uses this utility to construct
time series queries and return the correctly formatted results that TSDB clients expect.

Also included is a (mostly-stubbed) snuba backend for TagStore that currently only has the tag search method implemented.

bretthoerner · 2018-03-28T20:41:42Z

-        # If nothing actually matches the requested range, just return the
-        # lowest resolution interval.
-        return list(self.rollups)[-1]
+    def get_optimal_rollup(self, start):


Will need @tkaemming to chime in on this, I'm pretty clueless on TSDB.

Yeah this was a bit of a drive-by attempt at fixing the bug mentioned in the comments and simplifying the code. but doesn't actually need to be part of this PR

Looking back at history, I think this is OK to do (unless anything has changed since then, but I can't think of what would have): #3860 (comment)

I personally decided to defer it to a later change (which obviously never happened) so that we could make sure that it was working correctly without too many other changes to muddy the water in case anything went awry, your risk tolerance may differ.

bretthoerner · 2018-03-28T20:54:17Z

+            col = 'tags[{}]'.format(tag)
+            if val == ANY:
+                conditions.append((col, 'IS NOT NULL', None))
+            elif val == EMPTY:


I'm struggling to find/remember what EMPTY and ANY really mean again, but the other tagstore implementations seem to do something very different than checking whether the tag is null?

for k, v in tag_lookups: if v is EMPTY: return None

😕

Probably could use @tkaemming here too because looking at the existing impls I'm more confused than not.

IIRC, EMPTY is basically the equivalent of calling queryset.none() and only appears to be generated as part of this function:

sentry/src/sentry/search/utils.py

Lines 20 to 33 in 1244936

def parse_release(project, value):

# TODO(dcramer): add environment support

if value == 'latest':

value = Release.objects.filter(

organization_id=project.organization_id,

projects=project,

).extra(select={

'sort': 'COALESCE(date_released, date_added)',

}).order_by('-sort').values_list(

'version', flat=True

).first()

if value is None:

return EMPTY

return value

This basically precludes the necessity for doing any of the rest of the search (since we only AND conditions together, hitting this branch means there can't possibly be any results) so in my opinion we should should remove this from tagstore if it's not too much trouble.

we should should remove this from tagstore

Do you just mean something at a higher level up should return None, so tagstore never has to see EMPTY? Or something different?

In the meantime I've just put in an early return if there's any EMPTY tags, but open to considering a refactor where tagstore doesn't see EMPTY at all.

Do you just mean something at a higher level up should return None, so tagstore never has to see EMPTY?

Yep.

In the meantime I've just put in an early return if there's any EMPTY tags, but open to considering a refactor where tagstore doesn't see EMPTY at all.

Makes sense to me. 👍

bretthoerner

I didn't review the TSDB side. Mostly small stuff/questions.

bretthoerner · 2018-03-28T21:20:48Z

+    # passed-in keys for project_id, or indrectly (eg the set of projects
+    # related to the queried set of issues or releases)
+    project_ids = [get_project_ids(k, ids) for k, ids in six.iteritems(filter_keys)]
+    if all(not ids for ids in project_ids):


I think this is more easily stated/read as if not any(project_ids):

lol, yeah thats a whole lot better

bretthoerner · 2018-03-28T21:21:51Z

+        conditions.append((col, 'IN', keys))
+
+    # project_ids will be the set of projects either referenced directly as
+    # passed-in keys for project_id, or indrectly (eg the set of projects


nit: indirectly

bretthoerner · 2018-03-28T21:23:24Z

+    project_ids = [get_project_ids(k, ids) for k, ids in six.iteritems(filter_keys)]
+    if all(not ids for ids in project_ids):
+        raise Exception("No project_id filter, or none could be inferred from other filters.")
+    project_ids = list(set.intersection(*[set(ids) for ids in project_ids if ids]))


Is it intersection and not union? If it doesn't go with the largest possible selection of project_ids then why not just have the user pass in project_ids themselves? I guess I'm not sure why we don't require them as a top level param anyway?

for something like get_group_tag_value_count(project_id, group_id, environment_id) in tagstore; if you passed in project_id = 1, and an environment_id from project 2, and a group_id from project 3 then:

I guess you have a bug anyway, because why are you sending these unrelated things?

Its pointless to do a query across the union of all 3 projects, (assuming all conditions are ANDed together) because there will be no matches.

Oh, to answer the other question. We don't use user-passed in project_ids exclusively because sometimes they don't exist. eg with TSDBModel.frequent_releases_by_group and others, we only have the group_id, release_id, environment_id etc.

So the main purpose of this whole block is just to infer the project_id for snuba purposes, when we are not passed a project id. In the cases where we are passed a project id, it doesn't really do much yeah,

OK, I definitely buy it not being required. But I wonder if it should still be either,

toplevel and not required, but if provided it is strictly used as the project_id filter

keep what you have an assert that the projects match, because I think you nailed my confusion that it just feels weird to magically pull out project_ids and some of them vanish via intersection

@bretthoerner cleaned this up so that we use the project_ids directly if they are passed, and the union of any projects we can infer from other models if not.

bretthoerner · 2018-03-28T21:25:39Z

+def get_project_issues(project_ids):
+    """
+    Get a list of issues and associated fingerprint hashes for a project.
+    """


This is returning every hash a project has? That definitely won't fit in RAM or in a query, and what's the difference between doing primary_hash IN (every_hash_in_the_project) and project_id = FOO?

Doesn't this method need to take a group_id argument and only return those hashes?

Hmm, this could be a problem. In the case where you are only looking at a single group, then we should reduce the list here for sure. The snuba code already does reduce the issue expression to only the ones that are needed to filter/group by, but I guess its better to filter here so we don;t have to send as many hashes to snuba.

The thing is though, with things like TSDBModel.frequent_issues_by_project which would require sending all the issue->hash mappings, so that we could compute the most frequently seen issues. If that list is not going to fit in memory I guess we have a problem

Just to repeat what I said in #snuba, my only idea is to select primary_hash, count() as c from sentry_dist where project_id = X group by primary_hash order by c desc limit 100 and then on Sentry side we'd have to reconcile the (hash, count) results into the biggest groups.

I definitely think "send all project hashes in the query" won't work, though. I could be wrong!

bretthoerner · 2018-03-28T21:29:30Z

+    equivalent ones in snuba.
+    """
+    mappings = {
+        'environment': (Environment, 'name'),


nit: I'd prefer to call it an environment_id (and so on) when it is an ID, as we do elsewhere. I guess that'd need another level of translation to Snuba columns though, bleh.

bretthoerner · 2018-03-29T17:42:36Z

Nice, it's looking good to me except I didn't check TSDB and I'm not sure if you want to do something about get_project_issues now or later?

Thanks again for putting this up so early, having the query util in a shareable state will be useful to break off more tagstore pieces.

alex-hofsteede · 2018-03-29T20:35:25Z

@bretthoerner I fixed get_project_issues to scope down to only the referenced set of issue_ids if they are present. So that should at least make this not terrible in that case that we know which issue we are looking for.

I will defer solving the case where we need to solve the "group by issue for all issues" problem for later.

bretthoerner

Nice, LGTM. But still need 👀 on TSDB before merge.

tkaemming

Just a few notes inline from looking at the TSDB bits. I'm assuming that the plan here is just to work out any serious issues from doing parity testing with real data so I didn't review super closely, and I'm also assuming Brett covered anything in utils.snuba.

tkaemming · 2018-03-29T22:04:10Z

+    def get_optimal_rollup(self, start):
+        """
+        Return the size (in seconds) of the finest-granularity available rollup
+        that will have data available in the given time range.


This is a much better explanation, good clarification. Though, it's not a time range any longer, I guess, so it might make sense to say "available between the provided start time and the current timestamp"?

tkaemming · 2018-03-29T22:07:24Z

-        # If nothing actually matches the requested range, just return the
-        # lowest resolution interval.
-        return list(self.rollups)[-1]
+    def get_optimal_rollup(self, start):


Looking back at history, I think this is OK to do (unless anything has changed since then, but I can't think of what would have): #3860 (comment)

I personally decided to defer it to a later change (which obviously never happened) so that we could make sure that it was working correctly without too many other changes to muddy the water in case anything went awry, your risk tolerance may differ.

tkaemming · 2018-03-29T22:24:02Z

+        model_columns = self.model_columns(model)
+
+        if model_columns is None:
+            return None


This seems like it should raise an exception here if the backend can't handle the request, rather then returning None. Otherwise this is probably going to manifest itself as a weird combination of {Attribute,Type,Value}Error.

tkaemming · 2018-03-29T22:31:25Z

+        # into
+        #    {group: [(top1, score), ...]}
+        for k in result:
+            item_scores = [(v, float(i + 1)) for i, v in enumerate(reversed(result[k]))]


Is the enumeration ordinal just a placeholder since due to topK not returning the counts? I think that'd at least be worth a comment here clarifying this isn't a direct translation.

Correct. I'll leave a comment.

tkaemming · 2018-03-29T22:33:08Z

+        # into
+        #    {group: [(timestamp, {top1: score, ...}), ...]}
+        for k in result:
+            result[k] = sorted([


Same comment as above with regard to score.

tkaemming · 2018-03-29T22:44:54Z

+        return result
+
+    def get_frequency_series(self, model, items, start, end=None,
+                             rollup=None, environment_id=None, limit=10):


I don't think this method actually takes a limit parameter.

Came up with a reasonable helper method that works for all range, distinct, and frequency queries. The results of this helper sometimes have to be reformatted slightly to conform to the expected (historical) TSDB result formats

get_frequency_totals, get_distinct_union now supported. Some more cleanup and refactoring of the core get_data code. Added a test that the result shape was correct for every test.

For environment and release, the values stored in snuba are the name strings, not the db ids, when querying we need to dereference the ids to get the right values to query for in snuba.

- Mock a snuba response in SnubaTSDB tests and verify that the request and response formats are as expected. - Translate model keys to snuba fields, and then translate the result keys back to model ids in SnubaTSDB. This only applies to environment and release at the moment.

Return the correct response format from the search function, and test it.

If the query has a particular issue or set of issues its looking at, only expand the issue-> hash map for those issues.

alex-hofsteede requested review from bretthoerner and tkaemming March 28, 2018 20:36

bretthoerner reviewed Mar 28, 2018

View reviewed changes

bretthoerner suggested changes Mar 28, 2018

View reviewed changes

alex-hofsteede force-pushed the hoff/snuba-tsdb branch from 7828853 to a5a3748 Compare March 29, 2018 17:20

bretthoerner approved these changes Mar 29, 2018

View reviewed changes

alex-hofsteede force-pushed the hoff/snuba-tsdb branch from de6ff49 to 41b6039 Compare March 29, 2018 21:39

tkaemming reviewed Mar 29, 2018

View reviewed changes

alex-hofsteede added 19 commits March 30, 2018 10:59

feat(snuba): Add basic TSDB backend for snuba service

9150dd8

Finish implementing most TSDB methods

9292fe4

Came up with a reasonable helper method that works for all range, distinct, and frequency queries. The results of this helper sometimes have to be reformatted slightly to conform to the expected (historical) TSDB result formats

Support more methods

58c0a65

get_frequency_totals, get_distinct_union now supported. Some more cleanup and refactoring of the core get_data code. Added a test that the result shape was correct for every test.

Translate columns

e32e2c5

For environment and release, the values stored in snuba are the name strings, not the db ids, when querying we need to dereference the ids to get the right values to query for in snuba.

Move common snuba query logic to util

dce4ffd

Start on Snuba TagStore

6401e5f

Handle ANY/EMPTY tag searches using IS (NOT) NULL

210abfa

Test tag search a bit better.

2bc5c5a

Return the correct response format from the search function, and test it.

Clean up some tests and pass through arrayjoin param

48c9e03

Clarify comments

8067cec

Mock snuba response for basic get_range test

abb9b0f

No reason to bump this

876e9df

review feedback

10815ad

Clarify how we get a set of project_ids to work with

9af034e

Just return empty if there is a tag = EMPTY clause

2983a41

fix test for EMPTY as behavior has changed

436e329

Only get referenced issues.

438356d

If the query has a particular issue or set of issues its looking at, only expand the issue-> hash map for those issues.

review feedback

325606e

revert base tsdb changes

2252366

alex-hofsteede force-pushed the hoff/snuba-tsdb branch from 46a6bf0 to 2252366 Compare March 30, 2018 18:00

alex-hofsteede merged commit 468ffba into master Mar 30, 2018

github-actions Bot locked and limited conversation to collaborators Dec 22, 2020

	def parse_release(project, value):
	# TODO(dcramer): add environment support
	if value == 'latest':
	value = Release.objects.filter(
	organization_id=project.organization_id,
	projects=project,
	).extra(select={
	'sort': 'COALESCE(date_released, date_added)',
	}).order_by('-sort').values_list(
	'version', flat=True
	).first()
	if value is None:
	return EMPTY
	return value

Uh oh!

Conversation

alex-hofsteede commented Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkaemming Mar 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bretthoerner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-hofsteede Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bretthoerner commented Mar 29, 2018

Uh oh!

alex-hofsteede commented Mar 29, 2018

Uh oh!

bretthoerner left a comment

Choose a reason for hiding this comment

Uh oh!

tkaemming left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

alex-hofsteede commented Mar 28, 2018 •

edited

Loading

tkaemming Mar 29, 2018 •

edited

Loading

alex-hofsteede Mar 28, 2018 •

edited

Loading