fix: Batch requests to snuba for TSDB queries #14379

manuzope · 2019-08-13T21:17:37Z

Some accounts are not getting their weekly reports.

This is cause when the report generation is scheduled, sentry makes a call to snuba with all the new issue IDs and reopened issue IDs. Snuba has a max query size of 500 kB (i.e. the SQL cannot be more than 500 kB). The max query size is not a problem of most cases but does become a problem for larger organizations who open a lot of issues in one week.

Here's the sentry error.

This change batches requests in batches of 1000 to snuba and reassembles them. 1000 was an arbitrary number and can be increased. Back of the napkin math for how many we can have:

Total allowed size: 500 * 1000 = 500,000 bytes
SQL size (without the issue IDs) = 290 bytes
Size of one long: 8 bytes

So, number of longs we can have in one request: (500,000 - 290)/8 = 62K

lynnagara · 2019-08-15T23:56:06Z

This looks good to me. Why not start with something closer to your estimated max though? Feels like it would be easier to detect whether batch size is too high (we get a lot of errors) rather than too low.

manuzope · 2019-08-16T16:51:05Z

I guess I was being conservative so that emails definitely get sent on Monday. How about raising it to 10,000 and taking it from there?

tkaemming

The only really blocking issue I have is with the BATCH_SIZE. There are some other comments inline that I think are worth considering, but not going to block on them.

Otherwise, a few general comments about this:

While it's only manifested itself in this prepare_project_issue_summaries call so far, this doesn't seem like a problem unique to this call (or event reports, potentially.) Is this the right level of abstraction to implement this? Would it make more sense to implement this in tsdb.get_sums itself to avoid this becoming a problem elsewhere? Should we eventually support breaking up long queries in Snuba instead of Sentry (though we'll still have HTTP POST body limits to deal with there.)

For example, I would anticipate this call exceeding size limits at some point as well:

sentry/src/sentry/tasks/reports.py

Lines 169 to 181 in dfafa26

    
           tsdb.get_range( 
        
               tsdb.models.group, 
        
               list( 
        
                   project.group_set.filter( 
        
                       status=GroupStatus.RESOLVED, 
        
                       resolved_at__gte=start, 
        
                       resolved_at__lt=stop, 
        
                   ).values_list("id", flat=True) 
        
               ), 
        
               start, 
        
               stop, 
        
               rollup=rollup, 
        
           ).values(),

I'm not sure that there's anything actionable that should come out of that righ tnow, but worth considering.

Size of one long: 8 bytes

From my understanding, we're sending the SQL statement as UTF-8 text — there is no processing on the driver side to convert the string representation of integers to C longs with the native driver. (I know that there for sure won't be one with the HTTP driver.)

In this case, the maximum size of a UInt64 (the type of the column in ClickHouse) as a UTF-8 string is going to be len(str(2 ** 64)) which is 20 bytes of text. PostgreSQL doesn't have unsigned integers, so our maximum value would len(str(2 ** (64 - 1))) (19 bytes of text.) The actual max(group_id) that we have in ClickHouse right now is in the low billions so len(str(10 ** 9)) is still going to be ~10 bytes per ID, with an extra 2 bytes for the separator (, ), so 12 bytes total. That would work out as (500,000 - 290) / 12 = 41642.5.

If we're exceeding that number (yikes) that means that the organizations that are currently sending one request that fails instead will send at least ceil(41,642 / 1000) = 42 requests for each organization. If this math checks out on the query size, we should make this batch size quite a bit higher than 1000 to avoid repeatedly scanning the same ClickHouse column data.

tkaemming · 2019-08-16T22:56:45Z

src/sentry/tasks/reports.py

+
+    def chunks(arr, chunk_size):
+        for i in range(0, len(arr), chunk_size):
+            yield arr[i : i + chunk_size]


If this is input set is really as big as the math suggests, it's probably worth using sentry.utils.iterators.chunked instead. It should be a bit better with memory utilization because it won't require copying the entire input set into a new list first. (More on this below.)

tkaemming · 2019-08-16T23:14:33Z

src/sentry/tasks/reports.py

+
+@chunk_event_counts
+def get_event_counts(issue_ids, start, stop, rollup):
+    return tsdb.get_sums(tsdb.models.group, issue_ids, start, stop, rollup=rollup)


The implementation of the chunk_event_counts decorator basically couples it to the get_event_counts function signature and reduces it's utility as a general purpose, reusable decorator. (In my opinion, single purpose decorators are a bit of a smell — the snuba codebase is a good — or bad, depending on how you look at it — example of this, with things like split_query that only actually work on one method in practice.)

I think it might be worth it to reconsider this the structure of this function so that it makes it more apparent how the result is being constructed.

def get_event_counts(issue_ids, start, stop, rollup): combined = {} for chunk in chunked(issue_ids, BATCH_SIZE): combined.update(tsdb.get_sums(tsdb.models.group, chunk, start, stop, rollup=rollup)) return combined

tkaemming · 2019-08-16T23:57:19Z

I'm not sure that there's anything actionable that should come out of that right now, but worth considering.

Going to amend this to say that I don't think there is anything that should come out of that train of thought for this PR, but we might want to consider whether or not follow up changes to generalize this are worth it, or worth writing up and deferring to a later time.

manuzope · 2019-08-17T00:38:36Z

I thought a bunch about the abstraction bit while implementing this solution cause I agree that it doesn't feel like it should belong here. It feels like it should belong in Snuba.

However, what we're trying to do is a map-reduce. The map bit is easy - find the longest conditional and split it up. You need to deal with ORs and ANDs but that syntax isn't that complicated.

I feel it becomes far more complicated at the reduce step. Cause now, you can have all sorts of aggregations. Still do-able, but much easier to solve when one has a specific aggregation in mind.

So, I felt it was just more practical to do it on a case-by-case basis cause it doesn't happen very often.

tkaemming

So, I felt it was just more practical to do it on a case-by-case basis cause it doesn't happen very often.

Makes sense — I think we can readdress if it does actually become an issue.

Thanks for the updates!

manuzope changed the title ~~fix: Batch requests to snuba~~ fix: Batch requests to snuba for TSDB queries Aug 13, 2019

manuzope force-pushed the SNS-189-paginate-group-scoped-tsdb-queries branch from 47abd33 to 23df9c0 Compare August 14, 2019 00:07

manuzope requested a review from a team August 14, 2019 02:02

manuzope marked this pull request as ready for review August 14, 2019 02:02

manuzope requested a review from tkaemming as a code owner August 14, 2019 02:02

Manu Zope added 5 commits August 16, 2019 09:38

fix: Batch requests to snuba

5d7f217

Lets see if the test works

678fd92

Aaaand lets start using it

c1701c4

Refactor

790a830

Re-blacken branch after rebase

cc27f19

manuzope force-pushed the SNS-189-paginate-group-scoped-tsdb-queries branch from 4a57bec to cc27f19 Compare August 16, 2019 16:48

lynnagara approved these changes Aug 16, 2019

View reviewed changes

tkaemming suggested changes Aug 16, 2019

View reviewed changes

CR: use sentry.utils.iterator.chunked, increase batch size to 30k

b62bbe1

tkaemming approved these changes Aug 17, 2019

View reviewed changes

manuzope merged commit 5c03215 into master Aug 17, 2019

manuzope deleted the SNS-189-paginate-group-scoped-tsdb-queries branch August 17, 2019 00:55

manuzope mentioned this pull request Aug 19, 2019

fix: Do chunking of prepare_project_series #14442

Merged

github-actions bot locked and limited conversation to collaborators Dec 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Batch requests to snuba for TSDB queries #14379

fix: Batch requests to snuba for TSDB queries #14379

manuzope commented Aug 13, 2019 •

edited

lynnagara commented Aug 15, 2019

manuzope commented Aug 16, 2019

tkaemming left a comment

tkaemming Aug 16, 2019

tkaemming Aug 16, 2019

tkaemming commented Aug 16, 2019

manuzope commented Aug 17, 2019 •

edited

tkaemming left a comment

	tsdb.get_range(
	tsdb.models.group,
	list(
	project.group_set.filter(
	status=GroupStatus.RESOLVED,
	resolved_at__gte=start,
	resolved_at__lt=stop,
	).values_list("id", flat=True)
	),
	start,
	stop,
	rollup=rollup,
	).values(),

fix: Batch requests to snuba for TSDB queries #14379

fix: Batch requests to snuba for TSDB queries #14379

Conversation

manuzope commented Aug 13, 2019 • edited

lynnagara commented Aug 15, 2019

manuzope commented Aug 16, 2019

tkaemming left a comment

Choose a reason for hiding this comment

tkaemming Aug 16, 2019

Choose a reason for hiding this comment

tkaemming Aug 16, 2019

Choose a reason for hiding this comment

tkaemming commented Aug 16, 2019

manuzope commented Aug 17, 2019 • edited

tkaemming left a comment

Choose a reason for hiding this comment

manuzope commented Aug 13, 2019 •

edited

manuzope commented Aug 17, 2019 •

edited