Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Batch requests to snuba for TSDB queries #14379

Merged
merged 6 commits into from Aug 17, 2019

Conversation

manuzope
Copy link
Contributor

@manuzope manuzope commented Aug 13, 2019

Some accounts are not getting their weekly reports.

This is cause when the report generation is scheduled, sentry makes a call to snuba with all the new issue IDs and reopened issue IDs. Snuba has a max query size of 500 kB (i.e. the SQL cannot be more than 500 kB). The max query size is not a problem of most cases but does become a problem for larger organizations who open a lot of issues in one week.

Here's the sentry error.

This change batches requests in batches of 1000 to snuba and reassembles them. 1000 was an arbitrary number and can be increased. Back of the napkin math for how many we can have:

Total allowed size: 500 * 1000 = 500,000 bytes
SQL size (without the issue IDs) = 290 bytes
Size of one long: 8 bytes

So, number of longs we can have in one request: (500,000 - 290)/8 = 62K

@manuzope manuzope changed the title fix: Batch requests to snuba fix: Batch requests to snuba for TSDB queries Aug 13, 2019
@manuzope manuzope force-pushed the SNS-189-paginate-group-scoped-tsdb-queries branch from 47abd33 to 23df9c0 Compare August 14, 2019 00:07
@manuzope manuzope requested a review from a team August 14, 2019 02:02
@manuzope manuzope marked this pull request as ready for review August 14, 2019 02:02
@lynnagara
Copy link
Member

This looks good to me. Why not start with something closer to your estimated max though? Feels like it would be easier to detect whether batch size is too high (we get a lot of errors) rather than too low.

@manuzope manuzope force-pushed the SNS-189-paginate-group-scoped-tsdb-queries branch from 4a57bec to cc27f19 Compare August 16, 2019 16:48
@manuzope
Copy link
Contributor Author

I guess I was being conservative so that emails definitely get sent on Monday. How about raising it to 10,000 and taking it from there?

Copy link
Contributor

@tkaemming tkaemming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only really blocking issue I have is with the BATCH_SIZE. There are some other comments inline that I think are worth considering, but not going to block on them.

Otherwise, a few general comments about this:

While it's only manifested itself in this prepare_project_issue_summaries call so far, this doesn't seem like a problem unique to this call (or event reports, potentially.) Is this the right level of abstraction to implement this? Would it make more sense to implement this in tsdb.get_sums itself to avoid this becoming a problem elsewhere? Should we eventually support breaking up long queries in Snuba instead of Sentry (though we'll still have HTTP POST body limits to deal with there.)

For example, I would anticipate this call exceeding size limits at some point as well:

tsdb.get_range(
tsdb.models.group,
list(
project.group_set.filter(
status=GroupStatus.RESOLVED,
resolved_at__gte=start,
resolved_at__lt=stop,
).values_list("id", flat=True)
),
start,
stop,
rollup=rollup,
).values(),

I'm not sure that there's anything actionable that should come out of that righ tnow, but worth considering.

Size of one long: 8 bytes

From my understanding, we're sending the SQL statement as UTF-8 text — there is no processing on the driver side to convert the string representation of integers to C longs with the native driver. (I know that there for sure won't be one with the HTTP driver.)

In this case, the maximum size of a UInt64 (the type of the column in ClickHouse) as a UTF-8 string is going to be len(str(2 ** 64)) which is 20 bytes of text. PostgreSQL doesn't have unsigned integers, so our maximum value would len(str(2 ** (64 - 1))) (19 bytes of text.) The actual max(group_id) that we have in ClickHouse right now is in the low billions so len(str(10 ** 9)) is still going to be ~10 bytes per ID, with an extra 2 bytes for the separator (, ), so 12 bytes total. That would work out as (500,000 - 290) / 12 = 41642.5.

If we're exceeding that number (yikes) that means that the organizations that are currently sending one request that fails instead will send at least ceil(41,642 / 1000) = 42 requests for each organization. If this math checks out on the query size, we should make this batch size quite a bit higher than 1000 to avoid repeatedly scanning the same ClickHouse column data.


def chunks(arr, chunk_size):
for i in range(0, len(arr), chunk_size):
yield arr[i : i + chunk_size]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is input set is really as big as the math suggests, it's probably worth using sentry.utils.iterators.chunked instead. It should be a bit better with memory utilization because it won't require copying the entire input set into a new list first. (More on this below.)


@chunk_event_counts
def get_event_counts(issue_ids, start, stop, rollup):
return tsdb.get_sums(tsdb.models.group, issue_ids, start, stop, rollup=rollup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of the chunk_event_counts decorator basically couples it to the get_event_counts function signature and reduces it's utility as a general purpose, reusable decorator. (In my opinion, single purpose decorators are a bit of a smell — the snuba codebase is a good — or bad, depending on how you look at it — example of this, with things like split_query that only actually work on one method in practice.)

I think it might be worth it to reconsider this the structure of this function so that it makes it more apparent how the result is being constructed.

def get_event_counts(issue_ids, start, stop, rollup):
        combined = {}
        for chunk in chunked(issue_ids, BATCH_SIZE):
            combined.update(tsdb.get_sums(tsdb.models.group, chunk, start, stop, rollup=rollup))
        return combined

@tkaemming
Copy link
Contributor

I'm not sure that there's anything actionable that should come out of that right now, but worth considering.

Going to amend this to say that I don't think there is anything that should come out of that train of thought for this PR, but we might want to consider whether or not follow up changes to generalize this are worth it, or worth writing up and deferring to a later time.

@manuzope
Copy link
Contributor Author

manuzope commented Aug 17, 2019

I thought a bunch about the abstraction bit while implementing this solution cause I agree that it doesn't feel like it should belong here. It feels like it should belong in Snuba.

However, what we're trying to do is a map-reduce. The map bit is easy - find the longest conditional and split it up. You need to deal with ORs and ANDs but that syntax isn't that complicated.

I feel it becomes far more complicated at the reduce step. Cause now, you can have all sorts of aggregations. Still do-able, but much easier to solve when one has a specific aggregation in mind.

So, I felt it was just more practical to do it on a case-by-case basis cause it doesn't happen very often.

Copy link
Contributor

@tkaemming tkaemming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I felt it was just more practical to do it on a case-by-case basis cause it doesn't happen very often.

Makes sense — I think we can readdress if it does actually become an issue.

Thanks for the updates!

@manuzope manuzope merged commit 5c03215 into master Aug 17, 2019
@manuzope manuzope deleted the SNS-189-paginate-group-scoped-tsdb-queries branch August 17, 2019 00:55
@github-actions github-actions bot locked and limited conversation to collaborators Dec 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants