New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Batch requests to snuba for TSDB queries #14379
Conversation
47abd33
to
23df9c0
Compare
This looks good to me. Why not start with something closer to your estimated max though? Feels like it would be easier to detect whether batch size is too high (we get a lot of errors) rather than too low. |
4a57bec
to
cc27f19
Compare
I guess I was being conservative so that emails definitely get sent on Monday. How about raising it to 10,000 and taking it from there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only really blocking issue I have is with the BATCH_SIZE
. There are some other comments inline that I think are worth considering, but not going to block on them.
Otherwise, a few general comments about this:
While it's only manifested itself in this prepare_project_issue_summaries
call so far, this doesn't seem like a problem unique to this call (or event reports, potentially.) Is this the right level of abstraction to implement this? Would it make more sense to implement this in tsdb.get_sums
itself to avoid this becoming a problem elsewhere? Should we eventually support breaking up long queries in Snuba instead of Sentry (though we'll still have HTTP POST body limits to deal with there.)
For example, I would anticipate this call exceeding size limits at some point as well:
sentry/src/sentry/tasks/reports.py
Lines 169 to 181 in dfafa26
tsdb.get_range( | |
tsdb.models.group, | |
list( | |
project.group_set.filter( | |
status=GroupStatus.RESOLVED, | |
resolved_at__gte=start, | |
resolved_at__lt=stop, | |
).values_list("id", flat=True) | |
), | |
start, | |
stop, | |
rollup=rollup, | |
).values(), |
I'm not sure that there's anything actionable that should come out of that righ tnow, but worth considering.
Size of one long: 8 bytes
From my understanding, we're sending the SQL statement as UTF-8 text — there is no processing on the driver side to convert the string representation of integers to C longs with the native driver. (I know that there for sure won't be one with the HTTP driver.)
In this case, the maximum size of a UInt64 (the type of the column in ClickHouse) as a UTF-8 string is going to be len(str(2 ** 64))
which is 20 bytes of text. PostgreSQL doesn't have unsigned integers, so our maximum value would len(str(2 ** (64 - 1)))
(19 bytes of text.) The actual max(group_id)
that we have in ClickHouse right now is in the low billions so len(str(10 ** 9))
is still going to be ~10 bytes per ID, with an extra 2 bytes for the separator (,
), so 12 bytes total. That would work out as (500,000 - 290) / 12 = 41642.5
.
If we're exceeding that number (yikes) that means that the organizations that are currently sending one request that fails instead will send at least ceil(41,642 / 1000) = 42
requests for each organization. If this math checks out on the query size, we should make this batch size quite a bit higher than 1000 to avoid repeatedly scanning the same ClickHouse column data.
src/sentry/tasks/reports.py
Outdated
|
||
def chunks(arr, chunk_size): | ||
for i in range(0, len(arr), chunk_size): | ||
yield arr[i : i + chunk_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is input set is really as big as the math suggests, it's probably worth using sentry.utils.iterators.chunked
instead. It should be a bit better with memory utilization because it won't require copying the entire input set into a new list first. (More on this below.)
src/sentry/tasks/reports.py
Outdated
|
||
@chunk_event_counts | ||
def get_event_counts(issue_ids, start, stop, rollup): | ||
return tsdb.get_sums(tsdb.models.group, issue_ids, start, stop, rollup=rollup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of the chunk_event_counts
decorator basically couples it to the get_event_counts
function signature and reduces it's utility as a general purpose, reusable decorator. (In my opinion, single purpose decorators are a bit of a smell — the snuba
codebase is a good — or bad, depending on how you look at it — example of this, with things like split_query
that only actually work on one method in practice.)
I think it might be worth it to reconsider this the structure of this function so that it makes it more apparent how the result is being constructed.
def get_event_counts(issue_ids, start, stop, rollup):
combined = {}
for chunk in chunked(issue_ids, BATCH_SIZE):
combined.update(tsdb.get_sums(tsdb.models.group, chunk, start, stop, rollup=rollup))
return combined
Going to amend this to say that I don't think there is anything that should come out of that train of thought for this PR, but we might want to consider whether or not follow up changes to generalize this are worth it, or worth writing up and deferring to a later time. |
I thought a bunch about the abstraction bit while implementing this solution cause I agree that it doesn't feel like it should belong here. It feels like it should belong in Snuba. However, what we're trying to do is a map-reduce. The map bit is easy - find the longest conditional and split it up. You need to deal with I feel it becomes far more complicated at the reduce step. Cause now, you can have all sorts of aggregations. Still do-able, but much easier to solve when one has a specific aggregation in mind. So, I felt it was just more practical to do it on a case-by-case basis cause it doesn't happen very often. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I felt it was just more practical to do it on a case-by-case basis cause it doesn't happen very often.
Makes sense — I think we can readdress if it does actually become an issue.
Thanks for the updates!
Some accounts are not getting their weekly reports.
This is cause when the report generation is scheduled, sentry makes a call to snuba with all the new issue IDs and reopened issue IDs. Snuba has a max query size of 500 kB (i.e. the SQL cannot be more than 500 kB). The max query size is not a problem of most cases but does become a problem for larger organizations who open a lot of issues in one week.
Here's the sentry error.
This change batches requests in batches of
1000
to snuba and reassembles them.1000
was an arbitrary number and can be increased. Back of the napkin math for how many we can have: