Skip to content

Conversation

@JoshFerge
Copy link
Member

Using our base query, it's shown that removing the string conversion / dash stripping on the replay_id column improves memory usage by ~26% and speed by ~21%. Queries pasted in at bottom.

This PR creates a new Field UUIDField as we have to do validation / conversion prior to sending our query to clickhouse, and our minimum clickhouse version doesn't have useful helpful functions so we have to work around.

We also now have to strip dashes on replay_ids in the post processing of our queries. We could do the same for error_ids and trace_ids in a future PRs if we want that optimization on those fields as well.

SET send_logs_level = 'trace'
SELECT     project_id AS _snuba_project_id,     replaceAll (toString (replay_id), '-', '') AS _snuba_replay_id,     max(timestamp AS _snuba_timestamp) AS _snuba_finished_at,     sum(length(error_ids AS _snuba_error_ids)) AS _snuba_count_errors,     groupArray (1) (environment AS _snuba_environment)[1] AS _snuba_agg_environment,     _snuba_replay_id,     notEmpty (groupArray (is_archived AS _snuba_is_archived)) AS _snuba_isArchived FROM     replays_local WHERE (_snuba_project_id IN[11276])     AND (_snuba_timestamp < toDateTime ('2023-03-11T00:30:34', 'Universal'))     AND (_snuba_timestamp >= toDateTime ('2022-03-10T00:30:34', 'Universal')) GROUP BY     _snuba_project_id,     _snuba_replay_id HAVING (min(segment_id AS _snuba_segment_id) = 0) AND (_snuba_finished_at < toDateTime ('2023-03-11T00:30:34', 'Universal')) AND (_snuba_isArchived = 0) ORDER BY     _snuba_count_errors ASC LIMIT 0, 10

Peak memory usage (for query): 612.65 MiB.
10 rows in set. Elapsed: 2.912 sec. Processed 7.94 million rows, 341.42 MB (2.73 million rows/s., 117.25 MB/s.)
SELECT     project_id AS _snuba_project_id,     replay_id,     max(timestamp AS _snuba_timestamp) AS _snuba_finished_at,     sum(length(error_ids AS _snuba_error_ids)) AS _snuba_count_errors,     groupArray (1) (environment AS _snuba_environment)[1] AS _snuba_agg_environment,     notEmpty (groupArray (is_archived AS _snuba_is_archived)) AS _snuba_isArchived FROM     replays_local WHERE (_snuba_project_id IN[11276])     AND (_snuba_timestamp < toDateTime ('2023-03-11T00:30:34', 'Universal'))     AND (_snuba_timestamp >= toDateTime ('2022-03-10T00:30:34', 'Universal')) GROUP BY     _snuba_project_id,     replay_id HAVING (min(segment_id AS _snuba_segment_id) = 0) AND (_snuba_finished_at < toDateTime ('2023-03-11T00:30:34', 'Universal')) AND (_snuba_isArchived = 0) ORDER BY     _snuba_count_errors ASC LIMIT 0, 10

MemoryTracker: Peak memory usage (for query): 448.65 MiB
10 rows in set. Elapsed: 1.711 sec. Processed 7.94 million rows, 341.42 MB (4.64 million rows/s., 199.60 MB/s.)

@JoshFerge JoshFerge requested a review from a team as a code owner March 13, 2023 18:25
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Mar 13, 2023
@JoshFerge
Copy link
Member Author

needs https://github.com/getsentry/getsentry/pull/9792 to get tests to pass

Copy link
Member

@cmanallen cmanallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on this! This will be a great improvement.

@JoshFerge
Copy link
Member Author

JoshFerge commented Mar 20, 2023

this is currently blocked as on clickhouse 20.8, our minimum version supported, it seems we cannot supply UUIDs to an IN operator. we'd like to use the primary key as we have an index on it, so i'd like to avoid a materialized column.

I'm wondering if we only make this type of search available via an environment variable, or perhaps just don't allow this type of query?

Anyone have ideas for a way to query on a list of UUIDs? can see the error in the failed test.

@JoshFerge
Copy link
Member Author

was able to fix after playing around on a clickhouse 20.3 instance for a second -- I didn't need to use the toUUID function, just had to simply ensure the uuid strings had dashes in them.

@JoshFerge JoshFerge requested a review from cmanallen March 21, 2023 16:31
@JoshFerge JoshFerge merged commit c90915b into master Mar 21, 2023
@JoshFerge JoshFerge deleted the jferg/optimize-replay-id-dash branch March 21, 2023 17:41
@github-actions github-actions bot locked and limited conversation to collaborators Apr 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants