-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(ch-upgrades): update query_comparer #5584
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5584 +/- ##
==========================================
+ Coverage 89.94% 90.02% +0.07%
==========================================
Files 900 900
Lines 43624 43809 +185
Branches 288 299 +11
==========================================
+ Hits 39236 39437 +201
+ Misses 4346 4330 -16
Partials 42 42 ☔ View full report in Codecov by Sentry. |
6f640b7
to
e245df4
Compare
for v1_row, v2_row in itertools.zip_longest(base_reader, upgrade_reader): | ||
if v1_row[0] == "query_id": | ||
# csv header row | ||
def get_matched_pairs() -> Sequence[Tuple[str, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document what this function does and what it returns. Similar to what you have in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Still missing a small description of what this does :-)
snuba/cli/query_comparer.py
Outdated
|
||
PERF_THRESHOLDS = { | ||
"query_duration_ms": 0, | ||
"read_rows": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have read_rows and read_bytes in perf thresholds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we care to track if a query is reading more data than before? that would probably be reflected in the duration I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for query performance, it shouldn't really matter as the first point of finding mismatches. you should only use query duration. If you want to drill down into mismatches, those additional fields may give us hints.
for v1_row, v2_row in itertools.zip_longest(base_reader, upgrade_reader): | ||
if v1_row[0] == "query_id": | ||
# csv header row | ||
def get_matched_pairs() -> Sequence[Tuple[str, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Still missing a small description of what this does :-)
Updating the comparer to use the file_manager and grab the files to compare from GCS. Corresponding ops PR https://github.com/getsentry/ops/pull/9471
What the comparer does:
results-
directories to see if we have pairs of results that are ready to be compared and puts them together as amatched_pairs
. e.g.--override
is used). We can check in thecompared-data/
andcompared-perf/
directoriesdata
andperf
categories based on the result data. Thresholds of what should be considered mismatching are in the corresponding_THRESHOLDS
dicts.