Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(ch-upgrades): update query_comparer #5584

Merged
merged 4 commits into from
Mar 5, 2024

Conversation

MeredithAnya
Copy link
Member

@MeredithAnya MeredithAnya commented Feb 23, 2024

Updating the comparer to use the file_manager and grab the files to compare from GCS. Corresponding ops PR https://github.com/getsentry/ops/pull/9471

What the comparer does:

  1. Looks at the results- directories to see if we have pairs of results that are ready to be compared and puts them together as a matched_pairs. e.g.
     [('results-21-8/2024_02_15/meredith_test_1.csv', 'results-22-8/2024_02_15/meredith_test_1.csv')]
  2. Iterates through those pairs to actually compare the results from those files. Before it does so though, it checks that we haven't already done the comparison (we ignore that if --override is used). We can check in the compared-data/ and compared-perf/ directories
  3. We compare the results and generate mismatches for the data and perf categories based on the result data. Thresholds of what should be considered mismatching are in the corresponding _THRESHOLDS dicts.
  4. We save the comparison results for each category separately, as well as send the slack overview + csv files per category as well. We do calculate the % mismatches based on the total rows of the files. Adding percentages of both categories might exceed 100% since the same query_id could have a mismatch in the performance AND the data consistency.

@MeredithAnya MeredithAnya requested a review from a team as a code owner February 23, 2024 17:24
Copy link

codecov bot commented Feb 23, 2024

Codecov Report

Attention: Patch coverage is 51.79487% with 94 lines in your changes are missing coverage. Please review.

Project coverage is 90.02%. Comparing base (9693ca2) to head (b135959).
Report is 22 commits behind head on master.

Files Patch % Lines
snuba/cli/query_comparer.py 33.88% 80 Missing ⚠️
snuba/clickhouse/upgrades/comparisons.py 68.18% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5584      +/-   ##
==========================================
+ Coverage   89.94%   90.02%   +0.07%     
==========================================
  Files         900      900              
  Lines       43624    43809     +185     
  Branches      288      299      +11     
==========================================
+ Hits        39236    39437     +201     
+ Misses       4346     4330      -16     
  Partials       42       42              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

snuba/cli/query_comparer.py Outdated Show resolved Hide resolved
snuba/cli/query_comparer.py Outdated Show resolved Hide resolved
for v1_row, v2_row in itertools.zip_longest(base_reader, upgrade_reader):
if v1_row[0] == "query_id":
# csv header row
def get_matched_pairs() -> Sequence[Tuple[str, str]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document what this function does and what it returns. Similar to what you have in the PR description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Still missing a small description of what this does :-)


PERF_THRESHOLDS = {
"query_duration_ms": 0,
"read_rows": 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have read_rows and read_bytes in perf thresholds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we care to track if a query is reading more data than before? that would probably be reflected in the duration I guess

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for query performance, it shouldn't really matter as the first point of finding mismatches. you should only use query duration. If you want to drill down into mismatches, those additional fields may give us hints.

snuba/cli/query_comparer.py Outdated Show resolved Hide resolved
snuba/cli/query_comparer.py Outdated Show resolved Hide resolved
for v1_row, v2_row in itertools.zip_longest(base_reader, upgrade_reader):
if v1_row[0] == "query_id":
# csv header row
def get_matched_pairs() -> Sequence[Tuple[str, str]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Still missing a small description of what this does :-)

@MeredithAnya MeredithAnya merged commit 0a1f79b into master Mar 5, 2024
31 of 32 checks passed
@MeredithAnya MeredithAnya deleted the meredith/update-comparer branch March 5, 2024 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants