Skip to content

feat(MEP): Implement the run_query method#31652

Merged
wmak merged 12 commits into
masterfrom
wmak/feat/run-metrics-query
Feb 18, 2022
Merged

feat(MEP): Implement the run_query method#31652
wmak merged 12 commits into
masterfrom
wmak/feat/run-metrics-query

Conversation

@wmak

@wmak wmak commented Feb 8, 2022

Copy link
Copy Markdown
Member
  • This checks the orderby for each entity, and orders the queries so the entity with the most orderbys are first.
    • If more than one orderby contains functions throw an error since the query is now impossible
  • This then uses the groupby to group the results of the up to 3 queries back together
  • This also implements the count_unique function so that we can have a test across 2 entities
  • Part 1 here

wmak and others added 5 commits February 7, 2022 17:48
- This adds a MetricsQueryBuilder, which works very similarily to our
  QueryBuilder, but with specific handlers for how metrics construct
  queries
- This MetricsQueryBuilder does not yet construct snql queries, and will
  not because table queries will require multiple queries to construct
  similar table data
  - that is, if we want [transaction, p95, count_unique(user)], we need
    a query against distributions with [transaction, p95] followed by a
    second query for [transaction, count_unique(user)] against the sets
    table
    - This is so we can maintain a sortby
- This checks the orderby for each entity, and orders the queries so the
  entity with the most orderbys are first.
  - If more than one orderby contains functions throw an error since the
    query is now impossible
- This then uses the groupby to group the results of the up to 3 queries
  back together
Comment thread src/sentry/search/events/builder.py
Base automatically changed from wmak/feat/metrics-query-builder to master February 10, 2022 18:57
@visual-snapshot visual-snapshot Bot requested a review from a team as a code owner February 10, 2022 18:57
@wmak wmak marked this pull request as draft February 10, 2022 19:02
@wmak

wmak commented Feb 10, 2022

Copy link
Copy Markdown
Member Author

Moving to draft, need to rework the conditions for 2nd+ queries

- This also adds better tests since the previous ones were not
  sufficient

@k-fish k-fish left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but some of the code should be split into functions for clarity either now or in the future, and it could use more tests. Should also get SnS review for how the query is being built if they have a chance.

Comment thread tests/sentry/search/events/test_builder.py Outdated
Comment thread src/sentry/search/events/builder.py Outdated
Comment thread src/sentry/search/events/builder.py
Comment thread src/sentry/search/events/builder.py
Comment thread tests/sentry/search/events/test_builder.py

@evanh evanh left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This idea makes sense. Couple questions. Also I don't know if you planned to add this now or were going to do it later, but I would think about how to instrument this, particularly the results aggregation.

Comment thread tests/sentry/search/events/test_builder.py Outdated
],
),
Op.IN,
Function("tuple", groupby_values),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like groupby_values is unbounded. Is there a limit to how big this list can be?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, We probably don't want to allow doing the 10k limit on these table queries. I'll enforce a 51 limit on this builder for now since we only plan to use this for the performance table 👍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also keep in mind there is a maximum size of the actual query you can build. So if you have 100k conditions, the SQL string will be too large for Clickhouse to process.

groupby_values.append(groupby_key)
value_map[value_map_key].update(row)
result["meta"] += current_result["meta"]
result["data"] = list(value_map.values())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I thought of: metrics makes no guarantees that there will be data for every aggregation for every group by. This is a contrived example, but there could be a p95 for transaction x but not unique users. Is your code supposed to handle that case?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes 👍 I'll add an explicit test for this, but what should happen is that we'll get p95 for x, and nothing for count_unique.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, very good point, i just realized for this scenario if we sort by unique users we'll get an unexpected result (x just won't show up at all), and I don't think there's a workaround either 🤔 Going to document this as a known deficiency for now.

@wmak

wmak commented Feb 14, 2022

Copy link
Copy Markdown
Member Author

re: instrumenting, I'll tackle that when its hooked up to an endpoint. I find it easier to do at that point since I'll be able to see the spans in context, thanks for the reminder! 🙏

- Adding a max limit to the metric query builder since we don't ever
  want to do the group by filtering on more than a reasonable number
  - Picking 51 for now since that aligns with tables in the UI (+1 for
    pagination)
- Adding a test & skipped test with note for the case where the data is
  sparse between the tables (ie. only in distribution but not set)
result = query.run_query("test_query")
assert len(result["data"]) == 2
assert result["data"][0] == {
"transaction": indexer.resolve("foo_transaction"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result here is an integer, right? Should it be mapped back to the string for the response?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, still haven't decided if that makes more sense in a post-processing step, or here in the run_query function. Going to leave this for now and revisit when I connect this to an endpoint.

Comment thread src/sentry/search/events/builder.py Outdated
)
current_result = raw_snql_query(
query,
referrer,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this and I think I would do a separate referrer to delineate between the "primary" and secondary queries, since the conditions will change between those two types of queries.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Good idea

@wmak wmak merged commit 28405b1 into master Feb 18, 2022
@wmak wmak deleted the wmak/feat/run-metrics-query branch February 18, 2022 19:12
@github-actions github-actions Bot locked and limited conversation to collaborators Mar 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants