-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: propose storing queries as a first-class concept #27
Conversation
Signed-off-by: Dorian Johnson <2020@dorianj.net>
|
||
## Summary | ||
|
||
This proposes to ingest user-run queries directly into Amundsen. The information included would be the raw string query (typically SQL), as well as the objects referenced by it (Columns, Tables, etc). Amundsen would not parse SQL natively, rather it would provide the plumbing for users who have SQL parsing capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we done any tests on how long it takes to ingest user log sql. Typically production usage log is 10s of millions of sql per day. Are we planning to surface all the queries in the table page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not during ingest, when will the sql parsing happen? are we planning to expose the usagelog for lineage consumption through API?
|
||
## Motivation | ||
|
||
This allows for considerably more dynamism in how Amundsen surfaces usage-based information. Rather than popular tables being determined primarily by querying usage statistics from dashboarding systems, Amundsen could determine it at runtime by querying the metastore. Same goes for frequent table users. It also lays the groundwork for features like: commonly joined tables, popular queries, audit logging, etc. Additionally, if DDL/DML changes are observed in the log, they can be associated with a table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the popular table is gathering usage info for table systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you clarify the audit logging piece? also will the usage log ingestion refreshed everyday?
|
||
## Guide-level Explanation (aka Product Details) | ||
|
||
In Amundsen, query logs from your DWH, RDBMS or BI tool may be ingested raw. It's up to you to extract the information about which subjects a particular query touches (e.g. which tables it queries, which user ran it, etc), but once ingested, Amundsen can surface interesting insights about query patterns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typically approach is to store the query log through a data warehouse table (delta, hive etc) and it could fetch the usage log and persist the relevant information into the metadata store.
I wonder how. the usage query log ingestion is going to look like? are you directly consume from the source and directly persist into metastore store?
|
||
### databuilder | ||
|
||
We would create a new `class Query(GraphSerializable)` . It would contain these properties: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is already a https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/models/dashboard/dashboard_query.py, we should at least make a base query class and make two subclasses (dashboard, table)
- Time that the query was run (required) | ||
- At least one of required: | ||
- Raw string representation of the query (for query types like SQL) | ||
- Name of job run (for things like Python notebooks run) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have application model, it should be a relationship between query and application model.
|
||
## Unresolved questions | ||
|
||
- Is there other prior art or considerations about this? Surely it's been considered before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a table query relationship makes sense, however, I am not sure about using amundsen as the source of truth for all the queries which could easily store many non-sense queries(e.g typo etc).
- SLA of query log emit from computation framework to this store is NA (typical is to hive table)
- Non GC of the query nodes, are you going to retain PII data(query has all kinds PII) for the full deployment?
I think we could have the query model, but if I am going to use it, I will only surface the most relevant / popular queries to the table (top 5, or top X)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think in our use cases, we would probably lean toward only surfacing either queries that were run recently, or queries that are associated with dashboards - and we'd probably only want to store recent ones in Amundsen too. I've run into a lot of situations where someone references a very old query, and the business logic has shifted enough since then that they end up making some incorrect assumptions
|
||
We would create a new `class Query(GraphSerializable)` . It would contain these properties: | ||
|
||
- Time that the query was run (required) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat of a meta point, but most of this discussion seems to be around an assumption that we are recording every execution of a query as a unique instance in the database - so this idea of the time that the query was run makes sense in that context. However, I could see wanting to use this to instead record every unique query, with some information (like how often it is run) pre-computed before storing it in Amundsen. For those cases, I don't know if this requirement really makes sense.
I'll keep my comments high-level, since I haven't actively worked with Amundsen very recently. As the author of amundsen-io/amundsen#547, I believe the original intent (which in retrospect is not as clear as it should have been) was to point out the lack of relationship between If I understand correctly, the scope of this RFC is more far-reaching: creating a more general first-class query object that could potentially capture all queries against DBs (where, as @feng-tao mentions, Having said that, I do imagine there could be some very good use-cases for ingesting non-dashboard-related queries into Amundsen. Perhaps a quick few bullet points of use cases for this RFC would be helpful to clarify what exactly the aim is here? |
@feng-tao @dorianj @jdavidheiser @jonhehir I wanted to pick up the conversation on this and possibly refine the scope of the RFC. Think this feature set would be a wonderful addition but the current state of the RFC is a little too broad with many unknowns for how exactly this will enable future features. I would like to propose that the RFC be rescoped to the following:
Furthermore, I would propose the following are not in scope and why:
|
Closing this in favor of #37 |
This proposes a new fundamental data type in Amundsen. I've left breadcrumbs for possible features on top of this, but I haven't specced that out. I think discussing those possibilities before implementing the architecture would be healthy, though.
I think a likely implementation path would be to: