New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

rfc: propose storing queries as a first-class concept #27

Closed

dorianj wants to merge 1 commit into amundsen-io:master from dorianj:dj-first-class-queries

Contributor

dorianj commented Mar 12, 2021

This proposes a new fundamental data type in Amundsen. I've left breadcrumbs for possible features on top of this, but I haven't specced that out. I think discussing those possibilities before implementing the architecture would be healthy, though.

I think a likely implementation path would be to:

Implement this on its own and get it landed as a beta
Build a first feature using this data. It is expected that we will learn new things and change the underlying data shape slightly (hence the beta designation)
Announce this new data structure as supported in conjunction with the new feature.


          rfc: propose storing queries as a first-class concept

02db751

Signed-off-by: Dorian Johnson <2020@dorianj.net>

dorianj requested a review from a team as a code owner

March 12, 2021 19:42

dorianj added the Status: Draft label

This was referenced Mar 12, 2021

feature request - sample queries for tables amundsen-io/amundsen#31

Closed

feat: Store relationships between queries and tables amundsen-io/amundsen#547

Closed

feng-tao reviewed

View reviewed changes

rfcs/027-first-class-queries.md


		## Summary

		This proposes to ingest user-run queries directly into Amundsen. The information included would be the raw string query (typically SQL), as well as the objects referenced by it (Columns, Tables, etc). Amundsen would not parse SQL natively, rather it would provide the plumbing for users who have SQL parsing capabilities.

Member

feng-tao Mar 12, 2021

have we done any tests on how long it takes to ingest user log sql. Typically production usage log is 10s of millions of sql per day. Are we planning to surface all the queries in the table page?

Member

feng-tao Mar 12, 2021

if not during ingest, when will the sql parsing happen? are we planning to expose the usagelog for lineage consumption through API?

rfcs/027-first-class-queries.md


		## Motivation

		This allows for considerably more dynamism in how Amundsen surfaces usage-based information. Rather than popular tables being determined primarily by querying usage statistics from dashboarding systems, Amundsen could determine it at runtime by querying the metastore. Same goes for frequent table users. It also lays the groundwork for features like: commonly joined tables, popular queries, audit logging, etc. Additionally, if DDL/DML changes are observed in the log, they can be associated with a table.

Member

feng-tao Mar 12, 2021

the popular table is gathering usage info for table systems.

Member

feng-tao Mar 12, 2021

could you clarify the audit logging piece? also will the usage log ingestion refreshed everyday?

rfcs/027-first-class-queries.md


		## Guide-level Explanation (aka Product Details)

		In Amundsen, query logs from your DWH, RDBMS or BI tool may be ingested raw. It's up to you to extract the information about which subjects a particular query touches (e.g. which tables it queries, which user ran it, etc), but once ingested, Amundsen can surface interesting insights about query patterns.

Member

feng-tao Mar 12, 2021

typically approach is to store the query log through a data warehouse table (delta, hive etc) and it could fetch the usage log and persist the relevant information into the metadata store.

I wonder how. the usage query log ingestion is going to look like? are you directly consume from the source and directly persist into metastore store?

rfcs/027-first-class-queries.md


		### databuilder

		We would create a new `class Query(GraphSerializable)` . It would contain these properties:

Member

feng-tao Mar 12, 2021

there is already a https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/models/dashboard/dashboard_query.py, we should at least make a base query class and make two subclasses (dashboard, table)

rfcs/027-first-class-queries.md

+              - Time that the query was run (required)
+              - At least one of required:
+                  - Raw string representation of the query (for query types like SQL)
+                  - Name of job run (for things like Python notebooks run)

Member

feng-tao Mar 12, 2021

we have application model, it should be a relationship between query and application model.

rfcs/027-first-class-queries.md


		## Unresolved questions

		- Is there other prior art or considerations about this? Surely it's been considered before.

Member

feng-tao Mar 13, 2021

I think having a table query relationship makes sense, however, I am not sure about using amundsen as the source of truth for all the queries which could easily store many non-sense queries(e.g typo etc).

SLA of query log emit from computation framework to this store is NA (typical is to hive table)
Non GC of the query nodes, are you going to retain PII data(query has all kinds PII) for the full deployment?

I think we could have the query model, but if I am going to use it, I will only surface the most relevant / popular queries to the table (top 5, or top X)

jdavidheiser Mar 14, 2021

Yeah I think in our use cases, we would probably lean toward only surfacing either queries that were run recently, or queries that are associated with dashboards - and we'd probably only want to store recent ones in Amundsen too. I've run into a lot of situations where someone references a very old query, and the business logic has shifted enough since then that they end up making some incorrect assumptions

jdavidheiser reviewed

View reviewed changes

rfcs/027-first-class-queries.md


		We would create a new `class Query(GraphSerializable)` . It would contain these properties:

		- Time that the query was run (required)

jdavidheiser Mar 14, 2021

This is somewhat of a meta point, but most of this discussion seems to be around an assumption that we are recording every execution of a query as a unique instance in the database - so this idea of the time that the query was run makes sense in that context. However, I could see wanting to use this to instead record every unique query, with some information (like how often it is run) pre-computed before storing it in Amundsen. For those cases, I don't know if this requirement really makes sense.

jonhehir commented Mar 22, 2021 •

edited

Loading

I'll keep my comments high-level, since I haven't actively worked with Amundsen very recently.

As the author of amundsen-io/amundsen#547, I believe the original intent (which in retrospect is not as clear as it should have been) was to point out the lack of relationship between DashboardQuery objects and Tables in the current model. As I recall, we had relationships between Dashboard and Table objects, but no such relationship between DashboardQuery and Table objects. But in most cases, the relationship between Dashboard and Table objects probably takes place through these queries. Adding this relationship to the data model seemed like it should be easy to accommodate, both in the Amundsen data model and in any dashboard extractors.

If I understand correctly, the scope of this RFC is more far-reaching: creating a more general first-class query object that could potentially capture all queries against DBs (where, as @feng-tao mentions, DashboardQuery might become a subclass). I'll leave the discussion of this to the folks who are more active on the project. I guess it's not exactly clear to me how wide a net is being cast here in terms of queries to ingest into Amundsen, but I would imagine that in many cases, casting too wide a net of queries to ingest would be undesirable, for reasons of volume, signal-to-noise ratio, and potential for sensitive content in raw queries (e.g., PII).

Having said that, I do imagine there could be some very good use-cases for ingesting non-dashboard-related queries into Amundsen. Perhaps a quick few bullet points of use cases for this RFC would be helpful to clarify what exactly the aim is here?

Contributor

sewardgw commented Apr 22, 2021

@feng-tao @dorianj @jdavidheiser @jonhehir

I wanted to pick up the conversation on this and possibly refine the scope of the RFC. Think this feature set would be a wonderful addition but the current state of the RFC is a little too broad with many unknowns for how exactly this will enable future features. I would like to propose that the RFC be rescoped to the following:

Enable creation of a generic Query object
Enable creation of a generic Query Relationship (to table, to dashboard, etc. - maybe having specific relationship types based on the nodes being connected?)
Enable association of query to user
Enable association of queries to a job and / or source
Update metadata + FE services to serve a tab for queries on the table details page (like dashboards)
An implementation of a generic extractor (CSV?) to load queries + relationships

Furthermore, I would propose the following are not in scope and why:

Any form of processing / reading query logs and SQL parsing (the RFC is aligned here, but the comments focus heavily on this point)
- @feng-tao brought up good points about how to consume queries from the source, I would propose that Amundsen does not enforce a design choice here and to leave this up to the implementer (in the same way that implementers need to figure out how to do auth and get users into Amundsen)
- This will likely be the most customized aspect of the implementation. In this discussion there are questions such as how to handle performance, how to expire queries, and whether every query execution should be captured. I think Amundsen should enable all of these, but not go so far as to make a declaration for how to do it (at least not yet).
Associating queries to columns
- It is not overtly obvious what the value is to associate a query to a column given that Amundsen would be associating queries to tables. I recognize you can get most frequently used columns and similar features but I question the incremental value users would get and how much different the UX would be for a typical user going through this flow. I propose delaying this until there is a more well-defined use-case.
Associating python notebooks
- I think having a way to view notebooks is a great idea but notebooks have the opportunity for a greater scope such as multiple transformations / ELT, association to models & features as well as metrics and analysis. I would also reduce the scope of this RFC in this regard until there is a better understanding of both notebooks and these additional capabilities within Amundsen.
- A short-term workaround to solve this is to link to the notebook using the GitHub source

sewardgw mentioned this pull request

Index queries and query compositions #37

Merged

Contributor Author

dorianj commented May 13, 2021

Closing this in favor of #37

dorianj closed this

Golodhros added Status: Rejected and removed Status: Draft labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Rejected