Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: propose storing queries as a first-class concept #27

Closed
wants to merge 1 commit into from

Conversation

dorianj
Copy link
Contributor

@dorianj dorianj commented Mar 12, 2021

This proposes a new fundamental data type in Amundsen. I've left breadcrumbs for possible features on top of this, but I haven't specced that out. I think discussing those possibilities before implementing the architecture would be healthy, though.

I think a likely implementation path would be to:

  1. Implement this on its own and get it landed as a beta
  2. Build a first feature using this data. It is expected that we will learn new things and change the underlying data shape slightly (hence the beta designation)
  3. Announce this new data structure as supported in conjunction with the new feature.

Signed-off-by: Dorian Johnson <2020@dorianj.net>

## Summary

This proposes to ingest user-run queries directly into Amundsen. The information included would be the raw string query (typically SQL), as well as the objects referenced by it (Columns, Tables, etc). Amundsen would not parse SQL natively, rather it would provide the plumbing for users who have SQL parsing capabilities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we done any tests on how long it takes to ingest user log sql. Typically production usage log is 10s of millions of sql per day. Are we planning to surface all the queries in the table page?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not during ingest, when will the sql parsing happen? are we planning to expose the usagelog for lineage consumption through API?


## Motivation

This allows for considerably more dynamism in how Amundsen surfaces usage-based information. Rather than popular tables being determined primarily by querying usage statistics from dashboarding systems, Amundsen could determine it at runtime by querying the metastore. Same goes for frequent table users. It also lays the groundwork for features like: commonly joined tables, popular queries, audit logging, etc. Additionally, if DDL/DML changes are observed in the log, they can be associated with a table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the popular table is gathering usage info for table systems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you clarify the audit logging piece? also will the usage log ingestion refreshed everyday?


## Guide-level Explanation (aka Product Details)

In Amundsen, query logs from your DWH, RDBMS or BI tool may be ingested raw. It's up to you to extract the information about which subjects a particular query touches (e.g. which tables it queries, which user ran it, etc), but once ingested, Amundsen can surface interesting insights about query patterns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically approach is to store the query log through a data warehouse table (delta, hive etc) and it could fetch the usage log and persist the relevant information into the metadata store.

I wonder how. the usage query log ingestion is going to look like? are you directly consume from the source and directly persist into metastore store?


### databuilder

We would create a new `class Query(GraphSerializable)` . It would contain these properties:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is already a https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/models/dashboard/dashboard_query.py, we should at least make a base query class and make two subclasses (dashboard, table)

- Time that the query was run (required)
- At least one of required:
- Raw string representation of the query (for query types like SQL)
- Name of job run (for things like Python notebooks run)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have application model, it should be a relationship between query and application model.


## Unresolved questions

- Is there other prior art or considerations about this? Surely it's been considered before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a table query relationship makes sense, however, I am not sure about using amundsen as the source of truth for all the queries which could easily store many non-sense queries(e.g typo etc).

  1. SLA of query log emit from computation framework to this store is NA (typical is to hive table)
  2. Non GC of the query nodes, are you going to retain PII data(query has all kinds PII) for the full deployment?

I think we could have the query model, but if I am going to use it, I will only surface the most relevant / popular queries to the table (top 5, or top X)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think in our use cases, we would probably lean toward only surfacing either queries that were run recently, or queries that are associated with dashboards - and we'd probably only want to store recent ones in Amundsen too. I've run into a lot of situations where someone references a very old query, and the business logic has shifted enough since then that they end up making some incorrect assumptions


We would create a new `class Query(GraphSerializable)` . It would contain these properties:

- Time that the query was run (required)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat of a meta point, but most of this discussion seems to be around an assumption that we are recording every execution of a query as a unique instance in the database - so this idea of the time that the query was run makes sense in that context. However, I could see wanting to use this to instead record every unique query, with some information (like how often it is run) pre-computed before storing it in Amundsen. For those cases, I don't know if this requirement really makes sense.

@jonhehir
Copy link

jonhehir commented Mar 22, 2021

I'll keep my comments high-level, since I haven't actively worked with Amundsen very recently.

As the author of amundsen-io/amundsen#547, I believe the original intent (which in retrospect is not as clear as it should have been) was to point out the lack of relationship between DashboardQuery objects and Tables in the current model. As I recall, we had relationships between Dashboard and Table objects, but no such relationship between DashboardQuery and Table objects. But in most cases, the relationship between Dashboard and Table objects probably takes place through these queries. Adding this relationship to the data model seemed like it should be easy to accommodate, both in the Amundsen data model and in any dashboard extractors.

If I understand correctly, the scope of this RFC is more far-reaching: creating a more general first-class query object that could potentially capture all queries against DBs (where, as @feng-tao mentions, DashboardQuery might become a subclass). I'll leave the discussion of this to the folks who are more active on the project. I guess it's not exactly clear to me how wide a net is being cast here in terms of queries to ingest into Amundsen, but I would imagine that in many cases, casting too wide a net of queries to ingest would be undesirable, for reasons of volume, signal-to-noise ratio, and potential for sensitive content in raw queries (e.g., PII).

Having said that, I do imagine there could be some very good use-cases for ingesting non-dashboard-related queries into Amundsen. Perhaps a quick few bullet points of use cases for this RFC would be helpful to clarify what exactly the aim is here?

@sewardgw
Copy link
Contributor

@feng-tao @dorianj @jdavidheiser @jonhehir

I wanted to pick up the conversation on this and possibly refine the scope of the RFC. Think this feature set would be a wonderful addition but the current state of the RFC is a little too broad with many unknowns for how exactly this will enable future features. I would like to propose that the RFC be rescoped to the following:

  • Enable creation of a generic Query object
  • Enable creation of a generic Query Relationship (to table, to dashboard, etc. - maybe having specific relationship types based on the nodes being connected?)
  • Enable association of query to user
  • Enable association of queries to a job and / or source
  • Update metadata + FE services to serve a tab for queries on the table details page (like dashboards)
  • An implementation of a generic extractor (CSV?) to load queries + relationships

Furthermore, I would propose the following are not in scope and why:

  • Any form of processing / reading query logs and SQL parsing (the RFC is aligned here, but the comments focus heavily on this point)
    • @feng-tao brought up good points about how to consume queries from the source, I would propose that Amundsen does not enforce a design choice here and to leave this up to the implementer (in the same way that implementers need to figure out how to do auth and get users into Amundsen)
    • This will likely be the most customized aspect of the implementation. In this discussion there are questions such as how to handle performance, how to expire queries, and whether every query execution should be captured. I think Amundsen should enable all of these, but not go so far as to make a declaration for how to do it (at least not yet).
  • Associating queries to columns
    • It is not overtly obvious what the value is to associate a query to a column given that Amundsen would be associating queries to tables. I recognize you can get most frequently used columns and similar features but I question the incremental value users would get and how much different the UX would be for a typical user going through this flow. I propose delaying this until there is a more well-defined use-case.
  • Associating python notebooks
    • I think having a way to view notebooks is a great idea but notebooks have the opportunity for a greater scope such as multiple transformations / ELT, association to models & features as well as metrics and analysis. I would also reduce the scope of this RFC in this regard until there is a better understanding of both notebooks and these additional capabilities within Amundsen.
    • A short-term workaround to solve this is to link to the notebook using the GitHub source

@dorianj
Copy link
Contributor Author

dorianj commented May 13, 2021

Closing this in favor of #37

@dorianj dorianj closed this May 13, 2021
@Golodhros Golodhros added Status: Rejected closed without being merged and removed Status: Draft labels May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Rejected closed without being merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants