-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table checks frontend component #41
Conversation
Signed-off-by: Daniel Won <dwon@lyft.com>
Signed-off-by: Daniel Won <dwon@lyft.com>
lgtm |
## Product Details | ||
Users will see a new section on the left panel of the Table Details page titled `Data Quality Checks`. There will be | ||
a message and icon for the current status of these checks, if they exist. A passing table might say | ||
`✅ 10 of 10 checks passed` and a failing table might say `❌ 3 of 10 checks failed`. Additionally, the user may see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we are talking about 3 of 10 checks passed
which one do we mean:
- out of last 10 unit tests 3 of them failed
or - the last unit test on this table consisted of 10 checks and 3 of them failed?
I am pretty sure you mean the latter one but wanted to be 200% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I interpret it is: Checks run on various cadences. Some tests run hourly, some might run daily or even weekly. For each individual check, how many of the last run passed or failed.
This interpretation is left up to the specific implementation. We define an API with fields like num_checks_passed
, num_checks_total
, and last_run_timestamp
. These can mean slightly different things if you wish.
## Terms | ||
- Data Quality Check - Like unit tests for code, data quality checks are assertions that try to guarantee certain | ||
characteristics of data. For example a data quality check could be as simple as asserting that a column is not null, | ||
or implement some complex custom logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you already have in mind some tools from which these quality checks could have been collected? I can think of Great Expectations and dbt tests but would be useful to have broader list. It's also something @sewardgw had good point about the interface to be generic enough we are not forced to do backwards incompatible changes later on.
From what I can see the mockups you introduced somewhat adhere to GE format but I am not sure about dbt (cause I don't have exp with it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not building this with a particular tool in mind. At the 1M-mile view, we think our proposed definition is so generic that it could crudely fit anything. The idea is to start here and expand once people start plugging in particular systems.
At Lyft we do have a DQ system we're planning to plug in, but it exposes a much more complicated API and semantics which we are not taking into account here. Instead, we asked ourselves what would be the most generic first step.
@mgorsk1 @sewardgw Thanks for your thoughts above. Your question around comparing the API vs offline databuilder approaches makes a lot of sense. At this time, we are not ready to add full first class data quality checks to Amundsen. We aren't even ready to put forward a good generic definition of a check, as it might have a lot of dimensions and we're not ready to pin it down (condition, time range, columns validated, correctness/failure percentage, severity, owner, etc). This would be a great direction, but not what we were planning to tackle. Instead, we're proposing a simple "Table Checks" frontend component to display limited pass/fail state. Originally, we were going to power this component with a customizable client (as this follows some prior art in Amundsen) but given your feedback we have an alternate suggestion:
This has the advantage of moving in the general correct direction (a pluggable "checks" frontend component) without making premature decisions, since we'd be starting with the the most bare-bones definition possible. |
What I am missing about this proposal is some example/guidance on what DQ systems you have in mind to be used with current approach @dkunitsk Is it something custom you have in Lyft? Having such list enables us to expand It is also important to note that quality checks might be on column level, not table level and most probably these are even more important. For example Great Expectation tests are defined both on I agree with @sewardgw on SLA and push/pull semantics and that either way it's possible to have data in Amundsen as soon as it's possible. |
@mgorsk1 unfortunately we don't have most of the answers at this time. You're very right that a big portion of data quality will be at the column level, and we will be working in that direction in the near future. At the moment we're proposing only:
In some sense, this is un-opinionated enough that one could crudely plug in any DQ system (even column-level systems). And we believe it's a generic starting point for all generalizations discussed - databuilders, push/pull questions, check definition, etc. Given the good feedback above, we have a much better grasp of what the community is looking for in a generic DQ check feature and we'll be tailoring future proposals according to this feedback. In the meantime, we'll adjust this RFC to make our scope more clear so we don't get stuck in the deeper questions just yet. |
Signed-off-by: Daniel Won <dwon@lyft.com>
bad5e17
to
b459ede
Compare
|
||
## Alternatives | ||
There are several alternatives: | ||
- Ingest data quality checks as part of the databuilder and display the status as part of the table metadata API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is not so much an alternative but rather an addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. After the pared down scope, this leaves room for multiple directions on the unsolved questions (push/pull, specific integrations, etc).
Adds an RFC to propose:
-a frontend component for displaying 3-valued table status: number of checks, number passing, and an optional timestamp
-stubbing of the backend for this component