Table checks frontend component #41

danwom · 2021-07-19T18:49:11Z

Adds an RFC to propose:

-a frontend component for displaying 3-valued table status: number of checks, number passing, and an optional timestamp
-stubbing of the backend for this component

Signed-off-by: Daniel Won <dwon@lyft.com>

rfcs/039-table-data-quality-checks.md

feng-tao · 2021-07-29T05:37:05Z

lgtm

mgorsk1 · 2021-07-29T10:51:55Z

rfcs/039-table-data-quality-checks.md

+## Product Details
+Users will see a new section on the left panel of the Table Details page titled `Data Quality Checks`. There will be
+a message and icon for the current status of these checks, if they exist. A passing table might say 
+`✅ 10 of 10 checks passed` and a failing table might say `❌ 3 of 10 checks failed`. Additionally, the user may see 


when we are talking about 3 of 10 checks passed which one do we mean:

out of last 10 unit tests 3 of them failed
or

the last unit test on this table consisted of 10 checks and 3 of them failed?

I am pretty sure you mean the latter one but wanted to be 200% sure.

The way I interpret it is: Checks run on various cadences. Some tests run hourly, some might run daily or even weekly. For each individual check, how many of the last run passed or failed.

This interpretation is left up to the specific implementation. We define an API with fields like num_checks_passed, num_checks_total, and last_run_timestamp. These can mean slightly different things if you wish.

mgorsk1 · 2021-07-29T11:03:29Z

rfcs/039-table-data-quality-checks.md

+## Terms
+- Data Quality Check - Like unit tests for code, data quality checks are assertions that try to guarantee certain 
+  characteristics of data. For example a data quality check could be as simple as asserting that a column is not null,
+  or implement some complex custom logic.


Do you already have in mind some tools from which these quality checks could have been collected? I can think of Great Expectations and dbt tests but would be useful to have broader list. It's also something @sewardgw had good point about the interface to be generic enough we are not forced to do backwards incompatible changes later on.

From what I can see the mockups you introduced somewhat adhere to GE format but I am not sure about dbt (cause I don't have exp with it).

We are not building this with a particular tool in mind. At the 1M-mile view, we think our proposed definition is so generic that it could crudely fit anything. The idea is to start here and expand once people start plugging in particular systems.

At Lyft we do have a DQ system we're planning to plug in, but it exposes a much more complicated API and semantics which we are not taking into account here. Instead, we asked ourselves what would be the most generic first step.

rfcs/039-table-data-quality-checks.md

dkunitsk · 2021-08-03T04:08:20Z

@mgorsk1 @sewardgw Thanks for your thoughts above. Your question around comparing the API vs offline databuilder approaches makes a lot of sense.

At this time, we are not ready to add full first class data quality checks to Amundsen. We aren't even ready to put forward a good generic definition of a check, as it might have a lot of dimensions and we're not ready to pin it down (condition, time range, columns validated, correctness/failure percentage, severity, owner, etc). This would be a great direction, but not what we were planning to tackle.

Instead, we're proposing a simple "Table Checks" frontend component to display limited pass/fail state. Originally, we were going to power this component with a customizable client (as this follows some prior art in Amundsen) but given your feedback we have an alternate suggestion:

Add a placeholder endpoint (/api/quality/v0) in the frontend service which returns a dummy response (e.g. "0 out of 0 checks successful"). Companies who have an API to hook up to this frontend component could override the endpoint implementation with a call to their API. Otherwise, this state would be a clean starting point for adding backend ingestion down the line.

This has the advantage of moving in the general correct direction (a pluggable "checks" frontend component) without making premature decisions, since we'd be starting with the the most bare-bones definition possible.

mgorsk1 · 2021-08-03T17:27:21Z

What I am missing about this proposal is some example/guidance on what DQ systems you have in mind to be used with current approach @dkunitsk Is it something custom you have in Lyft? Having such list enables us to expand Future possibilities section with more guided way to build on top of this.

It is also important to note that quality checks might be on column level, not table level and most probably these are even more important. For example Great Expectation tests are defined both on table (I expect this table to have at least N rows) and column level (I expect this column to have those 5 unique values only) separately.

I agree with @sewardgw on SLA and push/pull semantics and that either way it's possible to have data in Amundsen as soon as it's possible.

dkunitsk · 2021-08-03T18:22:37Z

@mgorsk1 unfortunately we don't have most of the answers at this time. You're very right that a big portion of data quality will be at the column level, and we will be working in that direction in the near future.

At the moment we're proposing only:

a frontend component for displaying 3-valued table status: number of checks, number passing, and an optional timestamp
stubbing of the backend for this component

In some sense, this is un-opinionated enough that one could crudely plug in any DQ system (even column-level systems). And we believe it's a generic starting point for all generalizations discussed - databuilders, push/pull questions, check definition, etc.

Given the good feedback above, we have a much better grasp of what the community is looking for in a generic DQ check feature and we'll be tailoring future proposals according to this feedback.

In the meantime, we'll adjust this RFC to make our scope more clear so we don't get stuck in the deeper questions just yet.

Signed-off-by: Daniel Won <dwon@lyft.com>

dkunitsk · 2021-08-04T23:09:00Z

rfcs/039-table-checks-frontend-component.md

+
+## Alternatives
+There are several alternatives: 
+  - Ingest data quality checks as part of the databuilder and display the status as part of the table metadata API. 


Note: this is not so much an alternative but rather an addition.

dkunitsk

+1. After the pared down scope, this leaves room for multiple directions on the unsolved questions (push/pull, specific integrations, etc).

Daniel Won added 2 commits July 16, 2021 17:06

Add Data Quality Check RFC

becfb84

Signed-off-by: Daniel Won <dwon@lyft.com>

Added additional detail

ca1f985

Signed-off-by: Daniel Won <dwon@lyft.com>

danwom requested a review from a team as a code owner July 19, 2021 18:49

dorianj reviewed Jul 19, 2021

View reviewed changes

rfcs/039-table-data-quality-checks.md Outdated Show resolved Hide resolved

sewardgw reviewed Jul 20, 2021

View reviewed changes

rfcs/039-table-data-quality-checks.md Outdated Show resolved Hide resolved

feng-tao added the Status: Draft label Jul 20, 2021

mgorsk1 reviewed Jul 29, 2021

View reviewed changes

rfcs/039-table-data-quality-checks.md Outdated Show resolved Hide resolved

dkunitsk changed the title ~~Table Data Quality Checks~~ Table checks frontend component Aug 3, 2021

Amend title and edit content

b459ede

Signed-off-by: Daniel Won <dwon@lyft.com>

danwom force-pushed the dwon-data-quality-checks branch from bad5e17 to b459ede Compare August 4, 2021 16:34

dkunitsk reviewed Aug 4, 2021

View reviewed changes

dkunitsk approved these changes Aug 4, 2021

View reviewed changes

feng-tao approved these changes Aug 5, 2021

View reviewed changes

feng-tao added Status: Final Comment Period (FCP) On final comment period (seven days) and removed Status: Draft labels Aug 5, 2021

danwom merged commit f92b1fe into master Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table checks frontend component #41

Table checks frontend component #41

danwom commented Jul 19, 2021 •

edited by dkunitsk

feng-tao commented Jul 29, 2021

mgorsk1 Jul 29, 2021

danwom Aug 2, 2021

mgorsk1 Jul 29, 2021

dkunitsk Aug 3, 2021

dkunitsk commented Aug 3, 2021 •

edited

mgorsk1 commented Aug 3, 2021 •

edited

dkunitsk commented Aug 3, 2021 •

edited

dkunitsk Aug 4, 2021

dkunitsk left a comment

Table checks frontend component #41

Table checks frontend component #41

Conversation

danwom commented Jul 19, 2021 • edited by dkunitsk

feng-tao commented Jul 29, 2021

mgorsk1 Jul 29, 2021

Choose a reason for hiding this comment

danwom Aug 2, 2021

Choose a reason for hiding this comment

mgorsk1 Jul 29, 2021

Choose a reason for hiding this comment

dkunitsk Aug 3, 2021

Choose a reason for hiding this comment

dkunitsk commented Aug 3, 2021 • edited

mgorsk1 commented Aug 3, 2021 • edited

dkunitsk commented Aug 3, 2021 • edited

dkunitsk Aug 4, 2021

Choose a reason for hiding this comment

dkunitsk left a comment

Choose a reason for hiding this comment

danwom commented Jul 19, 2021 •

edited by dkunitsk

dkunitsk commented Aug 3, 2021 •

edited

mgorsk1 commented Aug 3, 2021 •

edited

dkunitsk commented Aug 3, 2021 •

edited