Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table checks frontend component #41

Merged
merged 3 commits into from
Aug 5, 2021
Merged

Table checks frontend component #41

merged 3 commits into from
Aug 5, 2021

Conversation

danwom
Copy link

@danwom danwom commented Jul 19, 2021

Adds an RFC to propose:

-a frontend component for displaying 3-valued table status: number of checks, number passing, and an optional timestamp
-stubbing of the backend for this component

Daniel Won added 2 commits July 16, 2021 17:06
Signed-off-by: Daniel Won <dwon@lyft.com>
Signed-off-by: Daniel Won <dwon@lyft.com>
@danwom danwom requested a review from a team as a code owner July 19, 2021 18:49
@feng-tao
Copy link
Member

lgtm

## Product Details
Users will see a new section on the left panel of the Table Details page titled `Data Quality Checks`. There will be
a message and icon for the current status of these checks, if they exist. A passing table might say
`✅ 10 of 10 checks passed` and a failing table might say `❌ 3 of 10 checks failed`. Additionally, the user may see
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we are talking about 3 of 10 checks passed which one do we mean:

  • out of last 10 unit tests 3 of them failed
    or
  • the last unit test on this table consisted of 10 checks and 3 of them failed?

I am pretty sure you mean the latter one but wanted to be 200% sure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I interpret it is: Checks run on various cadences. Some tests run hourly, some might run daily or even weekly. For each individual check, how many of the last run passed or failed.

This interpretation is left up to the specific implementation. We define an API with fields like num_checks_passed, num_checks_total, and last_run_timestamp. These can mean slightly different things if you wish.

## Terms
- Data Quality Check - Like unit tests for code, data quality checks are assertions that try to guarantee certain
characteristics of data. For example a data quality check could be as simple as asserting that a column is not null,
or implement some complex custom logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you already have in mind some tools from which these quality checks could have been collected? I can think of Great Expectations and dbt tests but would be useful to have broader list. It's also something @sewardgw had good point about the interface to be generic enough we are not forced to do backwards incompatible changes later on.

From what I can see the mockups you introduced somewhat adhere to GE format but I am not sure about dbt (cause I don't have exp with it).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not building this with a particular tool in mind. At the 1M-mile view, we think our proposed definition is so generic that it could crudely fit anything. The idea is to start here and expand once people start plugging in particular systems.

At Lyft we do have a DQ system we're planning to plug in, but it exposes a much more complicated API and semantics which we are not taking into account here. Instead, we asked ourselves what would be the most generic first step.

@dkunitsk
Copy link

dkunitsk commented Aug 3, 2021

@mgorsk1 @sewardgw Thanks for your thoughts above. Your question around comparing the API vs offline databuilder approaches makes a lot of sense.

At this time, we are not ready to add full first class data quality checks to Amundsen. We aren't even ready to put forward a good generic definition of a check, as it might have a lot of dimensions and we're not ready to pin it down (condition, time range, columns validated, correctness/failure percentage, severity, owner, etc). This would be a great direction, but not what we were planning to tackle.

Instead, we're proposing a simple "Table Checks" frontend component to display limited pass/fail state. Originally, we were going to power this component with a customizable client (as this follows some prior art in Amundsen) but given your feedback we have an alternate suggestion:

Add a placeholder endpoint (/api/quality/v0) in the frontend service which returns a dummy response (e.g. "0 out of 0 checks successful"). Companies who have an API to hook up to this frontend component could override the endpoint implementation with a call to their API. Otherwise, this state would be a clean starting point for adding backend ingestion down the line.

This has the advantage of moving in the general correct direction (a pluggable "checks" frontend component) without making premature decisions, since we'd be starting with the the most bare-bones definition possible.

@mgorsk1
Copy link
Contributor

mgorsk1 commented Aug 3, 2021

What I am missing about this proposal is some example/guidance on what DQ systems you have in mind to be used with current approach @dkunitsk Is it something custom you have in Lyft? Having such list enables us to expand Future possibilities section with more guided way to build on top of this.

It is also important to note that quality checks might be on column level, not table level and most probably these are even more important. For example Great Expectation tests are defined both on table (I expect this table to have at least N rows) and column level (I expect this column to have those 5 unique values only) separately.

I agree with @sewardgw on SLA and push/pull semantics and that either way it's possible to have data in Amundsen as soon as it's possible.

@dkunitsk
Copy link

dkunitsk commented Aug 3, 2021

@mgorsk1 unfortunately we don't have most of the answers at this time. You're very right that a big portion of data quality will be at the column level, and we will be working in that direction in the near future.

At the moment we're proposing only:

  • a frontend component for displaying 3-valued table status: number of checks, number passing, and an optional timestamp
  • stubbing of the backend for this component

In some sense, this is un-opinionated enough that one could crudely plug in any DQ system (even column-level systems). And we believe it's a generic starting point for all generalizations discussed - databuilders, push/pull questions, check definition, etc.

Given the good feedback above, we have a much better grasp of what the community is looking for in a generic DQ check feature and we'll be tailoring future proposals according to this feedback.

In the meantime, we'll adjust this RFC to make our scope more clear so we don't get stuck in the deeper questions just yet.

@dkunitsk dkunitsk changed the title Table Data Quality Checks Table checks frontend component Aug 3, 2021
Signed-off-by: Daniel Won <dwon@lyft.com>

## Alternatives
There are several alternatives:
- Ingest data quality checks as part of the databuilder and display the status as part of the table metadata API.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is not so much an alternative but rather an addition.

Copy link

@dkunitsk dkunitsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. After the pared down scope, this leaves room for multiple directions on the unsolved questions (push/pull, specific integrations, etc).

@feng-tao feng-tao added Status: Final Comment Period (FCP) On final comment period (seven days) and removed Status: Draft labels Aug 5, 2021
@danwom danwom merged commit f92b1fe into master Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Final Comment Period (FCP) On final comment period (seven days)
Projects
None yet
6 participants