Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portable data integrity/invariant checks #8

Open
anjsimmo opened this issue Nov 17, 2021 · 0 comments
Open

Portable data integrity/invariant checks #8

anjsimmo opened this issue Nov 17, 2021 · 0 comments

Comments

@anjsimmo
Copy link
Contributor

anjsimmo commented Nov 17, 2021

This idea was brought up in discussions #4 "Data quality issues should be easy to validate/verify" and #7 "Have a means to validate data to identify correlations / mistakes in the data"

Motivation:
Data issues often occur that in principle should be easy to detect. E.g., Google's data panel for COVID-19 deaths (which in turn was sourced from Wikipedia) was off by a factor of 10 for Australia, and the incorrect figure even found it's way into some news articles. It should have been obvious that something was wrong by the sudden jump and the fact that the number of deaths at country level did not add up to the sum of deaths in states and territories.

We think that the reason these issues are common is that every company that uses data would need to reimplement checks, which they don't have time for. What we need is a portable format for data integrity/invariant checks so that sharing data validation checks is as easy as sharing the data itself. E.g. if one system implements an integrity check that the number of cases in a country should equal to the sum of the number of cases in the states/territories in that country, there needs to be a portable way to share this check with other systems.

Specific problem:
While standards for representing data integrity checks already exist (e.g. SQL CHECK constraint), we need to better understand the practical barriers to reuse of data integrity checks and propose solutions. If successful, this research should have widespread practical impact in improving data quality and preventing misinformation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant