Skip to content

data-engineering-helpers/data-quality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Material for Data Quality in a Data Engineering perspective

Table of Content (ToC)

Created by gh-md-toc

Overview

This project intends to document requirements and referential material to improve data quality through implementation of data validation rules in the perspective of data engineering on a modern data stack (MDS).

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

References

Use cases

Geonames

Geonames is kind of a geographic wiki, crowd-souring meta-data for points of interest (PoI)/points of reference (PoR) in the world. The corresponding data are stored in a geospatial PostgreSQL database, and published every day under the Creative Commons (CC) BY 4.0 license; the corresponding snapshots are available on http://download.geonames.org/export/dump/ .

As the data are crowd-sourced, the daily snapshots may be corrupted with some errors, intentional or not. Geonames therefore offer a premium, curated, monthly data feed service, which guarantees that the data snapshots are free of the above-mentioned errors.

For that purpose, Geonames maintain a Quality Assurance (QA) framework, where the data sets are checked every day against an extensive list of validation rules.

Most of those validation rules take the shape of SQL queries and yield the problematic records:

  • The lower the number of returned records, the better the quality. A 0 score means perfection (success in Geonames parlance)
  • If some records are retrieved (which are then, by design, problematic), they are therefore exposed, so that some data curator/steward may fix them. Example of such a list of problematic records: http://qa.geonames.org/qa/2023-05-30/chkPPLAWithAdminCode1.html
  • Each number of records yields the corresponding time-series over time, so that the users see how the quality has evolved over time. Geonames QA for the "PPLA have valid admincode1" validation rule

Note: of course, as a wiki, Geonames track the full history of changes. For instance, Dakhla is retrieved as a problematic record for the above-mentioned data quality validation rule and, at some point, the so-named aiddata user altered that PoR. Keeping track of the changes allows to find the root cause of bad quality for the data.

The Geonames QA framework allows to vet the data sets for the monthly premium data feed, while keeping track of the problemtic records so that data stewards/curators may fix them later on.

OpenTravelData (OPTD)

OpenTravelData (OPTD) curates, under the Creative Commons (CC) BY 4.0 license, a few referential data sets relevant for industries like travel, transport and logistics, among others. Among many other original data sources, OPTD heavily rely on Geonames (see above) for the geographical data sets.

OPTD maintain a GitHub repository dedicated to Quality Assurance (QA) featuring data validation rule checkers. Each data validation rule checker yields a CSV file with all the problematic records. Like for the Geonames QA framework, 0 score means perfection and a high score means some poor quality for the corresponding data set.

The resulting CSV files are published on the Transport Search data QA page. For instance, for the 2 June 2021 snapshots:

The data validation rule checkers are mere scripts. Most of them are written in Python, but any other programmation language may be used.

Articles

Writing data to production

Data Quality with Great Expectations

Data Quality 101: Ensuring accurate data in your pipelines

Why Data Reliability Should Be the Top Priority

Data validation, documentation, and profiling

Preventing data quality issues with unit testing

Data Quality in Python Pipelines

Using data linter to streamline data quality checks

Concepts and practices to ensure data quality

Stop Firefighting Data Quality Issues

The four pillars of data observability

Data Quality Validation for Python Dataframes

Frameworks and tools

Deequ

Deequ's purpose is to "unit-test" data to find errors early, before the data gets fed to consuming systems or machine learning algorithms.

Deequ works on tabular data, e.g., Parquet and CSV files, database tables, logs, flattened JSON files, basically anything that you can fit into a Spark dataframe.

Great Expectations (GX)

Great Expectations (GX) helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.

DuckDQ

DuckDQ is an embeddable Data Quality (DQ) validation tool in Python

Hooqu

Hooqu is a library built on top of Pandas dataframes for defining "unit tests for data", which measure data quality datasets. Hooqu is a "spiritual" Python port of Apache Deequ and has been in an experimental state since its creation.

Note Now that Deequ has a native Python version, there should be no more need for Hooqu.

Data Diff

See how every change to dbt code affects the data produced in the modified model and downstream.

Elementary

Monitor your data warehouse in minutes directly from dbt. An analytics engineer first solution for monitoring data quality and operations

SODA

Soda SQL is an open-source command-line (CLI) tool. It uses user-defined input to prepare SQL queries that run tests on tables in a data warehouse to find invalid, missing, or unexpected data. When tests fail, they surface "bad" data that you can fix to ensure that downstream analysts are using "good" data to make decisions.

Google Common Expression Language

The Common Expression Language (CEL) implements common semantics for expression evaluation, enabling different applications to more easily interoperate.

Pydantic

Glue Data Quality

About

Material about data quality

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published