Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing some elementary data quality tests #3

Merged
merged 30 commits into from
Nov 2, 2023

Conversation

da115115
Copy link
Contributor

@da115115 da115115 marked this pull request as draft September 28, 2023 14:52
@stefannegele
Copy link
Contributor

Nice to see you getting started! Just write, if I can help with anything!

@da115115
Copy link
Contributor Author

Nice to see you getting started! Just write, if I can help with anything!

Thanks!

I would like to implement a simple use case/example:

There is a simple thing where you could maybe help, @stefannegele : writing the part (in https://github.com/data-engineering-helpers/datacontract-cli/blob/main/quality.go) where the specification is cross-referenced from another YAML file; which is the case in my example: https://github.com/data-engineering-helpers/data-contracts/blob/main/datacontract.com/contracts/data-contract-flight-route.yaml#L59
Right now, https://github.com/data-engineering-helpers/datacontract-cli/blob/main/quality.go parses only the data contract YAML file itself, not the cross-referenced YAML file. I do not know whether you already implemented that part. For https://github.com/data-engineering-helpers/datacontract-cli/blob/main/quality.go , I took inspiration from https://github.com/data-engineering-helpers/datacontract-cli/blob/main/schema.go and https://github.com/data-engineering-helpers/datacontract-cli/blob/main/lint.go

@da115115
Copy link
Contributor Author

If you need contributor access to the https://github.com/data-engineering-helpers GitHub organization, do not hesitate. Just let me know and I'll add you to one of the teams (https://github.com/orgs/data-engineering-helpers/teams).

@stefannegele
Copy link
Contributor

Hi @da115115 , thanks for the insights. I also think integrating SODA via a command line execution is fine for now. We will figure out a good solution, while integrating other tools.

I will put the resolution of referenced resources in my to-do list!

@stefannegele
Copy link
Contributor

@da115115
with d14490d remote and local references are resolved when using GetValue, e.g. Is that what you needed?

@da115115
Copy link
Contributor Author

da115115 commented Oct 2, 2023

@da115115 with d14490d remote and local references are resolved when using GetValue, e.g. Is that what you needed?

From what I can see in the Go code, yes, that is exactly what I meant. I have yet to test it, especially the part where a remote reference is downloaded locally. As a matter of fact, the same way there is a default file-name for local data contract files (that is datacontract.yaml), we could have default fine-names for cross-referenced schema files (eg, datacontract-schema.yaml) et quality files (eg, datacontract-quality.yaml). That way, when a remote data contract (specified by a URL) itself cross-references schema and/or quality files, those could be downloaded locally into those local files.

@stefannegele
Copy link
Contributor

We have just released version 0.3.0. The new inline command might also be helpful for you ;)

Tell me, if I can help to merge my latest changes.

@da115115
Copy link
Contributor Author

da115115 commented Oct 3, 2023

We have just released version 0.3.0. The new inline command might also be helpful for you ;)

Tell me, if I can help to merge my latest changes.

Thanks!

I've just tried the new inline method, but it did not work that much.

Here's what I've tried:

  1. Setup
mkdir -p ~/dev/infra/data-contracts && \
  git clone https://github.com/data-engineering-helpers/data-contracts.git ~/dev/infra/data-contracts/data-contracts && \
  cd ~/dev/infra/data-contracts/data-contracts/datacontract.com
  1. inline command, not working as expected (from my point of view; hopefully I'm wrong)
  datacontract quality --file contracts/data-contract-flight-route.yaml
  cat contracts/data-contract-flight-route.yaml # shows that the $ref are still there, nothing has been inlined

2.2. Revert local changes

git checkout -- .
  1. Quality checks
datacontract quality --file contracts/data-contract-flight-route.yaml 
$ref contracts/data-contract-flight-route-quality.yaml

@da115115
Copy link
Contributor Author

da115115 commented Oct 3, 2023

And on this branch/pull request, I've merged the last changes, up until 0.3.0, and there are now a few compilation issues:

go build -o datacontract
# github.com/datacontract/cli
./quality.go:9:27: undefined: ReadLocalDataContract
./quality.go:10:23: undefined: ParseDataContract
./quality.go:30:1: missing return

I guess the ReadLocalDataContract() and ParseDataContract() functions have been renamed. I can of course investigate on my own; but since you made the change, you may see what's wrong much quicker than what I'll be able to for now.

Thanks!

@stefannegele
Copy link
Contributor

stefannegele commented Oct 3, 2023

I've just tried the new inline method, but it did not work that much.

I tried the command on this file. Your "$ref ..." should be "$ref: ..." (adding colons). Then it works.

I guess the ReadLocalDataContract() and ParseDataContract() functions have been renamed.

Correct. You now can just use GetDataContract(). This will give you a DataContract object and takes a local path or an URL. Therefor usage of a specific ParseDataContract() is not needed any longer.

@da115115
Copy link
Contributor Author

da115115 commented Oct 3, 2023

I've just tried the new inline method, but it did not work that much.

I tried the command on this file. Your "$ref ..." should be "$ref: ..." (adding colons). Then it works.

Thanks, indeed the inlining now seems to work better. However, if I follow up with a lint command, there is an internal error:

datacontract inline --file contracts/data-contract-flight-route.yaml 
datacontract lint --file contracts/data-contract-flight-route.yaml 
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x137e157]

Moreover, it may be better, when executing the inline command, to produce another file, rather than overwriting the current data contract (specified by the --file parameter). We could maybe have more specific parameters like --from-file and/or --to-file, à-la-Unix, where every command can take as input a file (which may be stdin) and produce as output a file (which may be stdout).
The init and inline commands would then work like curl -o and/or wget -O (they fetch some files on internet and produces a local file as a result).

Moreover, if we want to keep the cross-reference pattern for local files, we could have specific parameters like --schema-file, --quality-file, with default values as respectively datacontract-schema.yaml and datacontract-quality.yaml. That way, when the init command fetches some data contract from a URL, and stumbles upon cross-references, the cross-referenced YAML files may be fetched as well and downloaded as those respective local files.

@stefannegele
Copy link
Contributor

stefannegele commented Oct 3, 2023

I just added some tickets. Does not mean I will implement it right now, but it's open for discussion.

The bug with lint, I will check when I have time - probably end of the week. I wanted to do some cleanup and add better testing anyways.

edit:
you can work around the inline output parameter, by just copying your contract before using the inline command.

@da115115
Copy link
Contributor Author

da115115 commented Oct 3, 2023

edit: you can work around the inline output parameter, by just copying your contract before using the inline command.

Since my data contracts are in Git (https://github.com/data-engineering-helpers/data-contracts/blob/main/datacontract.com/contracts/data-contract-flight-route.yaml), it is easy enough to do git checkout -- . when wanting to revert the changes. And of course, if I like the changes, I can copy them into a new inlined file.

@stefannegele
Copy link
Contributor

@da115115 can you please give more info on #5 (add it to the ticket, please)? I can't replicate it.

@da115115
Copy link
Contributor Author

da115115 commented Oct 12, 2023

@stefannegele , the integration of SodaCL is now working. It is still simple and not that much robust, but it is a good start (IMHO).
An example of how to use it is given in https://github.com/data-engineering-helpers/data-contracts/tree/main/datacontract.com#check-the-data-quality-of-a-table-thanks-to-a-data-contract
Basically:

$ python -mpip install -U duckdb
$ mkdir -p ~/dev/infra/data-contracts && \
  git clone https://github.com/data-engineering-helpers/data-contracts.git ~/dev/infra/data-contracts/data-contracts && \
  cd ~/dev/infra/data-contracts/data-contracts/datacontract.com
$ datacontract quality-init --file contracts/data-contract-flight-route.yaml --quality-file contracts/data-contract-flight-route-quality.yaml
$ duckdb quality/db.duckdb < sql/duckdb-ddl-create-view-from-csv.sql
$ datacontract quality-check --file contracts/data-contract-flight-route.yaml --quality-file contracts/data-contract-flight-route-quality.yaml

Basically:

  • The CLI commands around quality checks are now:
    • --quality-print (formerly --quality): print the specification part related to quality checks
    • --quality-init: initialize, if needed, the environment/file-system for the quality checks. For instance, with Soda, it creates a quality/ directory and the YAML file for the Soda configuration. Soda is configured to use DuckDB. The DuckDB database file has to be setup with the data sets to be checked against the data contract
    • --quality-check: performs the quality checks on the data sets against the data contract. The control is handed over to the Soda CLI (when the type of quality checks is set to SodaCL)
  • Only Soda is supported for now; other types are not supported yet. The Soda quality checks are delegated within a dedicated Golang source file, namely sodaQuality.go. Adding support for other tools should be quite straightforward by creating similar Golang source files like for instance gxQuality.go (for Great Expectations)
  • The current implementation expects a YAML specification dedicated to the quality checks. For the example detailed here, we use https://github.com/data-engineering-helpers/data-contracts/blob/main/datacontract.com/contracts/data-contract-flight-route-quality.yaml . A robust way has still to be designed and implemented to support various methods to inline and/or to cross-reference the quality check part in the data contract. I have a few ideas, and we could open a dedicated issue for that

Copy link
Contributor

@stefannegele stefannegele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @da115115, nice work!
I had a brief look at your changes and added some comments. Please tell me if I can help or if I misunderstood something.

file.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
@da115115 da115115 marked this pull request as ready for review October 13, 2023 10:20
@da115115 da115115 changed the title [WIP] Implementing some elementary data quality tests Implementing some elementary data quality tests Nov 1, 2023
@stefannegele
Copy link
Contributor

I plan to merge tomorrow when I have time for my adjustments. Thank you!

@stefannegele stefannegele merged commit a6e78d5 into datacontract:main Nov 2, 2023
4 checks passed
@stefannegele
Copy link
Contributor

stefannegele commented Nov 2, 2023

Hi @da115115
Just merged your PR and added some tickets. If you plan to work on one of them, leve a comment there, please.
#12
#13
#14

Thanks for your work!

@da115115
Copy link
Contributor Author

da115115 commented Nov 2, 2023

Hi @da115115 Just merged your PR and added some tickets. If you plan to work on one of them, leve a comment there, please. #12 #13 #14

Thanks for your work!

Sure, makes sense, will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants