Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow upload of invalid resource data #85

Open
fulior opened this issue Feb 20, 2023 · 6 comments
Open

Allow upload of invalid resource data #85

fulior opened this issue Feb 20, 2023 · 6 comments

Comments

@fulior
Copy link

fulior commented Feb 20, 2023

Currently ckanext-validation doesn't allow uploading resources which don’t pass validation. If the validation fails the extension tries to remove a failed resource file. But there are cases where CKAN should be allowed to store invalid data. In cases like asynchronous bulk upload or large tabular data compiled by multiple parties the validation report is important for data curators to know what’s going on in the system.

This could be solved by adding extra configuration option:

ckanext.validation.allow_invalid_data = False

This would let us to modify logic in ckanext/validation/logic.py (https://github.com/frictionlessdata/ckanext-validation/blob/1073c80dace453a404df3d5f1ac1ae86a88b5029/ckanext/validation/logic.py#L665) to preserve data file in case of validation error e.g:

allow_invalid_data = bool(t.config.get(
'ckanext.validation.allow_invalid_data'
))

if not report['valid'] and not allow_invalid_data:
...

I’ll gladly provide a PR for this if you think this is something useful. It’s a requirement of our users, so we’re currently using it in a forked version of ckanext-validation

@ThrawnCA
Copy link
Contributor

I was under the impression that if only asynchronous validation is used, then the desired behaviour is already the case; you can upload an invalid resource and just get a report on the failures.

It's when you enable synchronous validation that uploads will be rejected.

@fulior
Copy link
Author

fulior commented Feb 21, 2023

You're right, I've missed this, I was testing v2.0.0 in synchronous mode. Wouldn't it make more sense to have the same behavior for both async and sync mode?

  1. Either a configurable option like ckanext.validation.allow_invalid_data_in_sync_mode.
  2. Or just enable invalid data upload in sync mode permanently, so it's consistent with the asynchronous mode?

@ThrawnCA
Copy link
Contributor

ThrawnCA commented Feb 21, 2023

Wouldn't it make more sense to have the same behavior for both async and sync mode?

I don't think so. They're not just performance differences, they actually give you different feedback. Synchronous validation can actually give you an error, whereas async just gives you a report.

If you have chosen not to validate resources until after upload is complete, then clearly you are okay with potentially invalid resources existing in your system.

If you have chosen to take the performance hit to validate potentially large resources on the spot, then clearly you care a lot about ensuring that they're valid, and it makes sense to deny invalid uploads.

It's like the difference between having metal detectors and X-raying luggage, vs just having security cameras and doing background checks. One has the capability and expectation to actually deny entry to those who fail, the other doesn't.

@fulior
Copy link
Author

fulior commented Feb 22, 2023

In our case we have users that are collaborating on the uploaded data. Some files can be incomplete (e.g. incomplete patient surveys etc.) and the gaps will be filled by other users. I understand this can be achieved with asynchronous mode, but I think this behavior should be available in both modes.

@tomeksabala
Copy link

Hello @ThrawnCA . I'm working together with @fulior on the issue. Thanks a lot for your valuable feedback and pointing out the differences between sync and async validation regarding data file deletion. This resolves one of our use cases.
However, I still find the option to keep invalid data uploaded to the system with sync validation sensible. Two big reasons behind it (I wonder what you think?):

  1. synchronous validation is easier to work with on local dev env, and it would be nice to reproduce the production behaviour of keeping invalid data in the system with the validation report
  2. a use case where data is curated by multiple users and someone would like to upload a partially valid file to CKAN (with synchronous validation) so other parties can update the file further (e.g. adding missing column values).

ckanext.validation.allow_invalid_data could be False by default and hence preserve old behaviour. It is not a big logic change and would make our configuration and workflow much more consistent.

@ThrawnCA
Copy link
Contributor

ThrawnCA commented Mar 1, 2023

synchronous validation is easier to work with on local dev env

How so? Wouldn't it be more important to replicate production behaviour? With a local database, I would expect that validation would complete in seconds at most, so there shouldn't be any particular difficulties in testing it.

a use case where data is curated by multiple users and someone would like to upload a partially valid file to CKAN (with synchronous validation) so other parties can update the file further (e.g. adding missing column values).

Why couldn't this use asynchronous validation? All you need is the report to tell you when you've successfully fixed it, right? So asynchronous validation works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants