Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Added collect_errors processor #144

Closed
wants to merge 1 commit into from

Conversation

roll
Copy link
Contributor

@roll roll commented Aug 5, 2020


Hi @cschloer,

I've checked options for goodtables integration and it doesn't seem doable till the end of this stage. So I dove into the initial issue - https://github.com/BCODMO/laminar_web/issues/801 - and tried to implement something simpler. For example, we can have a collect_errors processor that works like this:

def test_collect_errors():
    from dataflows import load, collect_errors, Flow
    res, dp, stats = Flow(
        load('data/cities.csv', name='cities'),
        # it accepts an "Extra Schema" that can be partial
        collect_errors({'fields': [{'name': 'city', 'constraints': {'maxLength': 4}}]}),
    ).results()
    assert dp.get_resource('cities').descriptor.get('errors') == [
        'Field "city" has constraint "maxLength" which is not satisfied for value "london"',
        'Field "city" has constraint "maxLength" which is not satisfied for value "paris"'
    ]

On the other hand, all the checks @adyork had asked can be set in the main schema e.g. using constraints.pattern - then the validation will fails in dataflows without this processor.

@adyork
Copy link

adyork commented Aug 5, 2020

Thanks for looking into this @roll. I'm not sure this can get us to the usecase we laid out in https://github.com/BCODMO/laminar_web/issues/801 since it seems to be able to check values of fields, but not the field names.

For example, in your above example it says:

collect_errors({'fields': [{'name': 'city', 'constraints': {'maxLength': 4}}]})
'Field "city" has constraint "maxLength" which is not satisfied for value "london"'

But we are looking for something that can check the field name itself ('city' in your example) for constraints: like only [_0-9A-Za-z] allowed, and does not begin with a number.

As for maybe using the constraints.pattern in schema, I think that has the same problem of looking at the values not the field names? e.g. https://specs.frictionlessdata.io/table-schema/#constraints is there somewhere else in the documentation I should be looking for that?

Let me know if I am missing something here.

@roll
Copy link
Contributor Author

roll commented Aug 5, 2020

Oh I see now - what if we just a create a processor that validates a field (metadata) against JSON Schema?

@roll
Copy link
Contributor Author

roll commented Aug 6, 2020

cc @cschloer, WDYT?

@lwinfree
Copy link

lwinfree commented Aug 10, 2020

@adyork, 😄 hey! @roll and I were talking a bit about this today. Would Evgeny's solution (#144 (comment)) work for the use case?

@akariv
Copy link
Member

akariv commented Aug 11, 2020

@roll - why not use the onerror for set_type or validate? it seems to accomplish the same purpose, no?

@roll
Copy link
Contributor Author

roll commented Aug 12, 2020

I think we need a declarative way but this PR is not what we need anyway so I'm closing. Thanks

@roll roll closed this Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An ability to add goodtables checks to the validate processor
4 participants