Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column-to-Field mapping strategy #123

Closed
1 task
roll opened this issue Oct 9, 2016 · 2 comments
Closed
1 task

Column-to-Field mapping strategy #123

roll opened this issue Oct 9, 2016 · 2 comments
Assignees
Labels
general General improvements

Comments

@roll
Copy link
Member

roll commented Oct 9, 2016

Overview

We have a list of columns and list of fields. There are two ways to link them:

  • based on order (I suppose it's how JTS says we should do)
  • based on name

There is a check for it - https://github.com/frictionlessdata/goodtables-py/blob/next/goodtables/checks/head/non_matching_header.py - any logic could be implemented.

For now it uses current goodtables approach checking names using order but removing column from inspection if there is non matching name. Also there is not implemented order_fields option. But this approach could lead to problems described below. It could be solved removing whitespaces from names.

But also may be instead of order_fileds to use option to set mapping strategy:

  • map fields to columns using order
  • map fields to columns using names

Problems

@rgrp has wrote:

e.g. column named "Number" and JTS has " Number" (edt: see space!) then number column is ignored.

Qu: Why do we even use column names vs JTS names - can't we use JTS order in checking or is the idea we could validate arbitrary JSON data?

Tasks

  • For now non_matching_headers(order_fields=True) raises NotImplementedError but it should re-order fields if order_fields option is True. Needed some tricky algorithm because headers and fields could be duplicated etc. [IF NEEDED]
@roll roll added bug Something isn't working branch-next labels Oct 9, 2016
@roll roll added this to the goodtables-v1 milestone Oct 9, 2016
@roll roll self-assigned this Oct 9, 2016
@roll roll changed the title Implement non-matching-header with order_fields=True Implement non-matching-header check with order_fields=True Oct 9, 2016
@roll roll changed the title Implement non-matching-header check with order_fields=True Implement non-matching-header with order_fields=True Oct 9, 2016
@roll roll added the priority label Oct 9, 2016
@roll roll changed the title Implement non-matching-header with order_fields=True Column-to-Field mapping strategy Oct 14, 2016
@rufuspollock
Copy link
Contributor

@roll i get this is tricky. For example, we'd generally like to be able to validate a stream of rows where each row is either an array or a dict (basically either what you get from python csv.reader or csv.DictReader ...). My point in the original issue is not that we must accomodate order stuff but just that we should not ignore something if not there - if we expected a column called name and not there i think we should warn or error.

@roll roll added priority general General improvements and removed priority bug Something isn't working labels Oct 18, 2016
@roll roll added review and removed current labels Oct 31, 2016
@roll
Copy link
Member Author

roll commented Oct 31, 2016

So goodtables.next is going to do the next thing based on raised problem:

  • if there is any difference between header and field_name it will add non-matching-header error
  • but if slugified names are the same ("Number" and "/space/Number" -> "number") it will NOT stop the future inspection of the field (for other errors)

Also order_fields options will force the system to re-order fields as much as possible close to headers before check above.

roll pushed a commit that referenced this issue Oct 31, 2016
* fixed column producing for body context

* removed column from schema checks only if name slugs are different

* implemented order_fields algo

* improved comments
@roll roll closed this as completed Oct 31, 2016
@roll roll removed the review label Oct 31, 2016
roll pushed a commit that referenced this issue Nov 2, 2016
* Rebased on goodtables.next codebase (#118)

* removed current codebase

* added updated codebase

* fixed linting

* updated readme

* updated readme

* updated readme

* fixed source checks

* added dataset checks

* min style change

* removed ecode filter from filter_checks

* renamed cells back to columns + row_number

* added dataset check stubs

* implemented dataset checks

* fixed linting

* moved __inspect_table next to inspect for better reading

* fixed list.clear for python2

* added error limit to dataset errors

* updated readme

* updated readme

* added breaking note to readme

* added custom checks support

* implemented custom profiles

* fixed linting

* added options order_fields and infer_fields

* fixed extra_header

* added comments

* updated spec

* renamed unordered_headers to non_matching_header

* renamed col-number to column-number

* min

* updated added dataset errors to readme

* splitted error and check concpets

* fixed linting

* min

* fixed readme

* updated readme

* updated readme

* fixed readme

* fixed readme

* added guard assertion to checks

* updated custom checks API

* fixed linting, readme

* fixed linting, readme

* updated readme

* updated readme

* typo

* moved table errors to Inspector, deleted checks

* added ability to profilies to return errors

* fixed head checks not columns break

* rebased check on in-place erorrs update

* rebased profiles on in-place errors

* fixed readme

* fixed readme

* fixed readme

* fixed readme

* fixed readme

* no extra-header error if infer_fields is True

* implemented proper non-matching-header without ordering

* added custom_profiles, custom_checks arguments instead of global
registry

* moved default args to signatures

* removed make release (use github releases instead!)

* added return code to cli

* improved cli error formatting

* improved error messages, tests

* updated examples

* added custom examples

* added inspector tests

* added limit tests

* fixed tests

* fixed spec link

* added checks options to cli

* fixed profiles

* added description to setup

* added entry_points, keywords

* moved ckan profile to examples

* removed report from spec

* updated version to v1

* updated install instruction for now

* Fixed jsontableschema-error message (#133)

* Rebased on granular tabulator exceptions (#115)

* updated dependencies

* rebases on new tabulator exceptions

* Renamed profile to preset with simplified API (#124)

* renamed profile to preset

* removed errors, tables arguments from preset

* Added infer_schema option, updated preset API (#128)

* minor improvements

* added infer_schema option false by default

* Added support for schema constraints (#55)

* updated jsontableschema version

* implemented all constraints except unique

* implemented unique constraint check

* updated to jsontableschema-v0.8.2

* fixed linting

* Added tables preset (#125)

* added tables preset

* fixed linting

* added tables test

* Implemented order_fields option (#123)

* fixed column producing for body context

* removed column from schema checks only if name slugs are different

* implemented order_fields algo

* improved comments

* Rebased on external spec (#131)

Rebased on external spec

* Improved tests (#127)

* added prev version data examples

* implemented feature tests

* moved files to data

* updated verson

* added features

* marked spec test as xfail

* Rebased on spec-v1.0.0-alpha1 (#131)

* updated spec, added spec to API

* added config with checks order

* rebased in inspector on updated spec

* updated @check API

* rebased on spec message templates

* fixed cutom checks

* fixed linting

* Updated readme note
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
general General improvements
Projects
Archived in project
Development

No branches or pull requests

2 participants