Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

Added support for regex patterns in skip_rows #290

Merged
merged 6 commits into from Jan 30, 2020
Merged

Added support for regex patterns in skip_rows #290

merged 6 commits into from Jan 30, 2020

Conversation

roll
Copy link
Member

@roll roll commented Dec 19, 2019


@akariv
@cschloer
Here is regex support for skip_rows:

skip_rows=[1, '# comment', '^# (regex|comment)']

What do you think :

  • it's OK to have one argument for it OR
  • it's better to use skip_rows and skip_rows_regex

strings to skip. If a string, it'll skip rows that begin with it
(e.g. '#' and '//').
List of row numbers, strings and regex patterns to skip.
If a string, it'll skip rows that begin with it e.g. '#' and '//'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'rows that begin with' is a bit ambiguous - perhaps 'rows that their first cells begin with the string or match the regex'

@akariv
Copy link
Member

akariv commented Dec 24, 2019

This looks good @roll , the API is better the way you implemented it (i.e. no need for separate property).
This might be a breaking change, though, in case someone is using the 'skip by string' feature and is using '^' as a first character...

@cschloer
Copy link
Contributor

cschloer commented Jan 6, 2020

For our usecase totally changing skip_rows would be okay, but it's probably better for backward compatibility to have skip_rows_regex.

@roll
Copy link
Member Author

roll commented Jan 6, 2020

Honestly, I don't think that it's a real risk that ^ will be breaking for real existent software. Probably with a proper changelog note and a new x.N.x version should be fine. But yea it's a tricky situation.

The reason I don't like skip_rows_regex is that we have skip_rows accepting string and numbers. So the current argument is already mixed.

Actually, the truly proper way could be having something like:

skip_rows # with deprecated strings as a parameter
skip_rows_text
skip_rows_regex

But it will mean changes (at least documentation) for a lot of libraries (datapackage/pipeline/etc). And our current specs/software style is more dynamic than strictly typed so I'm not sure that it's worth the trouble

@roll
Copy link
Member Author

roll commented Jan 14, 2020

@akariv
@cschloer
I've figured out another option which will definitely not be breaking and maybe more obvious because it's a well-known JavaScript notation for RegExp:

skip_rows=[1, '# comment', '/# (regex|comment)/'] # new idea
skip_rows=[1, '# comment', '^# (regex|comment)'] # initial idea

Which one do you think is better?

I have a feeling that this concept (string/regex coming from a text source like DPP) can be used in other parts of the stack so I want to ensure that the solution is good enough

@akariv
Copy link
Member

akariv commented Jan 14, 2020 via email

@cschloer
Copy link
Contributor

cschloer commented Jan 14, 2020

Using RegExp objects would mean that the regular expression option couldn't be used from a pipeline-spec.yaml (since it's only text). If we want to keep skip rows, we could also make it into a dictionary object, which can be passed in the yaml. Something like :
skip_rows=[1, '# comment', { 'type': 'regex', 'value': '^#'}]

That would keep support for simple strings and numbers but would open regular expression support.

@akariv
Copy link
Member

akariv commented Jan 23, 2020

@cschloer this is a good solution I think

@roll
Copy link
Member Author

roll commented Jan 30, 2020

Hopefully, at some point, we will find a less verbose syntax but I agree that with @cschloer's version we can't go wrong

@roll roll merged commit a6606ad into master Jan 30, 2020
@roll roll deleted the skip_rows_regex branch January 30, 2020 12:56
@roll roll moved this from In progress to Done in Pilot with BCO-DMO Jan 30, 2020
@roll
Copy link
Member Author

roll commented Jan 30, 2020

Will be tabulator@1.33

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Feature request: regular expression skip rows
3 participants