Added support for regex patterns in skip_rows #290
Conversation
tabulator/stream.py
Outdated
strings to skip. If a string, it'll skip rows that begin with it | ||
(e.g. '#' and '//'). | ||
List of row numbers, strings and regex patterns to skip. | ||
If a string, it'll skip rows that begin with it e.g. '#' and '//'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'rows that begin with' is a bit ambiguous - perhaps 'rows that their first cells begin with the string or match the regex'
This looks good @roll , the API is better the way you implemented it (i.e. no need for separate property). |
For our usecase totally changing skip_rows would be okay, but it's probably better for backward compatibility to have skip_rows_regex. |
Honestly, I don't think that it's a real risk that The reason I don't like Actually, the truly proper way could be having something like:
But it will mean changes (at least documentation) for a lot of libraries (datapackage/pipeline/etc). And our current specs/software style is more dynamic than strictly typed so I'm not sure that it's worth the trouble |
@akariv
Which one do you think is better? I have a feeling that this concept (string/regex coming from a text source like DPP) can be used in other parts of the stack so I want to ensure that the solution is good enough |
It has the same issues as before, as e.g. `/* comment */` will be treated
as a regexp.
What if we used RegExp objects when we wanted to specify regular
expressions? i.e. the result of `re.compile()`
…On Tue, Jan 14, 2020, 15:51 roll ***@***.***> wrote:
@akariv <https://github.com/akariv>
@cschloer <https://github.com/cschloer>
I've figured out another option which will definitely not be breaking and
maybe more obvious because it's a well-known JavaScript notation for RegExp:
skip_rows=[1, '# comment', '/# (regex|comment)/'] # new idea
skip_rows=[1, '# comment', '^# (regex|comment)'] # initial idea
``
Which one do you think is better?
I have a feeling that this concept (string/regex coming from a text source) can be used in other parts of the stack so I want to ensure that the solution is good enough
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#290?email_source=notifications&email_token=AACAY5P2CS6AIHX7RB6PSFDQ5W7PTA5CNFSM4J5GQIJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4VS3I#issuecomment-574183789>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACAY5MKN2UOXIWH4O5VKH3Q5W7PTANCNFSM4J5GQIJQ>
.
|
Using RegExp objects would mean that the regular expression option couldn't be used from a pipeline-spec.yaml (since it's only text). If we want to keep skip rows, we could also make it into a dictionary object, which can be passed in the yaml. Something like : That would keep support for simple strings and numbers but would open regular expression support. |
@cschloer this is a good solution I think |
Hopefully, at some point, we will find a less verbose syntax but I agree that with @cschloer's version we can't go wrong |
Will be |
@akariv
@cschloer
Here is regex support for
skip_rows
:What do you think :
skip_rows
andskip_rows_regex