adding handling for multi-line and sparse header structures, per recent team discussions#181
Conversation
jayckaiser
left a comment
There was a problem hiding this comment.
As a whole, this feature makes sense. This is some pretty complicated code, so I want us to simplify it where possible before we merge it into main.
| _header_rows = config.get('header_rows', 1) | ||
| _fill_sparse_headers = config.get('fill_sparse_headers', False) | ||
| if type(_header_rows) is list: | ||
| pass |
There was a problem hiding this comment.
I don't love us throwing a pass here, but I don't have a good alternative.
| ] for row in flattened_columns ] # 1. iterate over header levels | ||
| # flatten multi-line header tuples to single string col names | ||
| flattened_columns = [ | ||
| "__".join(x).removeprefix("__").removesuffix("__") # 2. join across levels, trimming |
There was a problem hiding this comment.
Is this not the same as the following:
"__".join(x).strip("__")
Or is the issue that some column names may have a single underscore as its prefix or suffix?
There was a problem hiding this comment.
Right, and .strip("__") (which is technically the same as .strip("_")) would remove single leading/trailing underscores, not just doubles.
I think you're gesturing at the fact that this could have hidden effects, which is true. We're joining tuples of values - that could include empty strings - like ("", "x", "_") → __x___, and then removing the prefix/suffix → x_ (which is correct in this case). A case where it wouldn't be correct is ("__", "x", "__") → ____x____ and after prefix/suffix removal x (it should be __x__). However, .strip("__") would get the first case wrong too. Personally I think the fact that we're using double-underscores to join & trim, which should hopefully be very uncommon in column names, makes this ok. But I'm open to push-back.
|
Thanks for the excellent and helpful review, @jayckaiser. I've incorporated the changes you suggested, re-tagging you for review, let me know if there's anything else. |
This PR adds support for multi-line and sparse header structures in
csv,tsv, andexcelsources. It does so byheader_rows)fill_sparse_headers: True) filling sparse column names to the rightAdditionally, this PR adds documentation explaining how to configure such sources in
earthmover.yaml, and anexample_project/for testing the new functionality.Per discussion at the last team meeting, this PR will supersede #179 .