Skip to content

adding handling for multi-line and sparse header structures, per recent team discussions#181

Merged
tomreitz merged 5 commits into
mainfrom
feature/multiline_sparse_headers
Jan 9, 2026
Merged

adding handling for multi-line and sparse header structures, per recent team discussions#181
tomreitz merged 5 commits into
mainfrom
feature/multiline_sparse_headers

Conversation

@tomreitz
Copy link
Copy Markdown
Collaborator

@tomreitz tomreitz commented Dec 15, 2025

This PR adds support for multi-line and sparse header structures in csv, tsv, and excel sources. It does so by

  1. reading the first few lines of an input file (those specified by a list of header_rows)
  2. (if fill_sparse_headers: True) filling sparse column names to the right
  3. flattening multi-index headers by concatenating levels with double-underscores
  4. specifying the result as the column names to use when reading the file's data

Additionally, this PR adds documentation explaining how to configure such sources in earthmover.yaml, and an example_project/ for testing the new functionality.

Per discussion at the last team meeting, this PR will supersede #179 .

Copy link
Copy Markdown
Collaborator

@jayckaiser jayckaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a whole, this feature makes sense. This is some pretty complicated code, so I want us to simplify it where possible before we merge it into main.

Comment thread earthmover/nodes/source.py Outdated
_header_rows = config.get('header_rows', 1)
_fill_sparse_headers = config.get('fill_sparse_headers', False)
if type(_header_rows) is list:
pass
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love us throwing a pass here, but I don't have a good alternative.

Comment thread earthmover/nodes/source.py Outdated
] for row in flattened_columns ] # 1. iterate over header levels
# flatten multi-line header tuples to single string col names
flattened_columns = [
"__".join(x).removeprefix("__").removesuffix("__") # 2. join across levels, trimming
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not the same as the following:

"__".join(x).strip("__")

Or is the issue that some column names may have a single underscore as its prefix or suffix?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and .strip("__") (which is technically the same as .strip("_")) would remove single leading/trailing underscores, not just doubles.

I think you're gesturing at the fact that this could have hidden effects, which is true. We're joining tuples of values - that could include empty strings - like ("", "x", "_")__x___, and then removing the prefix/suffix → x_ (which is correct in this case). A case where it wouldn't be correct is ("__", "x", "__")____x____ and after prefix/suffix removal x (it should be __x__). However, .strip("__") would get the first case wrong too. Personally I think the fact that we're using double-underscores to join & trim, which should hopefully be very uncommon in column names, makes this ok. But I'm open to push-back.

Comment thread earthmover/nodes/source.py Outdated
Comment thread earthmover/nodes/source.py Outdated
Comment thread earthmover/nodes/source.py Outdated
Comment thread earthmover/nodes/source.py
@tomreitz
Copy link
Copy Markdown
Collaborator Author

Thanks for the excellent and helpful review, @jayckaiser. I've incorporated the changes you suggested, re-tagging you for review, let me know if there's anything else.

@tomreitz tomreitz merged commit 8791949 into main Jan 9, 2026
@tomreitz tomreitz deleted the feature/multiline_sparse_headers branch January 9, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants