adding handling for multi-line and sparse header structures, per recent team discussions by tomreitz · Pull Request #181 · edanalytics/earthmover

tomreitz · 2025-12-15T18:21:57Z

This PR adds support for multi-line and sparse header structures in csv, tsv, and excel sources. It does so by

reading the first few lines of an input file (those specified by a list of header_rows)
(if fill_sparse_headers: True) filling sparse column names to the right
flattening multi-index headers by concatenating levels with double-underscores
specifying the result as the column names to use when reading the file's data

Additionally, this PR adds documentation explaining how to configure such sources in earthmover.yaml, and an example_project/ for testing the new functionality.

Per discussion at the last team meeting, this PR will supersede #179 .

…nt team discussions

jayckaiser

As a whole, this feature makes sense. This is some pretty complicated code, so I want us to simplify it where possible before we merge it into main.

jayckaiser · 2025-12-15T23:19:11Z

+            _header_rows = config.get('header_rows', 1)
+            _fill_sparse_headers = config.get('fill_sparse_headers', False)
+            if type(_header_rows) is list:
+                pass


I don't love us throwing a pass here, but I don't have a good alternative.

jayckaiser · 2025-12-15T23:21:48Z

+                    ] for row in flattened_columns ]                # 1. iterate over header levels
+                # flatten multi-line header tuples to single string col names
+                flattened_columns = [
+                    "__".join(x).removeprefix("__").removesuffix("__") # 2. join across levels, trimming


Is this not the same as the following:

"__".join(x).strip("__")

Or is the issue that some column names may have a single underscore as its prefix or suffix?

Right, and .strip("__") (which is technically the same as .strip("_")) would remove single leading/trailing underscores, not just doubles.

I think you're gesturing at the fact that this could have hidden effects, which is true. We're joining tuples of values - that could include empty strings - like ("", "x", "_") → __x___, and then removing the prefix/suffix → x_ (which is correct in this case). A case where it wouldn't be correct is ("__", "x", "__") → ____x____ and after prefix/suffix removal x (it should be __x__). However, .strip("__") would get the first case wrong too. Personally I think the fact that we're using double-underscores to join & trim, which should hopefully be very uncommon in column names, makes this ok. But I'm open to push-back.

tomreitz · 2025-12-16T15:52:34Z

Thanks for the excellent and helpful review, @jayckaiser. I've incorporated the changes you suggested, re-tagging you for review, let me know if there's anything else.

Tom Reitz added 3 commits December 15, 2025 12:06

adding handling for multi-line and sparse header structures, per rece…

b5458c3

…nt team discussions

remove unneeded str test

16a5db1

improve var naming

75f21c7

tomreitz requested review from jayckaiser and johncmerfeld December 15, 2025 18:27

jayckaiser reviewed Dec 15, 2025

View reviewed changes

updates per review from Jay

b4a11aa

tomreitz requested a review from jayckaiser December 16, 2025 15:52

tomreitz mentioned this pull request Dec 17, 2025

CIRCLE fix for no-match student ID file header edanalytics/earthmover_edfi_bundles#256

Draft

jayckaiser approved these changes Dec 18, 2025

View reviewed changes

johncmerfeld mentioned this pull request Jan 7, 2026

Remove preprocessor script from TX KEA edanalytics/earthmover_edfi_bundles#234

Closed

prep for earthmover 0.4.8 release

1bddb6d

tomreitz merged commit 8791949 into main Jan 9, 2026

tomreitz deleted the feature/multiline_sparse_headers branch January 9, 2026 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding handling for multi-line and sparse header structures, per recent team discussions#181

adding handling for multi-line and sparse header structures, per recent team discussions#181
tomreitz merged 5 commits into
mainfrom
feature/multiline_sparse_headers

tomreitz commented Dec 15, 2025 •

edited

Loading

Uh oh!

jayckaiser left a comment

Uh oh!

Uh oh!

jayckaiser Dec 15, 2025

Uh oh!

jayckaiser Dec 15, 2025

Uh oh!

tomreitz Dec 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomreitz commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomreitz commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jayckaiser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jayckaiser Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

jayckaiser Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

tomreitz Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomreitz commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomreitz commented Dec 15, 2025 •

edited

Loading