Add optional colspecs parameter to fwf#84
Conversation
90866c8 to
a63778d
Compare
|
@susanxiong We're not ignoring your PR. There's a broader discussion to be had about how to improve user-flexibility when adding keyword arguments to file sources, but without having to update Earthmover with every possible argument. |
Yeah that makes sense! I haven't marked as ready yet because I also don't know if it's even worth it vs just preprocessing to a csv. |
update FW branch with latest main, to prep for testing
|
I've tested this locally with staar: I used staar_summative_fwf_xwalk_2024.csv to construct an earthmover YAML like version: 2
sources:
staar_test:
file: ./staar_sample.txt
columns:
- administration_date
- grade_level_tested
- esc_region_number
- county_district_campus_number
- district_name
- campus_name
...
colspecs:
- [0,4]
- [4,6]
- [6,8]
- [8,17]
- [17,32]
- [32,47]
...
destinations:
my_destination:
source: $sources.staar_test
extension: jsonl
linearize: Trueand then I ran this on sample_anonymized_file.txt. The run complete with no errors, and the JSONL produced looked fine. I've added a FWF file to |
|
Not to be a broken record but I wanted to draw attention to 3 issues we ran into while creating the CogAT bundle and make sure they're tested for here. I don't know whether these are always problems, but I think the solutions are harmless in any case. The potential issues are:
raw = Path(filepath).read_text(encoding="utf-8-sig") # <- this is the part I'm least sure is generalizable; I don't remember why I had to do it
raw_no_blank_lines = os.linesep.join([s for s in raw.splitlines() if s.strip()])
df = pd.read_fwf(
StringIO(raw_no_blank_lines),
...
)
df = pd.read_fwf(...,
delimiter="\n\t",
...
)
df = pd.read_fwf(...,
converters={c: str.rstrip for c in colnames},
...
) |
|
Great comments, thanks @johncmerfeld. (I was not aware of the specific issues with CogAT.) Do you expect them to hold universally true for other FW assessment files, or should we consider making some/all of these settings configurable in earthmover? I can make revisions to this PR to handle thes situations you mention, and then post an update back here. |
|
It's possible that not every case will require them, but I don't think they would ever cause an issue. In my opinion they're additional guardrails around the data, so we might as well make them universal. |
|
I have concerns regarding these three points.
I have no experience with Fixed-Width files, so maybe these are silly questions. Please let me know! |
|
The level of configurability of this type of |
|
On the above three points, and per our conversation yesterday:
@jayckaiser would you mind giving this one more glance and a 👍 if you're good to merge? Thanks! |
johncmerfeld
left a comment
There was a problem hiding this comment.
Still would love to hear Jay's take but FWIW this seems good to me. As you say, we can always revisit if we find weird edge cases.
Closes #63
needs to be tested