Add optional colspecs parameter to fwf by susanxiong · Pull Request #84 · edanalytics/earthmover

susanxiong · 2024-04-19T16:02:10Z

Closes #63

needs to be tested

jayckaiser · 2024-04-26T19:02:40Z

@susanxiong We're not ignoring your PR. There's a broader discussion to be had about how to improve user-flexibility when adding keyword arguments to file sources, but without having to update Earthmover with every possible argument.

susanxiong · 2024-04-26T19:15:33Z

@susanxiong We're not ignoring your PR. There's a broader discussion to be had about how to improve user-flexibility when adding keyword arguments to file sources, but without having to update Earthmover with every possible argument.

Yeah that makes sense! I haven't marked as ready yet because I also don't know if it's even worth it vs just preprocessing to a csv.

update FW branch with latest main, to prep for testing

…xample

tomreitz · 2024-10-18T15:18:42Z

I've tested this locally with staar: I used staar_summative_fwf_xwalk_2024.csv to construct an earthmover YAML like

version: 2

sources:
  staar_test:
    file: ./staar_sample.txt
    columns:
      - administration_date
      - grade_level_tested
      - esc_region_number
      - county_district_campus_number
      - district_name
      - campus_name
      ...
    colspecs:
      - [0,4]
      - [4,6]
      - [6,8]
      - [8,17]
      - [17,32]
      - [32,47]
      ...

destinations:
  my_destination:
    source: $sources.staar_test
    extension: jsonl
    linearize: True

and then I ran this on sample_anonymized_file.txt. The run complete with no errors, and the JSONL produced looked fine.

I've added a FWF file to example_project/07_filetypes/ and updated the test suite. Re-tagging you @jayckaiser for a quick glance - if it looks ok, I'll merge.

johncmerfeld · 2024-10-22T20:39:31Z

Not to be a broken record but I wanted to draw attention to 3 issues we ran into while creating the CogAT bundle and make sure they're tested for here. I don't know whether these are always problems, but I think the solutions are harmless in any case. The potential issues are:

Corrupted first line – Pandas didn't process the first line of the file correctly and the data came through misaligned. The fix was to read the entire file in a string and then have Pandas read from that:

raw = Path(filepath).read_text(encoding="utf-8-sig") # <- this is the part I'm least sure is generalizable; I don't remember why I had to do it
raw_no_blank_lines = os.linesep.join([s for s in raw.splitlines() if s.strip()])
df = pd.read_fwf(
    StringIO(raw_no_blank_lines),
    ...
)

Pandas strips leading whitespace – see my full note in the linked file, but the fix boils down to

df = pd.read_fwf(...,
    delimiter="\n\t",
    ...
)

Pandas removes leading 0s from numbers – see my full note in the linked file, but the fix boils down to

df = pd.read_fwf(...,
     converters={c: str.rstrip for c in colnames},
    ...
)

tomreitz · 2024-10-23T14:59:42Z

Great comments, thanks @johncmerfeld. (I was not aware of the specific issues with CogAT.) Do you expect them to hold universally true for other FW assessment files, or should we consider making some/all of these settings configurable in earthmover?

I can make revisions to this PR to handle thes situations you mention, and then post an update back here.

johncmerfeld · 2024-10-23T15:02:34Z

It's possible that not every case will require them, but I don't think they would ever cause an issue. In my opinion they're additional guardrails around the data, so we might as well make them universal.

jayckaiser · 2024-10-24T22:36:15Z

I have concerns regarding these three points.

Corrupted first line – Pandas didn't process the first line of the file correctly and the data came through misaligned. The fix was to read the entire file in a string and then have Pandas read from that.

Is this a universal feature where Pandas corrupts the file? By reading the entire thing into memory, we introduce a memory bottleneck that can cause server crashes.

Pandas strips leading whitespace – see my full note in the linked file, but the fix boils down to [updating the delimiter].

I want to make sure this is universal, and if not to default it to this but give the user an option to customize this field.

Pandas removes leading 0s from numbers – see my full note in the linked file, but the fix boils down to [right-stripping columns].

Is this a bug or a feature? I'm confused how right-stripping the column resolves this.

I have no experience with Fixed-Width files, so maybe these are silly questions. Please let me know!

johncmerfeld · 2024-10-24T22:52:45Z

I have concerns regarding these three points.

Corrupted first line – Pandas didn't process the first line of the file correctly and the data came through misaligned. The fix was to read the entire file in a string and then have Pandas read from that.

Is this a universal feature where Pandas corrupts the file? By reading the entire thing into memory, we introduce a memory bottleneck that can cause server crashes.

Pandas strips leading whitespace – see my full note in the linked file, but the fix boils down to [updating the delimiter].

I want to make sure this is universal, and if not to default it to this but give the user an option to customize this field.

Pandas removes leading 0s from numbers – see my full note in the linked file, but the fix boils down to [right-stripping columns].

Is this a bug or a feature? I'm confused how right-stripping the column resolves this.

I have no experience with Fixed-Width files, so maybe these are silly questions. Please let me know!

@susanxiong and I never determined the root cause. I haven't used read_fwf enough to know whether this is common - I doubt it's universal since I haven't found references to the behavior anywhere on the web. Point taken about memory. If we really don't want to risk that, then this behavior should be toggleable and not on by default. But on the other hand there is the risk that data down the line will get corrupted in a way that might be hard to catch.
Fair point. Upon further reflection I think there are cases where the user would not want this behavior.
Not just right-stripping but converting to str. Without this, if you have a value like 0000101 it gets read in as 101. If the leading zeroes weren't needed, they'd be whitespace instead. To be honest I can't remember why the right-stripping was necessary, but I think casting to string is always essential for FWF.

jayckaiser · 2024-10-25T14:20:12Z

@susanxiong and I never determined the root cause. I haven't used read_fwf enough to know whether this is common - I doubt it's universal since I haven't found references to the behavior anywhere on the web. Point taken about memory. If we really don't want to risk that, then this behavior should be toggleable and not on by default. But on the other hand there is the risk that data down the line will get corrupted in a way that might be hard to catch.

Fair point. Upon further reflection I think there are cases where the user would not want this behavior.

Not just right-stripping but converting to str. Without this, if you have a value like 0000101 it gets read in as 101. If the leading zeroes weren't needed, they'd be whitespace instead. To be honest I can't remember why the right-stripping was necessary, but I think casting to string is always essential for FWF.

I agree that this should be a toggleable-setting that defaults to False. Something like force_string would be fine.
Yep, let's make this customizable as well, with the delimiter config.
Let's default to casting column names as strings, and we can investigate more about the rstrip() functionality. I don't mind these being default if it prevents unwanted side-effects from Pandas.

The level of configurability of this type of FileSource makes me want us to reprioritize refactoring the FileSource object into separate subclasses based on filetype. We are overloading a lot of configs into FileSource, and this shows just how complex some filetypes can be.

tomreitz · 2024-11-08T18:58:05Z

On the above three points, and per our conversation yesterday:

I tested this branch against the CogAT sample files John shared. I'm not sure if the "corrupted first line" issue was present in these files, but they all loaded fine.
& 3. I followed this solution to ensure the data is loaded as strings. From there, it's up to the earthmover transformation instructions to cast, trim, lpad or rpad, etc. I've left off the delimiter for now, we can always add that later if it turns out to be needed.

@jayckaiser would you mind giving this one more glance and a 👍 if you're good to merge? Thanks!

johncmerfeld

Still would love to hear Jay's take but FWIW this seems good to me. As you say, we can always revisit if we find weird edge cases.

susanxiong requested review from jayckaiser and rlittle08 April 19, 2024 16:07

Add optional colspecs parameter to fwf

a63778d

susanxiong force-pushed the feature/fwf_colspecs branch from 90866c8 to a63778d Compare April 19, 2024 16:29

Add header and colnames to fwf params

26b41e5

jayckaiser marked this pull request as ready for review October 16, 2024 21:15

jayckaiser approved these changes Oct 16, 2024

View reviewed changes

tomreitz and others added 2 commits October 17, 2024 14:27

Merge pull request #133 from edanalytics/main

5ac827e

update FW branch with latest main, to prep for testing

adding fwf file extension mapping, fixed-width test to 07_filetypes e…

96066e3

…xample

tomreitz requested a review from jayckaiser October 18, 2024 15:18

tweak to maintain string data type

8f91783

johncmerfeld approved these changes Nov 11, 2024

View reviewed changes

jayckaiser approved these changes Nov 15, 2024

View reviewed changes

Merge branch 'main' into feature/fwf_colspecs

83cce79

tomreitz merged commit c0be2e2 into main Nov 15, 2024

tomreitz deleted the feature/fwf_colspecs branch November 15, 2024 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional colspecs parameter to fwf#84

Add optional colspecs parameter to fwf#84
tomreitz merged 6 commits into
mainfrom
feature/fwf_colspecs

susanxiong commented Apr 19, 2024

Uh oh!

jayckaiser commented Apr 26, 2024

Uh oh!

susanxiong commented Apr 26, 2024

Uh oh!

tomreitz commented Oct 18, 2024

Uh oh!

johncmerfeld commented Oct 22, 2024

Uh oh!

tomreitz commented Oct 23, 2024

Uh oh!

johncmerfeld commented Oct 23, 2024

Uh oh!

jayckaiser commented Oct 24, 2024

Uh oh!

johncmerfeld commented Oct 24, 2024

Uh oh!

jayckaiser commented Oct 25, 2024

Uh oh!

tomreitz commented Nov 8, 2024

Uh oh!

johncmerfeld left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

susanxiong commented Apr 19, 2024

Uh oh!

jayckaiser commented Apr 26, 2024

Uh oh!

susanxiong commented Apr 26, 2024

Uh oh!

tomreitz commented Oct 18, 2024

Uh oh!

johncmerfeld commented Oct 22, 2024

Uh oh!

tomreitz commented Oct 23, 2024

Uh oh!

johncmerfeld commented Oct 23, 2024

Uh oh!

jayckaiser commented Oct 24, 2024

Uh oh!

johncmerfeld commented Oct 24, 2024

Uh oh!

jayckaiser commented Oct 25, 2024

Uh oh!

tomreitz commented Nov 8, 2024

Uh oh!

johncmerfeld left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants