Skip to content

Add optional colspecs parameter to fwf#84

Merged
tomreitz merged 6 commits into
mainfrom
feature/fwf_colspecs
Nov 15, 2024
Merged

Add optional colspecs parameter to fwf#84
tomreitz merged 6 commits into
mainfrom
feature/fwf_colspecs

Conversation

@susanxiong
Copy link
Copy Markdown
Contributor

Closes #63

needs to be tested

@susanxiong susanxiong force-pushed the feature/fwf_colspecs branch from 90866c8 to a63778d Compare April 19, 2024 16:29
@jayckaiser
Copy link
Copy Markdown
Collaborator

@susanxiong We're not ignoring your PR. There's a broader discussion to be had about how to improve user-flexibility when adding keyword arguments to file sources, but without having to update Earthmover with every possible argument.

@susanxiong
Copy link
Copy Markdown
Contributor Author

@susanxiong We're not ignoring your PR. There's a broader discussion to be had about how to improve user-flexibility when adding keyword arguments to file sources, but without having to update Earthmover with every possible argument.

Yeah that makes sense! I haven't marked as ready yet because I also don't know if it's even worth it vs just preprocessing to a csv.

@jayckaiser jayckaiser marked this pull request as ready for review October 16, 2024 21:15
tomreitz and others added 2 commits October 17, 2024 14:27
@tomreitz
Copy link
Copy Markdown
Collaborator

I've tested this locally with staar: I used staar_summative_fwf_xwalk_2024.csv to construct an earthmover YAML like

version: 2

sources:
  staar_test:
    file: ./staar_sample.txt
    columns:
      - administration_date
      - grade_level_tested
      - esc_region_number
      - county_district_campus_number
      - district_name
      - campus_name
      ...
    colspecs:
      - [0,4]
      - [4,6]
      - [6,8]
      - [8,17]
      - [17,32]
      - [32,47]
      ...

destinations:
  my_destination:
    source: $sources.staar_test
    extension: jsonl
    linearize: True

and then I ran this on sample_anonymized_file.txt. The run complete with no errors, and the JSONL produced looked fine.

I've added a FWF file to example_project/07_filetypes/ and updated the test suite. Re-tagging you @jayckaiser for a quick glance - if it looks ok, I'll merge.

@tomreitz tomreitz requested a review from jayckaiser October 18, 2024 15:18
@johncmerfeld
Copy link
Copy Markdown
Contributor

Not to be a broken record but I wanted to draw attention to 3 issues we ran into while creating the CogAT bundle and make sure they're tested for here. I don't know whether these are always problems, but I think the solutions are harmless in any case. The potential issues are:

  1. Corrupted first line – Pandas didn't process the first line of the file correctly and the data came through misaligned. The fix was to read the entire file in a string and then have Pandas read from that:
raw = Path(filepath).read_text(encoding="utf-8-sig") # <- this is the part I'm least sure is generalizable; I don't remember why I had to do it
raw_no_blank_lines = os.linesep.join([s for s in raw.splitlines() if s.strip()])
df = pd.read_fwf(
    StringIO(raw_no_blank_lines),
    ...
)
  1. Pandas strips leading whitespace – see my full note in the linked file, but the fix boils down to
df = pd.read_fwf(...,
    delimiter="\n\t",
    ...
)
  1. Pandas removes leading 0s from numbers – see my full note in the linked file, but the fix boils down to
df = pd.read_fwf(...,
     converters={c: str.rstrip for c in colnames},
    ...
)

@tomreitz
Copy link
Copy Markdown
Collaborator

Great comments, thanks @johncmerfeld. (I was not aware of the specific issues with CogAT.) Do you expect them to hold universally true for other FW assessment files, or should we consider making some/all of these settings configurable in earthmover?

I can make revisions to this PR to handle thes situations you mention, and then post an update back here.

@johncmerfeld
Copy link
Copy Markdown
Contributor

It's possible that not every case will require them, but I don't think they would ever cause an issue. In my opinion they're additional guardrails around the data, so we might as well make them universal.

@jayckaiser
Copy link
Copy Markdown
Collaborator

I have concerns regarding these three points.

  1. Corrupted first line – Pandas didn't process the first line of the file correctly and the data came through misaligned. The fix was to read the entire file in a string and then have Pandas read from that.
  • Is this a universal feature where Pandas corrupts the file? By reading the entire thing into memory, we introduce a memory bottleneck that can cause server crashes.
  1. Pandas strips leading whitespace – see my full note in the linked file, but the fix boils down to [updating the delimiter].
  • I want to make sure this is universal, and if not to default it to this but give the user an option to customize this field.
  1. Pandas removes leading 0s from numbers – see my full note in the linked file, but the fix boils down to [right-stripping columns].
  • Is this a bug or a feature? I'm confused how right-stripping the column resolves this.

I have no experience with Fixed-Width files, so maybe these are silly questions. Please let me know!

@johncmerfeld
Copy link
Copy Markdown
Contributor

I have concerns regarding these three points.

  1. Corrupted first line – Pandas didn't process the first line of the file correctly and the data came through misaligned. The fix was to read the entire file in a string and then have Pandas read from that.
  • Is this a universal feature where Pandas corrupts the file? By reading the entire thing into memory, we introduce a memory bottleneck that can cause server crashes.
  1. Pandas strips leading whitespace – see my full note in the linked file, but the fix boils down to [updating the delimiter].
  • I want to make sure this is universal, and if not to default it to this but give the user an option to customize this field.
  1. Pandas removes leading 0s from numbers – see my full note in the linked file, but the fix boils down to [right-stripping columns].
  • Is this a bug or a feature? I'm confused how right-stripping the column resolves this.

I have no experience with Fixed-Width files, so maybe these are silly questions. Please let me know!

  1. @susanxiong and I never determined the root cause. I haven't used read_fwf enough to know whether this is common - I doubt it's universal since I haven't found references to the behavior anywhere on the web. Point taken about memory. If we really don't want to risk that, then this behavior should be toggleable and not on by default. But on the other hand there is the risk that data down the line will get corrupted in a way that might be hard to catch.

  2. Fair point. Upon further reflection I think there are cases where the user would not want this behavior.

  3. Not just right-stripping but converting to str. Without this, if you have a value like 0000101 it gets read in as 101. If the leading zeroes weren't needed, they'd be whitespace instead. To be honest I can't remember why the right-stripping was necessary, but I think casting to string is always essential for FWF.

@jayckaiser
Copy link
Copy Markdown
Collaborator

  1. @susanxiong and I never determined the root cause. I haven't used read_fwf enough to know whether this is common - I doubt it's universal since I haven't found references to the behavior anywhere on the web. Point taken about memory. If we really don't want to risk that, then this behavior should be toggleable and not on by default. But on the other hand there is the risk that data down the line will get corrupted in a way that might be hard to catch.
  2. Fair point. Upon further reflection I think there are cases where the user would not want this behavior.
  3. Not just right-stripping but converting to str. Without this, if you have a value like 0000101 it gets read in as 101. If the leading zeroes weren't needed, they'd be whitespace instead. To be honest I can't remember why the right-stripping was necessary, but I think casting to string is always essential for FWF.
  1. I agree that this should be a toggleable-setting that defaults to False. Something like force_string would be fine.
  2. Yep, let's make this customizable as well, with the delimiter config.
  3. Let's default to casting column names as strings, and we can investigate more about the rstrip() functionality. I don't mind these being default if it prevents unwanted side-effects from Pandas.

The level of configurability of this type of FileSource makes me want us to reprioritize refactoring the FileSource object into separate subclasses based on filetype. We are overloading a lot of configs into FileSource, and this shows just how complex some filetypes can be.

@tomreitz
Copy link
Copy Markdown
Collaborator

tomreitz commented Nov 8, 2024

On the above three points, and per our conversation yesterday:

  1. I tested this branch against the CogAT sample files John shared. I'm not sure if the "corrupted first line" issue was present in these files, but they all loaded fine.
  2. & 3. I followed this solution to ensure the data is loaded as strings. From there, it's up to the earthmover transformation instructions to cast, trim, lpad or rpad, etc. I've left off the delimiter for now, we can always add that later if it turns out to be needed.

@jayckaiser would you mind giving this one more glance and a 👍 if you're good to merge? Thanks!

Copy link
Copy Markdown
Contributor

@johncmerfeld johncmerfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still would love to hear Jay's take but FWIW this seems good to me. As you say, we can always revisit if we find weird edge cases.

@tomreitz tomreitz merged commit c0be2e2 into main Nov 15, 2024
@tomreitz tomreitz deleted the feature/fwf_colspecs branch November 15, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle fixed-width files with colspecs

4 participants