Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: invalid header lines should be ignored, not parsed. #78

Merged
merged 8 commits into from
Feb 1, 2020

Conversation

benprew
Copy link
Collaborator

@benprew benprew commented Jan 31, 2020

No description provided.

@benprew benprew force-pushed the ignore-invalid-header-lines branch 2 times, most recently from 5281afe to 077fccb Compare January 31, 2020 22:37
@cantino
Copy link
Owner

cantino commented Jan 31, 2020

Let me know if you want me to take a look at this when you're done.

@benprew
Copy link
Collaborator Author

benprew commented Jan 31, 2020

@cantino That would be great! I finally got all the tests to pass, I'm ready for you to take a look at it.

Thanks!

@benprew
Copy link
Collaborator Author

benprew commented Jan 31, 2020

The change ended up being a little bigger than I anticipated, I had some in-place changes in my own version of reckon (stripping out the BOM marker) that I wanted to get added here.

The changes to app_spec are because sort isn't stable, so sorting by date in each_with_backwards meant that the Book Store transaction wasn't always row 7, so I changed it to look for the string, instead of by index.

The date_column change to also sort by index was a similar issue, sort isn't stable, so either date field in the Broker Canada example could've been returned. Adding an index to use the column that came first seemed like it would be correct more often (at least it holds true for the 3-4 csv files I process from my financial institutions).

Using rchardet is so we can correctly parse non-utf8 files that don't specify an encoding using the encoding option. In the provided for the extractofake.csv file, we should be able to represent all those characters with diacritics correctly, rather than converting them to ?.

I think that about covers my changes. Also, now that I'm writing out what changed, I should rework my commits and commit messages, some of those commits cover more than a single change.

Maybe hold off on your review until I do that, hopefully it will be easier to understand that way.

@cantino
Copy link
Owner

cantino commented Jan 31, 2020 via email

Sort isn't stable, so sorting by date in each_with_backwards meant that the "Book Store"
transaction wasn't always row 7, so look for the string, instead of by index.
…n first

Sorting by date_score isn't stable, so either date field for Broker Canada data could've
been returned. Added index to the sort key to use the column that came first.  This
behavior matches the 3-4 csv files I process from my financial institutions.
High Sierra installs 2.0, so it's unlikely that someone would have a ruby < 2.0
installed. High Sierra is 2 versions behind the current OSx version (Catalina).
Since we throw them away anyway, we should just skip them
If the user doesn't pass an encoding option, we try to determine the encoding of the
file using CharDet, then convert it to UTF-8 before parsing it as CSV.  Also, strip the
BOM, if it exists.  Fall back to BINARY as a last resort
@benprew
Copy link
Collaborator Author

benprew commented Feb 1, 2020

Ok @cantino, cleaned up and ready for you to take a look.

Thanks

@cantino
Copy link
Owner

cantino commented Feb 1, 2020

This looks great @benprew. I haven't tried running it, but the refactors are nice. Go for it.

@benprew benprew merged commit af3d029 into master Feb 1, 2020
@benprew benprew deleted the ignore-invalid-header-lines branch February 1, 2020 01:31
@benprew
Copy link
Collaborator Author

benprew commented Feb 1, 2020

Great, thanks. I ran it against a couple of my files and didn't have any problems. I'll merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants