-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: invalid header lines should be ignored, not parsed. #78
Conversation
5281afe
to
077fccb
Compare
Let me know if you want me to take a look at this when you're done. |
@cantino That would be great! I finally got all the tests to pass, I'm ready for you to take a look at it. Thanks! |
The change ended up being a little bigger than I anticipated, I had some in-place changes in my own version of reckon (stripping out the BOM marker) that I wanted to get added here. The changes to app_spec are because sort isn't stable, so sorting by date in each_with_backwards meant that the Book Store transaction wasn't always row 7, so I changed it to look for the string, instead of by index. The date_column change to also sort by index was a similar issue, sort isn't stable, so either date field in the Broker Canada example could've been returned. Adding an index to use the column that came first seemed like it would be correct more often (at least it holds true for the 3-4 csv files I process from my financial institutions). Using rchardet is so we can correctly parse non-utf8 files that don't specify an encoding using the encoding option. In the provided for the extractofake.csv file, we should be able to represent all those characters with diacritics correctly, rather than converting them to ?. I think that about covers my changes. Also, now that I'm writing out what changed, I should rework my commits and commit messages, some of those commits cover more than a single change. Maybe hold off on your review until I do that, hopefully it will be easier to understand that way. |
Sure, no rush.
…On Fri, Jan 31, 2020 at 3:07 PM Ben Prew ***@***.***> wrote:
The change ended up being a little bigger than I anticipated, I had some
in-place changes in my own version of reckon (stripping out the BOM marker)
that I wanted to get added here.
The changes to app_spec are because sort isn't stable, so sorting by date
in each_with_backwards meant that the Book Store transaction wasn't always
row 7, so I changed it to look for the string, instead of by index.
The date_column change to also sort by index was a similar issue, sort
isn't stable, so either date field in the Broker Canada example could've
been returned. Adding an index to use the column that came first seemed
like it would be correct more often (at least it holds true for the 3-4 csv
files I process from my financial institutions).
Using rchardet is so we can correctly parse non-utf8 files that don't
specify an encoding using the encoding option. In the provided for the
extractofake.csv file, we should be able to represent all those characters
with diacritics correctly, rather than converting them to ?.
I think that about covers my changes. Also, now that I'm writing out what
changed, I should rework my commits and commit messages, some of those
commits cover more than a single change.
Maybe hold off on your review until I do that, hopefully it will be easier
to understand that way.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#78?email_source=notifications&email_token=AAAUO6Y7CL4LVFKPRZXKGFDRASVL3A5CNFSM4KOCUCL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKQJQYA#issuecomment-580950112>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAUO6YADQXJMV72ALEOPCTRASVL3ANCNFSM4KOCUCLQ>
.
|
Sort isn't stable, so sorting by date in each_with_backwards meant that the "Book Store" transaction wasn't always row 7, so look for the string, instead of by index.
…n first Sorting by date_score isn't stable, so either date field for Broker Canada data could've been returned. Added index to the sort key to use the column that came first. This behavior matches the 3-4 csv files I process from my financial institutions.
High Sierra installs 2.0, so it's unlikely that someone would have a ruby < 2.0 installed. High Sierra is 2 versions behind the current OSx version (Catalina).
Since we throw them away anyway, we should just skip them
If the user doesn't pass an encoding option, we try to determine the encoding of the file using CharDet, then convert it to UTF-8 before parsing it as CSV. Also, strip the BOM, if it exists. Fall back to BINARY as a last resort
f5673d0
to
0e9e977
Compare
Ok @cantino, cleaned up and ready for you to take a look. Thanks |
This looks great @benprew. I haven't tried running it, but the refactors are nice. Go for it. |
Great, thanks. I ran it against a couple of my files and didn't have any problems. I'll merge this. |
No description provided.