bug: invalid header lines should be ignored, not parsed. #78

benprew · 2020-01-31T07:24:10Z

No description provided.

cantino · 2020-01-31T22:52:15Z

Let me know if you want me to take a look at this when you're done.

benprew · 2020-01-31T22:54:38Z

@cantino That would be great! I finally got all the tests to pass, I'm ready for you to take a look at it.

Thanks!

benprew · 2020-01-31T23:07:41Z

The change ended up being a little bigger than I anticipated, I had some in-place changes in my own version of reckon (stripping out the BOM marker) that I wanted to get added here.

The changes to app_spec are because sort isn't stable, so sorting by date in each_with_backwards meant that the Book Store transaction wasn't always row 7, so I changed it to look for the string, instead of by index.

The date_column change to also sort by index was a similar issue, sort isn't stable, so either date field in the Broker Canada example could've been returned. Adding an index to use the column that came first seemed like it would be correct more often (at least it holds true for the 3-4 csv files I process from my financial institutions).

Using rchardet is so we can correctly parse non-utf8 files that don't specify an encoding using the encoding option. In the provided for the extractofake.csv file, we should be able to represent all those characters with diacritics correctly, rather than converting them to ?.

I think that about covers my changes. Also, now that I'm writing out what changed, I should rework my commits and commit messages, some of those commits cover more than a single change.

Maybe hold off on your review until I do that, hopefully it will be easier to understand that way.

cantino · 2020-01-31T23:08:42Z

Sure, no rush.

…

On Fri, Jan 31, 2020 at 3:07 PM Ben Prew ***@***.***> wrote: The change ended up being a little bigger than I anticipated, I had some in-place changes in my own version of reckon (stripping out the BOM marker) that I wanted to get added here. The changes to app_spec are because sort isn't stable, so sorting by date in each_with_backwards meant that the Book Store transaction wasn't always row 7, so I changed it to look for the string, instead of by index. The date_column change to also sort by index was a similar issue, sort isn't stable, so either date field in the Broker Canada example could've been returned. Adding an index to use the column that came first seemed like it would be correct more often (at least it holds true for the 3-4 csv files I process from my financial institutions). Using rchardet is so we can correctly parse non-utf8 files that don't specify an encoding using the encoding option. In the provided for the extractofake.csv file, we should be able to represent all those characters with diacritics correctly, rather than converting them to ?. I think that about covers my changes. Also, now that I'm writing out what changed, I should rework my commits and commit messages, some of those commits cover more than a single change. Maybe hold off on your review until I do that, hopefully it will be easier to understand that way. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#78?email_source=notifications&email_token=AAAUO6Y7CL4LVFKPRZXKGFDRASVL3A5CNFSM4KOCUCL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKQJQYA#issuecomment-580950112>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAUO6YADQXJMV72ALEOPCTRASVL3ANCNFSM4KOCUCLQ> .

… gems

Sort isn't stable, so sorting by date in each_with_backwards meant that the "Book Store" transaction wasn't always row 7, so look for the string, instead of by index.

…n first Sorting by date_score isn't stable, so either date field for Broker Canada data could've been returned. Added index to the sort key to use the column that came first. This behavior matches the 3-4 csv files I process from my financial institutions.

High Sierra installs 2.0, so it's unlikely that someone would have a ruby < 2.0 installed. High Sierra is 2 versions behind the current OSx version (Catalina).

Since we throw them away anyway, we should just skip them

If the user doesn't pass an encoding option, we try to determine the encoding of the file using CharDet, then convert it to UTF-8 before parsing it as CSV. Also, strip the BOM, if it exists. Fall back to BINARY as a last resort

benprew · 2020-02-01T00:01:58Z

Ok @cantino, cleaned up and ready for you to take a look.

Thanks

cantino · 2020-02-01T00:32:04Z

This looks great @benprew. I haven't tried running it, but the refactors are nice. Go for it.

benprew · 2020-02-01T02:34:56Z

Great, thanks. I ran it against a couple of my files and didn't have any problems. I'll merge this.

This was referenced Jan 31, 2020

problem of importing file #59

Closed

Problem with file in which every column is quoted. #58

Closed

benprew force-pushed the ignore-invalid-header-lines branch 2 times, most recently from 5281afe to 077fccb Compare January 31, 2020 22:37

benprew added 8 commits January 31, 2020 15:10

Pin rubies to OS default versions

3f93860

Update gems to higest version supported by Ruby 2.0. Add pry to devel…

7e15a61

… gems

bug: fix order-dependent test

5ce43ae

Sort isn't stable, so sorting by date in each_with_backwards meant that the "Book Store" transaction wasn't always row 7, so look for the string, instead of by index.

Remove fastercsv, ruby 2.0 is our minimum version.

9c95364

High Sierra installs 2.0, so it's unlikely that someone would have a ruby < 2.0 installed. High Sierra is 2 versions behind the current OSx version (Catalina).

bug: don't try to parse rows that the user considers header rows

e775392

Since we throw them away anyway, we should just skip them

Use CharDet to detect char encoding, strip BOM from file

deba42b

If the user doesn't pass an encoding option, we try to determine the encoding of the file using CharDet, then convert it to UTF-8 before parsing it as CSV. Also, strip the BOM, if it exists. Fall back to BINARY as a last resort

Minor cleanup, use require_relative where appropriate

0e9e977

benprew force-pushed the ignore-invalid-header-lines branch from f5673d0 to 0e9e977 Compare January 31, 2020 23:59

benprew merged commit af3d029 into master Feb 1, 2020

benprew deleted the ignore-invalid-header-lines branch February 1, 2020 01:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: invalid header lines should be ignored, not parsed. #78

bug: invalid header lines should be ignored, not parsed. #78

benprew commented Jan 31, 2020

cantino commented Jan 31, 2020

benprew commented Jan 31, 2020

benprew commented Jan 31, 2020

cantino commented Jan 31, 2020 via email

benprew commented Feb 1, 2020

cantino commented Feb 1, 2020

benprew commented Feb 1, 2020

bug: invalid header lines should be ignored, not parsed. #78

bug: invalid header lines should be ignored, not parsed. #78

Conversation

benprew commented Jan 31, 2020

cantino commented Jan 31, 2020

benprew commented Jan 31, 2020

benprew commented Jan 31, 2020

cantino commented Jan 31, 2020 via email

benprew commented Feb 1, 2020

cantino commented Feb 1, 2020

benprew commented Feb 1, 2020