Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Credit card CSV extra headers not auto stripped #9

Closed
cjhowald opened this issue Jan 8, 2017 · 2 comments
Closed

Credit card CSV extra headers not auto stripped #9

cjhowald opened this issue Jan 8, 2017 · 2 comments

Comments

@cjhowald
Copy link
Owner

cjhowald commented Jan 8, 2017

Example headers:

Credit card transactions;;Customer number: xxxx
MasterCard xxxxx
Outstanding/current transactions till: 01/08/2017
Billing date: 01/25/2017
Voucher date;Date of receipt;Reason for payment;Foreign currency;Amount;Exchange rate;Amount;Currency

Expect all but the last row to be automatically removed on import.

@ohthehugemanatee
Copy link

Really? How will you figure out which row is the real header?

I just posted an issue in the parent project ( halloffame#23) asking for a variable where we can select the number of header rows. That allows for files with 0 (just show the first row example data to choose columns), and files with several. We know that n-1=header row.

automatic would be awesome, but I can't imagine how to do it without big assumptions.

I wanted to submit a patch, but the project doesn't build on Linux (unsupported platform for fsevents, apparently). Thanks for all your work!

@cjhowald
Copy link
Owner Author

cjhowald commented Jan 22, 2017

@ohthehugemanatee I made one assumption: the CSV will not parse properly if an incorrect header row is used. Then, the trick is just defining what "properly" means and extracting that info from the parse result. Turns out, Papaparse does a nice job of recording errors it encounters while parsing. For our purposes, I'm defining parse failure as the first row of data throwing a FieldMismatch error (too many or too few values). This could be too strict for some, but I've found DB always sets the right number of values for data rows. Extraneous header rows, on the other hand, have far fewer "fields", so the parsing will definitely fail according to my criteria. Then, it's just a matter or detecting the failure, stripping off the first row (the bad header) and trying again. This is how I fixed #3.

For this issue specifically, it appears my failure criteria are actually not strict enough. If 2 consecutive "extra" header rows have the same number of "fields", then it will actually register as a success and continue parsing. Rows 2 and 3 in my sample data each have just one field (no delimiters), which causes this to happen. I've just solved this by also imposing a rule that there must be at least 3 fields (date, amount, payee). There are still cases where a false positive success is possible (e.g. 2 "extra" header rows have 4 fields each), but this is sufficiently unlikely for me and does not appear in any of my imports in any case.

Perhaps this approach feels too fragile for you (or the parent maintainer), and I agree a user-specified header number is generally more robust. However, it works well for me and prevents an extra step from the user.

I hope this answers your questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants