Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing exception when a cell contains a newline #63

Closed
jpownby opened this issue Nov 4, 2020 · 6 comments
Closed

Confusing exception when a cell contains a newline #63

jpownby opened this issue Nov 4, 2020 · 6 comments
Assignees

Comments

@jpownby
Copy link

jpownby commented Nov 4, 2020

A CSV file gave threw an exception ("invalid vector subscript") when I called:

document.GetColumn<std::string>(someIndex);

This exception was confusing to me. someIndex was less than the result returned by document.GetColumnCount(), so I didn't understand what the problem was and had to debug the code to figure it out.

It turns out that the CSV file has a newline \n character in the middle of a quoted cell. So, if I set pQuotedLinebreaks to true in my SeparatorParameters it fixes the problem.

But, this was really non-obvious to me, and it seems strange that rapidcsv doesn't do any validation when parsing to catch that a row has the wrong number of cells and then assumes in GetColumn() that the number will be correct. The way the behavior currently works makes it seem like there is a problem with GetColumn(), when really the problem is with the source data.

I would suggest, in ParseCsv(std::istream& pStream, std::streamsize p_FileLength), some kind of check whenever mData.push_back(row) is about to be called to verify that row.size() == GetColumnCount() (or similar), and if it doesn't then an exception could be thrown. That would help identify what the problem really is (whether it's the result of a newline or just bad data) rather than having parsing apparently succeed but then unexpected errors happen when the results are used.

@d99kris
Copy link
Owner

d99kris commented Nov 5, 2020

Hi @jpownby - thanks for reporting this issue and providing a very clear issue description!

It's definitely a valid feedback, and I'll see what I can do. My main concern is that rapidcsv will need to make assumptions about the use-case in order to provide more detailed error messages, but perhaps there's some way around it.

I'll update here again once I've had some time to look at this.

@d99kris d99kris self-assigned this Nov 5, 2020
@jpownby
Copy link
Author

jpownby commented Nov 5, 2020

Here's an example of the kind of error message that I think would be helpful:

if (!mData.empty() && (row.size() != GetColumnCount()))
{
	std::ostringstream errorMessage;
	errorMessage << "Row #" << mData.size() << " has " << row.size() << " cells (instead of " << GetColumnCount() << ")";
	throw std::invalid_argument(errorMessage.str());
}
mData.push_back(row);

I don't know what use-case assumptions you mean, but an exception like that would have helped me in my situation because I could have gone and looked at the file to see what the problem was. It also would have been helpful to have the problem identified immediately rather than having an exception get thrown later after I had assumed parsing was successful, regardless of what the exception message was.

Hope that helps!

@d99kris
Copy link
Owner

d99kris commented Nov 6, 2020

Thanks!

I don't know what use-case assumptions you mean

Yeah I should've been a bit more specific. I think there could be CSV files that does not necessarily have same number of columns on all rows. Definitely not a typical use-case, but I think it could exist. So I wouldn't want to restrict this at parser-level. But the type of check you suggested could of course be added in relevant Get-functions.

I'll play around with it a little and check performance impact, and update here again.

Thanks again for the feedback, it is a good idea.

@jpownby
Copy link
Author

jpownby commented Nov 6, 2020

Oh, ok, I see. Well, in that case, maybe the bug would be that an exception is thrown in GetColumn() if the index given is bigger than a particular row has :)

@d99kris
Copy link
Owner

d99kris commented Aug 15, 2021

Yeah, I think that's a reasonable and good idea for rapidcsv to support. It will also has minimal performance impact.

The above commit implements support for this, with exception message on the format requested column index 2 >= 2 (number of columns on row index 1).

@homer6
Copy link

homer6 commented Jun 19, 2022

Thanks to you both, @d99kris and @jpownby. With this discussion, I was able to sidestep this error with this code:

auto separator_param = rapidcsv::SeparatorParams();
separator_param.mQuotedLinebreaks = true;
rapidcsv::Document csv_doc( input_file, rapidcsv::LabelParams(), separator_param, rapidcsv::ConverterParams(true) );

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants