Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing issue reading from a stream >64KB on Windows #104

Closed
RedBreadcat opened this issue Mar 9, 2022 · 4 comments
Closed

Parsing issue reading from a stream >64KB on Windows #104

RedBreadcat opened this issue Mar 9, 2022 · 4 comments
Assignees

Comments

@RedBreadcat
Copy link

RedBreadcat commented Mar 9, 2022

Hello,
I'm encountering a parsing issue on Windows files >64KB. This has occurred across several different machines. GetRowCount (and rapidcsv::Document's internal mData member) have more rows than is present in the file.
I have seen this issue on the latest master ee33419, as well as release 8.53.

I believe that this is related to the way that the MSVC standard library implementation magically removes \r characters when reading a file.
In void ReadCsv(std::istream& pStream), the file size calculated from std::streamsize length = pStream.tellg(); includes \r characters as one would expect.

However, in void ParseCsv(std::istream& pStream, std::streamsize p_FileLength), calls to pStream.read(buffer.data(), readLength); return a buffer that does not contain any \r characters. So in short, the total amount of bytes outputted by read is smaller than the size of the file.
Because of this mismatch, parts of buffer are parsed more than is required, leading to multiple additional rows being appended to mData.

Below is some self-contained code that should reproduce the issue:

#include <fstream>
#include <iostream>

#include "rapidcsv.h"

int dataRows = 5000;
const auto filename = "testfile.csv";

void CreateFile()
{
   std::ofstream out(filename);
   out << "Foo,Bar,Baz" << std::endl;
   for (int i = 0; i < dataRows; i++)
   {
      out << i << "," << i * 2 << "," << i * 3 << std::endl;
   }
}

void ReadFile()
{
   std::ifstream file(filename);
   
   rapidcsv::Document doc(file);
   std::cout << doc.GetRowCount() << std::endl; // ==5353 in my case
}

int main()
{
   CreateFile();
   ReadFile();

   return 0;
}
@d99kris d99kris self-assigned this Mar 9, 2022
@d99kris
Copy link
Owner

d99kris commented Mar 9, 2022

Hello,
Thanks for reporting the issue and providing the detailed analysis! It's a pretty bad bug and I'm surprised it's not been encountered (or detected) before.

I can confirm your test case reproduces the issue on MSVC for me as well. I'm preparing a fix in rapidcsv (incl. updates to documentation and test cases) and will likely get it in this weekend.

Meanwhile, if you need to work around it, I think there are two options:

  1. Open the file using rapidcsv, i.e. change ReadFile() to:
void ReadFile()
{
   rapidcsv::Document doc(filename);
   std::cout << doc.GetRowCount() << std::endl;
}
  1. Open the ifstream in binary mode, i.e. change ReadFile() to:
void ReadFile()
{
   std::ifstream file(filename, std::ios::binary);
   rapidcsv::Document doc(file);
   std::cout << doc.GetRowCount() << std::endl;
}

@d99kris d99kris changed the title Parsing issue on Windows files >64KB Parsing issue reading from a stream >64KB on Windows Mar 9, 2022
@RedBreadcat
Copy link
Author

Great, thank you for the ergonomic suggestions to work around the issue in the meantime!

@d99kris
Copy link
Owner

d99kris commented Mar 12, 2022

Hello,
The reported issue has been addressed in above commit. Please let me know if you encounter any issues. Thanks!

@ezemskov
Copy link

Thanks Kris,
rapidcsv 8.61 fixes our issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants