New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issues #42
Comments
Interesting, thanks for raising this! I currently see these possibilities why this could be happening:
Is the file publicly available? If yes I could take some time looking at this. |
I looked into a little more and the first issue I can into was that I didn't open the file using the right mode. If I do the following I get a little further:
This however only gets me through 600 rows. On row 601 it hangs but don't see what's different about this row. I've uploaded the data to my S3 bucket if you want to see for yourself. The data is a subset of what you can get from http://www.gbif.org/ |
I had a look at this, there is this field in the file on line 686:
There are other fields with stray quotes, however this is seen as an unfinished escape sequence because it is an odd number of quotes which are starting in the middle of the field - Fields with quotes should start with quotes, and quotes within the field should be double quoted to escape them, resulting in an even number of quotes in properly escaped and terminated fields. Otherwise the library will continue to collect lines as escaped, leading to the behavior you're seeing. Eventually you will see an error, but since the file is quite big that can take a while. In the RFC 4180 this is marked as SHOULD so there may be deviations in files for good reasons. I do not see the reason why this field could not be properly encoded. There are these things that the lib can improve to address these issues:
I will probably get to work on these some time next week. Thanks again for raising this! |
@beatrichartz Thanks for looking into this in such detail! I really appreciate it. |
I have a tab delimited file that's ~2.6GB. I'm attempting to do the following in iex but it never completes:
I thought this was because of the number of rows but even if I do
Enum.take(700)
it doesn't complete. If I only take 500 however it completes almost immediately. Any idea on what's going on or know what I could do to debug this? I'm using Elixir 1.3.0The text was updated successfully, but these errors were encountered: