Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issues #42

Closed
kyledecot opened this issue Jun 21, 2016 · 4 comments
Closed

Performance Issues #42

kyledecot opened this issue Jun 21, 2016 · 4 comments

Comments

@kyledecot
Copy link

kyledecot commented Jun 21, 2016

I have a tab delimited file that's ~2.6GB. I'm attempting to do the following in iex but it never completes:

file = File.open!("output.csv", [:write])
File.stream!("input.csv")
  |> CSV.Decoder.decode(separator: ?\t)
  |> CSV.encode 
  |> Enum.each(&(IO.write(file, &1)))

I thought this was because of the number of rows but even if I do Enum.take(700) it doesn't complete. If I only take 500 however it completes almost immediately. Any idea on what's going on or know what I could do to debug this? I'm using Elixir 1.3.0

@beatrichartz
Copy link
Owner

beatrichartz commented Jun 22, 2016

Interesting, thanks for raising this! I currently see these possibilities why this could be happening:

  • An escaped, unfinished line somewhere between row 500 and 700 (in which case it should not just stop but raise an error sometime) - essentially a CSV field that begins with a stray quote.
  • A non-utf8 encoded file or a file with broken encoding (but also that should raise an error somewhen)
  • A bug in the parser / decoder

Is the file publicly available? If yes I could take some time looking at this.

@kyledecot
Copy link
Author

I looked into a little more and the first issue I can into was that I didn't open the file using the right mode. If I do the following I get a little further:

file = File.open!("output.csv", [:write, :utf8])

This however only gets me through 600 rows. On row 601 it hangs but don't see what's different about this row. I've uploaded the data to my S3 bucket if you want to see for yourself. The data is a subset of what you can get from http://www.gbif.org/

@beatrichartz
Copy link
Owner

I had a look at this, there is this field in the file on line 686:

ecatalogue.LocCollectionEventLocal: "Australia, Western Australia, 10.9 Km S of The Turnoff To Athleen Valley" Along The Wiluna - Agnew Rd (27° 32' S, 120° 33' E) 23/09/1981 - 23/09/1981, A Greer, R Sadlier Et Al(Collector), Field Collected - Terrestrial"

There are other fields with stray quotes, however this is seen as an unfinished escape sequence because it is an odd number of quotes which are starting in the middle of the field - Fields with quotes should start with quotes, and quotes within the field should be double quoted to escape them, resulting in an even number of quotes in properly escaped and terminated fields. Otherwise the library will continue to collect lines as escaped, leading to the behavior you're seeing. Eventually you will see an error, but since the file is quite big that can take a while.

In the RFC 4180 this is marked as SHOULD so there may be deviations in files for good reasons. I do not see the reason why this field could not be properly encoded.

There are these things that the lib can improve to address these issues:

I will probably get to work on these some time next week. Thanks again for raising this!

@kyledecot
Copy link
Author

@beatrichartz Thanks for looking into this in such detail! I really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants