Performance Issues #42

kyledecot · 2016-06-21T23:19:43Z

I have a tab delimited file that's ~2.6GB. I'm attempting to do the following in iex but it never completes:

file = File.open!("output.csv", [:write])
File.stream!("input.csv")
  |> CSV.Decoder.decode(separator: ?\t)
  |> CSV.encode 
  |> Enum.each(&(IO.write(file, &1)))

I thought this was because of the number of rows but even if I do Enum.take(700) it doesn't complete. If I only take 500 however it completes almost immediately. Any idea on what's going on or know what I could do to debug this? I'm using Elixir 1.3.0

The text was updated successfully, but these errors were encountered:

beatrichartz · 2016-06-22T06:54:50Z

Interesting, thanks for raising this! I currently see these possibilities why this could be happening:

An escaped, unfinished line somewhere between row 500 and 700 (in which case it should not just stop but raise an error sometime) - essentially a CSV field that begins with a stray quote.
A non-utf8 encoded file or a file with broken encoding (but also that should raise an error somewhen)
A bug in the parser / decoder

Is the file publicly available? If yes I could take some time looking at this.

kyledecot · 2016-06-22T16:19:46Z

I looked into a little more and the first issue I can into was that I didn't open the file using the right mode. If I do the following I get a little further:

file = File.open!("output.csv", [:write, :utf8])

This however only gets me through 600 rows. On row 601 it hangs but don't see what's different about this row. I've uploaded the data to my S3 bucket if you want to see for yourself. The data is a subset of what you can get from http://www.gbif.org/

beatrichartz · 2016-06-25T13:37:12Z

I had a look at this, there is this field in the file on line 686:

ecatalogue.LocCollectionEventLocal: "Australia, Western Australia, 10.9 Km S of The Turnoff To Athleen Valley" Along The Wiluna - Agnew Rd (27° 32' S, 120° 33' E) 23/09/1981 - 23/09/1981, A Greer, R Sadlier Et Al(Collector), Field Collected - Terrestrial"

There are other fields with stray quotes, however this is seen as an unfinished escape sequence because it is an odd number of quotes which are starting in the middle of the field - Fields with quotes should start with quotes, and quotes within the field should be double quoted to escape them, resulting in an even number of quotes in properly escaped and terminated fields. Otherwise the library will continue to collect lines as escaped, leading to the behavior you're seeing. Eventually you will see an error, but since the file is quite big that can take a while.

In the RFC 4180 this is marked as SHOULD so there may be deviations in files for good reasons. I do not see the reason why this field could not be properly encoded.

There are these things that the lib can improve to address these issues:

Allow stray quotes in fields not starting with a quote (fields that are not escaped) [opened in Allow stray quotes in unescaped fields #44]
Add a maximum number of lines to be added to an escaped field with an option to allow configuration. This will make debugging easier for these files. [opened in Add configurable maximum number of lines aggregated by line aggregator #45]
Add context of syntax errors to error message [opened in Add context and field number to syntax errors #46]

I will probably get to work on these some time next week. Thanks again for raising this!

kyledecot · 2016-06-25T20:55:01Z

@beatrichartz Thanks for looking into this in such detail! I really appreciate it.

beatrichartz closed this as completed Jun 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issues #42

Performance Issues #42

kyledecot commented Jun 21, 2016 •

edited

beatrichartz commented Jun 22, 2016 •

edited

kyledecot commented Jun 22, 2016

beatrichartz commented Jun 25, 2016

kyledecot commented Jun 25, 2016

Performance Issues #42

Performance Issues #42

Comments

kyledecot commented Jun 21, 2016 • edited

beatrichartz commented Jun 22, 2016 • edited

kyledecot commented Jun 22, 2016

beatrichartz commented Jun 25, 2016

kyledecot commented Jun 25, 2016

kyledecot commented Jun 21, 2016 •

edited

beatrichartz commented Jun 22, 2016 •

edited