Lazy reading from file & test fixes. #6

Merged
1 commit merged into from Feb 11, 2011

Conversation

Projects
None yet
2 participants
Contributor

i0cus commented Feb 9, 2011

Hi David,

I have added some code that let's me evaluate CSV without reading the entire file into memory at once (multi-method for java.io.Reader). Wrote some tests for that.

Also fixed your two tests regarding exception throwing during parsing with strict.

I'm not saying that you should merge this commit as is, but having a way to work with files lazily (read and maybe even write) would be awesome and this is one way to do it (without changing any of your parsing logic and mechanisms, with a small addin). Also -- please take a look at fixed tests.

Regards,
Slawek.

@i0cus i0cus Source coercion with char-seq.
Added char-seq coercion for Strings, Readers, byte-, int- and char-array. You can now process CSV lazily without reading it from memory at once and only thing needed to do that is something like:

(with-open [r (reader "/home/memyselfandi/test.csv")]
  (count (parse-csv (char-seq r))))

Also fixed some test cases regadring Exceptions thrown during *strict* parsing & added new cases for char-seq section.
9fac699
Owner

davidsantiago commented Feb 9, 2011

This looks like good code. Thanks for the test fixes, I forgot those tests were even in there commented out.

I'm very ambivalent about the lazy file reader, though, for exactly that reason from the test. If you parse a file and return a lazy value based on a lazy file reader, when you go to read it, you will die since the file won't still be open. There's currently some work on fixing this problem in future Clojure releases, it seems, but for now I worry that this might just be more confusing. Can you convince me that this is an overall useful feature to have to make the danger worth it? Do you have a use case in mind?

- David 
Contributor

i0cus commented Feb 10, 2011

No worries, I've been tinkering with it since I've put issue out -- did not have time to follow through because of day job. ;-)

As I said before -- I am actually proposing this for you consideration as "one way of doing things".

Well for me that thing I've put in the test is actually a feature -- you don't leak the file descriptor while processing, it's closed by "with-open", so processing must be finalized within it. I don't see an issue with it if it's documented to work as advertised, i.e.

  • "with-open" is closing the source, processing must finalise within;
  • source can be managed manually, but this may leave files opened.

One use case I have in mind is "files that do not fit in memory" -- currently you have to use another library for that, since clojure-csv only works with in-memory-String as both input and output.

This also gives you an ability to parse files that you do not want to read at once, because it's using Reader's .read method to get input byte-by-byte and your processing function is not keeping the head of resulting seq. If this could be paired with lazy writing you would have a nice solution for processing CSV in-out with low memory footprint on the IO part. Caveat: this is not getting JITed and is kind of slow and needs to be benchmarked. I have no idea for speed improvements though.

Bonus: my proposal does not actually change the process for people who don't need that -- slurping files and feeding resulting string into process-csv is still the preferred method.

Owner

davidsantiago commented Feb 10, 2011

Fair enough, I'm convinced. I suppose it's possible there are really big CSVs out there, and we should definitely not require a gigabyte string or whatever to work through them. I'll see about merging in your changes tomorrow.

This issue was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment