Test simple parquet scheme by mrocklin · Pull Request #1810 · dask/dask

mrocklin · 2016-11-24T14:54:12Z

Took a quick stab at supporting single-file parquet datasets from dask.dataframe. Added a test here. Also tried making the following changes to fastparquet

--- a/fastparquet/api.py
+++ b/fastparquet/api.py
@@ -45,14 +45,16 @@ class ParquetFile(object):
         try:
             fn2 = sep.join([fn, '_metadata'])
             f = open_with(fn2, 'rb')
+            with f as f:
+                self._parse_header(f, verify)
             fn = fn2
         except (IOError, OSError):
             f = open_with(fn, 'rb')
+            with f as f:
+                self._parse_header(f, verify)
         self.open = open_with
         self.fn = fn
         self.sep = sep
-        with f as f:
-            self._parse_header(f, verify)
         self._read_partitions()
 
     def _parse_header(self, f, verify=True):
@@ -106,8 +108,11 @@ class ParquetFile(object):
         self.cats = {key: list(v) for key, v in cats.items()}
 
     def row_group_filename(self, rg):
-        return self.sep.join([os.path.dirname(self.fn),
-                              rg.columns[0].file_path])
+        if rg.columns[0].file_path:
+            return self.sep.join([os.path.dirname(self.fn),
+                                  rg.columns[0].file_path])
+        else:
+            return self.fn

Though now I have to step out for the day. cc @martindurant

martindurant · 2016-11-24T16:01:27Z

dask/dataframe/io/tests/test_parquet.py

+        df = pd.DataFrame({'x': [1, 2, 3]})
+        fastparquet.write(fn, df)
+        ddf = dd.io.parquet.read_parquet(fn)
+        import pdb; pdb.set_trace()


This is a leftover?

martindurant · 2016-11-24T16:03:55Z

I guess this will then be a PR in fastparquet. May as well have with open_with(fn, 'rb') in one go. Otherwise, I see no problem - if loading from a single file within fastparquet, row_group_filename is never called.

To be sure: I don't think that to_parquet should support single-file mode on write, as that would only be possible for a one-division dataframe. I was, however, thinking of the possibility of a simple function that could collect several isolated parquet files into a logical collection, if they had compatible schemas, of course.

martindurant · 2016-11-28T19:47:50Z

Were you planning on merging this here, and the changes above into fastparquet?

mrocklin · 2016-11-28T19:59:55Z

Honestly I had forgotten about it. I've been swamped with a few other things. I'll try to get back to it in a couple of days. Feel free to steal it from me if you have time.

martindurant · 2016-11-29T20:12:05Z

NB failure fixed in dask/fastparquet#34 - is this test skipped in appveyor?

martindurant · 2016-12-01T22:04:44Z

This now passes following merger in fastparquet.

mrocklin · 2016-12-01T22:06:24Z

Should we merge?

martindurant · 2016-12-01T22:08:19Z

I believe yes - adding a test that passes has to be a good thing!

Test simple parquet scheme

6818b7b

martindurant reviewed Nov 24, 2016

View reviewed changes

martindurant mentioned this pull request Nov 29, 2016

To allow dask to read parallel chunks from one file dask/fastparquet#34

Merged

Fix test and flake

1a990e7

Merge branch 'master' into parquet-pandas

acffd0f

mrocklin merged commit 9685631 into dask:master Dec 1, 2016

mrocklin deleted the parquet-pandas branch December 1, 2016 22:08

sinhrks added this to the 0.13.0 milestone Jan 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test simple parquet scheme#1810

Test simple parquet scheme#1810
mrocklin merged 3 commits intodask:masterfrom
mrocklin:parquet-pandas

mrocklin commented Nov 24, 2016

Uh oh!

martindurant Nov 24, 2016

Uh oh!

martindurant commented Nov 24, 2016

Uh oh!

martindurant commented Nov 28, 2016

Uh oh!

mrocklin commented Nov 28, 2016

Uh oh!

martindurant commented Nov 29, 2016

Uh oh!

martindurant commented Dec 1, 2016

Uh oh!

mrocklin commented Dec 1, 2016

Uh oh!

martindurant commented Dec 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mrocklin commented Nov 24, 2016

Uh oh!

martindurant Nov 24, 2016

Choose a reason for hiding this comment

Uh oh!

martindurant commented Nov 24, 2016

Uh oh!

martindurant commented Nov 28, 2016

Uh oh!

mrocklin commented Nov 28, 2016

Uh oh!

martindurant commented Nov 29, 2016

Uh oh!

martindurant commented Dec 1, 2016

Uh oh!

mrocklin commented Dec 1, 2016

Uh oh!

martindurant commented Dec 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants