Skip to content

Test simple parquet scheme#1810

Merged
mrocklin merged 3 commits intodask:masterfrom
mrocklin:parquet-pandas
Dec 1, 2016
Merged

Test simple parquet scheme#1810
mrocklin merged 3 commits intodask:masterfrom
mrocklin:parquet-pandas

Conversation

@mrocklin
Copy link
Copy Markdown
Member

Took a quick stab at supporting single-file parquet datasets from dask.dataframe. Added a test here. Also tried making the following changes to fastparquet

--- a/fastparquet/api.py
+++ b/fastparquet/api.py
@@ -45,14 +45,16 @@ class ParquetFile(object):
         try:
             fn2 = sep.join([fn, '_metadata'])
             f = open_with(fn2, 'rb')
+            with f as f:
+                self._parse_header(f, verify)
             fn = fn2
         except (IOError, OSError):
             f = open_with(fn, 'rb')
+            with f as f:
+                self._parse_header(f, verify)
         self.open = open_with
         self.fn = fn
         self.sep = sep
-        with f as f:
-            self._parse_header(f, verify)
         self._read_partitions()
 
     def _parse_header(self, f, verify=True):
@@ -106,8 +108,11 @@ class ParquetFile(object):
         self.cats = {key: list(v) for key, v in cats.items()}
 
     def row_group_filename(self, rg):
-        return self.sep.join([os.path.dirname(self.fn),
-                              rg.columns[0].file_path])
+        if rg.columns[0].file_path:
+            return self.sep.join([os.path.dirname(self.fn),
+                                  rg.columns[0].file_path])
+        else:
+            return self.fn

Though now I have to step out for the day. cc @martindurant

df = pd.DataFrame({'x': [1, 2, 3]})
fastparquet.write(fn, df)
ddf = dd.io.parquet.read_parquet(fn)
import pdb; pdb.set_trace()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a leftover?

@martindurant
Copy link
Copy Markdown
Member

I guess this will then be a PR in fastparquet. May as well have with open_with(fn, 'rb') in one go. Otherwise, I see no problem - if loading from a single file within fastparquet, row_group_filename is never called.

To be sure: I don't think that to_parquet should support single-file mode on write, as that would only be possible for a one-division dataframe. I was, however, thinking of the possibility of a simple function that could collect several isolated parquet files into a logical collection, if they had compatible schemas, of course.

@martindurant
Copy link
Copy Markdown
Member

Were you planning on merging this here, and the changes above into fastparquet?

@mrocklin
Copy link
Copy Markdown
Member Author

Honestly I had forgotten about it. I've been swamped with a few other things. I'll try to get back to it in a couple of days. Feel free to steal it from me if you have time.

@martindurant
Copy link
Copy Markdown
Member

NB failure fixed in dask/fastparquet#34 - is this test skipped in appveyor?

@martindurant
Copy link
Copy Markdown
Member

This now passes following merger in fastparquet.

@mrocklin
Copy link
Copy Markdown
Member Author

mrocklin commented Dec 1, 2016

Should we merge?

@martindurant
Copy link
Copy Markdown
Member

I believe yes - adding a test that passes has to be a good thing!

@mrocklin mrocklin merged commit 9685631 into dask:master Dec 1, 2016
@mrocklin mrocklin deleted the parquet-pandas branch December 1, 2016 22:08
@sinhrks sinhrks added this to the 0.13.0 milestone Jan 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants