ARROW-1400: [Python] Adding parquet.write_to_dataset() method for writing partitioned .parquet files #991

saffrydaffry · 2017-08-24T06:28:09Z

Add write_to_dataset in pyarrow.parquet that writes tables to parquet given partitioning columns.

wesm

Cool! This will be very useful. I made some comments around the API and particular details in the implementation that will affect memory use and performance

@jreback and @cpcloud could you double check my reasoning if you agree re: the performance considerations? we may want to add some ASV benchmarks so that we can ascertain the exact performance implications

wesm · 2017-08-28T19:03:16Z

python/pyarrow/parquet.py

+    ----------
+    table : pyarrow.Table
+    where: string,
+        Name of the parquet file for data saved in each partition


In these cases, usually the data file paths are randomly generated. Any reason not to use a uuid value like $UUID.parquet? (see guid() function in pyarrow.compat)

You could make the file path an optional parameter, so that if it's None a uuid is used

Sounds good, I'll take a look at guid(). I was also looking at uuid4() from the uuid module, but glad to know there's already a native function.

wesm · 2017-08-28T19:03:42Z

python/pyarrow/parquet.py

+    table : pyarrow.Table
+    where: string,
+        Name of the parquet file for data saved in each partition
+    parition_cols : list,


Typo in parameter name.

This should be optional, so that this function also works for an unpartitioned dataset

re Typo: oof!

re Optional parameter: if no parameter is specified, would it be best to just call the original parquet.write_table?

No, because you're appending data to a dataset consisting of one or more Parquet files, so you might have:

root_dir/ file1.parquet file2.parquet file3.parquet file4.parquet

So if partition_cols is not specified, then it will write another file to the folder. This special case should be handled in the implementation of this function to avoid unnecessary data copying, and unit tested, too

Oh I see! Giving the file names UUIDs (as suggested above) will make this simpler and follow the spark implementation more closely: root_dir is specified, but each of the files appended under it will be <uuid>.parquet.

re "Unnecessary data copying": Are you saying I should check if the data being appended to the dataset has duplicates within the existing dataset? If so, this seems like it could get hairy: a) I might want duplicates in my dataset (not sure why I would) b) I'm not sure how to do it other than reading back in the existing data, appending new data, deduping, then overwriting the directory with the complete dataset.

I'm hoping users will know not to accidentally write the same data twice. Right now, my implementation for this is:

if partition_cols is not None: (modified original function) else: outfile = compat.guid() + ".parquet" full_path = "/".join([root_path, outfile]) write_table(table, full_path, **kwargs)

wesm · 2017-08-28T19:04:59Z

python/pyarrow/parquet.py

+        Column names by which to partition the dataset
+        Columns are partitioned in the order they are given
+    root_path: string,
+        The root directory of the table


IMHO this should be the second parameter:

pq.write_to_dataset(table, '/path/to/mydata') pq.write_to_dataset(table, '/path/to/mydata', partition_cols=[k1, k2])

wesm · 2017-08-28T19:06:31Z

python/pyarrow/parquet.py

+
+    df = table.to_pandas()
+    groups = df.groupby(partition_cols)
+    data_cols = [col for col in df.columns.tolist()


Use df.columns.drop(partition_cols) here

wesm · 2017-08-28T19:07:31Z

python/pyarrow/parquet.py

+                 if col not in partition_cols]
+    for partition in partition_cols:
+        try:
+            df[partition] = df[partition].astype(str)


This is quite expensive. It would be faster to coerce the keys to string only when writing to the dataset

wesm · 2017-08-28T19:07:46Z

python/pyarrow/parquet.py

+        except ValueError:
+            raise ValueError("Partition column must be coercible to string")
+
+    if not data_cols:


len(data_cols) == 0

wesm · 2017-08-28T19:08:55Z

python/pyarrow/parquet.py

+
+    schema = {}
+    for subgroup in groups.indices:
+        sub_df = groups.get_group(subgroup)[data_cols]


Instead do for keys, subgroup in df.groupby(...) and omit the .indices and .get_group

I think it would be more efficient to do:

partition_keys = [df[col] for col in partition_cols] data_df = df.drop(partition_cols, axis='columns') for part_keys, data_group in data_df.groupby(partition_keys): ...

wesm · 2017-08-28T19:09:24Z

python/pyarrow/parquet.py

+            subgroup = (subgroup,)
+        subdir = "/".join(
+            ["{colname}={value}".format(colname=name, value=val)
+             for name, val in zip(partition_cols, subgroup)])


Should the coercion to string fail, it will fail here, so the cast above likely not needed

wesm · 2017-08-28T19:11:13Z

python/pyarrow/parquet.py

+        prefix = "/".join([root_path, path])
+        os.makedirs(prefix, exist_ok=True)
+        full_path = "/".join([prefix, where])
+        write_table(data, full_path, **kwargs)


Maybe write the chunks eagerly rather than waiting until the end? Otherwise you have a minimum of a memory tripling in this function:

first copy in table.to_pandas()

second copy when iterating through the groups

Awesome! Thanks for the feedback, I'll start on these tonight.

wesm · 2017-08-28T19:15:04Z

@saffrydaffry can you also change the PR title to explain the content of the patch?

saffrydaffry · 2017-08-28T20:51:46Z

@wesm, I've never changed the title on a pr before--is it possible to change while merging commits to these fixes?
This looks like a good lead, but I want to make sure this is how it's typically done: https://stackoverflow.com/questions/35770346/is-it-possible-to-change-github-pull-request-title-when-merging-the-pr.

Thanks!

wesm · 2017-08-28T21:16:08Z

Right now it's "ARROW-1400 Hotfix". There's an Edit button at the top of the page. The commit messages aren't important, because those all get squashed. You can change it to something like ARROW-1400: [Python] Add function to append data to a partitioned Parquet dataset

saffrydaffry · 2017-08-28T21:22:01Z

Great, thank you!

wesm · 2017-08-30T18:44:18Z

python/pyarrow/parquet.py

+    )
+
+    if not os.path.isdir(root_path):
+        os.mkdir(root_path)


This needs to work with other kinds of filesystems (see the filesystem argument in ParquetDataset)

I'll take a look, thanks!

wesm · 2017-08-30T18:44:53Z

python/pyarrow/parquet.py

+            os.makedirs(prefix, exist_ok=True)
+            outfile = compat.guid() + ".parquet"
+            full_path = "/".join([prefix, outfile])
+            write_table(subtable, full_path, **kwargs)


This will need to be

with filesystem.open(full_path, 'wb') as f: write_table(subtable, f, **kwargs)

wesm · 2017-08-30T18:45:01Z

python/pyarrow/parquet.py

+    else:
+        outfile = compat.guid() + ".parquet"
+        full_path = "/".join([root_path, outfile])
+        write_table(table, full_path, **kwargs)


wesm · 2017-08-30T18:47:27Z

python/pyarrow/parquet.py

+                 for name, val in zip(partition_cols, keys)])
+            subtable = Table.from_pandas(subgroup, preserve_index=preserve_index)
+            prefix = "/".join([root_path, subdir])
+            os.makedirs(prefix, exist_ok=True)


If you use the filesystem.mkdir function (e.g. in LocalFileSystem or HadoopFileSystem) then it will resolve this Python 2 incompatibility (the exist_ok argument is Py3-only)

wesm · 2017-08-31T15:27:12Z

If you give me write permission (add me as collaborator) on your fork I can try to help too if you don't have time or are not too familiar with this portion of the code

saffrydaffry · 2017-09-01T13:02:55Z

@wesm Sorry it took a bit for me to understand, but I made the changes per your suggestions and it works (according to unit tests) for LocalFilesystem at least. Should I be worried that I keep failing the C++ check... I'm not sure if it's something I'm doing or if it's been failing all along? Happy to add you as a collaborator.

wesm · 2017-09-01T13:30:33Z

No problem, I will take a look

Change-Id: I60dc74a3a404bd5f75a3f93011758e3d87572e6b

wesm · 2017-09-04T00:07:23Z

Alright, I think I have this in ship shape. I will wait for the build to run

…ntation

wesm · 2017-09-04T02:36:56Z

+1. Thanks @saffrydaffry!

saffrydaffry · 2017-09-04T22:50:02Z

@wesm My pleasure! Thank you for your suggestions and polishing!

wesm requested changes Aug 28, 2017

View reviewed changes

saffrydaffry changed the title ~~ARROW-1400 Hotfix~~ ARROW-1400: [Python] Adding parquet.write_to_dataset() method for writing partitioned .parquet files Aug 28, 2017

wesm reviewed Aug 30, 2017

View reviewed changes

Consolidate commits from ARROW-1400 so far

725394b

Change-Id: I60dc74a3a404bd5f75a3f93011758e3d87572e6b

wesm force-pushed the 1400-hotfix branch from ca47130 to 032de15 Compare September 4, 2017 00:07

Test parquet.write_to_dataset with HDFS also, small tweaks to impleme…

af91176

…ntation

wesm force-pushed the 1400-hotfix branch from 032de15 to af91176 Compare September 4, 2017 00:49

asfgit closed this in 9968d95 Sep 4, 2017

ARROW-1400: [Python] Adding parquet.write_to_dataset() method for writing partitioned .parquet files #991

ARROW-1400: [Python] Adding parquet.write_to_dataset() method for writing partitioned .parquet files #991

Uh oh!

Conversation

saffrydaffry commented Aug 24, 2017

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

wesm Aug 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Aug 28, 2017

Uh oh!

saffrydaffry commented Aug 28, 2017

Uh oh!

wesm commented Aug 28, 2017

Uh oh!

saffrydaffry commented Aug 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Aug 31, 2017

Uh oh!

saffrydaffry commented Sep 1, 2017

Uh oh!

wesm commented Sep 1, 2017

Uh oh!

wesm commented Sep 4, 2017

Uh oh!

wesm commented Sep 4, 2017

Uh oh!

saffrydaffry commented Sep 4, 2017

Uh oh!

Reviewers

Assignees

Labels

wesm Aug 28, 2017 •

edited

Loading