[BEAM-10036] More flexible dataframes partitioning. #11766

robertwb · 2020-05-20T21:58:46Z

Also adds (naive) dataframe.agg() that uses this.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

TheNeuralBit

I haven't dug into the logic much yet, but I have some bikesheddy comments I wanted to go ahead and send out since answers to them could help clarify things.

TheNeuralBit · 2020-06-01T23:07:53Z

sdks/python/apache_beam/dataframe/frames_test.py

@@ -23,6 +23,7 @@

 from apache_beam.dataframe import expressions
 from apache_beam.dataframe import frame_base
+from apache_beam.dataframe import frames  # pylint: disable=unused-import


What is this for?

It makes sure the wrapper code is populated with the various types of frames.

TheNeuralBit · 2020-06-01T23:09:58Z

sdks/python/apache_beam/dataframe/partitionings.py

+      Nothing() < Index([i]) < Index([i, j]) < ... < Index() < Singleton()
+  """
+
+  _INDEX_PARTITIONS = 100


Previously this was 10 right (in partitioned_by_index)? Assuming this intentional, but I just wanted to double-check its not a typo.

Oh, I was just testing things. I'll change it back. (It would be great to get rid of this altogether, as it limits parallelism, but that's not part of this change.)

sdks/python/apache_beam/dataframe/partitionings.py

TheNeuralBit · 2020-06-01T23:17:58Z

sdks/python/apache_beam/dataframe/partitionings.py

+class Partitioning(object):
+  """A class representing a (consistent) partitioning of dataframe objects.
+  """
+  def is_subpartition_of(self, other):


nit: I think I'd prefer is_subpartitioning_of

TheNeuralBit · 2020-06-01T23:31:17Z

sdks/python/apache_beam/dataframe/expressions.py

-  def preserves_partition_by_index(self):  # type: () -> bool
-    """Whether the result of this expression will be partitioned by index
-    whenever all of its inputs are partitioned by index."""
+  def preserves_partition_by(self):  # type: () -> Partitioning


The meaning of this function is a little confusing now since it implies some connection to the input partitioning, but it also has it's own partitioning. Would renaming it to outputs_.. or produces_.. still be accurate, or is the output partitioning actually a function of both "preserves" and the input?

I also think we should consider changing .._partition_by to .._partitioning for clarity.

Yes, it's a function of both the input and the operation. E.g. an elementwise operation preserves all existing partitioning, but does not guarantee any.

Ah makes sense. So perhaps "preserves" could be thought of as an upper bound on the partitioning of the output (similar to how "requires" is a lower bound on the partitioning of the input).

It looks like every current expression has preserves set to either Nothing or Singleton. Wouldn't it be simpler to just keep preserves as a boolean? Or maybe you have some other expression in mind where a boolean won't be sufficient?

There are operations, such as setting a column to be an additional level of the index, that would do partial preservation. But perhaps that's not worth the additional complexity. I can change this to a boolean if you'd rather.

Ah that makes sense. And I guess the name "preserves" is actually intuitive now that I understand it's setting an upper bound on the output partitioning.

I think the complexity is worth it, unless there's a chance those operations will never materialize. Can you just add a docstring indicating that "preserves" sets an upper bound on the output partitioning (or any other language to make sure readers can grok it)? A similar comment about requires would be good too.

Docstring comments added.

TheNeuralBit

LGTM. I have a couple minor suggestions and questions (in addition to the one about docstrings for preserves and requires above).

TheNeuralBit · 2020-06-04T17:27:43Z

sdks/python/apache_beam/dataframe/partitionings.py

+  def __bool__(self):
+    return False
+
+  __nonzero__ = __bool__


I think that making Nothing falsy and relying on that in logic elsewhere harms readability. What do you think about dropping this and just explicitly checking for Nothing when needed?

TheNeuralBit · 2020-06-04T17:31:26Z

sdks/python/apache_beam/dataframe/partitionings.py

+
+
+class Singleton(Partitioning):
+  """A partitioning co-locating all data to a singleton partition.


Isn't Singleton completely partitioning the data into one element per partition? This description doesn't seem consistent to me, maybe I'm misunderstanding

TheNeuralBit · 2020-06-04T18:03:53Z

sdks/python/apache_beam/dataframe/transforms.py

+      elif expr in stage.inputs:
+        return stage.partitioning.is_subpartitioning_of(partitioning)
+      elif expr.preserves_partition_by().is_subpartitioning_of(partitioning):
+        if expr.requires_partition_by().is_subpartitioning_of(partitioning):


Had trouble justifying this logic to myself, ended up writing a little proof which I'm a bit embarrassed to share with a Math PhD:

output partitioning of expr = min(expr.preserves, input partitioning) input partitioning >= expr.requires thus if expr.requires >= required output AND expr.preserves >= required output then output partitioning of expr >= required output Otherwise we need to go up the tree of inputs to figure out their partitionings

This may be the least concise way to express this so I don't know if it's worth putting in a comment verbatim, but something to that effect would be helpful (assuming I've got it right)

Yeah, fleshing this out with more comments.

robertwb

Thanks for the feedback.

robertwb · 2020-06-05T00:49:44Z

sdks/python/apache_beam/dataframe/expressions.py

-  def preserves_partition_by_index(self):  # type: () -> bool
-    """Whether the result of this expression will be partitioned by index
-    whenever all of its inputs are partitioned by index."""
+  def preserves_partition_by(self):  # type: () -> Partitioning


Docstring comments added.

robertwb · 2020-06-05T00:50:38Z

sdks/python/apache_beam/dataframe/partitionings.py

+
+
+class Singleton(Partitioning):
+  """A partitioning co-locating all data to a singleton partition.


robertwb · 2020-06-05T00:55:48Z

sdks/python/apache_beam/dataframe/partitionings.py

+  def __bool__(self):
+    return False
+
+  __nonzero__ = __bool__


robertwb · 2020-06-05T00:56:10Z

sdks/python/apache_beam/dataframe/transforms.py

+      elif expr in stage.inputs:
+        return stage.partitioning.is_subpartitioning_of(partitioning)
+      elif expr.preserves_partition_by().is_subpartitioning_of(partitioning):
+        if expr.requires_partition_by().is_subpartitioning_of(partitioning):


Yeah, fleshing this out with more comments.

Also add simple dataframe.agg() which uses these features.

probot-autolabeler bot added the python label May 20, 2020

robertwb requested a review from TheNeuralBit May 20, 2020 21:58

robertwb force-pushed the dataframe-partitioning branch from 06be3ba to 68e4ddc Compare May 21, 2020 23:44

robertwb added 2 commits May 21, 2020 17:13

[BEAM-10036] More flexible dataframes partitioning.

0a1c576

[BEAM-9496] Add dataframe.agg()

a435925

robertwb force-pushed the dataframe-partitioning branch from 68e4ddc to a435925 Compare May 22, 2020 00:13

lint

8378136

TheNeuralBit reviewed Jun 1, 2020

View reviewed changes

TheNeuralBit self-requested a review June 2, 2020 00:06

robertwb added 2 commits June 2, 2020 15:46

rename is_subpartitioning_of

aa22f6f

reviewer comments

46ac950

TheNeuralBit approved these changes Jun 4, 2020

View reviewed changes

robertwb commented Jun 5, 2020

View reviewed changes

robertwb added 2 commits June 4, 2020 18:04

reviewer comments

f793096

lint, typing, py2

c5cb113

robertwb merged commit a74374d into apache:master Jun 5, 2020

yirutang pushed a commit to yirutang/beam that referenced this pull request Jul 23, 2020

[BEAM-10036] More flexible dataframes partitioning. (apache#11766)

abcbe7a

Also add simple dataframe.agg() which uses these features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-10036] More flexible dataframes partitioning. #11766

[BEAM-10036] More flexible dataframes partitioning. #11766

robertwb commented May 20, 2020

TheNeuralBit left a comment

TheNeuralBit Jun 1, 2020

robertwb Jun 2, 2020

TheNeuralBit Jun 1, 2020

robertwb Jun 2, 2020

TheNeuralBit Jun 1, 2020

robertwb Jun 2, 2020

TheNeuralBit Jun 1, 2020

robertwb Jun 2, 2020

TheNeuralBit Jun 3, 2020

robertwb Jun 4, 2020

TheNeuralBit Jun 4, 2020

robertwb Jun 5, 2020

TheNeuralBit left a comment

TheNeuralBit Jun 4, 2020

robertwb Jun 5, 2020

TheNeuralBit Jun 4, 2020

robertwb Jun 5, 2020

TheNeuralBit Jun 4, 2020

robertwb Jun 5, 2020

robertwb left a comment

robertwb Jun 5, 2020

robertwb Jun 5, 2020

robertwb Jun 5, 2020

robertwb Jun 5, 2020



		class Singleton(Partitioning):
		"""A partitioning co-locating all data to a singleton partition.

[BEAM-10036] More flexible dataframes partitioning. #11766

[BEAM-10036] More flexible dataframes partitioning. #11766

Conversation

robertwb commented May 20, 2020

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

TheNeuralBit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheNeuralBit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment