[BEAM-9546] DataframeTransform can now consume a schema-aware PCollection #11980

TheNeuralBit · 2020-06-10T23:35:24Z

Adds batching of schema'd PCollections into dataframes based on BatchElements transform.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Samza	Spark
Go		---	---	---
Java
Python		---		---
XLang	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

TheNeuralBit · 2020-06-10T23:35:35Z

CC: @robertwb

TheNeuralBit · 2020-08-07T00:52:33Z

R: @robertwb
This is ready for review

TheNeuralBit · 2020-08-07T15:38:12Z

retest this please

…eTransform

robertwb · 2020-08-14T20:26:51Z

sdks/python/apache_beam/dataframe/convert.py

@@ -36,7 +37,7 @@
 # TODO: Or should this be called as_dataframe?
 def to_dataframe(
    pcoll,  # type: pvalue.PCollection
-    proxy,  # type: pandas.core.generic.NDFrame
+    proxy=None,  # type: pandas.core.generic.NDFrame


robertwb · 2020-08-14T20:31:58Z

sdks/python/apache_beam/dataframe/schemas.py

+    self._batch_elements_transform = BatchElements(*args, **kwargs)
+
+  def expand(self, pcoll):
+    return super(BatchRowsAsDataFrame, self).expand(pcoll) | ParDo(


Rather than subclassing, it would probably be cleaner to make this just a PTransform whose expand method returns

`pcoll | BatchElements(...) | Pardo(...)`.

If you want to accept all the parameter from BatchElements, you could construct the BatchElements instance in your constructor.

Done! Looks like I actually started to do it that way with the unused self._batch_elements_transform but then changed my mind

robertwb · 2020-08-14T20:33:49Z

sdks/python/apache_beam/dataframe/schemas.py

+        _RowBatchToDataFrameDoFn(pcoll.element_type))
+
+
+class _RowBatchToDataFrameDoFn(DoFn):


Rather than letting this be a full DoFn, you could just let columns be a local variable in the map above, and then write

... | Map(lambda batch: pd.DataFrame.from_records(batch, columns))

robertwb · 2020-08-14T20:36:05Z

sdks/python/apache_beam/dataframe/schemas.py

+import pandas as pd
+
+from apache_beam import typehints
+from apache_beam.transforms.core import DoFn


Nit. We typically have the style of importing modules, and then using qualified names (which results in less churn and makes it a bit easier to figure out where things come from). Instead of core, it's typical to do import apache_beam as beam and use beam.DoFn, etc.

robertwb · 2020-08-14T20:37:53Z

sdks/python/apache_beam/dataframe/schemas_test.py

+          | schemas.BatchRowsAsDataFrame()
+          | transforms.DataframeTransform(
+              lambda df: df.groupby('animal').mean(),
+              proxy=schemas.generate_proxy(Animal)))


Can we get rid of this one too? (Or at least drop a TODO to do it in a subsequent PR?)

Added a TODO for now. I guess we'd need to store some type information on a PCollection[DataFrame], should we just store a proxy object when we know it?

robertwb · 2020-08-14T20:38:30Z

sdks/python/apache_beam/dataframe/schemas_test.py

+
+    self.assertTrue(schemas.generate_proxy(Animal).equals(expected))
+
+  def test_batch_with_df_transform(self):


Maybe do a test using to_dataframe?

I added a test using to_dataframe in transforms_test: test_batching_beam_row_to_dataframe.

I intended for these tests to just test schemas.py, while transforms_test verifies the integration with DataframeTransform. The one below was a stepping stone to integrating, we could even remove it now.

robertwb · 2020-08-14T20:48:12Z

sdks/python/apache_beam/dataframe/transforms_test.py

+              AnimalSpeed('Elephant', 35),
+              AnimalSpeed('Zebra', 40)
+          ]).with_output_types(AnimalSpeed)
+          | transforms.DataframeTransform(lambda df: df.filter(regex='A.*')))


This threw me because I was expecting the result to be ['Aardvark', 'Ant']. I see now that it's filtering down to the column names that start with A, but perhaps the filter could be written a bit differently to make it more obvious (e.g. filter on the values, let the regex be 'Anim*', or use another operation).

You and me both :) I just reused the operation from test_filter above. Changed it to Anim.* in both places

robertwb · 2020-08-14T20:50:22Z

sdks/python/apache_beam/pvalue.py

@@ -88,7 +88,7 @@ class PValue(object):
  def __init__(self,
               pipeline,  # type: Pipeline
               tag=None,  # type: Optional[str]
-               element_type=None,  # type: Optional[object]
+               element_type=None,  # type: Optional[type]


Or a type constraint. (Not all our type hints are types.)

robertwb · 2020-08-14T20:50:58Z

sdks/python/apache_beam/typehints/schemas.py

+
+  Returns schema as a list of (name, python_type) tuples"""
+  if isinstance(element_type, row_type.RowTypeConstraint):
+    # TODO: Make sure beam.Row generated schemas are registered and de-duped


Is this worth a JIRA?

Filed BEAM-10722 for this

TheNeuralBit

Addressed all the comments, PTAL. Also pushed 66d258d updating the docstring.

TheNeuralBit · 2020-08-18T16:00:49Z

sdks/python/apache_beam/dataframe/schemas.py

+        _RowBatchToDataFrameDoFn(pcoll.element_type))
+
+
+class _RowBatchToDataFrameDoFn(DoFn):


TheNeuralBit · 2020-08-18T16:01:55Z

sdks/python/apache_beam/dataframe/schemas.py

+    self._batch_elements_transform = BatchElements(*args, **kwargs)
+
+  def expand(self, pcoll):
+    return super(BatchRowsAsDataFrame, self).expand(pcoll) | ParDo(


Done! Looks like I actually started to do it that way with the unused self._batch_elements_transform but then changed my mind

TheNeuralBit · 2020-08-18T16:02:02Z

sdks/python/apache_beam/dataframe/schemas.py

+import pandas as pd
+
+from apache_beam import typehints
+from apache_beam.transforms.core import DoFn


TheNeuralBit · 2020-08-18T16:02:13Z

sdks/python/apache_beam/dataframe/schemas_test.py

+
+    self.assertTrue(schemas.generate_proxy(Animal).equals(expected))
+
+  def test_batch_with_df_transform(self):


I added a test using to_dataframe in transforms_test: test_batching_beam_row_to_dataframe.

I intended for these tests to just test schemas.py, while transforms_test verifies the integration with DataframeTransform. The one below was a stepping stone to integrating, we could even remove it now.

TheNeuralBit · 2020-08-18T16:06:07Z

sdks/python/apache_beam/dataframe/transforms_test.py

+    sorted_df = df.sort_values(by=list(df.columns))
+  else:
+    sorted_df = df.sort_values()
+  return sorted_df.reset_index(drop=True)


Note there's actually a diff here from the original check_correct. It sorts by value and resets the index rather than sorting by index. I had to do this because the concatenated indices of the batches (e.g. [0,1,2,0,1,0,]) wouldn't match the index in my expected df (e.g. [0,1,2,3,4,5]).

TheNeuralBit · 2020-08-18T16:07:17Z

sdks/python/apache_beam/dataframe/transforms_test.py

+              AnimalSpeed('Elephant', 35),
+              AnimalSpeed('Zebra', 40)
+          ]).with_output_types(AnimalSpeed)
+          | transforms.DataframeTransform(lambda df: df.filter(regex='A.*')))


You and me both :) I just reused the operation from test_filter above. Changed it to Anim.* in both places

TheNeuralBit · 2020-08-18T16:12:28Z

sdks/python/apache_beam/pvalue.py

@@ -88,7 +88,7 @@ class PValue(object):
  def __init__(self,
               pipeline,  # type: Pipeline
               tag=None,  # type: Optional[str]
-               element_type=None,  # type: Optional[object]
+               element_type=None,  # type: Optional[type]


robertwb

LGTM

TheNeuralBit · 2020-08-20T15:46:19Z

Run Python PreCommit

TheNeuralBit · 2020-08-20T15:46:28Z

Run Python2_PVR_Flink PreCommit

TheNeuralBit · 2020-08-20T15:46:34Z

Run PythonDocker PreCommit

TheNeuralBit · 2020-08-20T22:11:38Z

Run Python PreCommit

…tion (apache#11980) * Add BatchRowsAsDataframe and generate_proxy, integrated into DataFrameTransform * lint * fix ci failures * yapf? * Address PR comments * Update DataframeTransform docstring * lint

probot-autolabeler bot added the python label Jun 10, 2020

TheNeuralBit marked this pull request as draft June 10, 2020 23:35

TheNeuralBit changed the title ~~WIP: [BEAM-9564] Dataframe schema input~~ WIP: [BEAM-9546] Dataframe schema input Aug 6, 2020

TheNeuralBit force-pushed the dataframe-schemas branch from ff822ac to 55c4920 Compare August 7, 2020 00:48

TheNeuralBit changed the title ~~WIP: [BEAM-9546] Dataframe schema input~~ [BEAM-9546] DataframeTransform can now consume a schema-aware PCollection Aug 7, 2020

TheNeuralBit marked this pull request as ready for review August 7, 2020 00:51

Add BatchRowsAsDataframe and generate_proxy, integrated into DataFram…

f658cb2

…eTransform

TheNeuralBit force-pushed the dataframe-schemas branch from 55c4920 to f658cb2 Compare August 7, 2020 20:33

TheNeuralBit added 3 commits August 7, 2020 14:16

lint

e8ab82a

fix ci failures

424b493

yapf?

a1f9a2b

robertwb reviewed Aug 14, 2020

View reviewed changes

TheNeuralBit added 3 commits August 18, 2020 09:51

Address PR comments

c0c552c

Update DataframeTransform docstring

66d258d

lint

00c5b16

TheNeuralBit commented Aug 18, 2020

View reviewed changes

TheNeuralBit requested a review from robertwb August 18, 2020 16:58

robertwb approved these changes Aug 20, 2020

View reviewed changes

TheNeuralBit merged commit cfa448d into apache:master Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-9546] DataframeTransform can now consume a schema-aware PCollection #11980

[BEAM-9546] DataframeTransform can now consume a schema-aware PCollection #11980

TheNeuralBit commented Jun 10, 2020

TheNeuralBit commented Jun 10, 2020

TheNeuralBit commented Aug 7, 2020

TheNeuralBit commented Aug 7, 2020

robertwb Aug 14, 2020

robertwb Aug 14, 2020

TheNeuralBit Aug 18, 2020

robertwb Aug 14, 2020

TheNeuralBit Aug 18, 2020

robertwb Aug 14, 2020

TheNeuralBit Aug 18, 2020

robertwb Aug 14, 2020

TheNeuralBit Aug 17, 2020

robertwb Aug 14, 2020

TheNeuralBit Aug 18, 2020

robertwb Aug 14, 2020

TheNeuralBit Aug 18, 2020

robertwb Aug 14, 2020

TheNeuralBit Aug 18, 2020

robertwb Aug 14, 2020

TheNeuralBit Aug 18, 2020

TheNeuralBit left a comment

TheNeuralBit Aug 18, 2020

TheNeuralBit Aug 18, 2020

TheNeuralBit Aug 18, 2020

TheNeuralBit Aug 18, 2020

TheNeuralBit Aug 18, 2020

TheNeuralBit Aug 18, 2020

TheNeuralBit Aug 18, 2020

robertwb left a comment

TheNeuralBit commented Aug 20, 2020

TheNeuralBit commented Aug 20, 2020

TheNeuralBit commented Aug 20, 2020

TheNeuralBit commented Aug 20, 2020

		_RowBatchToDataFrameDoFn(pcoll.element_type))


		class _RowBatchToDataFrameDoFn(DoFn):


		self.assertTrue(schemas.generate_proxy(Animal).equals(expected))

		def test_batch_with_df_transform(self):

[BEAM-9546] DataframeTransform can now consume a schema-aware PCollection #11980

[BEAM-9546] DataframeTransform can now consume a schema-aware PCollection #11980

Conversation

TheNeuralBit commented Jun 10, 2020

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

TheNeuralBit commented Jun 10, 2020

TheNeuralBit commented Aug 7, 2020

TheNeuralBit commented Aug 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheNeuralBit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertwb left a comment

Choose a reason for hiding this comment

TheNeuralBit commented Aug 20, 2020

TheNeuralBit commented Aug 20, 2020

TheNeuralBit commented Aug 20, 2020

TheNeuralBit commented Aug 20, 2020