Add BatchConverter implementations for pandas types, use Batched DoFns in DataFrame convert utilities #22575

TheNeuralBit · 2022-08-03T18:50:57Z

This PR moves the pandas - Beam type mapping from apache_beam.dataframe.schemas to apache_beam.typehints.pandas_type_compatibility, and modifies apache_beam.dataframe.convert to leverage that mapping by defining batch-producing and -consuming DoFns.

The new module now provides BatchConverter implementations that can be re-used in other DoFns that wish to process structured data using the pandas API. It also makes one slight modification in the type mapping: we now have a special field option, beam:dataframe:index:v1. This option is used to indicate that a Beam schema field should map to an index in the pandas DataFrame type system. If a Beam schema has no fields identified as an index, then we assume the user does not care about the index, and a "meaningless" one will be generated when mapping to DataFrames. Similarly when mapping a DataFrame back to the Beam type system, the index will be dropped.

Note apache_beam.dataframe.schemas still exists, for two purposes:

To maintain backwards compatibility, we still define BatchRowsAsDataFrame and UnbatchPandas transforms. These transforms are no longer used in apache_beam.dataframe.convert though.
To handle proxy generation and consumption (generate_proxy, element_type_from_dataframe). These functions are still used in apache_beam.dataframe.convert.

All of the logic in apache_beam.dataframe.schemas defers to apache_beam.typehints.pandas_type_compatibility as much as possible.

The following PRs were separated from this one to ease review:

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

TheNeuralBit · 2022-08-03T18:51:11Z

Run Python 3.8 PostCommit

TheNeuralBit · 2022-08-04T22:32:17Z

Run Python 3.7 PostCommit

codecov · 2022-08-04T22:56:47Z

Codecov Report

Merging #22575 (c088431) into master (63ba9c7) will decrease coverage by 0.01%.
The diff coverage is 93.36%.

@@            Coverage Diff             @@
##           master   #22575      +/-   ##
==========================================
- Coverage   74.19%   74.17%   -0.02%     
==========================================
  Files         709      712       +3     
  Lines       93499    93802     +303     
==========================================
+ Hits        69367    69582     +215     
- Misses      22855    22943      +88     
  Partials     1277     1277

Flag	Coverage Δ
python	`83.53% <93.36%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdks/python/apache_beam/typehints/__init__.py	`77.77% <66.66%> (-22.23%)`	⬇️
sdks/python/apache_beam/dataframe/schemas.py	`96.62% <92.30%> (-1.05%)`	⬇️
sdks/python/apache_beam/dataframe/convert.py	`91.20% <93.47%> (+0.83%)`	⬆️
...apache_beam/typehints/pandas_type_compatibility.py	`94.95% <94.95%> (ø)`
sdks/python/apache_beam/typehints/batch.py	`90.38% <100.00%> (+1.99%)`	⬆️
...examples/inference/sklearn_mnist_classification.py	`43.75% <0.00%> (-3.75%)`	⬇️
sdks/python/apache_beam/internal/metrics/metric.py	`93.00% <0.00%> (-1.00%)`	⬇️
sdks/python/apache_beam/io/localfilesystem.py	`90.97% <0.00%> (-0.76%)`	⬇️
...hon/apache_beam/runners/direct/test_stream_impl.py	`93.28% <0.00%> (-0.75%)`	⬇️
sdks/python/apache_beam/typehints/schemas.py	`93.84% <0.00%> (-0.48%)`	⬇️
... and 25 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

TheNeuralBit · 2022-08-04T23:39:12Z

Clarify separation of concerns between pandas_type_compatibility and dataframe.schemas

dataframe.schemas:

Maintain its current public API (possibly with deprecation notices)
Responsible for making proxies for the DataFrame API

typehints.pandas_type_compatibility:

pandas-Beam type mapping
BatchConverter implementations

…tilities

TheNeuralBit · 2022-08-12T22:46:22Z

CC: @robertwb

TheNeuralBit · 2022-08-12T23:36:26Z

Run Python 3.8 PostCommit

github-actions · 2022-08-13T00:16:45Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @y1chi for label python.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions · 2022-08-20T12:13:32Z

Reminder, please take a look at this pr: @y1chi

TheNeuralBit · 2022-08-22T16:38:35Z

@y1chi do you have time to review this?

TheNeuralBit · 2022-08-24T23:48:44Z

R: @yeandy

github-actions · 2022-08-24T23:50:00Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

yeandy

Overall, LGTM. Left a few questions/comments

yeandy · 2022-08-26T14:02:02Z

sdks/python/apache_beam/typehints/pandas_type_compatibility_test.py

+    else:
+      raise TypeError(f"Encountered unexpected type, left is a {type(left)!r}")


Should we also be checking against the type of right?

assert_series_equal or assert_frame_equal will raise if right isn't the appropriate type.

yeandy · 2022-08-26T14:16:53Z

sdks/python/apache_beam/typehints/pandas_type_compatibility_test.py

+    typehints.validate_composite_type_param(self.batch_typehint, '')
+    typehints.validate_composite_type_param(self.element_typehint, '')
+
+  def test_type_check(self):


Suggested change

def test_type_check(self):

def test_type_check_batch(self):

yeandy · 2022-08-26T14:34:01Z

sdks/python/apache_beam/typehints/pandas_type_compatibility.py

+    raise NotImplementedError
+
+  def explode_batch(self, batch: pd.DataFrame):
+    # TODO: Only do null checks for nullable types


Is there an issue for this?

There is now :) (#22948)

yeandy · 2022-08-26T14:37:01Z

sdks/python/apache_beam/typehints/pandas_type_compatibility.py

+
+  def produce_batch(self, elements):
+    # Note from_records has an index= parameter
+    batch = pd.DataFrame.from_records(elements, columns=self._columns)


Why don't we use index= parameter here? Is it so it's easier to set the data type in the next 2 lines?

Yeah that's right, I think the above comment was just a note to self as I was iterating on this. I dropped the comment.

yeandy · 2022-08-26T14:39:11Z

sdks/python/apache_beam/typehints/pandas_type_compatibility.py

+  def estimate_byte_size(self, batch: pd.DataFrame):
+    return batch.memory_usage().sum()
+
+  def get_length(self, batch: pd.DataFrame):
+    return len(batch)


Can we add tests for these? And also for SeriesBatchConverter?

Done, thank you! I also filed #22950 - we should have a standard test suite to test all the BatchConverter implementations.

yeandy · 2022-08-26T14:41:00Z

sdks/python/apache_beam/typehints/pandas_type_compatibility.py

+  def explode_batch(self, batch: pd.Series):
+    raise NotImplementedError(
+        "explode_batch should be generated in SeriesBatchConverter.__init__")


Why is should this generated in __init__?

We branch on is_nullable one time in __init__ and assign explode_batch with a null-checking or non-null-checking alternative.

yeandy · 2022-08-26T14:47:27Z

sdks/python/apache_beam/typehints/pandas_type_compatibility.py

+    all_series = self._get_series(batch)
+    iterators = [make_null_checking_generator(series) for series in all_series]
+
+    for values in zip(*iterators):


Could we zip the self._columns along with the iterators? Might make it harder to read though

Im not quite sure what this would look like, could you clarify?

Related: It would be good to add a microbenchmark for produce_batch and explode_batch so we can easily evaluate alternative implementations. But I'd prefer to leave that as future work. For now this just preserves the implementation from apache_beam.dataframe.schemas.

I was originally thinking of

for values, columns in zip(*iterators, self._columns): ...

But I had to take a look again to wrap my head around it. Looks like you're zipping to create the rows first, and then in the second zip, you line them up with the column names. The length of an individual iterator in iterators isn't necessarily the same as the length of self._columns. Plus, we'd probably get too many values to unpack error if we had values, columns.

yeandy · 2022-08-26T14:49:18Z

sdks/python/apache_beam/typehints/pandas_type_compatibility.py

+    return SeriesBatchConverter.from_typehints(
+        element_type=element_type, batch_type=batch_type)
+
+  return None


what happens if we return None? Do we have checks in other places to detect for a None BatchConvertor?

yep, this is handled when we try construction all the registered implementations:

beam/sdks/python/apache_beam/typehints/batch.py

Lines 77 to 90 in 7cc48e9

def from_typehints(*, element_type, batch_type) -> 'BatchConverter':

element_type = typehints.normalize(element_type)

batch_type = typehints.normalize(batch_type)

for constructor in BATCH_CONVERTER_REGISTRY:

result = constructor(element_type, batch_type)

if result is not None:

return result

# TODO(https://github.com/apache/beam/issues/21654): Aggregate error

# information from the failed BatchConverter matches instead of this

# generic error.

raise TypeError(

f"Unable to find BatchConverter for element_type {element_type!r} and "

f"batch_type {batch_type!r}")

Note this is very naive right now. In the future this should include helpful debug information to handle cases where one or more implementations almost matches. Tracked in #21654

Got it, thanks!

yeandy · 2022-08-26T14:58:06Z

sdks/python/apache_beam/dataframe/convert.py

+    yield element
+
+  def infer_output_type(self, input_element_type):
+    # Raise a TypeError if proxy has an unknown type


I may have missed this, but where does the error get raised?

Oops, this comment references behavior that was removed in 2b0597e

Now we will just shunt to Any in this case. I removed the comment. Thanks for raising this!

yeandy · 2022-08-26T15:04:03Z

sdks/python/apache_beam/typehints/pandas_type_compatibility_test.py

+    self.assertTrue(self.converter == self.create_batch_converter())
+    self.assertTrue(self.create_batch_converter() == self.converter)


Can you explain the purpose of checking the equality both ways?

This is just being overly cautious - in theory the instances on either side could be a different type and could have a different __eq__ implementation.

yeandy · 2022-08-30T12:38:45Z

sdks/python/apache_beam/typehints/pandas_type_compatibility.py

+    if is_nullable(element_type):
+
+      def unbatch(series):
+        for isnull, value in zip(pd.isnull(series), series):
+          yield None if isnull else value
+    else:
+
+      def unbatch(series):
+        yield from series


Nit. I actually don't mind the extra lines, especially since we're defining functions here, so it's easier to read. I'll leave it up to you.

Suggested change

if is_nullable(element_type):

def unbatch(series):

for isnull, value in zip(pd.isnull(series), series):

yield None if isnull else value

else:

def unbatch(series):

yield from series

if is_nullable(element_type):

def unbatch(series):

for isnull, value in zip(pd.isnull(series), series):

yield None if isnull else value

else:

def unbatch(series):

yield from series

yeandy · 2022-08-30T12:39:21Z

sdks/python/apache_beam/typehints/pandas_type_compatibility_test.py

+      (3, ),
+      (10, ),
+  ])
+  def test_get_lenth(self, N):


Suggested change

def test_get_lenth(self, N):

def test_get_length(self, N):

yeandy · 2022-08-30T12:40:15Z

sdks/python/apache_beam/typehints/pandas_type_compatibility.py

+  def estimate_byte_size(self, batch: pd.DataFrame):
+    return batch.memory_usage().sum()
+
+  def get_length(self, batch: pd.DataFrame):
+    return len(batch)


TheNeuralBit · 2022-08-31T01:23:33Z

Run Python 3.8 PostCommit

TheNeuralBit · 2022-08-31T01:29:16Z

Run Python Examples_Direct

TheNeuralBit · 2022-08-31T01:29:30Z

Run Python Examples_Dataflow

TheNeuralBit · 2022-08-31T14:21:17Z

retest this please

TheNeuralBit · 2022-08-31T17:01:39Z

retest this please

TheNeuralBit · 2022-08-31T17:01:51Z

Run Python Examples_Direct

TheNeuralBit · 2022-08-31T17:01:57Z

Run Python Examples_Dataflow

TheNeuralBit · 2022-08-31T17:02:05Z

Run Python 3.8 PostCommit

TheNeuralBit · 2022-08-31T21:05:18Z

PythonDocs PreCommit has passed (https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Commit/9575/), merging

github-actions bot added the python label Aug 3, 2022

TheNeuralBit force-pushed the BEAM-14293-pandas branch 3 times, most recently from 5c15cd2 to 09238a5 Compare August 9, 2022 00:05

github-actions bot added go model labels Aug 9, 2022

TheNeuralBit mentioned this pull request Aug 10, 2022

Add GeneratedClassRowTypeConstraint #22679

Merged

TheNeuralBit force-pushed the BEAM-14293-pandas branch from 49e915a to 1e2c800 Compare August 10, 2022 23:12

TheNeuralBit mentioned this pull request Aug 10, 2022

Add shunts for Beam typehints to apache_beam.dataframe.schemas #22680

Merged

TheNeuralBit added 7 commits August 12, 2022 13:14

Extract utilities in dataframe.schemas

5fe1574

Add pandas_type_compatibility with pandas BatchConverter implementations

c913e05

Use Batched DoFns at DataFrame API boundaries

10bb964

Move dtype conversion to pandas_type_compatibility

ff31656

Always register pandas BatchConverters

91018e1

Fix interactive runner tests

5be9d19

Use pandas_type_compatibility BatchConverters for dataframe.schemas u…

865b23b

…tilities

TheNeuralBit force-pushed the BEAM-14293-pandas branch from 1e2c800 to 865b23b Compare August 12, 2022 20:23

github-actions bot removed model go labels Aug 12, 2022

TheNeuralBit changed the title ~~WIP: Use Batched DoFns in DataFrame convert utilities~~ WIP: Add BatchConverter implementations for pandas types, use Batched DoFns in DataFrame convert utilities Aug 12, 2022

Skip test cases broken in pandas 1.1.x

953beeb

TheNeuralBit changed the title ~~WIP: Add BatchConverter implementations for pandas types, use Batched DoFns in DataFrame convert utilities~~ Add BatchConverter implementations for pandas types, use Batched DoFns in DataFrame convert utilities Aug 12, 2022

github-actions bot added the Next Action: Reviewers label Aug 13, 2022

github-actions bot added the slow-review label Aug 20, 2022

yeandy reviewed Aug 26, 2022

View reviewed changes

Address review comments

54c5a62

yeandy reviewed Aug 30, 2022

View reviewed changes

yeandy approved these changes Aug 30, 2022

View reviewed changes

yapf, typo in test

c088431

TheNeuralBit merged commit a6329a5 into apache:master Aug 31, 2022

		else:
		raise TypeError(f"Encountered unexpected type, left is a {type(left)!r}")

	def from_typehints(*, element_type, batch_type) -> 'BatchConverter':
	element_type = typehints.normalize(element_type)
	batch_type = typehints.normalize(batch_type)
	for constructor in BATCH_CONVERTER_REGISTRY:
	result = constructor(element_type, batch_type)
	if result is not None:
	return result

	# TODO(https://github.com/apache/beam/issues/21654): Aggregate error
	# information from the failed BatchConverter matches instead of this
	# generic error.
	raise TypeError(
	f"Unable to find BatchConverter for element_type {element_type!r} and "
	f"batch_type {batch_type!r}")

		self.assertTrue(self.converter == self.create_batch_converter())
		self.assertTrue(self.create_batch_converter() == self.converter)

Add BatchConverter implementations for pandas types, use Batched DoFns in DataFrame convert utilities #22575

Add BatchConverter implementations for pandas types, use Batched DoFns in DataFrame convert utilities #22575

Conversation

TheNeuralBit commented Aug 3, 2022 • edited Loading

GitHub Actions Tests Status (on master branch)

TheNeuralBit commented Aug 3, 2022

TheNeuralBit commented Aug 4, 2022

codecov bot commented Aug 4, 2022 • edited Loading

Codecov Report

TheNeuralBit commented Aug 4, 2022

TheNeuralBit commented Aug 12, 2022

TheNeuralBit commented Aug 12, 2022

github-actions bot commented Aug 13, 2022

github-actions bot commented Aug 20, 2022

TheNeuralBit commented Aug 22, 2022

TheNeuralBit commented Aug 24, 2022

github-actions bot commented Aug 24, 2022

yeandy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheNeuralBit commented Aug 31, 2022

TheNeuralBit commented Aug 31, 2022

TheNeuralBit commented Aug 31, 2022

TheNeuralBit commented Aug 31, 2022

TheNeuralBit commented Aug 31, 2022

TheNeuralBit commented Aug 31, 2022

TheNeuralBit commented Aug 31, 2022

TheNeuralBit commented Aug 31, 2022

TheNeuralBit commented Aug 31, 2022

TheNeuralBit commented Aug 3, 2022 •

edited

Loading

codecov bot commented Aug 4, 2022 •

edited

Loading