[BEAM-7746] Add python type hints (part 2) #10367

chadrik · 2019-12-12T17:39:08Z

This is part 2 of #9056.

Unlike part 1 (#9915) this goes beyond simple type comments. It introduces changes that could affect runtime behavior, though I was careful to avoid doing so, unless it's noted as a bug.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

chadrik · 2019-12-12T17:45:04Z

R: @robertwb
R: @udim

Some of these were already reviewed in the overarching PR (#9056), but I did not include them in part 1 because I was trying to focus on type comments in part 1.

I tried to give some context for each change in the commit description. It may be worth considering not squashing history when merging this.

pabloem · 2019-12-12T19:29:09Z

Thanks Chad. I decided to squash the previous PR because I saw some commits containing 'fixup' type comments. Sorry if that was not the intention (guess I should have checked..). Thanks for explicitly requesting the 'merge' option this time : )

chadrik · 2019-12-12T19:36:05Z

No worries. I should have said something earlier. I got lazy at the very end.

…

On Thu, Dec 12, 2019 at 11:29 AM Pablo ***@***.***> wrote: Thanks Chad. I decided to squash the previous PR because I saw some commits containing 'fixup' type comments. Sorry if that was not the intention (guess I should have checked..). Thanks for explicitly requesting the 'merge' option this time : ) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10367?email_source=notifications&email_token=AAAPOEY22QERH3HK7OJFD2TQYKGINA5CNFSM4J2CDVPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGXYDIA#issuecomment-565150112>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAPOE6L4KKUP2UGEXRCNKTQYKGINANCNFSM4J2CDVPA> .

chadrik · 2020-01-05T19:00:58Z

Hi, now that the holidays are over, I'd like to bump this to the tops of people's review queue!

udim

partial review, about half way done

sdks/python/apache_beam/coders/coder_impl.py

sdks/python/apache_beam/transforms/sideinputs.py

sdks/python/apache_beam/portability/__init__.py

sdks/python/gen_protos.py

sdks/python/apache_beam/pvalue.py

sdks/python/apache_beam/utils/profiler.py

sdks/python/apache_beam/runners/common.py

chadrik · 2020-01-09T02:23:48Z

@udim thanks for the review! very good questions.

chadrik · 2020-01-09T02:26:29Z

btw, I made some edits to my answers to clarify them, so you should continue the review via github rather than email.

sdks/python/apache_beam/io/vcfio.py

robertwb · 2020-01-08T20:46:58Z

sdks/python/apache_beam/pipeline.py

@@ -808,7 +810,7 @@ class AppliedPTransform(object):

  def __init__(self,
               parent,
-               transform,  # type: ptransform.PTransform
+               transform,  # type: Optional[ptransform.PTransform]


Where is this optional?

in Pipeline.__init__:

# Stack of transforms generated by nested apply() calls. The stack will # contain a root node as an enclosing (parent) node for top transforms. self.transforms_stack = [AppliedPTransform(None, None, '', None)]

Best way to deal with this may be a special RootAppliedTransform subclass.

It can also possibly be None in AppliedPTransform.from_runner_api():

transform = ptransform.PTransform.from_runner_api(proto.spec, context) result = AppliedPTransform( parent=None, transform=transform, full_label=proto.unique_name, inputs=main_inputs)

This is because PTransform.from_runner_api() returns Optional[PTransform]

@classmethod def from_runner_api(cls, proto, # type: Optional[beam_runner_api_pb2.FunctionSpec] context # type: PipelineContext ): # type: (...) -> Optional[PTransform] if proto is None or not proto.urn: return None parameter_type, constructor = cls._known_urns[proto.urn]

On the second issue, I should point that mypy knows that proto.spec is not None when we call PTransform. from_runner_api(proto.spec, context) (because proto.spec is always non-None), and we could almost use that knowledge to solve this problem with @overloads of PTransform.from_runner_api(), like this:

@overload @classmethod def from_runner_api(cls, proto, # type: None context # type: PipelineContext ): # type: (...) -> None pass @overload @classmethod def from_runner_api(cls, proto, # type: beam_runner_api_pb2.FunctionSpec context # type: PipelineContext ): # type: (...) -> PTransform pass @classmethod def from_runner_api(cls, proto, context): if proto is None or not proto.urn: return None

Unfortunately, typing can't track whether the value of proto.urn is an empty string, which means that the above overload strategy doesn't actually work. Is there any chance that this could be changed to if proto is None or proto.urn is None?

I don't think it'd ever be None, given that an unset proto field is the default value of that field.

sdks/python/apache_beam/pvalue.py

sdks/python/apache_beam/runners/common.py

robertwb · 2020-01-09T19:00:13Z

sdks/python/apache_beam/runners/common.py

@@ -437,13 +439,15 @@ def invoke_start_bundle(self):
    # type: () -> None
    """Invokes the DoFn.start_bundle() method.
    """
+    assert self.output_processor is not None


These asserts should be fast, but have you verified this doesn't impact performance (given this is called for every element of every transform). Or is there another way to declare this for critical parts of the code. (In particular, isinstance can be slow. It's not common convention to disable asserts in Python.)

I have not done performance testing on this. Is there a way that we can invoke the perf suite on Jenkins from github?

The underlying issue in this case is that it's possible to instantiate a DoFnInvoker without an output_processor and that's considered ok (by us) as long as you don't call any of the methods that use the output_processor. If you do, then it would raise an exception. The asserts obviously also raise an exception, but it serves as a way to communicate to mypy that you're aware of it, and it can adjust its type analysis within that scope.

Solutions

Easy: Judiciously add type: ignore comments. I say "judiciously" because ignoring an error does not change the type analysis, so similar errors can pop up nearby in the same scope. In this particular case the methods are brief, so a type: ignore comment would suffice.

Not Easy: Rework the code so there's a subclass of DoFnInvoker that always has a non-optional output_processor, and only this class possesses these "safe if you're careful" methods, which would, under that design, always be safe.

The easy solution is fine for this particular case (we're aware there could be an error, but we accept it), but it's not a general solution to this problem. Choosing the right solution for each case takes some consideration.

I think we can get rid of output_processor altogether in this code... (It's a bit ugly ever since it was introduced, this typing confirms that.) May not be the case everywhere though.

Ok, how about I change these to type: ignore and you create a ticket to remove output_processor? I would make the ticket but I don't think I have enough context to explain it properly.

This is done. I ended up leaving one assert in invoke_process because it was conditional, and by using one assert I avoided 3 ignores that cluttered up the code. Let me know what you think.

sdks/python/apache_beam/runners/pipeline_context.py

robertwb

OK, I've finished going through all the files. Overall it looks good, just a couple more comments.

robertwb · 2020-01-09T21:56:14Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

@@ -265,7 +265,7 @@ def finish(self):

 class _StateBackedIterable(object):
  def __init__(self,
-               state_handler,
+               state_handler,  # type: sdk_worker.CachingStateHandler


Is the 'Caching' part necessary here (even if it always is right now)?

CachingStateHandler does not inherit from StateHandler nor does it implement its abstract methods

Yeah, maybe worth creating another issue for this. Could be a nice entry-level task.

robertwb · 2020-01-09T21:57:54Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

@@ -1130,19 +1132,26 @@ def process(self, windowed_value):

 @BeamTransformFactory.register_urn(
    DATA_INPUT_URN, beam_fn_api_pb2.RemoteGrpcPort)
-def create(factory, transform_id, transform_proto, grpc_port, consumers):
+def create_source_runner(


These registered constructors (necessarily) all have the same signature. Is there a way to declare that in a common place? (The return type is always Operation, what type is not ever introspected.)

I think the best way to reduce the noise here would be make the registration more object oriented.

Quick sketch:

class OpCreator(Generic[OperatorT]): def __init__( self, factory, # type: BeamTransformFactory transform_id, # type: str transform_proto, # type: beam_runner_api_pb2.PTransform consumers # type: Dict[str, List[operations.Operation]] ): self.factory = factory self.transform_id = transform_id self.transform_proto = transform_proto self.consumers = consumers def create(self, parameter): # type: (Any) -> OperatorT raise NotImplementedError

That's an idea. It would still add a level of indirection and boilerplate...

sdks/python/apache_beam/runners/worker/sdk_worker.py

sdks/python/apache_beam/runners/worker/statesampler_slow.py

robertwb · 2020-01-09T23:16:22Z

sdks/python/apache_beam/transforms/environments.py

+  @overload
+  def register_urn(cls,
+                   urn,  # type: str
+                   parameter_type,  # type: None


Idea: could we unify these and update all callers that currently pass None to pass bytes?

that would be nice from a typing simplicity standpoint. not sure about the implications of that though.

sdks/python/apache_beam/transforms/environments.py

robertwb

If you can make the changes ignore OutputProcessor and update the code to reflect that pipeline=None is transient, this looks good to go into me. Thanks.

robertwb · 2020-01-13T20:23:09Z

Run all tests

robertwb · 2020-01-13T21:48:57Z

Everything looks good except a little bit of lint: https://builds.apache.org/job/beam_PreCommit_PythonLint_Commit/1894/

Minor adjustments to runtime code required to silence certain errors. Two common patterns: - explicitly return None from functions that also return non-None - assert that optional attributes are non-None before using them, if there are no other conditionals present to ensure this.

…e statically analyzed

…e (Tuple[str, ...])

…uild process This fixes numerous errors generated throughout the code because mypy cannot track the dynamic setattr binding that was done by common_urns. The change also necessitated updating a few doctrings to prevent this error: more than one target found for cross-reference u'DisplayData': apache_beam.transforms.display.DisplayData apache_beam.portability.api.beam_runner_api_pb2_urns.DisplayData

There are several places in the code where it is assumed that these are part of the abstract StateSpec.

…w.StateSampler.reset() statesampler_slow.StateSampler does not have _states_by_name attribute. Only its fast counterpart does.

This gives us a type that we can use to ensure all handlers meet the same protocol

…egistration

Note that this means that tests that were previously being masked by other tests with the same name will now be run. There is a fix included for one such test.

chadrik · 2020-01-14T04:31:35Z

Run all tests

chadrik · 2020-01-14T04:33:56Z

I'm a bit confused: I don't see the Jenkins jobs listed here any more. Why would that be?

robertwb · 2020-01-14T19:24:05Z

Run all tests

robertwb · 2020-01-14T19:24:48Z

(As per the discussion on the dev list, Apache Infra made a change to block all tests from being run by non-comitters.)

chadrik · 2020-01-14T19:36:16Z

Ah, I missed that conversation but that seems like a reasonable way to conserve resources. I see the other PR that @udim made to resolve this for me, thanks

…

On Tue, Jan 14, 2020 at 11:24 AM Robert Bradshaw ***@***.***> wrote: (As per the discussion on the dev list, Apache Infra made a change to block all tests from being run by non-comitters.) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10367?email_source=notifications&email_token=AAAPOE6QB7WOSJWKR7MMHJTQ5YGQDA5CNFSM4J2CDVPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI5Z2DY#issuecomment-574332175>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAPOE4Q6CGX5T4OI2UMTJ3Q5YGQDANCNFSM4J2CDVPA> .

chadrik · 2020-01-14T20:19:15Z

@robertwb @udim I think everything is addressed. I melded the review notes into the commits, so I'll leave it up to you whether you want to squash or not.

chadrik · 2020-01-14T22:32:44Z

@robertwb @udim would you rather I make 6 new PRs for 6 typing changes (each covers a distinct topic), or lump them together into one PR. The lumped PR is smaller than this one.

robertwb · 2020-01-14T22:41:45Z

If they're orthogonal enough, let's make 6.

chadrik force-pushed the python-static-typing-part2 branch from d25cf73 to d7514b2 Compare December 12, 2019 18:14

chadrik mentioned this pull request Jan 5, 2020

[BEAM-7746] Add python type hints #9056

Closed

robertwb self-requested a review January 8, 2020 20:41

chadrik mentioned this pull request Jan 8, 2020

typehint fixes to DoOutputsTuple #10494

Merged

3 tasks

udim requested changes Jan 9, 2020

View reviewed changes

chadrik force-pushed the python-static-typing-part2 branch from d7514b2 to f15424d Compare January 9, 2020 06:37

robertwb reviewed Jan 9, 2020

View reviewed changes

robertwb reviewed Jan 10, 2020

View reviewed changes

robertwb approved these changes Jan 10, 2020

View reviewed changes

chadrik force-pushed the python-static-typing-part2 branch from f15424d to 9d81cce Compare January 13, 2020 01:14

robertwb mentioned this pull request Jan 13, 2020

Always initalize output processor on construction. #10570

Merged

3 tasks

chadrik added 11 commits January 13, 2020 17:50

[BEAM-7746] Address changes in code since annotations were introduced

2c00922

[BEAM-7746] Avoid creating attributes dynamically, so that they can b…

bfc6e55

…e statically analyzed

[BEAM-7746] Bugfix: coder id is expected to be str in python3

8353a28

[BEAM-7746] Explicitly unpack tuple to avoid inferring unbounded tupl…

29e243e

…e (Tuple[str, ...])

[BEAM-7746] Move name and coder to base StateSpec class

bc21b25

There are several places in the code where it is assumed that these are part of the abstract StateSpec.

[BEAM-7746] Remove reference to missing attribute in statesampler_slo…

80e2c2e

…w.StateSampler.reset() statesampler_slow.StateSampler does not have _states_by_name attribute. Only its fast counterpart does.

[BEAM-7746] Non-Optional arguments cannot default to None

9f14ce9

[BEAM-7746] Avoid reusing variables with different data types

ee13d08

[BEAM-7746] Add StateHandler abstract base class

83866ec

This gives us a type that we can use to ensure all handlers meet the same protocol

chadrik added 3 commits January 13, 2020 17:50

[BEAM-7746] Add TODO about fixing assignment to BundleManager._skip_r…

6189010

…egistration

[BEAM-7746] Fix functions that were defined twice

3ff01eb

[BEAM-7746] Fix tests that have the same name

f7f8792

Note that this means that tests that were previously being masked by other tests with the same name will now be run. There is a fix included for one such test.

chadrik force-pushed the python-static-typing-part2 branch from 9d81cce to f7f8792 Compare January 14, 2020 01:50

robertwb merged commit d03404d into apache:master Jan 14, 2020

[BEAM-7746] Add python type hints (part 2) #10367

[BEAM-7746] Add python type hints (part 2) #10367

Conversation

chadrik commented Dec 12, 2019 • edited

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

chadrik commented Dec 12, 2019

pabloem commented Dec 12, 2019

chadrik commented Dec 12, 2019 via email

chadrik commented Jan 5, 2020

udim left a comment

Choose a reason for hiding this comment

chadrik commented Jan 9, 2020

chadrik commented Jan 9, 2020

Choose a reason for hiding this comment

chadrik Jan 9, 2020 • edited

Choose a reason for hiding this comment

chadrik Jan 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadrik Jan 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertwb left a comment

Choose a reason for hiding this comment

robertwb commented Jan 13, 2020

robertwb commented Jan 13, 2020

chadrik commented Jan 14, 2020

chadrik commented Jan 14, 2020

robertwb commented Jan 14, 2020

robertwb commented Jan 14, 2020

chadrik commented Jan 14, 2020 via email

chadrik commented Jan 14, 2020

chadrik commented Jan 14, 2020

robertwb commented Jan 14, 2020

chadrik commented Dec 12, 2019 •

edited

chadrik Jan 9, 2020 •

edited

chadrik Jan 9, 2020 •

edited

chadrik Jan 9, 2020 •

edited