Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7746] Fix type errors and enable checks for apache_beam.dataframe.* #11632

Merged
merged 9 commits into from May 29, 2020

Conversation

robertwb
Copy link
Contributor

@robertwb robertwb commented May 7, 2020


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@robertwb
Copy link
Contributor Author

robertwb commented May 7, 2020

R: @chadrik

from typing import Any
from typing import Dict
from typing import Tuple
from typing import Union
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scoping these imports is not a bad idea since it keeps the module namespace cleaner, but there are a couple of issues with this:

  • once we get to python 3.x, and move from type comments to annotations, this will fail, as we'll be referencing non-existent objects in our annotations.
  • it's not consistent with how we've done this throughout the rest of the code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should mention that a solution to the first issue is that we can refer to these types as strings, such as 'Any', but that's certainly a lot more awkward, and developers are likely to forget to do so and get confused/frustrated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer unconditionally importing them, but was just trying to avoid lint issues (and did see this pattern elsewhere). Changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So lint complains about unguarded imports, so I put them back. We'll just to a massive sweep to fix these when we change to use type annotations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the lint error? Is it because of the unused typing import?

I'm confused because unguarded typing imports are used all over the beam codebase without any lint errors. Check pipeline, pipeline_context, pipeline_options, for starters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note hasn't been addressed:

I had a look at the lint errors, and they are legitimate, but scoping the imports is not the right solution.

see above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. Thanks for catching this. Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still needs to be fixed for dataframe.convert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[sigh] It still didn't like PCollection. apache_beam/dataframe/transforms.py:30:0: W0611: Unused PCollection imported from apache_beam.pvalue (unused-import). But the rest are OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And pandas needs to be guarded. Hopefully that should be it. PTAL.

@@ -248,7 +261,12 @@ def _dict_union(dicts):
return result


def _flatten(valueish, root=()):
def _flatten(
valueish, # type: Union[T, Tuple[T, ...], Dict[Any, T]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to cover the List[T] case for valueish?

Alternately, do we wan to make this more generic:

def _flatten(
    valueish,  # type: Union[T, Iterable[T], Mapping[Any, T]]
    root=(),  # type: Tuple[Any, ...]
    ):
  # type: (...) -> Dict[Tuple[Any, ...], T]
  """Given a nested structure of dicts, tuples, and lists, return a flat
  dictionary where the values are the leafs and the keys are the "paths" to
  these leaves.

  For example `{a: x, b: (y, z)}` becomes `{(a,): x, (b, 0): y, (b, 1): c}`.
  """
  if isinstance(valueish, typing.Mapping):
    return _dict_union(_flatten(v, root + (k, )) for k, v in valueish.items())
  elif isinstance(valueish, typing.Iterable):
    return _dict_union(
        _flatten(v, root + (ix, )) for ix, v in enumerate(valueish))
  else:
    return {root: valueish}

Another thought, is it valid / worthwhile to create a relationship between valueish, root, and the result key?:

def _flatten(
    valueish,  # type: Union[T, Iterable[T], Mapping[U, T]]
    root=(),  # type: Tuple[U, ...]
    ):
  # type: (...) -> Dict[Tuple[U, ...], T]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should mention that typing provides a lot of opportunity to go down rabbit-holes, so the right answer is often "it would be more accurate, but it's not that valuable". A common motivator behind investing in accurately typing utility functions is when it allows you to avoid adding manual / explicit types or casts elsewhere in the code. Imagine a scenario where mypy knows key of the valueish map, but after it passes through _flatten, you need to re-type the result because it becomes Any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I'd agree that Iterable/Mapping would be preferable, but here I want to be restrictive about the kinds of values I decompose.

The keys types of the mapping would be Union[None, int, U], so you'd have to cast anyway, so I think it's simpler to leave as is.

Still, good food for thought.

@robertwb robertwb changed the title [BEAM-7746] Fix type errors and enable checks for apache_beam.dataflow. [BEAM-7746] Fix type errors and enable checks for apache_beam.dataframe.* May 8, 2020
@robertwb
Copy link
Contributor Author

robertwb commented May 8, 2020

PTAL

@robertwb
Copy link
Contributor Author

Ping.

@robertwb
Copy link
Contributor Author

Run PythonLint PreCommit

@robertwb
Copy link
Contributor Author

robertwb commented May 27, 2020 via email

Copy link
Contributor

@chadrik chadrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to import TYPE_CHECKING

sdks/python/apache_beam/dataframe/convert.py Show resolved Hide resolved
sdks/python/apache_beam/dataframe/convert.py Outdated Show resolved Hide resolved
import pandas as pd

import apache_beam as beam
from apache_beam import transforms
from apache_beam.dataframe import expressions
from apache_beam.dataframe import frames # pylint: disable=unused-import

if typing.TYPE_CHECKING:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prevailing style for TYPE_CHECKING is to import it as from typing import TYPE_CHECKING. I think we should stay consistent. If we want to change that, it's fine by me, but we can do that in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for consistency. Changed.

@chadrik
Copy link
Contributor

chadrik commented May 28, 2020

LGTM!

@robertwb
Copy link
Contributor Author

Thanks!

@robertwb robertwb merged commit d87fc52 into apache:master May 29, 2020
yirutang pushed a commit to yirutang/beam that referenced this pull request Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants