Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Series[T] and DataFrame[T, ...] type hint #453

Merged
merged 2 commits into from
Jul 3, 2019

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jun 10, 2019

This PR proposes an alternative take for #437. The main diff is that:

  • In python 3.7, hacks __class_getitem__ from Generic at DataFrame. Here, it just wraps the given types into a tuple type, which is existing variadic generic.

  • In python 3.6 and python 3.5, similarly it wraps by a tuple type too; however, the logic is defined in metaclass. So, it should be wrapped in a different way from python 3.7's




Looks we need a way to specify type hint in order for them to be set when we run Python native functions via, for instance, apply(func). For DataFrame, groupby(..).apply(func) needs it if we implement it. Consider this example:

def func(pdf) -> DataFrame[...]
    return pdf

df.groupby.apply(func)

We already have a way in case of Series (previously Col). I renamed and refactored it to Series.

The new type hints are used as below:

def func(...) -> Series[np.float]:
    ...

def func(...) -> DataFrame[np.float, int, str]:
    ...

Note that:

def func(...) -> DataFrame[np.float, int, str]
    ...

Seems we cannot specify field names. I currently gave some default names c0, c1, ... cn.

@codecov-io
Copy link

codecov-io commented Jun 10, 2019

Codecov Report

Merging #453 into master will increase coverage by 0.13%.
The diff coverage is 85.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #453      +/-   ##
==========================================
+ Coverage   93.64%   93.78%   +0.13%     
==========================================
  Files          30       31       +1     
  Lines        4108     4198      +90     
==========================================
+ Hits         3847     3937      +90     
  Misses        261      261
Impacted Files Coverage Δ
databricks/koalas/series.py 93.2% <100%> (-0.03%) ⬇️
databricks/koalas/__init__.py 84.21% <100%> (ø) ⬆️
databricks/koalas/namespace.py 89.91% <100%> (ø) ⬆️
databricks/koalas/frame.py 94.1% <100%> (-0.28%) ⬇️
databricks/koalas/typedef.py 83% <76%> (+2.5%) ⬆️
databricks/koalas/generic.py 94.2% <0%> (-0.09%) ⬇️
databricks/koalas/missing/frame.py 100% <0%> (ø) ⬆️
databricks/koalas/missing/__init__.py 100% <0%> (ø) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6bd969...06e3390. Read the comment docs.

@HyukjinKwon HyukjinKwon force-pushed the series-dataframe-type-2 branch 3 times, most recently from 0ea61b6 to e43a17f Compare June 10, 2019 06:18
@HyukjinKwon HyukjinKwon changed the title [WIP] Add Series[T] and DataFrame[T, ...] type hint Add Series[T] and DataFrame[T, ...] type hint Jun 10, 2019
@HyukjinKwon
Copy link
Member Author

Okay, @rxin, @floscha, @icexelloss, @tahasyeddb, @ueshin, I managed to do it with a bunch of hacks.

The downside of the current PR is:

  • We should add a hack for each version of Python until variadic generic is officially supported. Note that at some points it might be impossible for us to hack anymore
  • This kind of hack might tie up with possible corner holes (like namedtuple hack in pyspark) forever.

Nevertheless, Koalas can support neat return hint like:

def test(...) -> ks.DataFrame[int, str]:
    pass

Which way do you guys prefer? I prefer this way experimentally considering Koalas is still premature.

@rxin
Copy link
Contributor

rxin commented Jun 10, 2019

The thing is for vast majority of functions this won't apply right? Because the output type depends on the input type, which is not known statically.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jun 10, 2019

Yup, in case of DataFrame this only applies to APIs that should be implemented via Grouped Map Pandas UDF for now where the types should be specified.

In case of Series, it applies to other APIs that should udf Pandas UDFs, for instance, DataFrame.[transform|apply], and pandas_wraps.

Additionally users might use this to note the return types of a function, for instance,

def read_csv() -> DataFrame[str, int]:
    ks.DataFrame(spark.schema("a strint, b int").csv())

For this case, it's just something optional and pretty.

@HyukjinKwon
Copy link
Member Author

BTW, we're encouraging to use type hints. Providing a proper way to specify types for our DataFrame and Series might be a good idea.

@thunterdb
Copy link
Contributor

@HyukjinKwon thank you for fixing this, I wanted to unify Col and Series but could not find a way to do that properly. A couple of thoughts:

Do not attempt to recreate a full annotation system like in scala, because in practice it is not that useful and there are workarounds:

workarounds: python is flexible enough that I think you can do something like:

MyReturnType = StructType([StructField(...), ...])

def f() -> DataFrame[MyReturnType]: raise

(not tested)

not that useful: in practice, people want to be able to type functions and Series, not DataFrames. DataFrames in ETL keep having columns being added and deleted. All the business logic is done by taking a Series and transforming it into another Series, or a structure.

Regarding variadic types. I think that we can resolve them, based on the inputs. If you write something like this:

T = type()
def f(x: Series[T]) -> Series[T]: pass

You should be able to recover the annotations of x and then infer the final type. This is main case I can think of. This improvement is good for another PR.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jun 11, 2019

MyReturnType = StructType([StructField(...), ...])

def f() -> DataFrame[MyReturnType]: raise

I wonder if this is possible ..I quickly tested:

    Traceback (most recent call last):
      File "/Applications/PyCharm.app/Contents/helpers/pycharm/docrunner.py", line 140, in __run
        compileflags, 1), test.globs)
      File "<doctest _infer_return_type[2]>", line 1, in <module>
        def func() -> ks.Series[1]:
      File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/typing.py", line 251, in inner
        return func(*args, **kwds)
      File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/typing.py", line 824, in __class_getitem__
        params = tuple(_type_check(p, msg) for p in params)
      File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/typing.py", line 824, in <genexpr>
        params = tuple(_type_check(p, msg) for p in params)
      File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/typing.py", line 139, in _type_check
        raise TypeError(f"{msg} Got {arg!r:.100}.")
    TypeError: Parameters to generic types must be types. Got StructType(List(StructField(a,StringType,true))).

Let me take another look and see if I can do similar stuff for that but from a cursory look, I am pretty sure that we will need another hack like this one because seems typing module wants this to be a type, TypeVar, etc:

    if arg is None:
        return type(None)
    if isinstance(arg, str):
        return ForwardRef(arg)
    if (isinstance(arg, _GenericAlias) and
            arg.__origin__ in invalid_generic_forms):
        raise TypeError(f"{arg} is not valid as type argument")
    if (isinstance(arg, _SpecialForm) and arg not in (Any, NoReturn) or
            arg in (Generic, _Protocol)):
        raise TypeError(f"Plain {arg} is not valid as type argument")
    if isinstance(arg, (type, TypeVar, ForwardRef)):
        return arg
    if not callable(arg):
        raise TypeError(f"{msg} Got {arg!r:.100}.")
    return arg

To make it working fine, we should produce a type instance from StructType([StructField(...), ...]). For instance, we should make a class dynamically somehow to allow this case.

DataFrame[np.float, int, str]

This annotation isn't completely new because Tuple[str, int, ...] is already supported in this way which is pretty similar instance with DataFrame. Considering we tried to represent Row as a tuple before in Spark side (https://github.com/apache/spark/blob/eec1a3c2862955bf2620d4cc5116fbd86e29952e/python/pyspark/sql/types.py#L1405), this wouldn't be that crazy.

FWIW, I found similar attempt here https://github.com/python/typing/issues/193#issuecomment-479484916

@HyukjinKwon
Copy link
Member Author

ping, what do you guys think about this?

@HyukjinKwon HyukjinKwon force-pushed the series-dataframe-type-2 branch 3 times, most recently from 32c9916 to f53ff44 Compare June 26, 2019 10:44
Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

databricks/koalas/__init__.py Outdated Show resolved Hide resolved
@softagram-bot
Copy link

Softagram Impact Report for pull/453 (head commit: 06e3390)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

details of dependency changes - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to support@softagram.com

@HyukjinKwon HyukjinKwon merged commit 4cac41b into databricks:master Jul 3, 2019
@HyukjinKwon HyukjinKwon deleted the series-dataframe-type-2 branch November 6, 2019 02:23
@HyukjinKwon HyukjinKwon mentioned this pull request Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants