-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Series[T] and DataFrame[T, ...] type hint #453
Add Series[T] and DataFrame[T, ...] type hint #453
Conversation
dcd04d5
to
686fba3
Compare
Codecov Report
@@ Coverage Diff @@
## master #453 +/- ##
==========================================
+ Coverage 93.64% 93.78% +0.13%
==========================================
Files 30 31 +1
Lines 4108 4198 +90
==========================================
+ Hits 3847 3937 +90
Misses 261 261
Continue to review full report at Codecov.
|
0ea61b6
to
e43a17f
Compare
Okay, @rxin, @floscha, @icexelloss, @tahasyeddb, @ueshin, I managed to do it with a bunch of hacks. The downside of the current PR is:
Nevertheless, Koalas can support neat return hint like:
Which way do you guys prefer? I prefer this way experimentally considering Koalas is still premature. |
The thing is for vast majority of functions this won't apply right? Because the output type depends on the input type, which is not known statically. |
Yup, in case of DataFrame this only applies to APIs that should be implemented via Grouped Map Pandas UDF for now where the types should be specified. In case of Series, it applies to other APIs that should udf Pandas UDFs, for instance, Additionally users might use this to note the return types of a function, for instance, def read_csv() -> DataFrame[str, int]:
ks.DataFrame(spark.schema("a strint, b int").csv()) For this case, it's just something optional and pretty. |
BTW, we're encouraging to use type hints. Providing a proper way to specify types for our DataFrame and Series might be a good idea. |
@HyukjinKwon thank you for fixing this, I wanted to unify Do not attempt to recreate a full annotation system like in scala, because in practice it is not that useful and there are workarounds: workarounds: python is flexible enough that I think you can do something like: MyReturnType = StructType([StructField(...), ...])
def f() -> DataFrame[MyReturnType]: raise (not tested) not that useful: in practice, people want to be able to type functions and Series, not DataFrames. DataFrames in ETL keep having columns being added and deleted. All the business logic is done by taking a Series and transforming it into another Series, or a structure. Regarding variadic types. I think that we can resolve them, based on the inputs. If you write something like this: T = type()
def f(x: Series[T]) -> Series[T]: pass You should be able to recover the annotations of |
MyReturnType = StructType([StructField(...), ...])
def f() -> DataFrame[MyReturnType]: raise I wonder if this is possible ..I quickly tested:
Let me take another look and see if I can do similar stuff for that but from a cursory look, I am pretty sure that we will need another hack like this one because seems if arg is None:
return type(None)
if isinstance(arg, str):
return ForwardRef(arg)
if (isinstance(arg, _GenericAlias) and
arg.__origin__ in invalid_generic_forms):
raise TypeError(f"{arg} is not valid as type argument")
if (isinstance(arg, _SpecialForm) and arg not in (Any, NoReturn) or
arg in (Generic, _Protocol)):
raise TypeError(f"Plain {arg} is not valid as type argument")
if isinstance(arg, (type, TypeVar, ForwardRef)):
return arg
if not callable(arg):
raise TypeError(f"{msg} Got {arg!r:.100}.")
return arg To make it working fine, we should produce a DataFrame[np.float, int, str] This annotation isn't completely new because FWIW, I found similar attempt here |
ping, what do you guys think about this? |
32c9916
to
f53ff44
Compare
f53ff44
to
7a0de7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Softagram Impact Report for pull/453 (head commit: 06e3390)⭐ Change Overview
⭐ Details of Dependency Changes
📄 Full report
Give feedback on this report to support@softagram.com |
This PR proposes an alternative take for #437. The main diff is that:
In python 3.7, hacks
__class_getitem__
fromGeneric
atDataFrame
. Here, it just wraps the given types into a tuple type, which is existing variadic generic.In python 3.6 and python 3.5, similarly it wraps by a tuple type too; however, the logic is defined in metaclass. So, it should be wrapped in a different way from python 3.7's
Looks we need a way to specify type hint in order for them to be set when we run Python native functions via, for instance,
apply(func)
. ForDataFrame
,groupby(..).apply(func)
needs it if we implement it. Consider this example:We already have a way in case of
Series
(previouslyCol
). I renamed and refactored it toSeries
.The new type hints are used as below:
Note that:
Seems we cannot specify field names. I currently gave some default names
c0, c1, ... cn
.