New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25659][PYTHON][TEST] Test type inference specification for createDataFrame in PySpark #22653
Conversation
Test build #97042 has finished for PR 22653 at commit
|
@BryanCutler and @viirya would you mind if I ask to take a look please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, would you mind creating JIRAs for fixing inference from None and datetime.time?
Decimal(1), | ||
Row(a=1), | ||
Row("a")(1), | ||
A(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't this was possible - does it just look at the variable attributes in the object to get the fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it uses __dict__
attribute.. looks that's not possible in UDFs .. possibly an issue.
Let me open a JIRA after taking some more further looks. I checked some code places and looks it's not easy to support |
Merged to master. Thank you @BryanCutler. Let me try to make sure I take further actions for related items. |
Sorry for late. @HyukjinKwon Yes, this LGTM. It is great we can follow up other issues like None and datetime.time in other JIRA. |
True, | ||
1, | ||
"a", | ||
u"a", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since this is for unicode string, how about using a non-ascii string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. Will change when I fix some codes around here.
…ateDataFrame in PySpark ## What changes were proposed in this pull request? This PR proposes to specify type inference and simple e2e tests. Looks we are not cleanly testing those logics. For instance, see https://github.com/apache/spark/blob/08c76b5d39127ae207d9d1fff99c2551e6ce2581/python/pyspark/sql/types.py#L894-L905 Looks we intended to support datetime.time and None for type inference too but it does not work: ``` >>> spark.createDataFrame([[datetime.time()]]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 751, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 432, in _createFromLocal data = [schema.toInternal(row) for row in data] File "/.../spark/python/pyspark/sql/types.py", line 604, in toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File "/.../spark/python/pyspark/sql/types.py", line 604, in <genexpr> for f, v, c in zip(self.fields, obj, self._needConversion)) File "/.../spark/python/pyspark/sql/types.py", line 442, in toInternal return self.dataType.toInternal(obj) File "/.../spark/python/pyspark/sql/types.py", line 193, in toInternal else time.mktime(dt.timetuple())) AttributeError: 'datetime.time' object has no attribute 'timetuple' >>> spark.createDataFrame([[None]]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 751, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 419, in _createFromLocal struct = self._inferSchemaFromList(data, names=schema) File "/.../python/pyspark/sql/session.py", line 353, in _inferSchemaFromList raise ValueError("Some of types cannot be determined after inferring") ValueError: Some of types cannot be determined after inferring ``` ## How was this patch tested? Manual tests and unit tests were added. Closes apache#22653 from HyukjinKwon/SPARK-25659. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR proposes to specify type inference and simple e2e tests. Looks we are not cleanly testing those logics.
For instance, see
spark/python/pyspark/sql/types.py
Lines 894 to 905 in 08c76b5
Looks we intended to support datetime.time and None for type inference too but it does not work:
How was this patch tested?
Manual tests and unit tests were added.