Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28454][PYTHON] Validate LongType in createDataFrame(verifySchema=True) #25117

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/sql-migration-guide-upgrade.md
Expand Up @@ -149,6 +149,8 @@ license: |

- Since Spark 3.0, if files or subdirectories disappear during recursive directory listing (i.e. they appear in an intermediate listing but then cannot be read or listed during later phases of the recursive directory listing, due to either concurrent file deletions or object store consistency issues) then the listing will fail with an exception unless `spark.sql.files.ignoreMissingFiles` is `true` (default `false`). In previous versions, these missing files or subdirectories would be ignored. Note that this change of behavior only applies during initial table file listing (or during `REFRESH TABLE`), not during query execution: the net change is that `spark.sql.files.ignoreMissingFiles` is now obeyed during table file listing / query planning, not only at query execution time.

- Since Spark 3.0, `createDataFrame(..., verifySchema=True)` validates `LongType` as well in PySpark. Previously, `LongType` was not verified and resulted in `None` in case the value overflows. To restore this behavior, `verifySchema` can be set to `False` to disable the validation.

- Since Spark 3.0, substitution order of nested WITH clauses is changed and an inner CTE definition takes precedence over an outer. In version 2.4 and earlier, `WITH t AS (SELECT 1), t2 AS (WITH t AS (SELECT 2) SELECT * FROM t) SELECT * FROM t2` returns `1` while in version 3.0 it returns `2`. The previous behaviour can be restored by setting `spark.sql.legacy.ctePrecedence.enabled` to `true`.

- Since Spark 3.0, the `add_months` function does not adjust the resulting date to a last day of month if the original date is a last day of months. For example, `select add_months(DATE'2019-02-28', 1)` results `2019-03-28`. In Spark version 2.4 and earlier, the resulting date is adjusted when the original date is a last day of months. For example, adding a month to `2019-02-28` results in `2019-03-31`.
Expand Down
3 changes: 2 additions & 1 deletion python/pyspark/sql/tests/test_types.py
Expand Up @@ -830,7 +830,8 @@ def __init__(self, **kwargs):
(2**31 - 1, IntegerType()),

# Long
(2**64, LongType()),
(-(2**63), LongType()),
(2**63 - 1, LongType()),

# Float & Double
(1.0, FloatType()),
Expand Down
14 changes: 14 additions & 0 deletions python/pyspark/sql/types.py
Expand Up @@ -1211,6 +1211,10 @@ def _make_type_verifier(dataType, nullable=True, name=None):
>>> _make_type_verifier(StructType([]))(None)
>>> _make_type_verifier(StringType())("")
>>> _make_type_verifier(LongType())(0)
>>> _make_type_verifier(LongType())(1 << 64) # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
...
ValueError:...
>>> _make_type_verifier(ArrayType(ShortType()))(list(range(3)))
>>> _make_type_verifier(ArrayType(StringType()))(set()) # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
Expand Down Expand Up @@ -1319,6 +1323,16 @@ def verify_integer(obj):

verify_value = verify_integer

elif isinstance(dataType, LongType):
def verify_long(obj):
assert_acceptable_types(obj)
verify_acceptable_types(obj)
if obj < -9223372036854775808 or obj > 9223372036854775807:
raise ValueError(
new_msg("object of LongType out of range, got: %s" % obj))

verify_value = verify_long

elif isinstance(dataType, ArrayType):
element_verifier = _make_type_verifier(
dataType.elementType, dataType.containsNull, name="element in array %s" % name)
Expand Down