Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-43545][SQL][PYTHON] Support nested timestamp type
### What changes were proposed in this pull request? Supports nested timestamp type in `spark.createDataFrame()` with pandas DataFrame and `df.toPandas()`, and makes them return correct results. For the following schema and pandas DataFrame: ```py schema = ( StructType() .add("ts", TimestampType()) .add("ts_ntz", TimestampNTZType()) .add( "struct", StructType().add("ts", TimestampType()).add("ts_ntz", TimestampNTZType()) ) .add("array", ArrayType(TimestampType())) .add("array_ntz", ArrayType(TimestampNTZType())) .add("map", MapType(StringType(), TimestampType())) .add("map_ntz", MapType(StringType(), TimestampNTZType())) ) data = [ Row( datetime.datetime(2023, 1, 1, 0, 0, 0), datetime.datetime(2023, 1, 1, 0, 0, 0), Row( datetime.datetime(2023, 1, 1, 0, 0, 0), datetime.datetime(2023, 1, 1, 0, 0, 0), ), [datetime.datetime(2023, 1, 1, 0, 0, 0)], [datetime.datetime(2023, 1, 1, 0, 0, 0)], dict(ts=datetime.datetime(2023, 1, 1, 0, 0, 0)), dict(ts_ntz=datetime.datetime(2023, 1, 1, 0, 0, 0)), ) ] pdf = pd.DataFrame.from_records(data, columns=schema.names) ``` ##### `spark.createDataFrame()` For all, return the same results: ```py >>> spark.conf.set("spark.sql.session.timeZone", "America/New_York") >>> spark.createDataFrame(pdf, schema).show(truncate=False) +-------------------+-------------------+------------------------------------------+---------------------+---------------------+---------------------------+-------------------------------+ |ts |ts_ntz |struct |array |array_ntz |map |map_ntz | +-------------------+-------------------+------------------------------------------+---------------------+---------------------+---------------------------+-------------------------------+ |2023-01-01 00:00:00|2023-01-01 00:00:00|{2023-01-01 00:00:00, 2023-01-01 00:00:00}|[2023-01-01 00:00:00]|[2023-01-01 00:00:00]|{ts -> 2023-01-01 00:00:00}|{ts_ntz -> 2023-01-01 00:00:00}| +-------------------+-------------------+------------------------------------------+---------------------+---------------------+---------------------------+-------------------------------+ ``` ##### `df.toPandas()` ```py >>> spark.conf.set("spark.sql.session.timeZone", "America/New_York") >>> df.toPandas() ts ts_ntz struct array array_ntz map map_ntz 0 2023-01-01 03:00:00 2023-01-01 (2023-01-01 03:00:00, 2023-01-01 00:00:00) [2023-01-01 03:00:00] [2023-01-01 00:00:00] {'ts': 2023-01-01 03:00:00} {'ts_ntz': 2023-01-01 00:00:00} ``` ### Why are the changes needed? Currently nested timestamps in `spark.createDataFrame()` with pandas DataFrame and `df.toPandas()` are not supported with `ArrayType` and `MapType`, or return different results from the top-level timestamps with `StructType`. For the following schema and pandas DataFrame: ```py schema = ( StructType() .add("ts", TimestampType()) .add("ts_ntz", TimestampNTZType()) .add( "struct", StructType().add("ts", TimestampType()).add("ts_ntz", TimestampNTZType()) ) ) data = [ Row( datetime.datetime(2023, 1, 1, 0, 0, 0), datetime.datetime(2023, 1, 1, 0, 0, 0), Row( datetime.datetime(2023, 1, 1, 0, 0, 0), datetime.datetime(2023, 1, 1, 0, 0, 0), ), ) ] pdf = pd.DataFrame.from_records(data, columns=schema.names) ``` ##### `spark.createDataFrame()` - Without Arrow ```py >>> spark.conf.set("spark.sql.session.timeZone", "America/New_York") >>> spark.createDataFrame(pdf, schema).show(truncate=False) +-------------------+-------------------+------------------------------------------+ |ts |ts_ntz |struct | +-------------------+-------------------+------------------------------------------+ |2023-01-01 00:00:00|2023-01-01 00:00:00|{2023-01-01 03:00:00, 2023-01-01 00:00:00}| +-------------------+-------------------+------------------------------------------+ ``` - With Arrow or Spark Connect: ```py >>> spark.createDataFrame(pdf, schema).show(truncate=False) +-------------------+-------------------+------------------------------------------+ |ts |ts_ntz |struct | +-------------------+-------------------+------------------------------------------+ |2023-01-01 00:00:00|2023-01-01 00:00:00|{2022-12-31 19:00:00, 2023-01-01 00:00:00}| +-------------------+-------------------+------------------------------------------+ ``` ##### `df.toPandas()` For the following DataFrame: ```py >>> spark.conf.unset("spark.sql.session.timeZone") >>> df = spark.createDataFrame(data, schema) >>> >>> df.show(truncate=False) +-------------------+-------------------+------------------------------------------+ |ts |ts_ntz |struct | +-------------------+-------------------+------------------------------------------+ |2023-01-01 00:00:00|2023-01-01 00:00:00|{2023-01-01 00:00:00, 2023-01-01 00:00:00}| +-------------------+-------------------+------------------------------------------+ >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row') ``` - Without Arrow ```py >>> spark.conf.set("spark.sql.session.timeZone", "America/New_York") >>> df.toPandas() ts ts_ntz struct 0 2023-01-01 03:00:00 2023-01-01 (2023-01-01 00:00:00, 2023-01-01 00:00:00) ``` - With Arrow or Spark Connect: ```py >>> df.toPandas() ts ts_ntz struct 0 2023-01-01 03:00:00 2023-01-01 (2023-01-01 08:00:00, 2023-01-01 00:00:00) ``` ### Does this PR introduce _any_ user-facing change? Users will be able to use nested timestamps. ### How was this patch tested? Added/updated the related tests. Closes #41240 from ueshin/issues/SPARK-43545/ts. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
- Loading branch information