From 7aa4c114386d409df415af8b1765c408b2db86e4 Mon Sep 17 00:00:00 2001 From: Dilip Biswal Date: Thu, 8 Feb 2018 10:46:18 -0800 Subject: [PATCH] [SPARK-23271][DOC] Document the empty dataframe write semantics --- docs/sql-programming-guide.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index a0e221b39cc34..db38bb4071078 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -1930,6 +1930,9 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Literal values used in SQL operations are converted to DECIMAL with the exact precision and scale needed by them. - The configuration `spark.sql.decimalOperations.allowPrecisionLoss` has been introduced. It defaults to `true`, which means the new behavior described here; if set to `false`, Spark uses previous rules, ie. it doesn't adjust the needed scale to represent the values and it returns NULL if an exact representation of the value is not possible. + - Since Spark 2.3, writing an empty dataframe (a dataframe with 0 partitions) in parquet or orc format, creates a format specific metadata only file. In prior versions the metadata only file was not created. As a result, subsequent attempt to read from this directory fails with AnalysisException while inferring schema of the file. For example : df.write.format("parquet").save("outDir") +followed by df.read.format("parquet").load("outDir") results in AnalysisException in prior versions. + ## Upgrading From Spark SQL 2.1 to 2.2 - Spark 2.1.1 introduced a new configuration key: `spark.sql.hive.caseSensitiveInferenceMode`. It had a default setting of `NEVER_INFER`, which kept behavior identical to 2.1.0. However, Spark 2.2.0 changes this setting's default value to `INFER_AND_SAVE` to restore compatibility with reading Hive metastore tables whose underlying file schema have mixed-case column names. With the `INFER_AND_SAVE` configuration value, on first access Spark will perform schema inference on any Hive metastore table for which it has not already saved an inferred schema. Note that schema inference can be a very time consuming operation for tables with thousands of partitions. If compatibility with mixed-case column names is not a concern, you can safely set `spark.sql.hive.caseSensitiveInferenceMode` to `NEVER_INFER` to avoid the initial overhead of schema inference. Note that with the new default `INFER_AND_SAVE` setting, the results of the schema inference are saved as a metastore key for future use. Therefore, the initial schema inference occurs only at a table's first access.