[SPARK-18510] Fix data corruption from inferred partition column dataTypes #15951

brkyvz · 2016-11-20T22:35:11Z

What changes were proposed in this pull request?

The Issue

If I specify my schema when doing

spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)

but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.

Proposed solution

The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.

The real issue is that a user that uses the spark.read code path can never clearly specify what the partition columns are. If you try to specify the fields in schema, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.

My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.

We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.

A side effect of this PR is that we won't need #15942 if this PR goes in.

How was this patch tested?

Regression tests

brkyvz · 2016-11-20T22:38:50Z

cc @rxin @marmbrus Don't know who's the best person to look at this, but git blame says I mainly changed your code :)

SparkQA · 2016-11-20T23:54:33Z

Test build #68909 has finished for PR 15951 at commit 9080c4e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-21T00:20:59Z

Test build #68910 has finished for PR 15951 at commit f518405.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-21T05:08:20Z

Test build #68913 has finished for PR 15951 at commit 46ab68a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-11-21T06:05:28Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+        .partitionBy("part", "id")
+        .mode("overwrite")
+        .parquet(src.toString)
+      // make sure to specify the schema in the wrong order. Partition column in the middle, etc.


Without the fix, the order of columns does not matter, right? As long as the types are right, it should work?

with the fix, it still does not matter. The comment is outdated, I thought something else was the problem. In terms of schema though, the output from Spark is always consistent, i.e. partition columns go last.

gatorsmile · 2016-11-21T06:29:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

          dataSchema = dataSchema,
          bucketSpec = None,
          format,
          caseInsensitiveOptions)(sparkSession)

      // This is a non-streaming file based datasource.
      case (format: FileFormat, _) =>
+        val (schema, inferredPartitionColumns) = inferFileFormatSchema(format)


inferFileFormatSchema is expensive, right?

not if you already specified the schema. I'm going to rename it to getOrInfer so that people don't think it's expensive all the time

Sounds good

gatorsmile · 2016-11-21T06:33:51Z

When users specify schemas, we do not want to infer the schemas due to its potentially expensive cost. Based on my understanding, data corruption issues are common when user-specified schemas do not provide correct types.

gatorsmile · 2016-11-21T08:16:00Z

The real issue is that a user that uses the spark.read code path can never clearly specify what the partition columns are. If you try to specify the fields in schema, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.

For data source tables, the partition columns are part of data schema. Users do not need to know which columns are used for partitioning. If they can provide the right types, they should be able to see the expected data.

In the test cases, we can get the correct result with the following changes:

      spark.range(4).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 'part).coalesce(1)
      val schema = new StructType()
        .add("part", LongType)
        .add("ex", ArrayType(StringType))
        .add("id", LongType)
      spark.read
        .schema(schema)
        .format("parquet")
        .load(src.toString).show()

The order of columns in the specified schema will not affect the result correctness. It only affects the column order of the final result set.

brkyvz · 2016-11-21T08:22:43Z

True. But there's no reason "part" and "id" can't be strings right?

On Nov 21, 2016 12:16 AM, "Xiao Li" notifications@github.com wrote:

The real issue is that a user that uses the spark.read code path can never
clearly specify what the partition columns are. If you try to specify the
fields in schema, we practically ignore what the user provides, and fall
back to our inferred data types. What happens in the end is data corruption.

For data source tables, the partition columns are part of data schema.
Users do not need to know which columns are used for partitioning. If they
can provide the right types, they should be able to see the expected data.

In the test cases, we can get the correct result with the following
changes:
  spark.range(4).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 'part).coalesce(1)
  val schema = new StructType()
    .add("part", LongType)
    .add("ex", ArrayType(StringType))
    .add("id", LongType)
  spark.read
    .schema(schema)
    .format("parquet")
    .load(src.toString).show()
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#15951 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AFACe7DmaOEHYPu22xbaNPZZl5OKZMwvks5rAVNagaJpZM4K3rkM
.

gatorsmile · 2016-11-21T08:43:15Z

Your concern is right. I just did another try. Using the string-type columns as partition columns. See the following code:

val rowRdd: RDD[Row] = sparkContext.parallelize(1 to 10).map(i => Row(i, i.toString))
val inputSchema = StructType(Seq(
  StructField("intCol", IntegerType),
  StructField("stringCol", StringType)
))
spark.createDataFrame(rowRdd, inputSchema)
  .write.partitionBy("stringCol").mode("overwrite").parquet(src.toString)
val schema = new StructType()
  .add("intCol", IntegerType)
  .add("stringCol", IntegerType)
spark.read
  .schema(schema)
  .format("parquet")
  .load(src.toString).show()

Users have to use IntegerType for these partition columns even if the original data type is StringType. This looks werid. Otherwise, they will hit the following error:

Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 2, localhost, executor driver): java.lang.NullPointerException
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getArrayLength(OnHeapColumnVector.java:375)

gatorsmile · 2016-11-21T09:04:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

        }

        HadoopFsRelation(
          fileCatalog,
-          partitionSchema = fileCatalog.partitionSchema,


This line causes the problem, right? We always ignore the user-specified column types and use the inferred partition columns.

SparkQA · 2016-11-21T20:58:58Z

Test build #68947 has finished for PR 15951 at commit f0a2754.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-21T21:09:43Z

Test build #68946 has finished for PR 15951 at commit de49ba5.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-11-21T21:28:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+    if (justPartitioning) {
+      partitionSchema -> partitionSchema.map(_.name)
+    }
+    val tableSchema = userSpecifiedSchema.map { schema =>


Are you missing else here, if you want an early exit?

good catch. missed the return

gatorsmile · 2016-11-21T21:52:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

    }.getOrElse {
-      throw new AnalysisException("Unable to infer schema. It must be specified manually.")
+      val exampleFiles = tempFileCatalog.allFiles().take(2).mkString(",")


This might generate some unwanted files. For example, in Json, inferSchema filter out the unwanted files.

Why will it generate the files? at max it may print few file names that the json format actually ignores while inferring schema.

None the less, to avoid this, could you not just print the allPaths?

I tried to keep it consistent with before. I can remove it.

gatorsmile · 2016-11-21T22:15:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      // backwards compatibility before SPARK-18510. Return the schema of catalog tables as is
+      return userSpecifiedSchema.get -> partitionSchema.map(_.name)
+    }
+    val tableSchema = userSpecifiedSchema.map { schema =>


tableSchema -> dataSchema. Be consistent with the other codes?

gatorsmile · 2016-11-21T22:20:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+        val dataSchema = if (isStreaming) {
+          schema
+        } else {
+          StructType(schema.dropRight(inferredPartitionColumns.length))


This is little hacky. How about changing the return of getOrInferFileFormatSchema to (dataSchema, partitionSchema)?

It looks elegant, but it opens up the possibility of a new kind of bug where the types in partitioning columns in the returned partitionSchema and dataSchema are different.

nvm. with justPartitioning param, its more confusing as it is right now. so I like this suggestion

gatorsmile · 2016-11-21T22:22:07Z

Also cc @ericl @cloud-fan who just recently changed the related codes.

tdas

This is round 1 of reviewing. I am still looking to understand the code.

tdas · 2016-11-21T20:47:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-  private def inferFileFormatSchema(format: FileFormat): (StructType, Seq[String]) = {
-    userSpecifiedSchema.map(_ -> partitionColumns).orElse {
-      val allPaths = caseInsensitiveOptions.get("path")
+  private def getOrInferFileFormatSchema(


can you add param docs to define what justPartitioning means?

This aint right. You cant return something incorrect. Rather return null for the first schema.
also, the docs is confusing in this way. please add @param and @return after the params to clarify what gets returned in both cases.

tdas · 2016-11-21T20:57:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+  private def getOrInferFileFormatSchema(
+      format: FileFormat,
+      justPartitioning: Boolean = false): (StructType, Seq[String]) = {
+    lazy val tempFileCatalog = {


Please add docs on why this is lazy. It took me half-a-minute to trace down why this should be lazy.

tdas · 2016-11-21T21:23:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

@@ -126,7 +126,6 @@ case class AnalyzeCreateTable(sparkSession: SparkSession) extends Rule[LogicalPl
      normalizeColumnName(tableDesc.identifier, schema, colName, "partition")
    }
    checkDuplication(normalizedPartitionCols, "partition")
-


undo this change.

tdas · 2016-11-21T21:26:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

@@ -274,7 +274,7 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
          pathToPartitionedTable,
          userSpecifiedSchema = Option("num int, str string"),
          userSpecifiedPartitionCols = partitionCols,
-          expectedSchema = new StructType().add("num", IntegerType).add("str", StringType),
+          expectedSchema = new StructType().add("str", StringType).add("num", IntegerType),


so in this PR, for some cases, the order of fields in schema created after resolveRelation is changing?

I believe the original test case was incorrect. Although the schema check passes, if you really read rows out of the Dataset, you'll hit an exception, as shown in the following Spark shell session:

import org.apache.spark.sql.types._ val df0 = spark.range(10).select( ('id % 4) cast StringType as "part", 'id cast StringType as "data" ) val path = "/tmp/part.parquet" df0.write.mode("overwrite").partitionBy("part").parquet(path) val df1 = spark.read.schema( new StructType() .add("part", StringType, nullable = true) .add("data", StringType, nullable = true) ).parquet(path) df1.printSchema() // root // |-- part: string (nullable = true) // |-- data: string (nullable = true) df1.show() // 16/11/22 22:52:21 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 34) // java.lang.NullPointerException // at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getArrayLength(OnHeapColumnVector.java:375) // at org.apache.spark.sql.execution.vectorized.ColumnVector.getArray(ColumnVector.java:554) // at org.apache.spark.sql.execution.vectorized.ColumnVector.getByteArray(ColumnVector.java:576) // [...]

tdas · 2016-11-21T21:28:13Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala

@@ -532,4 +532,50 @@ class DataStreamReaderWriterSuite extends StreamTest with BeforeAndAfter {
    assert(e.getMessage.contains("does not support recovering"))
    assert(e.getMessage.contains("checkpoint location"))
  }
+
+  test("SPARK-18510: Data corruption from user specified partition column schemas") {


I would say rename the test to something that explains what this test actually tests. For example, "use user-specified schema for partitioning columns in file sources"

tdas · 2016-11-21T21:49:46Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

@@ -573,4 +573,40 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be
      }
    }
  }
+
+  test("SPARK-18510: Data corruption from user specified partition column schemas") {


same comment as above.

SparkQA · 2016-11-21T23:10:17Z

Test build #68956 has finished for PR 15951 at commit fde4f64.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

Round 2

tdas · 2016-11-21T22:49:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+        val dataSchema = if (isStreaming) {
+          schema
+        } else {
+          StructType(schema.dropRight(inferredPartitionColumns.length))


It looks elegant, but it opens up the possibility of a new kind of bug where the types in partitioning columns in the returned partitionSchema and dataSchema are different.

tdas · 2016-11-21T22:50:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      return partitionSchema -> partitionSchema.map(_.name)
+    }
+    if (catalogTable.isDefined && userSpecifiedSchema.isDefined) {
+      // backwards compatibility before SPARK-18510. Return the schema of catalog tables as is


its returning the user specified schema, not the catalog table schema.

tdas · 2016-11-21T23:12:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+        val dataSchema = if (isStreaming) {
+          schema
+        } else {
+          StructType(schema.dropRight(inferredPartitionColumns.length))


nvm. with justPartitioning param, its more confusing as it is right now. so I like this suggestion

tdas · 2016-11-21T23:13:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+          StructType(schema.dropRight(inferredPartitionColumns.length))
+        }
+
+        val partitionSchema = if (inferredPartitionColumns.isEmpty) {


This is the one i dont understand. why should the behavior be different if this non-streaming source (HadoopFSRelation) is being create by user directly or by the file streaming source.

SparkQA · 2016-11-22T01:11:27Z

Test build #68964 has finished for PR 15951 at commit ff0a2c7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-11-22T01:18:17Z

This might change the behavior, but how about just raising an error if the partition types differ from those provided by the user, or the user failed to provide a partitioning schema? It seems confusing to partially infer a schema when the user does not provide it.

brkyvz · 2016-11-22T01:23:16Z

@ericl I feel that would probably break 90% of production Spark jobs out there, therefore am a bit scared of something radical. I agree, it's confusing and annoying

SparkQA · 2016-11-22T01:36:27Z

Test build #68965 has finished for PR 15951 at commit 330e148.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2016-11-22T02:00:08Z

Thanks @tejasapatil for the review. Addressed your comments

SparkQA · 2016-11-22T02:07:58Z

Test build #68975 has finished for PR 15951 at commit 97003e2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-11-22T02:55:31Z

spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)

I don't think this is valid use case, DataFrameReader can't specify partition columns, so we will always infer partitions.

I think the real problem is HadoopFsRelation.schema:

val schema: StructType = {
  val dataSchemaColumnNames = dataSchema.map(_.name.toLowerCase).toSet
  StructType(dataSchema ++ partitionSchema.filterNot { column =>
    dataSchemaColumnNames.contains(column.name.toLowerCase)
  })
}

It sliently drops the partition schema if the partition column names are duplicated in data schema.

I think the best solution is to add partitionBy in DataFrameReader so that we can skip inferring partitions really. But this maybe too late for 2.1, we should define a better semantic for the current "broken" API.

Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.

This LGTM

SparkQA · 2016-11-22T04:23:10Z

Test build #68976 has finished for PR 15951 at commit 6f741b6.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-22T07:17:23Z

Test build #68979 has finished for PR 15951 at commit 08566e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-11-22T11:12:38Z

shall we update the document of DataSource.partitionColumns? When this ist is empty, the relation is unpartitioned. this is wrong now, the relation can be partitioned even if partitionColumns is empty.

SparkQA · 2016-11-22T20:31:03Z

Test build #69010 has finished for PR 15951 at commit f3b42ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

LGTM. Just a few questions that needs clarification.

tdas · 2016-11-22T22:22:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      val equality = sparkSession.sessionState.conf.resolver
+      StructType(schema.filterNot(f => partitionSchema.exists(p => equality(p.name, f.name))))
+    }.orElse {
+      format.inferSchema(


just to confirm, inferschema returns schema without the partition columns?

tdas · 2016-11-22T22:23:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -144,8 +224,8 @@ case class DataSource(
              "you may be able to create a static DataFrame on that directory with " +
              "'spark.read.load(directory)' and infer schema from it.")
        }
-        val (schema, partCols) = inferFileFormatSchema(format)
-        SourceInfo(s"FileSource[$path]", schema, partCols)
+        val (schema, partCols) = getOrInferFileFormatSchema(format)


rename to dataSchema and partitionSchema

tdas · 2016-11-22T22:24:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-                  .getOrElse(throw new AnalysisException(s"Invalid partition column '$c'"))
-            })
-        }
+        val (dataSchema, inferredPartitionSchema) = getOrInferFileFormatSchema(format)


nit: this may not be inferred right? so just partitionSchema would be a better name.

tdas · 2016-11-22T22:27:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

        val globbedPaths = allPaths.flatMap { path =>
          val hdfsPath = new Path(path)
-          val fs = hdfsPath.getFileSystem(sparkSession.sessionState.newHadoopConf())
+          val fs = hdfsPath.getFileSystem(hadoopConf)


any need for this change? hadoopConf is not reused any where else

any need for this change? hadoopConf is not reused any where else

this is to avoid creating a new hadoopConf for each path.

tdas · 2016-11-22T22:30:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -281,33 +361,25 @@ case class DataSource(
      // This is a non-streaming file based datasource.
      case (format: FileFormat, _) =>
        val allPaths = caseInsensitiveOptions.get("path") ++ paths
+        val hadoopConf = sparkSession.sessionState.newHadoopConf()
        val globbedPaths = allPaths.flatMap { path =>


Just to confirm, for paths with globs, this would expand here ONCE, and then expand them AGAIN in getOrInferFileFormatSchema, right?

If so, we dont have to fix it in this PR, but we should document this in a JIRA or something for fixing later.

tdas · 2016-11-22T22:37:39Z

Would be good if @rxin can take a look.

liancheng

Mostly LGTM except for a few minor issues.

liancheng · 2016-11-23T01:03:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      format: FileFormat,
+      justPartitioning: Boolean = false): (StructType, StructType) = {
+    // the operations below are expensive therefore try not to do them if we don't need to
+    lazy val tempFileCatalog = {


Nit: tempFileIndex

liancheng · 2016-11-23T02:24:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      val equality = sparkSession.sessionState.conf.resolver
+      StructType(schema.filterNot(f => partitionSchema.exists(p => equality(p.name, f.name))))
+    }.orElse {
+      format.inferSchema(


liancheng · 2016-11-23T02:34:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+        inferredPartitions
+      } else {
+        val partitionFields = partitionColumns.map { partitionColumn =>
+          userSpecifiedSchema.flatMap(_.find(_.name == partitionColumn)).orElse {


Also need to use the resolver to handle case sensitivity here.

liancheng · 2016-11-23T02:39:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+                   |Falling back to inferred dataType if it exists.
+                 """.stripMargin)
+            }
+            inferredPartitions.find(_.name == partitionColumn)


Duplicated code?

liancheng · 2016-11-23T03:31:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-  private def inferFileFormatSchema(format: FileFormat): (StructType, Seq[String]) = {
-    userSpecifiedSchema.map(_ -> partitionColumns).orElse {
-      val allPaths = caseInsensitiveOptions.get("path")
+  private def getOrInferFileFormatSchema(


I think it would be clearer if we can split this method into two: one for partition schema and the other for data schema. In this way, we can also remove the justPartitioning argument by calling the method you need at the right place.

Well, just realized that it might be hard to split because of the temporary InMemoryFileIndex.

liancheng · 2016-11-23T06:53:16Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

@@ -274,7 +274,7 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
          pathToPartitionedTable,
          userSpecifiedSchema = Option("num int, str string"),
          userSpecifiedPartitionCols = partitionCols,
-          expectedSchema = new StructType().add("num", IntegerType).add("str", StringType),
+          expectedSchema = new StructType().add("str", StringType).add("num", IntegerType),


I believe the original test case was incorrect. Although the schema check passes, if you really read rows out of the Dataset, you'll hit an exception, as shown in the following Spark shell session:

import org.apache.spark.sql.types._ val df0 = spark.range(10).select( ('id % 4) cast StringType as "part", 'id cast StringType as "data" ) val path = "/tmp/part.parquet" df0.write.mode("overwrite").partitionBy("part").parquet(path) val df1 = spark.read.schema( new StructType() .add("part", StringType, nullable = true) .add("data", StringType, nullable = true) ).parquet(path) df1.printSchema() // root // |-- part: string (nullable = true) // |-- data: string (nullable = true) df1.show() // 16/11/22 22:52:21 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 34) // java.lang.NullPointerException // at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getArrayLength(OnHeapColumnVector.java:375) // at org.apache.spark.sql.execution.vectorized.ColumnVector.getArray(ColumnVector.java:554) // at org.apache.spark.sql.execution.vectorized.ColumnVector.getByteArray(ColumnVector.java:576) // [...]

tdas · 2016-11-23T19:48:17Z

Thanks @liancheng for your comments. Since these are mostly nits, I am going to merge this PR (since it fixes critical bug for 2.1) and address the final comments in a separate PR.

…Types ## What changes were proposed in this pull request? ### The Issue If I specify my schema when doing ```scala spark.read .schema(someSchemaWherePartitionColumnsAreStrings) ``` but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted. ### Proposed solution The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path. The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption. My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type. We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later. A side effect of this PR is that we won't need #15942 if this PR goes in. ## How was this patch tested? Regression tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15951 from brkyvz/partition-corruption. (cherry picked from commit 0d1bf2b) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

## What changes were proposed in this pull request? This PR addressed the rest comments in apache#15951. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#15997 from zsxwing/SPARK-18510-follow-up.

## What changes were proposed in this pull request? This PR addressed the rest comments in #15951. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15997 from zsxwing/SPARK-18510-follow-up. (cherry picked from commit 223fa21) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…Types ## What changes were proposed in this pull request? ### The Issue If I specify my schema when doing ```scala spark.read .schema(someSchemaWherePartitionColumnsAreStrings) ``` but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted. ### Proposed solution The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path. The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption. My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type. We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later. A side effect of this PR is that we won't need apache#15942 if this PR goes in. ## How was this patch tested? Regression tests Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#15951 from brkyvz/partition-corruption.

## What changes were proposed in this pull request? This PR addressed the rest comments in apache#15951. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#15997 from zsxwing/SPARK-18510-follow-up.

…Types ## What changes were proposed in this pull request? ### The Issue If I specify my schema when doing ```scala spark.read .schema(someSchemaWherePartitionColumnsAreStrings) ``` but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted. ### Proposed solution The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path. The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption. My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type. We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later. A side effect of this PR is that we won't need apache#15942 if this PR goes in. ## How was this patch tested? Regression tests Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#15951 from brkyvz/partition-corruption.

## What changes were proposed in this pull request? This PR addressed the rest comments in apache#15951. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#15997 from zsxwing/SPARK-18510-follow-up.

brkyvz added 2 commits November 20, 2016 14:19

Fix data corruption from inferred partition column dataTypes

ef2e1c2

make naming and comments a bit more descriptive

9080c4e

add a bit more to the test

f518405

fix issue

1c70de0

fix issues

46ab68a

gatorsmile reviewed Nov 21, 2016

View reviewed changes

brkyvz added 2 commits November 21, 2016 10:52

fix tests

de49ba5

fix comment

f0a2754

gatorsmile reviewed Nov 21, 2016

View reviewed changes

brkyvz added 2 commits November 21, 2016 13:38

fix R issue

78d42c1

add return

fde4f64

gatorsmile reviewed Nov 21, 2016

View reviewed changes

tdas reviewed Nov 21, 2016

View reviewed changes

address comments

97003e2

fix build

6f741b6

fix R test

08566e7

address last comment

f3b42ff

tdas reviewed Nov 22, 2016

View reviewed changes

liancheng reviewed Nov 23, 2016

View reviewed changes

asfgit closed this in 0d1bf2b Nov 23, 2016

zsxwing mentioned this pull request Nov 23, 2016

[SPARK-18510][SQL] Follow up to address comments in #15951 #15997

Closed

liancheng mentioned this pull request Nov 30, 2016

[SPARK-18108][SQL] Fix a schema inconsistent bug that makes a parquet reader fail to read data #16030

Closed

brkyvz deleted the partition-corruption branch February 3, 2019 20:58

[SPARK-18510] Fix data corruption from inferred partition column dataTypes #15951

[SPARK-18510] Fix data corruption from inferred partition column dataTypes #15951

Conversation

brkyvz commented Nov 20, 2016

What changes were proposed in this pull request?

The Issue

Proposed solution

How was this patch tested?

brkyvz commented Nov 20, 2016 • edited

SparkQA commented Nov 20, 2016

SparkQA commented Nov 21, 2016

SparkQA commented Nov 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Nov 21, 2016

gatorsmile commented Nov 21, 2016 • edited

brkyvz commented Nov 21, 2016

gatorsmile commented Nov 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 21, 2016

SparkQA commented Nov 21, 2016

gatorsmile Nov 21, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Nov 21, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Nov 21, 2016 • edited

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 21, 2016

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 22, 2016

ericl commented Nov 22, 2016

brkyvz commented Nov 22, 2016

SparkQA commented Nov 22, 2016

brkyvz commented Nov 22, 2016

SparkQA commented Nov 22, 2016

cloud-fan commented Nov 22, 2016

SparkQA commented Nov 22, 2016

SparkQA commented Nov 22, 2016

cloud-fan commented Nov 22, 2016

SparkQA commented Nov 22, 2016

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas commented Nov 22, 2016

liancheng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brkyvz commented Nov 20, 2016 •

edited

gatorsmile commented Nov 21, 2016 •

edited

gatorsmile Nov 21, 2016 •

edited

gatorsmile Nov 21, 2016 •

edited

gatorsmile commented Nov 21, 2016 •

edited