[SPARK-5498][SQL]fix query exception when partition schema does not match table schema #4289

jeanlyn · 2015-01-30T14:12:02Z

In hive,the schema of partition may be difference from the table schema.When we use spark-sql to query the data of partition which schema is difference from the table schema,we will get the exceptions as the description of the jira .For example:

We take a look of the schema for the partition and the table

DESCRIBE partition_test PARTITION (dt='1');
id                      int                 None                
name                    string                  None                
dt                      string                  None                

# Partition Information      
# col_name              data_type               comment             

dt                      string                  None

DESCRIBE partition_test;
OK
id                      bigint                  None                
name                    string                  None   
dt                      string                  None                

# Partition Information      
# col_name              data_type               comment             

dt                      string                  None

run the sql

SELECT * FROM partition_test where dt='1';

we will get the cast exception java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

marmbrus · 2015-02-02T01:52:58Z

ok to test

marmbrus · 2015-02-02T01:54:09Z

/cc @chenghao-intel

SparkQA · 2015-02-02T01:59:16Z

Test build #26479 has finished for PR 4289 at commit adfc7de.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

… match table schema

SparkQA · 2015-02-02T02:24:16Z

Test build #26481 has finished for PR 4289 at commit 10744ca.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-02T02:59:18Z

Test build #26484 has finished for PR 4289 at commit b1527d5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-02T06:11:10Z

Test build #26489 has finished for PR 4289 at commit afc7da5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating[@specialized(Int, Long) ID](user: ID, item: ID, rating: Float)
- class StandardScalerModel (

SparkQA · 2015-02-03T03:41:22Z

Test build #26592 has finished for PR 4289 at commit 7470901.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-03T05:46:05Z

Test build #26602 has finished for PR 4289 at commit 63d170a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait Column extends DataFrame with ExpressionApi
- class ColumnName(name: String) extends IncomputableColumn(name)
- trait DataFrame extends DataFrameSpecificApi with RDDApi[Row]
- class GroupedDataFrame protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expression])
- protected[sql] class QueryExecution(val logical: LogicalPlan)

SparkQA · 2015-02-03T07:01:04Z

Test build #26611 has finished for PR 4289 at commit 12d800d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-02-03T15:19:37Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala

+    val tmpDir = Files.createTempDir()
+    sql(s"CREATE TABLE table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ")
+    sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT key,value FROM testData")
+    sql("ALTER TABLE table_with_partition CHANGE COLUMN key key BIGINT")


I just checked the Hive Document
It says:
The CASCADE|RESTRICT clause is available in Hive 0.15.0. ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column change only to table metadata.
I guess in Hive 0.13.1, when table schema changed via alter table, only the table meta data will be updated, can you double check if above query works for Hive 0.13.1?

I check this query in Hive 0.11 and hive-0.12 is OK,I will check this query in Hive 0.13.1 later.

chenghao-intel · 2015-02-03T15:38:35Z

Sorry for the late reply @jeanlyn !
I think it's a bug of Hive DDL, which probably was resolved in Hive 0.14 / 0.15, and I am not sure if we really want to fix that in Spark SQL. @yhuai , do you have any comment on this?
However, in this particular case, another work around in your product:

Rename the existed table;
Create a new table with schema you altered, and also the partitions.
Manually move the data from the old table into the new table folder from HDFS.
Drop the old table.

jeanlyn · 2015-02-03T16:21:32Z

Thanks @chenghao-intel for review and suggestions! We want to replace some hive sql to spark-sql in our production environment,so I use some sql in our production environmeng which running in hive-0.12 to test spark-sql and i found this issue,so i think make spark-sql to more compatible is well for popularized,and i will test the points @chenghao-intel listed both in hive and spark-sql.

chenghao-intel · 2015-02-04T01:30:16Z

Oh, @jeanlyn, I've also tested that in Hive 0.13, seems it works.
Hive will do the data type converting if it realizes the partition schema is not the same as the table schema. Your change seems reasonable. I will review the rest of the code, hope fully we can catch up the 0.13 release.

chenghao-intel · 2015-02-04T01:50:16Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

   * @return An `Iterator[Row]` transformed from `iterator`
   */
  def fillObject(
      iterator: Iterator[Writable],
      deserializer: Deserializer,
      nonPartitionKeyAttrs: Seq[(Attribute, Int)],
-      mutableRow: MutableRow): Iterator[Row] = {
+      mutableRow: MutableRow,
+      convertdeserializer: Option[Deserializer] = None): Iterator[Row] = {


Instead of passing the deserializer, how about take the converter as the argument? By the way, I think Hive provides the IdentityConverter, which mean we can make the parameter as "ObjectInspectorConverters.Converter", not necessary wrapped by Option.

But the val soi also need a convert deserializer when the schema doesn't match

OK, you're right, forget about my comment above. :)

Change the convertdeserializer to outputStructObjectInspector?

variable name should be in camel style. convertdeserializer => convertDeserializer? or change it to a better name?

chenghao-intel · 2015-02-04T02:14:14Z

In general I think the change looks reasonable to me, and we'd better use the Hive ObjectInspectorConverter directly, and some of the code can be cleaner.

SparkQA · 2015-02-05T04:49:16Z

Test build #26820 has finished for PR 4289 at commit 2a91a87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jeanlyn · 2015-02-06T03:08:23Z

hi,@chenghao-intel @marmbrus any suggestions?

SparkQA · 2015-02-07T19:57:31Z

Test build #27010 has finished for PR 4289 at commit 1e8b30c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-09T03:41:04Z

Test build #27066 has finished for PR 4289 at commit d6c93c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jeanlyn · 2015-02-09T05:34:55Z

Thanks @chenghao-intel for review and suggestions!I take some of your advises to simplify the code.

SparkQA · 2015-02-11T03:55:02Z

Test build #27267 has finished for PR 4289 at commit 535b0b6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jeanlyn · 2015-02-11T05:46:20Z

Retest this please

jeanlyn · 2015-02-11T05:56:38Z

Hi,@marmbrus , @chenghao-intel I have no idea why SPARK-4407 regression: Complex type support this test failed after i resolved the merge conflicts.It seems that not my problems,because i had passed this unit tests before.

chenghao-intel · 2015-02-11T06:08:26Z

@jeanlyn The HiveThriftServer unit test was disable previously before #4486 merged. From the log it's hard to say the failure reason, can you try it in you local?

build/sbt -Phive-0.13.1 -Phive-thriftserver assembly
build/sbt -Phive-0.13.1 -Phive-thriftserver 'test-only org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite'

jeanlyn · 2015-02-11T11:39:23Z

@chenghao-intel ,I had passed all unit test in my local .But i think the unit test of thrift-server seems unstable,it's depend on the state of the machine,when the machine is busy,it may time out during the unit test.

jeanlyn · 2015-02-11T13:03:25Z

/cc @marmbrus

marmbrus · 2015-03-12T19:02:28Z

test this please

SparkQA · 2015-03-12T19:19:57Z

Test build #28532 has finished for PR 4289 at commit 535b0b6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-03-13T12:57:39Z

Test build #28561 has finished for PR 4289 at commit b41d6b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jeanlyn · 2015-03-17T14:29:11Z

Updated, @marmbrus @chenghao-intel . We had tested this patch in our environment over the past few days.Any more problems in this patch?

marmbrus · 2015-03-18T03:24:21Z

sql/hive/v0.12.0/src/main/scala/org/apache/spark/sql/hive/Shim12.scala

@@ -244,6 +244,11 @@ private[hive] object HiveShim {
    }
  }

+  def getConvertedOI(inputOI: ObjectInspector,
+                     outputOI: ObjectInspector): ObjectInspector = {


Nit: wrapped parameters should all be on a new line indented 4 chars.

marmbrus · 2015-03-18T03:25:53Z

Minor comments otherwise LGTM.

SparkQA · 2015-03-18T05:07:34Z

Test build #28768 has finished for PR 4289 at commit 9c8da74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jeanlyn · 2015-03-18T11:50:30Z

Hi, @marmbrus ,I had update the code as you mentioned about.

marmbrus · 2015-03-26T00:47:57Z

Thanks! Merged to master.

jeanlyn changed the title ~~[SPARK-5498][SPARK-SQL]fix bug when query the data when partition schema does not match table schema~~ [SPARK-5498][SQL]fix bug when query the data when partition schema does not match table schema Jan 31, 2015

jeanlyn added 2 commits February 2, 2015 10:12

SPARK-5498:fix bug when query the data when partition schema does not…

3b27af3

… match table schema

Insert a space after the start of the comment

10744ca

jeanlyn force-pushed the schema branch from adfc7de to 10744ca Compare February 2, 2015 02:16

fix type mismatch

b1527d5

make getConvertedOI compatible between 0.12.0 and 0.13.1

afc7da5

reduce conflicts

7470901

fix compile problem

63d170a

fix code style

12d800d

chenghao-intel reviewed Feb 3, 2015
View reviewed changes

chenghao-intel reviewed Feb 4, 2015
View reviewed changes

add more test case and clean the code

2a91a87

fix code style

1e8b30c

fix bug

d6c93c5

jeanlyn changed the title ~~[SPARK-5498][SQL]fix bug when query the data when partition schema does not match table schema~~ [SPARK-5498][SQL]fix query exception when partition schema does not match table schema Feb 9, 2015

reduce conflicts

535b0b6

jeanlyn added 2 commits March 13, 2015 15:26

Merge branch 'master' into schema

07d84b6

fix compile errors

b41d6b9

marmbrus reviewed Mar 18, 2015
View reviewed changes

fix style

9c8da74

asfgit closed this in e6d1406 Mar 26, 2015

liancheng mentioned this pull request Mar 26, 2015

[SPARK-6546][Build] Using the wrong code that will make spark compile failed!! #5198

Closed

jeanlyn deleted the schema branch October 10, 2023 11:21

[SPARK-5498][SQL]fix query exception when partition schema does not match table schema #4289

[SPARK-5498][SQL]fix query exception when partition schema does not match table schema #4289

Conversation

jeanlyn commented Jan 30, 2015

marmbrus commented Feb 2, 2015

marmbrus commented Feb 2, 2015

SparkQA commented Feb 2, 2015

SparkQA commented Feb 2, 2015

SparkQA commented Feb 2, 2015

SparkQA commented Feb 2, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenghao-intel commented Feb 3, 2015

jeanlyn commented Feb 3, 2015

chenghao-intel commented Feb 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenghao-intel commented Feb 4, 2015

SparkQA commented Feb 5, 2015

jeanlyn commented Feb 6, 2015

SparkQA commented Feb 7, 2015

SparkQA commented Feb 9, 2015

jeanlyn commented Feb 9, 2015

SparkQA commented Feb 11, 2015

jeanlyn commented Feb 11, 2015

jeanlyn commented Feb 11, 2015

chenghao-intel commented Feb 11, 2015

jeanlyn commented Feb 11, 2015

jeanlyn commented Feb 11, 2015

marmbrus commented Mar 12, 2015

SparkQA commented Mar 12, 2015

SparkQA commented Mar 13, 2015

jeanlyn commented Mar 17, 2015

Choose a reason for hiding this comment

marmbrus commented Mar 18, 2015

SparkQA commented Mar 18, 2015

jeanlyn commented Mar 18, 2015

marmbrus commented Mar 26, 2015