Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5498][SQL]fix query exception when partition schema does not match table schema #4289

Closed
wants to merge 16 commits into from

Conversation

jeanlyn
Copy link
Contributor

@jeanlyn jeanlyn commented Jan 30, 2015

In hive,the schema of partition may be difference from the table schema.When we use spark-sql to query the data of partition which schema is difference from the table schema,we will get the exceptions as the description of the jira .For example:

  • We take a look of the schema for the partition and the table
DESCRIBE partition_test PARTITION (dt='1');
id                      int                 None                
name                    string                  None                
dt                      string                  None                

# Partition Information      
# col_name              data_type               comment             

dt                      string                  None     
DESCRIBE partition_test;
OK
id                      bigint                  None                
name                    string                  None   
dt                      string                  None                

# Partition Information      
# col_name              data_type               comment             

dt                      string                  None 
  • run the sql
SELECT * FROM partition_test where dt='1';

we will get the cast exception java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

@jeanlyn jeanlyn changed the title [SPARK-5498][SPARK-SQL]fix bug when query the data when partition schema does not match table schema [SPARK-5498][SQL]fix bug when query the data when partition schema does not match table schema Jan 31, 2015
@marmbrus
Copy link
Contributor

marmbrus commented Feb 2, 2015

ok to test

@marmbrus
Copy link
Contributor

marmbrus commented Feb 2, 2015

/cc @chenghao-intel

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26479 has finished for PR 4289 at commit adfc7de.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26481 has finished for PR 4289 at commit 10744ca.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26484 has finished for PR 4289 at commit b1527d5.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26489 has finished for PR 4289 at commit afc7da5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Rating[@specialized(Int, Long) ID](user: ID, item: ID, rating: Float)
    • class StandardScalerModel (

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26592 has finished for PR 4289 at commit 7470901.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26602 has finished for PR 4289 at commit 63d170a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait Column extends DataFrame with ExpressionApi
    • class ColumnName(name: String) extends IncomputableColumn(name)
    • trait DataFrame extends DataFrameSpecificApi with RDDApi[Row]
    • class GroupedDataFrame protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expression])
    • protected[sql] class QueryExecution(val logical: LogicalPlan)

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26611 has finished for PR 4289 at commit 12d800d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val tmpDir = Files.createTempDir()
sql(s"CREATE TABLE table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ")
sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData")
sql("ALTER TABLE table_with_partition CHANGE COLUMN key key BIGINT")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked the Hive Document
It says:
The CASCADE|RESTRICT clause is available in Hive 0.15.0. ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column change only to table metadata.
I guess in Hive 0.13.1, when table schema changed via alter table, only the table meta data will be updated, can you double check if above query works for Hive 0.13.1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I check this query in Hive 0.11 and hive-0.12 is OK,I will check this query in Hive 0.13.1 later.

@chenghao-intel
Copy link
Contributor

Sorry for the late reply @jeanlyn !
I think it's a bug of Hive DDL, which probably was resolved in Hive 0.14 / 0.15, and I am not sure if we really want to fix that in Spark SQL. @yhuai , do you have any comment on this?
However, in this particular case, another work around in your product:

  1. Rename the existed table;
  2. Create a new table with schema you altered, and also the partitions.
  3. Manually move the data from the old table into the new table folder from HDFS.
  4. Drop the old table.

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Feb 3, 2015

Thanks @chenghao-intel for review and suggestions! We want to replace some hive sql to spark-sql in our production environment,so I use some sql in our production environmeng which running in hive-0.12 to test spark-sql and i found this issue,so i think make spark-sql to more compatible is well for popularized,and i will test the points @chenghao-intel listed both in hive and spark-sql.

@chenghao-intel
Copy link
Contributor

Oh, @jeanlyn, I've also tested that in Hive 0.13, seems it works.
Hive will do the data type converting if it realizes the partition schema is not the same as the table schema. Your change seems reasonable. I will review the rest of the code, hope fully we can catch up the 0.13 release.

* @return An `Iterator[Row]` transformed from `iterator`
*/
def fillObject(
iterator: Iterator[Writable],
deserializer: Deserializer,
nonPartitionKeyAttrs: Seq[(Attribute, Int)],
mutableRow: MutableRow): Iterator[Row] = {
mutableRow: MutableRow,
convertdeserializer: Option[Deserializer] = None): Iterator[Row] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing the deserializer, how about take the converter as the argument? By the way, I think Hive provides the IdentityConverter, which mean we can make the parameter as "ObjectInspectorConverters.Converter", not necessary wrapped by Option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the val soi also need a convert deserializer when the schema doesn't match

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, you're right, forget about my comment above. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the convertdeserializer to outputStructObjectInspector?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable name should be in camel style. convertdeserializer => convertDeserializer? or change it to a better name?

@chenghao-intel
Copy link
Contributor

In general I think the change looks reasonable to me, and we'd better use the Hive ObjectInspectorConverter directly, and some of the code can be cleaner.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26820 has finished for PR 4289 at commit 2a91a87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Feb 6, 2015

hi,@chenghao-intel @marmbrus any suggestions?

@SparkQA
Copy link

SparkQA commented Feb 7, 2015

Test build #27010 has finished for PR 4289 at commit 1e8b30c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27066 has finished for PR 4289 at commit d6c93c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Feb 9, 2015

Thanks @chenghao-intel for review and suggestions!I take some of your advises to simplify the code.

@jeanlyn jeanlyn changed the title [SPARK-5498][SQL]fix bug when query the data when partition schema does not match table schema [SPARK-5498][SQL]fix query exception when partition schema does not match table schema Feb 9, 2015
@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27267 has finished for PR 4289 at commit 535b0b6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Feb 11, 2015

Retest this please

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Feb 11, 2015

Hi,@marmbrus , @chenghao-intel I have no idea why SPARK-4407 regression: Complex type support this test failed after i resolved the merge conflicts.It seems that not my problems,because i had passed this unit tests before.

@chenghao-intel
Copy link
Contributor

@jeanlyn The HiveThriftServer unit test was disable previously before #4486 merged. From the log it's hard to say the failure reason, can you try it in you local?

build/sbt -Phive-0.13.1 -Phive-thriftserver assembly
build/sbt -Phive-0.13.1 -Phive-thriftserver 'test-only org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite'

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Feb 11, 2015

@chenghao-intel ,I had passed all unit test in my local .But i think the unit test of thrift-server seems unstable,it's depend on the state of the machine,when the machine is busy,it may time out during the unit test.

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Feb 11, 2015

/cc @marmbrus

@marmbrus
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Mar 12, 2015

Test build #28532 has finished for PR 4289 at commit 535b0b6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2015

Test build #28561 has finished for PR 4289 at commit b41d6b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Mar 17, 2015

Updated, @marmbrus @chenghao-intel . We had tested this patch in our environment over the past few days.Any more problems in this patch?

@@ -244,6 +244,11 @@ private[hive] object HiveShim {
}
}

def getConvertedOI(inputOI: ObjectInspector,
outputOI: ObjectInspector): ObjectInspector = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: wrapped parameters should all be on a new line indented 4 chars.

@marmbrus
Copy link
Contributor

Minor comments otherwise LGTM.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28768 has finished for PR 4289 at commit 9c8da74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Mar 18, 2015

Hi, @marmbrus ,I had update the code as you mentioned about.

@marmbrus
Copy link
Contributor

Thanks! Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants