[SPARK-32192][SQL] Print column name when throws ClassCastException #29010

StefanXiepj · 2020-07-06T10:28:35Z

What changes were proposed in this pull request?

When somebody changed the type of partition's field, spark will throw ClassCastException. For example, we have a table like this:

drop table if exists cast_exception_test;

create table cast_exception_test(c1 int, c2 string) partitioned by (dt string) stored as orc;

insert into table cast_exception_test partition(dt='2020-04-08') values('1', 'jeff_1');

If you change the field's type in hive, query the old partition, spark will throw ClassCastException, but hive will not:

-- change the field's type using hive
alter table cast_exception_test change column c1 c1 string;
-- hive correct,  but spark throws ClassCastException
select * from cast_exception_test where dt='2020-04-08';

Why are the changes needed?

When the table has many fields, we don's known which field has been changed. If we print out log about this exception, it will very helpful for us to troubleshoot.

Does this PR introduce any user-facing change?

When the ClassCastException is caused by changed field's type, you can search which field has problem in exexutor logs:

20/04/09 17:22:05 ERROR hive.HadoopTableReader: Exception thrown in field <c1>

How was this patch tested?

First, prepare the test data, the table is partitioned and stored as orc:

drop table if exists cast_exception_test;
create table cast_exception_test(c1 int, c2 string) partitioned by (dt string) stored as orc;
insert into table cast_exception_test partition(dt='2020-04-08') values('1', 'jeff_1');

Then, change the field's type in hive.

alter table cast_exception_test change column c1 c1 string;

Now the metadata of the table has been modified, but the partition's metadata which is stored in orc file or hive metastore's mysql is still old. So, query command throws ClassCastException in spark, because spark use table's metadata which is different from orc file's metadata. But hive use partition's metadata which is the same as orc file's metadata.

If you query the old partition, spark will thrown ClassCastException, but hive will not:

select * from cast_exception_test where dt='2020-04-08';

maropu · 2020-07-06T13:00:38Z

Thanks for your contribution, @StefanXiepj ! btw, which Spark version you used? I think the current master does not accept the alter command;

scala> sql("""alter table cast_exception_test change column c1 c1 string""")
org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'c1' with type 'IntegerType' to 'c1' with type 'StringType';

StefanXiepj · 2020-07-06T13:31:29Z

Thanks for your contribution, @StefanXiepj ! btw, which Spark version you used? I think the current master does not accept the alter command;
scala> sql("""alter table cast_exception_test change column c1 c1 string""")
org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'c1' with type 'IntegerType' to 'c1' with type 'StringType';

Sorry, I forgot it, alter command executed in hive, and query command executed in spark.

srowen · 2020-07-06T15:04:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

-          mutableRow.setNullAt(fieldOrdinals(i))
-        } else {
-          unwrappers(i)(fieldValue, mutableRow, fieldOrdinals(i))
+        Try {


I don't feel strongly about it, but I prefer simple try-catch myself. I think it avoids having to handle the 'return value' here, when there isn't one within the while loop?

get it & done

maropu · 2020-07-07T00:07:41Z

Sorry, I forgot it, alter command executed in hive, and query command executed in spark.

Could you update the PR description and add some tests for the issue?

maropu · 2020-07-07T00:08:06Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

@@ -47,6 +48,7 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.unsafe.types.UTF8String
 import org.apache.spark.util.{SerializableConfiguration, Utils}

+


nit: please remove the unnecessary change.

StefanXiepj · 2020-07-07T03:59:13Z

Sorry, I forgot it, alter command executed in hive, and query command executed in spark.

Could you update the PR description and add some tests for the issue?

done

dongjoon-hyun · 2020-07-07T22:31:44Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

@@ -20,6 +20,7 @@ package org.apache.spark.sql.hive
 import java.util.Properties

 import scala.collection.JavaConverters._
+import scala.util.{Failure, Success, Try}


Shall we remove this, @StefanXiepj ?

Thanks for your review, i have removed it.

dongjoon-hyun · 2020-07-08T02:21:20Z

ok to test

SparkQA · 2020-07-08T02:38:16Z

Test build #125281 has finished for PR 29010 at commit 018c981.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation. - #28848 (comment) (amp-jenkins-worker-06) - #28926 (comment) (amp-jenkins-worker-06) - #28969 (comment) (amp-jenkins-worker-06) - #28975 (comment) (amp-jenkins-worker-05) - #28986 (comment) (amp-jenkins-worker-05) - #28992 (comment) (amp-jenkins-worker-06) - #28993 (comment) (amp-jenkins-worker-05) - #28999 (comment) (amp-jenkins-worker-04) - #29010 (comment) (amp-jenkins-worker-03) - #29013 (comment) (amp-jenkins-worker-04) - #29016 (comment) (amp-jenkins-worker-05) - #29025 (comment) (amp-jenkins-worker-04) - #29042 (comment) (amp-jenkins-worker-03) ### Why are the changes needed? Apache Spark `release-build.sh` generates the official document by using the following command. - https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341 ```bash PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build ``` And, this is executed by the following `unidoc` command for Scala/Java API doc. - https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30 ```ruby system("build/sbt -Pkinesis-asl clean compile unidoc") || raise("Unidoc generation failed") ``` However, the PR builder disabled `Jekyll build` and instead has a different test coverage. ```python # determine if docs were changed and if we're inside the amplab environment # note - the below commented out until *all* Jenkins workers can get `jekyll` installed # if "DOCS" in changed_modules and test_env == "amplab_jenkins": # build_spark_documentation() ``` ``` Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos -Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc ``` ### Does this PR introduce _any_ user-facing change? No. (This is used only for testing and not used in the official doc generation.) ### How was this patch tested? Pass the Jenkins without doc generation invocation. Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

srowen · 2020-07-08T21:37:37Z

Jenkins retest this please

SparkQA · 2020-07-09T02:20:01Z

Test build #125400 has finished for PR 29010 at commit 018c981.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-07-09T14:34:01Z

Merged to master

[SPARK-32192][SQL] Print column name when throws ClassCastException

60199ec

probot-autolabeler bot added the SQL label Jul 6, 2020

fix import style

a535b4d

srowen reviewed Jul 6, 2020

View reviewed changes

maropu reviewed Jul 7, 2020

View reviewed changes

remove the unnecessary change & replace Try-match with try-catch

3e51930

dongjoon-hyun reviewed Jul 7, 2020

View reviewed changes

remove unnecessary import

018c981

dongjoon-hyun mentioned this pull request Jul 8, 2020

[SPARK-32233][TESTS] Disable SBT unidoc generation testing in Jenkins #29017

Closed

srowen closed this in 523e238 Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32192][SQL] Print column name when throws ClassCastException #29010

[SPARK-32192][SQL] Print column name when throws ClassCastException #29010

StefanXiepj commented Jul 6, 2020 •

edited

Loading

maropu commented Jul 6, 2020

StefanXiepj commented Jul 6, 2020

srowen Jul 6, 2020

maropu Jul 7, 2020

StefanXiepj Jul 7, 2020

maropu commented Jul 7, 2020

maropu Jul 7, 2020

StefanXiepj Jul 7, 2020

StefanXiepj commented Jul 7, 2020

dongjoon-hyun Jul 7, 2020

StefanXiepj Jul 8, 2020

dongjoon-hyun commented Jul 8, 2020

SparkQA commented Jul 8, 2020

srowen commented Jul 8, 2020

SparkQA commented Jul 9, 2020

srowen commented Jul 9, 2020

		@@ -47,6 +48,7 @@ import org.apache.spark.sql.internal.SQLConf
		import org.apache.spark.unsafe.types.UTF8String
		import org.apache.spark.util.{SerializableConfiguration, Utils}

[SPARK-32192][SQL] Print column name when throws ClassCastException #29010

[SPARK-32192][SQL] Print column name when throws ClassCastException #29010

Conversation

StefanXiepj commented Jul 6, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

maropu commented Jul 6, 2020

StefanXiepj commented Jul 6, 2020

srowen Jul 6, 2020

Choose a reason for hiding this comment

maropu Jul 7, 2020

Choose a reason for hiding this comment

StefanXiepj Jul 7, 2020

Choose a reason for hiding this comment

maropu commented Jul 7, 2020

maropu Jul 7, 2020

Choose a reason for hiding this comment

StefanXiepj Jul 7, 2020

Choose a reason for hiding this comment

StefanXiepj commented Jul 7, 2020

dongjoon-hyun Jul 7, 2020

Choose a reason for hiding this comment

StefanXiepj Jul 8, 2020

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 8, 2020

SparkQA commented Jul 8, 2020

srowen commented Jul 8, 2020

SparkQA commented Jul 9, 2020

srowen commented Jul 9, 2020

StefanXiepj commented Jul 6, 2020 •

edited

Loading