Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32192][SQL] Print column name when throws ClassCastException #29010

Closed
wants to merge 4 commits into from

Conversation

StefanXiepj
Copy link

@StefanXiepj StefanXiepj commented Jul 6, 2020

What changes were proposed in this pull request?

When somebody changed the type of partition's field, spark will throw ClassCastException. For example, we have a table like this:

drop table if exists cast_exception_test;

create table cast_exception_test(c1 int, c2 string) partitioned by (dt string) stored as orc;

insert into table cast_exception_test partition(dt='2020-04-08') values('1', 'jeff_1');

If you change the field's type in hive, query the old partition, spark will throw ClassCastException, but hive will not:

-- change the field's type using hive
alter table cast_exception_test change column c1 c1 string;
-- hive correct,  but spark throws ClassCastException
select * from cast_exception_test where dt='2020-04-08';

Why are the changes needed?

When the table has many fields, we don's known which field has been changed. If we print out log about this exception, it will very helpful for us to troubleshoot.

Does this PR introduce any user-facing change?

When the ClassCastException is caused by changed field's type, you can search which field has problem in exexutor logs:

20/04/09 17:22:05 ERROR hive.HadoopTableReader: Exception thrown in field <c1>

How was this patch tested?

First, prepare the test data, the table is partitioned and stored as orc:

drop table if exists cast_exception_test;
create table cast_exception_test(c1 int, c2 string) partitioned by (dt string) stored as orc;
insert into table cast_exception_test partition(dt='2020-04-08') values('1', 'jeff_1');

Then, change the field's type in hive.

alter table cast_exception_test change column c1 c1 string;

Now the metadata of the table has been modified, but the partition's metadata which is stored in orc file or hive metastore's mysql is still old. So, query command throws ClassCastException in spark, because spark use table's metadata which is different from orc file's metadata. But hive use partition's metadata which is the same as orc file's metadata.

If you query the old partition, spark will thrown ClassCastException, but hive will not:

select * from cast_exception_test where dt='2020-04-08';

@maropu
Copy link
Member

maropu commented Jul 6, 2020

Thanks for your contribution, @StefanXiepj ! btw, which Spark version you used? I think the current master does not accept the alter command;

scala> sql("""alter table cast_exception_test change column c1 c1 string""")
org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'c1' with type 'IntegerType' to 'c1' with type 'StringType';

@StefanXiepj
Copy link
Author

Thanks for your contribution, @StefanXiepj ! btw, which Spark version you used? I think the current master does not accept the alter command;

scala> sql("""alter table cast_exception_test change column c1 c1 string""")
org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'c1' with type 'IntegerType' to 'c1' with type 'StringType';

Sorry, I forgot it, alter command executed in hive, and query command executed in spark.

mutableRow.setNullAt(fieldOrdinals(i))
} else {
unwrappers(i)(fieldValue, mutableRow, fieldOrdinals(i))
Try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about it, but I prefer simple try-catch myself. I think it avoids having to handle the 'return value' here, when there isn't one within the while loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it & done

@maropu
Copy link
Member

maropu commented Jul 7, 2020

Sorry, I forgot it, alter command executed in hive, and query command executed in spark.

Could you update the PR description and add some tests for the issue?

@@ -47,6 +48,7 @@ import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.util.{SerializableConfiguration, Utils}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please remove the unnecessary change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@StefanXiepj
Copy link
Author

Sorry, I forgot it, alter command executed in hive, and query command executed in spark.

Could you update the PR description and add some tests for the issue?

done

@@ -20,6 +20,7 @@ package org.apache.spark.sql.hive
import java.util.Properties

import scala.collection.JavaConverters._
import scala.util.{Failure, Success, Try}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove this, @StefanXiepj ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, i have removed it.

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125281 has finished for PR 29010 at commit 018c981.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun added a commit that referenced this pull request Jul 8, 2020
### What changes were proposed in this pull request?

This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation.

- #28848 (comment) (amp-jenkins-worker-06)
- #28926 (comment) (amp-jenkins-worker-06)
- #28969 (comment) (amp-jenkins-worker-06)
- #28975 (comment) (amp-jenkins-worker-05)
- #28986 (comment)  (amp-jenkins-worker-05)
- #28992 (comment) (amp-jenkins-worker-06)
- #28993 (comment) (amp-jenkins-worker-05)
- #28999 (comment) (amp-jenkins-worker-04)
- #29010 (comment) (amp-jenkins-worker-03)
- #29013 (comment) (amp-jenkins-worker-04)
- #29016 (comment) (amp-jenkins-worker-05)
- #29025 (comment) (amp-jenkins-worker-04)
- #29042 (comment) (amp-jenkins-worker-03)

### Why are the changes needed?

Apache Spark `release-build.sh` generates the official document by using the following command.
- https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341

```bash
PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build
```

And, this is executed by the following `unidoc` command for Scala/Java API doc.
- https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30

```ruby
system("build/sbt -Pkinesis-asl clean compile unidoc") || raise("Unidoc generation failed")
```

However, the PR builder disabled `Jekyll build` and instead has a different test coverage.
```python
# determine if docs were changed and if we're inside the amplab environment
# note - the below commented out until *all* Jenkins workers can get `jekyll` installed
# if "DOCS" in changed_modules and test_env == "amplab_jenkins":
#    build_spark_documentation()
```

```
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:
-Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos
-Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc
```

### Does this PR introduce _any_ user-facing change?

No. (This is used only for testing and not used in the official doc generation.)

### How was this patch tested?

Pass the Jenkins without doc generation invocation.

Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@srowen
Copy link
Member

srowen commented Jul 8, 2020

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125400 has finished for PR 29010 at commit 018c981.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 9, 2020

Merged to master

@srowen srowen closed this in 523e238 Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants