-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21936][SQL] backward compatibility test framework for HiveExternalCatalog #19148
Conversation
cc @gatorsmile |
Test build #81465 has finished for PR 19148 at commit
|
The test cases pass in my local environment when I run this test suite individually. It took 4 mins. It is not too bad.
|
spark = session | ||
|
||
testingVersions.indices.foreach { index => | ||
checkAnswer(session.sql(s"select * from t$index"), Row(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also do the other basic SQL. For example, insert, describe, drop table
.getOrCreate() | ||
|
||
val index = session.conf.get("spark.sql.test.version.index") | ||
session.sql(s"create table t$index using parquet as select 1 a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also create a view.
import scala.sys.process._ | ||
|
||
val url = | ||
s"http://mirrors.hust.edu.cn/apache/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to have this testsuite. BTW, is it okay to use this single site?
* downloading for this spark version. | ||
*/ | ||
class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts { | ||
private val wareHousePath = Utils.createTempDir(namePrefix = "warehouse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This single warehouse seems to make a failure. Maybe, Spark 2.0.2 tries to read 2.2.0?
build/sbt -Phive "project hive" "test-only *.HiveExternalCatalogVersionsSuite"
...
[info] - backward compatibility *** FAILED *** (17 seconds, 712 milliseconds)
[info] spark-submit returned with exit code 1.
...
[info] 2017-09-06 16:07:41.744 - stderr> Caused by: java.sql.SQLException: Database at /Users/dongjoon/PR-19148/target/tmp/warehouse-d2818ad2-f141-4fc7-bc68-e7f67c89f3f4/metastore_db has an incompatible format with the current version of the software. The database was created by or upgraded by version 10.12.
Test build #81512 has finished for PR 19148 at commit
|
Test build #81520 has finished for PR 19148 at commit
|
@@ -177,6 +177,10 @@ | |||
<artifactId>libfb303</artifactId> | |||
</dependency> | |||
<dependency> | |||
<groupId>org.apache.derby</groupId> | |||
<artifactId>derby</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive metastore depends on derby 10.10.2, and we package derby 10.12.1 when building Spark, so at last we are using derby 10.12.1 when Spark uses local hive metastore.
However this is a tricky approach, e.g. it doesn't work for SBT. When you build Spark with SBT, you are still using derby 10.10.2. It's probably the reason why the test failed on jenkins.
Here I explicitly add the derby dependency to the hive module, to overwrite the default derby 10.10.2 dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you!
import org.apache.spark.util.Utils | ||
|
||
|
||
class HiveExternalCatalogBackwardCompatibilitySuite extends QueryTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is covered by the new test suite.
@@ -1354,31 +1354,4 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv | |||
sparkSession.sparkContext.conf.set(DEBUG_MODE, previousValue) | |||
} | |||
} | |||
|
|||
test("SPARK-18464: support old table which doesn't store schema in table properties") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is covered by the new test suite.
|spark.sql("create table external_table_without_schema_" + version_index + \\ | ||
| " using json options (path '{}')".format(json_file2)) | ||
| | ||
|spark.sql("create view v_{} as select 1 i".format(version_index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive serde tables are excluded, because they are always compatible. Partitioned tables are excluded, becuase previous compatibility bugs are all about non-partitioned tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, we can add more. So far, it sounds good enough.
Test build #81524 has finished for PR 19148 at commit
|
Less than 2 mins to finish the suite. It looks pretty good! |
private val unusedJar = TestUtils.createJarWithClasses(Seq.empty) | ||
|
||
override def afterAll(): Unit = { | ||
Utils.deleteRecursively(wareHousePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also delete tmpDataDir
and sparkTestingDir
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanna keep the sparkTestingDir
, so we don't need to download spark again if this jenkins machine has already run this suite before.
} | ||
} | ||
|
||
object READ_TABLES extends QueryTest with SQLTestUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> PROCESS_TABLE
LGTM except two minor comments. |
* expected version under this local directory, e.g. `/tmp/spark-test/spark-2.0.3`, we will skip the | ||
* downloading for this spark version. | ||
*/ | ||
class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this test locally and encountered the failure like:
2017-09-07 19:28:07.595 - stderr> Caused by: java.sql.SQLException: Database at
/root/repos/spark-1/target/tmp/warehouse-66dad501-c743-4ac3-83cc-51451c6d697a/metastore_db
has an incompatible format with the current version of the software. The database was created by or
upgraded by version 10.12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you print org.apache.derby.tools.sysinfo.getVersionString
in IsolatedClientLoader.createClient
to see what's your actual derby version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After removing the added derby dependency, this test can work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try a clean clone? I added the derby dependency to make the test work on jenkins...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me do build clean and try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. After a build clean it works now.
import org.apache.spark.sql.test.ProcessTestUtils.ProcessOutputCapturer | ||
import org.apache.spark.util.Utils | ||
|
||
trait SparkSubmitTestUtils extends SparkFunSuite with Timeouts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Let's use TimeLimits
instead of Timeouts
. Timeouts
is deprecated now.
Test build #81532 has finished for PR 19148 at commit
|
|spark = SparkSession.builder.enableHiveSupport().getOrCreate() | ||
|version_index = spark.conf.get("spark.sql.test.version.index", None) | ||
| | ||
|spark.sql("create table data_source_tbl_{} using json as select 1 i".format(version_index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of only using lowercase column name, should we use mix-case Hive schema for those tables?
// make sure we can insert and query these tables. | ||
session.sql(s"insert into $tbl select 2") | ||
checkAnswer(session.sql(s"select * from $tbl"), Row(1) :: Row(2) :: Nil) | ||
checkAnswer(session.sql(s"select i from $tbl where i > 1"), Row(2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the all tests are wrapped in a single backward compatibility
, I'm not sure we can easily identify the problematic version here if any check fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can tell the version via table name. I agree it's a little tricky, but I don't have a better idea...
Test build #81535 has finished for PR 19148 at commit
|
Thanks! Merged to master. |
Could you send a PR to 2.2 branch? |
yea will do. @viirya feel free to send PRs to improve it like using mixed-case column names, thanks! |
…rnalCatalog `HiveExternalCatalog` is a semi-public interface. When creating tables, `HiveExternalCatalog` converts the table metadata to hive table format and save into hive metastore. It's very import to guarantee backward compatibility here, i.e., tables created by previous Spark versions should still be readable in newer Spark versions. Previously we find backward compatibility issues manually, which is really easy to miss bugs. This PR introduces a test framework to automatically test `HiveExternalCatalog` backward compatibility, by downloading Spark binaries with different versions, and create tables with these Spark versions, and read these tables with current Spark version. test-only change Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19148 from cloud-fan/test.
@cloud-fan Thanks. I'll do it. |
…eExternalCatalog backport apache#19148 to 2.2 Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19163 from cloud-fan/test.
What changes were proposed in this pull request?
HiveExternalCatalog
is a semi-public interface. When creating tables,HiveExternalCatalog
converts the table metadata to hive table format and save into hive metastore. It's very import to guarantee backward compatibility here, i.e., tables created by previous Spark versions should still be readable in newer Spark versions.Previously we find backward compatibility issues manually, which is really easy to miss bugs. This PR introduces a test framework to automatically test
HiveExternalCatalog
backward compatibility, by downloading Spark binaries with different versions, and create tables with these Spark versions, and read these tables with current Spark version.How was this patch tested?
test-only change