[SPARK-21936][SQL] backward compatibility test framework for HiveExternalCatalog #19148

cloud-fan · 2017-09-06T16:51:51Z

What changes were proposed in this pull request?

HiveExternalCatalog is a semi-public interface. When creating tables, HiveExternalCatalog converts the table metadata to hive table format and save into hive metastore. It's very import to guarantee backward compatibility here, i.e., tables created by previous Spark versions should still be readable in newer Spark versions.

Previously we find backward compatibility issues manually, which is really easy to miss bugs. This PR introduces a test framework to automatically test HiveExternalCatalog backward compatibility, by downloading Spark binaries with different versions, and create tables with these Spark versions, and read these tables with current Spark version.

How was this patch tested?

test-only change

cloud-fan · 2017-09-06T16:52:01Z

cc @gatorsmile

SparkQA · 2017-09-06T18:10:09Z

Test build #81465 has finished for PR 19148 at commit 283bc45.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts

gatorsmile · 2017-09-06T18:47:34Z

The test cases pass in my local environment when I run this test suite individually. It took 4 mins. It is not too bad.

[info] Run completed in 4 minutes, 25 seconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

gatorsmile · 2017-09-06T18:48:53Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

+    spark = session
+
+    testingVersions.indices.foreach { index =>
+      checkAnswer(session.sql(s"select * from t$index"), Row(1))


We should also do the other basic SQL. For example, insert, describe, drop table

gatorsmile · 2017-09-06T18:49:05Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

+      .getOrCreate()
+
+    val index = session.conf.get("spark.sql.test.version.index")
+    session.sql(s"create table t$index using parquet as select 1 a")


Also create a view.

dongjoon-hyun · 2017-09-06T23:05:02Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

+    import scala.sys.process._
+
+    val url =
+      s"http://mirrors.hust.edu.cn/apache/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"


It's great to have this testsuite. BTW, is it okay to use this single site?

dongjoon-hyun · 2017-09-06T23:12:27Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

+ * downloading for this spark version.
+ */
+class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts {
+  private val wareHousePath = Utils.createTempDir(namePrefix = "warehouse")


This single warehouse seems to make a failure. Maybe, Spark 2.0.2 tries to read 2.2.0?

build/sbt -Phive "project hive" "test-only *.HiveExternalCatalogVersionsSuite" ... [info] - backward compatibility *** FAILED *** (17 seconds, 712 milliseconds) [info] spark-submit returned with exit code 1. ... [info] 2017-09-06 16:07:41.744 - stderr> Caused by: java.sql.SQLException: Database at /Users/dongjoon/PR-19148/target/tmp/warehouse-d2818ad2-f141-4fc7-bc68-e7f67c89f3f4/metastore_db has an incompatible format with the current version of the software. The database was created by or upgraded by version 10.12.

SparkQA · 2017-09-07T09:55:56Z

Test build #81512 has finished for PR 19148 at commit 850536e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts

SparkQA · 2017-09-07T18:04:33Z

Test build #81520 has finished for PR 19148 at commit 3d827f9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts

cloud-fan · 2017-09-07T18:31:21Z

sql/hive/pom.xml

@@ -177,6 +177,10 @@
      <artifactId>libfb303</artifactId>
    </dependency>
    <dependency>
+      <groupId>org.apache.derby</groupId>
+      <artifactId>derby</artifactId>


Hive metastore depends on derby 10.10.2, and we package derby 10.12.1 when building Spark, so at last we are using derby 10.12.1 when Spark uses local hive metastore.

However this is a tricky approach, e.g. it doesn't work for SBT. When you build Spark with SBT, you are still using derby 10.10.2. It's probably the reason why the test failed on jenkins.

Here I explicitly add the derby dependency to the hive module, to overwrite the default derby 10.10.2 dependency.

I see. Thank you!

cloud-fan · 2017-09-07T18:32:14Z

...src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogBackwardCompatibilitySuite.scala

-import org.apache.spark.util.Utils
-
-
-class HiveExternalCatalogBackwardCompatibilitySuite extends QueryTest


This is covered by the new test suite.

cloud-fan · 2017-09-07T18:32:28Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala

@@ -1354,31 +1354,4 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
      sparkSession.sparkContext.conf.set(DEBUG_MODE, previousValue)
    }
  }
-
-  test("SPARK-18464: support old table which doesn't store schema in table properties") {


This is covered by the new test suite.

cloud-fan · 2017-09-07T18:34:04Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

+        |spark.sql("create table external_table_without_schema_" + version_index + \\
+        |          " using json options (path '{}')".format(json_file2))
+        |
+        |spark.sql("create view v_{} as select 1 i".format(version_index))


Hive serde tables are excluded, because they are always compatible. Partitioned tables are excluded, becuase previous compatibility bugs are all about non-partitioned tables.

In the future, we can add more. So far, it sounds good enough.

SparkQA · 2017-09-07T21:37:46Z

Test build #81524 has finished for PR 19148 at commit 08dcf22.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils
trait SparkSubmitTestUtils extends SparkFunSuite with Timeouts

gatorsmile · 2017-09-07T23:42:35Z

Less than 2 mins to finish the suite. It looks pretty good!

gatorsmile · 2017-09-07T23:47:19Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

+  private val unusedJar = TestUtils.createJarWithClasses(Seq.empty)
+
+  override def afterAll(): Unit = {
+    Utils.deleteRecursively(wareHousePath)


Also delete tmpDataDir and sparkTestingDir ?

I wanna keep the sparkTestingDir, so we don't need to download spark again if this jenkins machine has already run this suite before.

gatorsmile · 2017-09-07T23:52:05Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

+  }
+}
+
+object READ_TABLES extends QueryTest with SQLTestUtils {


-> PROCESS_TABLE

gatorsmile · 2017-09-07T23:59:32Z

LGTM except two minor comments.

viirya · 2017-09-08T02:42:57Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

+ * expected version under this local directory, e.g. `/tmp/spark-test/spark-2.0.3`, we will skip the
+ * downloading for this spark version.
+ */
+class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils {


I ran this test locally and encountered the failure like:

2017-09-07 19:28:07.595 - stderr> Caused by: java.sql.SQLException: Database at /root/repos/spark-1/target/tmp/warehouse-66dad501-c743-4ac3-83cc-51451c6d697a/metastore_db has an incompatible format with the current version of the software. The database was created by or upgraded by version 10.12.

can you print org.apache.derby.tools.sysinfo.getVersionString in IsolatedClientLoader.createClient to see what's your actual derby version?

After removing the added derby dependency, this test can work.

Did you try a clean clone? I added the derby dependency to make the test work on jenkins...

Let me do build clean and try again.

Ok. After a build clean it works now.

dongjoon-hyun · 2017-09-08T02:48:10Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/SparkSubmitTestUtils.scala

+import org.apache.spark.sql.test.ProcessTestUtils.ProcessOutputCapturer
+import org.apache.spark.util.Utils
+
+trait SparkSubmitTestUtils extends SparkFunSuite with Timeouts {


nit. Let's use TimeLimits instead of Timeouts. Timeouts is deprecated now.

SparkQA · 2017-09-08T03:42:53Z

Test build #81532 has finished for PR 19148 at commit 00cdd0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-08T03:50:38Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

+        |spark = SparkSession.builder.enableHiveSupport().getOrCreate()
+        |version_index = spark.conf.get("spark.sql.test.version.index", None)
+        |
+        |spark.sql("create table data_source_tbl_{} using json as select 1 i".format(version_index))


Instead of only using lowercase column name, should we use mix-case Hive schema for those tables?

viirya · 2017-09-08T03:57:46Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

+        // make sure we can insert and query these tables.
+        session.sql(s"insert into $tbl select 2")
+        checkAnswer(session.sql(s"select * from $tbl"), Row(1) :: Row(2) :: Nil)
+        checkAnswer(session.sql(s"select i from $tbl where i > 1"), Row(2))


As the all tests are wrapped in a single backward compatibility, I'm not sure we can easily identify the problematic version here if any check fails?

you can tell the version via table name. I agree it's a little tricky, but I don't have a better idea...

SparkQA · 2017-09-08T06:16:09Z

Test build #81535 has finished for PR 19148 at commit 62369e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class NettyMemoryMetrics implements MetricSet

gatorsmile · 2017-09-08T06:21:26Z

Thanks! Merged to master.

gatorsmile · 2017-09-08T06:22:34Z

Could you send a PR to 2.2 branch?

cloud-fan · 2017-09-08T07:14:54Z

yea will do. @viirya feel free to send PRs to improve it like using mixed-case column names, thanks!

…rnalCatalog `HiveExternalCatalog` is a semi-public interface. When creating tables, `HiveExternalCatalog` converts the table metadata to hive table format and save into hive metastore. It's very import to guarantee backward compatibility here, i.e., tables created by previous Spark versions should still be readable in newer Spark versions. Previously we find backward compatibility issues manually, which is really easy to miss bugs. This PR introduces a test framework to automatically test `HiveExternalCatalog` backward compatibility, by downloading Spark binaries with different versions, and create tables with these Spark versions, and read these tables with current Spark version. test-only change Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19148 from cloud-fan/test.

viirya · 2017-09-08T07:37:08Z

@cloud-fan Thanks. I'll do it.

…eExternalCatalog backport #19148 to 2.2 Author: Wenchen Fan <wenchen@databricks.com> Closes #19163 from cloud-fan/test.

…eExternalCatalog backport apache#19148 to 2.2 Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19163 from cloud-fan/test.

gatorsmile reviewed Sep 6, 2017

View reviewed changes

dongjoon-hyun reviewed Sep 6, 2017

View reviewed changes

cloud-fan force-pushed the test branch from 283bc45 to 850536e Compare September 7, 2017 08:33

cloud-fan force-pushed the test branch from 850536e to 3d827f9 Compare September 7, 2017 15:32

backward compatibility test framework for HiveExternalCatalog

08dcf22

cloud-fan force-pushed the test branch from 3d827f9 to 08dcf22 Compare September 7, 2017 18:25

cloud-fan commented Sep 7, 2017

View reviewed changes

cloud-fan changed the title ~~[SPARK-21936][SQL][WIP] backward compatibility test framework for HiveExternalCatalog~~ [SPARK-21936][SQL] backward compatibility test framework for HiveExternalCatalog Sep 7, 2017

gatorsmile reviewed Sep 7, 2017

View reviewed changes

address comments

00cdd0a

viirya reviewed Sep 8, 2017

View reviewed changes

dongjoon-hyun reviewed Sep 8, 2017

View reviewed changes

Merge branch 'master' into test

62369e3

viirya reviewed Sep 8, 2017

View reviewed changes

asfgit closed this in dbb8241 Sep 8, 2017

cloud-fan mentioned this pull request Sep 8, 2017

[SPARK-21936][SQL][2.2] backward compatibility test framework for HiveExternalCatalog #19163

Closed

asfgit pushed a commit that referenced this pull request Sep 8, 2017

[SPARK-21936][SQL][2.2] backward compatibility test framework for Hiv…

08cb06a

…eExternalCatalog backport #19148 to 2.2 Author: Wenchen Fan <wenchen@databricks.com> Closes #19163 from cloud-fan/test.

		import org.apache.spark.util.Utils


		class HiveExternalCatalogBackwardCompatibilitySuite extends QueryTest

[SPARK-21936][SQL] backward compatibility test framework for HiveExternalCatalog #19148

[SPARK-21936][SQL] backward compatibility test framework for HiveExternalCatalog #19148

Conversation

cloud-fan commented Sep 6, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Sep 6, 2017

SparkQA commented Sep 6, 2017

gatorsmile commented Sep 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Sep 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 7, 2017

SparkQA commented Sep 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 7, 2017

gatorsmile commented Sep 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Sep 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 8, 2017

Choose a reason for hiding this comment

viirya Sep 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 8, 2017

gatorsmile commented Sep 8, 2017

gatorsmile commented Sep 8, 2017

cloud-fan commented Sep 8, 2017

viirya commented Sep 8, 2017

cloud-fan commented Sep 6, 2017 •

edited

Loading

dongjoon-hyun Sep 6, 2017 •

edited

Loading

viirya Sep 8, 2017 •

edited

Loading