Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21936][SQL] backward compatibility test framework for HiveExternalCatalog #19148

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Sep 6, 2017

What changes were proposed in this pull request?

HiveExternalCatalog is a semi-public interface. When creating tables, HiveExternalCatalog converts the table metadata to hive table format and save into hive metastore. It's very import to guarantee backward compatibility here, i.e., tables created by previous Spark versions should still be readable in newer Spark versions.

Previously we find backward compatibility issues manually, which is really easy to miss bugs. This PR introduces a test framework to automatically test HiveExternalCatalog backward compatibility, by downloading Spark binaries with different versions, and create tables with these Spark versions, and read these tables with current Spark version.

How was this patch tested?

test-only change

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81465 has finished for PR 19148 at commit 283bc45.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts

@gatorsmile
Copy link
Member

The test cases pass in my local environment when I run this test suite individually. It took 4 mins. It is not too bad.

[info] Run completed in 4 minutes, 25 seconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

spark = session

testingVersions.indices.foreach { index =>
checkAnswer(session.sql(s"select * from t$index"), Row(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also do the other basic SQL. For example, insert, describe, drop table

.getOrCreate()

val index = session.conf.get("spark.sql.test.version.index")
session.sql(s"create table t$index using parquet as select 1 a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also create a view.

import scala.sys.process._

val url =
s"http://mirrors.hust.edu.cn/apache/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
Copy link
Member

@dongjoon-hyun dongjoon-hyun Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to have this testsuite. BTW, is it okay to use this single site?

* downloading for this spark version.
*/
class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts {
private val wareHousePath = Utils.createTempDir(namePrefix = "warehouse")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This single warehouse seems to make a failure. Maybe, Spark 2.0.2 tries to read 2.2.0?

build/sbt -Phive "project hive" "test-only *.HiveExternalCatalogVersionsSuite"
...
[info] - backward compatibility *** FAILED *** (17 seconds, 712 milliseconds)
[info]   spark-submit returned with exit code 1.
...
[info]   2017-09-06 16:07:41.744 - stderr> Caused by: java.sql.SQLException: Database at /Users/dongjoon/PR-19148/target/tmp/warehouse-d2818ad2-f141-4fc7-bc68-e7f67c89f3f4/metastore_db has an incompatible format with the current version of the software.  The database was created by or upgraded by version 10.12.

@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81512 has finished for PR 19148 at commit 850536e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts

@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81520 has finished for PR 19148 at commit 3d827f9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts

@@ -177,6 +177,10 @@
<artifactId>libfb303</artifactId>
</dependency>
<dependency>
<groupId>org.apache.derby</groupId>
<artifactId>derby</artifactId>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive metastore depends on derby 10.10.2, and we package derby 10.12.1 when building Spark, so at last we are using derby 10.12.1 when Spark uses local hive metastore.

However this is a tricky approach, e.g. it doesn't work for SBT. When you build Spark with SBT, you are still using derby 10.10.2. It's probably the reason why the test failed on jenkins.

Here I explicitly add the derby dependency to the hive module, to overwrite the default derby 10.10.2 dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you!

import org.apache.spark.util.Utils


class HiveExternalCatalogBackwardCompatibilitySuite extends QueryTest
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered by the new test suite.

@@ -1354,31 +1354,4 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
sparkSession.sparkContext.conf.set(DEBUG_MODE, previousValue)
}
}

test("SPARK-18464: support old table which doesn't store schema in table properties") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered by the new test suite.

|spark.sql("create table external_table_without_schema_" + version_index + \\
| " using json options (path '{}')".format(json_file2))
|
|spark.sql("create view v_{} as select 1 i".format(version_index))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive serde tables are excluded, because they are always compatible. Partitioned tables are excluded, becuase previous compatibility bugs are all about non-partitioned tables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we can add more. So far, it sounds good enough.

@cloud-fan cloud-fan changed the title [SPARK-21936][SQL][WIP] backward compatibility test framework for HiveExternalCatalog [SPARK-21936][SQL] backward compatibility test framework for HiveExternalCatalog Sep 7, 2017
@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81524 has finished for PR 19148 at commit 08dcf22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils
  • trait SparkSubmitTestUtils extends SparkFunSuite with Timeouts

@gatorsmile
Copy link
Member

Less than 2 mins to finish the suite. It looks pretty good!

private val unusedJar = TestUtils.createJarWithClasses(Seq.empty)

override def afterAll(): Unit = {
Utils.deleteRecursively(wareHousePath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also delete tmpDataDir and sparkTestingDir ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanna keep the sparkTestingDir, so we don't need to download spark again if this jenkins machine has already run this suite before.

}
}

object READ_TABLES extends QueryTest with SQLTestUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> PROCESS_TABLE

@gatorsmile
Copy link
Member

LGTM except two minor comments.

* expected version under this local directory, e.g. `/tmp/spark-test/spark-2.0.3`, we will skip the
* downloading for this spark version.
*/
class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this test locally and encountered the failure like:

2017-09-07 19:28:07.595 - stderr> Caused by: java.sql.SQLException: Database at
/root/repos/spark-1/target/tmp/warehouse-66dad501-c743-4ac3-83cc-51451c6d697a/metastore_db
has an incompatible format with the current version of the software.  The database was created by or
upgraded by version 10.12.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you print org.apache.derby.tools.sysinfo.getVersionString in IsolatedClientLoader.createClient to see what's your actual derby version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing the added derby dependency, this test can work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try a clean clone? I added the derby dependency to make the test work on jenkins...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do build clean and try again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. After a build clean it works now.

import org.apache.spark.sql.test.ProcessTestUtils.ProcessOutputCapturer
import org.apache.spark.util.Utils

trait SparkSubmitTestUtils extends SparkFunSuite with Timeouts {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Let's use TimeLimits instead of Timeouts. Timeouts is deprecated now.

@SparkQA
Copy link

SparkQA commented Sep 8, 2017

Test build #81532 has finished for PR 19148 at commit 00cdd0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|spark = SparkSession.builder.enableHiveSupport().getOrCreate()
|version_index = spark.conf.get("spark.sql.test.version.index", None)
|
|spark.sql("create table data_source_tbl_{} using json as select 1 i".format(version_index))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of only using lowercase column name, should we use mix-case Hive schema for those tables?

// make sure we can insert and query these tables.
session.sql(s"insert into $tbl select 2")
checkAnswer(session.sql(s"select * from $tbl"), Row(1) :: Row(2) :: Nil)
checkAnswer(session.sql(s"select i from $tbl where i > 1"), Row(2))
Copy link
Member

@viirya viirya Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the all tests are wrapped in a single backward compatibility, I'm not sure we can easily identify the problematic version here if any check fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can tell the version via table name. I agree it's a little tricky, but I don't have a better idea...

@SparkQA
Copy link

SparkQA commented Sep 8, 2017

Test build #81535 has finished for PR 19148 at commit 62369e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class NettyMemoryMetrics implements MetricSet

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@gatorsmile
Copy link
Member

Could you send a PR to 2.2 branch?

@asfgit asfgit closed this in dbb8241 Sep 8, 2017
@cloud-fan
Copy link
Contributor Author

yea will do. @viirya feel free to send PRs to improve it like using mixed-case column names, thanks!

cloud-fan added a commit to cloud-fan/spark that referenced this pull request Sep 8, 2017
…rnalCatalog

`HiveExternalCatalog` is a semi-public interface. When creating tables, `HiveExternalCatalog` converts the table metadata to hive table format and save into hive metastore. It's very import to guarantee backward compatibility here, i.e., tables created by previous Spark versions should still be readable in newer Spark versions.

Previously we find backward compatibility issues manually, which is really easy to miss bugs. This PR introduces a test framework to automatically test `HiveExternalCatalog` backward compatibility, by downloading Spark binaries with different versions, and create tables with these Spark versions, and read these tables with current Spark version.

test-only change

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19148 from cloud-fan/test.
@viirya
Copy link
Member

viirya commented Sep 8, 2017

@cloud-fan Thanks. I'll do it.

asfgit pushed a commit that referenced this pull request Sep 8, 2017
…eExternalCatalog

backport #19148 to 2.2

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19163 from cloud-fan/test.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…eExternalCatalog

backport apache#19148 to 2.2

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19163 from cloud-fan/test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants