Spark: Split SparkCatalogTestBase and allow running tests with one catalog #3549

hililiwei · 2021-11-14T15:28:49Z

new enum named SparkCatalogType with three test spark catalog 'testhive''testhadoop''spark_catalog''

public enum SparkCatalogType {
  TEST_HIVE("testhive", SparkCatalog.class.getName(), ImmutableMap.of(
      "type", "hive",
      "default-namespace", "default"
  )),
  TEST_HADOOP("testhadoop", SparkCatalog.class.getName(), ImmutableMap.of(
      "type", "hadoop"
  )),
  SPARK_CATALOG("spark_catalog", SparkSessionCatalog.class.getName(), ImmutableMap.of(
      "type", "hive",
      "default-namespace", "default",
      "parquet-enabled", "true",
      "cache-enabled", "false" // Spark will delete tables using v1, leaving the cache out of sync
  ));

new base abstract SparkSpecifyCatalogTestBase with three constructors

  public SparkSpecifyCatalogTestBase() {
    this(SparkCatalogType.TEST_HADOOP, null);
  }

  public SparkSpecifyCatalogTestBase(SparkCatalogType sparkCatalogType) {
    this(sparkCatalogType, null);
  }

  public SparkSpecifyCatalogTestBase(SparkCatalogType sparkCatalogType, Map<String, String> config) {
 .........
 }

Importantly, I'm not sure which test cases can be migrated to use it.

thx.

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java

kbendick

Thanks for working on this @hililiwei! I just have one piece of input or opinion that I'd like to discuss (and hear other's opinions on if they care to chime in). Let a comment in the code on where that change might be made.

In some of the tests that extend SparkCatalogTestBase, the catalogs are overridden by a new @Parameters section with only one or more catalogs. Often times, currently, this is done to allow for overriding the catalog config. But everything has to be redeclared.

Here's an example where we overrode the @Parameters to run a different set of catalogs (though still used all 3). The main reason for the override here was to allow for changing the catalog configuration with the additional stuff:

iceberg/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java

Lines 50 to 77 in f5a7537

    
           @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}") 
        
           public static Object[][] parameters() { 
        
             return new Object[][] { 
        
                 { "testhive", SparkCatalog.class.getName(), 
        
                   ImmutableMap.of( 
        
                     "type", "hive", 
        
                     "default-namespace", "default", 
        
                     hadoopPrefixedConfigToOverride, configOverrideValue 
        
                 ) }, 
        
                 { "testhadoop", SparkCatalog.class.getName(), 
        
                    ImmutableMap.of( 
        
                     "type", "hadoop", 
        
                     hadoopPrefixedConfigToOverride, configOverrideValue 
        
                    ) }, 
        
                 { "spark_catalog", SparkSessionCatalog.class.getName(), 
        
                   ImmutableMap.of( 
        
                     "type", "hive", 
        
                     "default-namespace", "default", 
        
                     hadoopPrefixedConfigToOverride, configOverrideValue 
        
                 ) } 
        
             }; 
        
           } 
        
           public TestSparkCatalogHadoopOverrides(String catalogName, 
        
                                                  String implementation, 
        
                                                  Map<String, String> config) { 
        
             super(catalogName, implementation, config); 
        
           }

If we could make it so that the config in the enum can be overridden without redeclaring everything (or whatever it winds up being if enums don't support that), that would be a big win in my view / really useful!

For me, it would be great if the passed in config could override the base configuration key by key, so things like type=hive don't need to be set again, that would be great.

This could be a task for a follow-up PR, but very interested to hear other's opinions on this 🙂

kbendick · 2021-11-14T20:52:09Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkCatalogType.java

+  TEST_HADOOP("testhadoop", SparkCatalog.class.getName(), ImmutableMap.of(
+      "type", "hadoop"
+  )),
+  SPARK_CATALOG("spark_catalog", SparkSessionCatalog.class.getName(), ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false" // Spark will delete tables using v1, leaving the cache out of sync
+  ));


If we had constructors that provided a copy or something, so we had SparkCatalogType.testHive(ImmutableMap.of("additionalConfig1", "additionalValue1"), this would have much more benefit for the tests that need to override the Parameters to specify different catalog configuration.

Though that could be a v2 task.

Thanks for your reply.
Do you mean that the developers to specify only the additional config, without having to redeclare the entire catalog repeatedly?

SparkSpecifyCatalogTestBase provides the constructor as shown below, it can be used to add additional configuration items.

iceberg/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkSpecifyCatalogTestBase.java

Lines 72 to 79 in 9d431e8

public SparkSpecifyCatalogTestBase(SparkCatalogType sparkCatalogType, Map<String, String> config) {

this.implementation = sparkCatalogType.getImplementation();

this.catalogConfig = new HashMap<>(sparkCatalogType.getConfig());

if (config != null && !config.isEmpty()) {

config.forEach((key, value) -> catalogConfig.merge(key, value, (oldValue, newValue) -> newValue));

}

super(SparkCatalogType.SPARK_CATALOG, ImmutableMap.of("additionalConfig1", "additionalValue1");
Is it possible to use this method to achieve the desired results?

…talog

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkCatalogType.java

rdblue · 2021-11-16T17:32:15Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkCatalogType.java

+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+
+public enum SparkCatalogType {


I wouldn't say this is a catalog type, it's a catalog configuration. We could add more that have slightly different configs.

We could add more that have slightly different configs.

Could you provide more detailed guidance?I'm not familiar with this part.thank you

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkSpecifyCatalogTestBase.java

rdblue · 2021-11-16T17:35:53Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestRuntimeFiltering.java

-  public TestRuntimeFiltering(String catalogName, String implementation, Map<String, String> config) {
-    super(catalogName, implementation, config);
+  public TestRuntimeFiltering() {
+    super(SparkCatalogType.SPARK_CATALOG);


I think we should use Hadoop for tests that only use one catalog. Also, if we're using Hadoop then we don't need to use the test base that sets up the metastore. Or at least we should signal to that test base that it should not set up the metastore if it won't be used by a catalog.

iceberg/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java

Lines 54 to 76 in f6fdeb0

@BeforeClass

public static void startMetastoreAndSpark() {

SparkTestBase.metastore = new TestHiveMetastore();

metastore.start();

SparkTestBase.hiveConf = metastore.hiveConf();

SparkTestBase.spark = SparkSession.builder()

.master("local[2]")

.config(SQLConf.PARTITION_OVERWRITE_MODE().key(), "dynamic")

.config("spark.hadoop." + METASTOREURIS.varname, hiveConf.get(METASTOREURIS.varname))

.enableHiveSupport()

.getOrCreate();

SparkTestBase.catalog = (HiveCatalog)

CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);

try {

catalog.createNamespace(Namespace.of("default"));

} catch (AlreadyExistsException ignored) {

// the default namespace already exists. ignore the create error

}

}

Can this be achieved by modifying the code as follows?

@Before public void checkMetastoreAndSpark() { if (SparkTestBase.spark == null) { synchronized (SparkTestBase.class) { if (SparkTestBase.spark == null) { if (StringUtils.equals(catalogName, SPARK_CATALOG_HADOOP.catalogName())) { startSpark(); } else { startMetastoreAndSpark(); } } } } } public static void startSpark() { SparkTestBase.spark = SparkSession.builder() .master("local[2]") .config(SQLConf.PARTITION_OVERWRITE_MODE().key(), "dynamic") .getOrCreate(); } public static void startMetastoreAndSpark() { SparkTestBase.metastore = new TestHiveMetastore(); metastore.start(); SparkTestBase.hiveConf = metastore.hiveConf(); SparkTestBase.spark = SparkSession.builder() .master("local[2]") .config(SQLConf.PARTITION_OVERWRITE_MODE().key(), "dynamic") .config("spark.hadoop." + METASTOREURIS.varname, hiveConf.get(METASTOREURIS.varname)) .enableHiveSupport() .getOrCreate(); SparkTestBase.catalog = (HiveCatalog) CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf); try { catalog.createNamespace(Namespace.of("default")); } catch (AlreadyExistsException ignored) { // the default namespace already exists. ignore the create error } }

…talog

rdblue · 2021-11-17T23:33:35Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkSpecifyCatalogTestBase.java

+import org.junit.Rule;
+import org.junit.rules.TemporaryFolder;
+
+public abstract class SparkSpecifyCatalogTestBase extends SparkTestBase {


How about SparkTestBaseWithCatalog? I think that is clear what this is for. Then SparkCatalogTestBase can be for testing catalogs with Spark.

rdblue · 2021-11-17T23:34:31Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkCatalogConfig.java

+      "parquet-enabled", "true",
+      "cache-enabled", "false" // Spark will delete tables using v1, leaving the cache out of sync
+  )),
+  SPARK_SESSION_CATALOG_HADOOP("spark_catalog", SparkSessionCatalog.class.getName(), ImmutableMap.of(


We only need to test Hive with the Spark session catalog. We don't recommend using other catalogs with the session catalog unless you really know what you're doing.

rdblue · 2021-11-17T23:58:29Z

@hililiwei, I opened a PR against your branch to make this a bit more generic and implement my suggestions.

Update to use SparkCatalogConfig in SparkCatalogTestBase

…talog

hililiwei · 2021-11-18T01:56:22Z

@hililiwei, I opened a PR against your branch to make this a bit more generic and implement my suggestions.

Thanks for your guidance. merged.

rdblue · 2021-11-18T16:30:21Z

Thanks, @hililiwei! This should really help us reduce test runtime in CI.

…he#3549)

github-actions bot added the spark label Nov 14, 2021

hililiwei force-pushed the ICEBERG-3499 branch 3 times, most recently from d6f996c to adacd56 Compare November 14, 2021 15:50

rdblue reviewed Nov 14, 2021

View reviewed changes

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java Outdated Show resolved Hide resolved

kbendick reviewed Nov 14, 2021

View reviewed changes

hililiwei force-pushed the ICEBERG-3499 branch 2 times, most recently from 02a32a6 to 603808b Compare November 15, 2021 06:26

Spark: Split SparkCatalogTestBase and allow running tests with one ca…

9d431e8

…talog

hililiwei force-pushed the ICEBERG-3499 branch from 603808b to 9d431e8 Compare November 15, 2021 06:30

hililiwei requested review from kbendick and rdblue November 16, 2021 02:07

rdblue reviewed Nov 16, 2021

View reviewed changes

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkCatalogType.java Outdated Show resolved Hide resolved

rdblue reviewed Nov 16, 2021

View reviewed changes

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkSpecifyCatalogTestBase.java Outdated Show resolved Hide resolved

rdblue reviewed Nov 16, 2021

View reviewed changes

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkSpecifyCatalogTestBase.java Outdated Show resolved Hide resolved

rdblue reviewed Nov 16, 2021

View reviewed changes

Spark: Split SparkCatalogTestBase and allow running tests with one ca…

50b189a

…talog

hililiwei force-pushed the ICEBERG-3499 branch from 7f9c82f to 50b189a Compare November 17, 2021 01:56

hililiwei requested a review from rdblue November 17, 2021 06:12

rdblue reviewed Nov 17, 2021

View reviewed changes

Update to use SparkCatalogConfig in SparkCatalogTestBase.

191d781

rdblue mentioned this pull request Nov 17, 2021

Update to use SparkCatalogConfig in SparkCatalogTestBase hililiwei/iceberg#1

Merged

hililiwei added 2 commits November 18, 2021 09:50

Merge pull request #1 from rdblue/ICEBERG-3499

45f0bce

Update to use SparkCatalogConfig in SparkCatalogTestBase

Spark: Split SparkCatalogTestBase and allow running tests with one ca…

813753b

…talog

fix format

1804259

hililiwei requested a review from rdblue November 18, 2021 09:27

rdblue approved these changes Nov 18, 2021

View reviewed changes

rdblue merged commit 419684a into apache:master Nov 18, 2021

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Nov 23, 2021

Spark: Add SparkTestBaseWithCatalog to run tests in one catalog (apac…

8af035d

…he#3549)

This was referenced Dec 14, 2021

[SPARK] Backport SpakTestBaseWithCatalog to Spark 3.0 to simplify run… #3736

Merged

[SPARK] Backport SpakTestBaseWithCatalog to Spark 3.1 to simplify run… #3737

Merged

	@Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
	public static Object[][] parameters() {
	return new Object[][] {
	{ "testhive", SparkCatalog.class.getName(),
	ImmutableMap.of(
	"type", "hive",
	"default-namespace", "default",
	hadoopPrefixedConfigToOverride, configOverrideValue
	) },
	{ "testhadoop", SparkCatalog.class.getName(),
	ImmutableMap.of(
	"type", "hadoop",
	hadoopPrefixedConfigToOverride, configOverrideValue
	) },
	{ "spark_catalog", SparkSessionCatalog.class.getName(),
	ImmutableMap.of(
	"type", "hive",
	"default-namespace", "default",
	hadoopPrefixedConfigToOverride, configOverrideValue
	) }
	};
	}

	public TestSparkCatalogHadoopOverrides(String catalogName,
	String implementation,
	Map<String, String> config) {
	super(catalogName, implementation, config);
	}

	public SparkSpecifyCatalogTestBase(SparkCatalogType sparkCatalogType, Map<String, String> config) {
	this.implementation = sparkCatalogType.getImplementation();

	this.catalogConfig = new HashMap<>(sparkCatalogType.getConfig());
	if (config != null && !config.isEmpty()) {
	config.forEach((key, value) -> catalogConfig.merge(key, value, (oldValue, newValue) -> newValue));
	}

	@BeforeClass
	public static void startMetastoreAndSpark() {
	SparkTestBase.metastore = new TestHiveMetastore();
	metastore.start();
	SparkTestBase.hiveConf = metastore.hiveConf();

	SparkTestBase.spark = SparkSession.builder()
	.master("local[2]")
	.config(SQLConf.PARTITION_OVERWRITE_MODE().key(), "dynamic")
	.config("spark.hadoop." + METASTOREURIS.varname, hiveConf.get(METASTOREURIS.varname))
	.enableHiveSupport()
	.getOrCreate();

	SparkTestBase.catalog = (HiveCatalog)
	CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);

	try {
	catalog.createNamespace(Namespace.of("default"));
	} catch (AlreadyExistsException ignored) {
	// the default namespace already exists. ignore the create error
	}
	}

Spark: Split SparkCatalogTestBase and allow running tests with one catalog #3549

Spark: Split SparkCatalogTestBase and allow running tests with one catalog #3549

Uh oh!

Conversation

hililiwei commented Nov 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kbendick left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick Nov 14, 2021

Choose a reason for hiding this comment

Uh oh!

hililiwei Nov 15, 2021

Choose a reason for hiding this comment

Uh oh!

hililiwei Nov 15, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hililiwei Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rdblue Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hililiwei Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue commented Nov 17, 2021

Uh oh!

hililiwei commented Nov 18, 2021

Uh oh!

rdblue commented Nov 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hililiwei commented Nov 14, 2021 •

edited

Loading

kbendick left a comment •

edited

Loading

rdblue Nov 16, 2021 •

edited

Loading

hililiwei Nov 17, 2021 •

edited

Loading

rdblue Nov 16, 2021 •

edited

Loading

rdblue Nov 17, 2021 •

edited

Loading