-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Split SparkCatalogTestBase and allow running tests with one catalog #3549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d6f996c to
adacd56
Compare
...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @hililiwei! I just have one piece of input or opinion that I'd like to discuss (and hear other's opinions on if they care to chime in). Let a comment in the code on where that change might be made.
In some of the tests that extend SparkCatalogTestBase, the catalogs are overridden by a new @Parameters section with only one or more catalogs. Often times, currently, this is done to allow for overriding the catalog config. But everything has to be redeclared.
Here's an example where we overrode the @Parameters to run a different set of catalogs (though still used all 3). The main reason for the override here was to allow for changing the catalog configuration with the additional stuff:
Lines 50 to 77 in f5a7537
| @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}") | |
| public static Object[][] parameters() { | |
| return new Object[][] { | |
| { "testhive", SparkCatalog.class.getName(), | |
| ImmutableMap.of( | |
| "type", "hive", | |
| "default-namespace", "default", | |
| hadoopPrefixedConfigToOverride, configOverrideValue | |
| ) }, | |
| { "testhadoop", SparkCatalog.class.getName(), | |
| ImmutableMap.of( | |
| "type", "hadoop", | |
| hadoopPrefixedConfigToOverride, configOverrideValue | |
| ) }, | |
| { "spark_catalog", SparkSessionCatalog.class.getName(), | |
| ImmutableMap.of( | |
| "type", "hive", | |
| "default-namespace", "default", | |
| hadoopPrefixedConfigToOverride, configOverrideValue | |
| ) } | |
| }; | |
| } | |
| public TestSparkCatalogHadoopOverrides(String catalogName, | |
| String implementation, | |
| Map<String, String> config) { | |
| super(catalogName, implementation, config); | |
| } |
If we could make it so that the config in the enum can be overridden without redeclaring everything (or whatever it winds up being if enums don't support that), that would be a big win in my view / really useful!
For me, it would be great if the passed in config could override the base configuration key by key, so things like type=hive don't need to be set again, that would be great.
This could be a task for a follow-up PR, but very interested to hear other's opinions on this 🙂
| TEST_HADOOP("testhadoop", SparkCatalog.class.getName(), ImmutableMap.of( | ||
| "type", "hadoop" | ||
| )), | ||
| SPARK_CATALOG("spark_catalog", SparkSessionCatalog.class.getName(), ImmutableMap.of( | ||
| "type", "hive", | ||
| "default-namespace", "default", | ||
| "parquet-enabled", "true", | ||
| "cache-enabled", "false" // Spark will delete tables using v1, leaving the cache out of sync | ||
| )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had constructors that provided a copy or something, so we had SparkCatalogType.testHive(ImmutableMap.of("additionalConfig1", "additionalValue1"), this would have much more benefit for the tests that need to override the Parameters to specify different catalog configuration.
Though that could be a v2 task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply.
Do you mean that the developers to specify only the additional config, without having to redeclare the entire catalog repeatedly?
SparkSpecifyCatalogTestBase provides the constructor as shown below, it can be used to add additional configuration items.
iceberg/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkSpecifyCatalogTestBase.java
Lines 72 to 79 in 9d431e8
| public SparkSpecifyCatalogTestBase(SparkCatalogType sparkCatalogType, Map<String, String> config) { | |
| this.implementation = sparkCatalogType.getImplementation(); | |
| this.catalogConfig = new HashMap<>(sparkCatalogType.getConfig()); | |
| if (config != null && !config.isEmpty()) { | |
| config.forEach((key, value) -> catalogConfig.merge(key, value, (oldValue, newValue) -> newValue)); | |
| } | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super(SparkCatalogType.SPARK_CATALOG, ImmutableMap.of("additionalConfig1", "additionalValue1");
Is it possible to use this method to achieve the desired results?
02a32a6 to
603808b
Compare
603808b to
9d431e8
Compare
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkCatalogType.java
Outdated
Show resolved
Hide resolved
| import java.util.Map; | ||
| import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; | ||
|
|
||
| public enum SparkCatalogType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say this is a catalog type, it's a catalog configuration. We could add more that have slightly different configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add more that have slightly different configs.
Could you provide more detailed guidance?I'm not familiar with this part.thank you
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkSpecifyCatalogTestBase.java
Outdated
Show resolved
Hide resolved
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkSpecifyCatalogTestBase.java
Outdated
Show resolved
Hide resolved
| public TestRuntimeFiltering(String catalogName, String implementation, Map<String, String> config) { | ||
| super(catalogName, implementation, config); | ||
| public TestRuntimeFiltering() { | ||
| super(SparkCatalogType.SPARK_CATALOG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use Hadoop for tests that only use one catalog. Also, if we're using Hadoop then we don't need to use the test base that sets up the metastore. Or at least we should signal to that test base that it should not set up the metastore if it won't be used by a catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iceberg/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java
Lines 54 to 76 in f6fdeb0
| @BeforeClass | |
| public static void startMetastoreAndSpark() { | |
| SparkTestBase.metastore = new TestHiveMetastore(); | |
| metastore.start(); | |
| SparkTestBase.hiveConf = metastore.hiveConf(); | |
| SparkTestBase.spark = SparkSession.builder() | |
| .master("local[2]") | |
| .config(SQLConf.PARTITION_OVERWRITE_MODE().key(), "dynamic") | |
| .config("spark.hadoop." + METASTOREURIS.varname, hiveConf.get(METASTOREURIS.varname)) | |
| .enableHiveSupport() | |
| .getOrCreate(); | |
| SparkTestBase.catalog = (HiveCatalog) | |
| CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf); | |
| try { | |
| catalog.createNamespace(Namespace.of("default")); | |
| } catch (AlreadyExistsException ignored) { | |
| // the default namespace already exists. ignore the create error | |
| } | |
| } | |
Can this be achieved by modifying the code as follows?
@Before
public void checkMetastoreAndSpark() {
if (SparkTestBase.spark == null) {
synchronized (SparkTestBase.class) {
if (SparkTestBase.spark == null) {
if (StringUtils.equals(catalogName, SPARK_CATALOG_HADOOP.catalogName())) {
startSpark();
} else {
startMetastoreAndSpark();
}
}
}
}
}
public static void startSpark() {
SparkTestBase.spark = SparkSession.builder()
.master("local[2]")
.config(SQLConf.PARTITION_OVERWRITE_MODE().key(), "dynamic")
.getOrCreate();
}
public static void startMetastoreAndSpark() {
SparkTestBase.metastore = new TestHiveMetastore();
metastore.start();
SparkTestBase.hiveConf = metastore.hiveConf();
SparkTestBase.spark = SparkSession.builder()
.master("local[2]")
.config(SQLConf.PARTITION_OVERWRITE_MODE().key(), "dynamic")
.config("spark.hadoop." + METASTOREURIS.varname, hiveConf.get(METASTOREURIS.varname))
.enableHiveSupport()
.getOrCreate();
SparkTestBase.catalog = (HiveCatalog)
CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);
try {
catalog.createNamespace(Namespace.of("default"));
} catch (AlreadyExistsException ignored) {
// the default namespace already exists. ignore the create error
}
}
7f9c82f to
50b189a
Compare
| import org.junit.Rule; | ||
| import org.junit.rules.TemporaryFolder; | ||
|
|
||
| public abstract class SparkSpecifyCatalogTestBase extends SparkTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about SparkTestBaseWithCatalog? I think that is clear what this is for. Then SparkCatalogTestBase can be for testing catalogs with Spark.
| "parquet-enabled", "true", | ||
| "cache-enabled", "false" // Spark will delete tables using v1, leaving the cache out of sync | ||
| )), | ||
| SPARK_SESSION_CATALOG_HADOOP("spark_catalog", SparkSessionCatalog.class.getName(), ImmutableMap.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to test Hive with the Spark session catalog. We don't recommend using other catalogs with the session catalog unless you really know what you're doing.
|
@hililiwei, I opened a PR against your branch to make this a bit more generic and implement my suggestions. |
Update to use SparkCatalogConfig in SparkCatalogTestBase
Thanks for your guidance. merged. |
|
Thanks, @hililiwei! This should really help us reduce test runtime in CI. |
#3499
Importantly, I'm not sure which test cases can be migrated to use it.
thx.