Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Hive3 builds for iceberg-mr and iceberg-hive-metastore #1478

Merged
merged 9 commits into from
Oct 7, 2020

Conversation

marton-bod
Copy link
Collaborator

@marton-bod marton-bod commented Sep 18, 2020

This is a second iteration after an earlier, draft PR: #1455

The goal of this PR is to run the iceberg-mr tests with Hive3/Hadoop3 and enable the hive-runtime module to work with Hive3 dependencies as well in addition to Hive2.

@rdsr
Copy link
Contributor

rdsr commented Sep 20, 2020

Hi @marton-bod thank you for working on this. I have a few questions

  1. I think hive-metastore module should be unaffected whether we are running with hive2 or hive3 as hive-metastore only uses Metatstoreclient API and I've seen that Metastore upgrades are always backward compatible. I believe Hive metastore 3 should be backward compatible with Hive metastore client 2. In that respect we could keep using hive-metastore module as is. @pvary, @rdblue thoughts?
  2. For the hive code in mr module which requires changes because of hive2/hive3 artifacts. Is it possible to move the hive specific code from mr module and have 3 modules [similar to spark2/spark3] - hive-exec for common code, hive-exec2 and hive-exec3 for incompatible code. This way our mr modules is simplified and it is also aligned with our discussion on the mailing list to have an hive-exec module. This is easier said than done since it will mean we might have to ship different runtime jars. So best to also get agreement from folks like @rdblue @massdosage and @pvary .

@marton-bod
Copy link
Collaborator Author

Thanks @rdsr for sharing your thoughts!
1.) I agree with you, hive-metastore should be left as is. I haven't changed the hive-metastore module other than fixing the teardown problem in the unit tests, which occurs when using hive 3 dependencies (discussed in #1455).
2.) I think I've done what you just described, only under different module names. Right now, common code resides in iceberg-mr, and incompatible codes have been moved out to iceberg-mr-hive2 and icebegr-mr-hive3. I'm open to renaming the modules to exec, sounds like a more suitable name. As for shipping different runtime jars, we might want to create separate runtime modules too just like for spark2 and 3.

@rdsr
Copy link
Contributor

rdsr commented Sep 20, 2020

Thanks @marton-bod for your comment. Is it possible, then, to only have hive3 dependencies under iceberg-mr-hive3 instead of building other modules too with hive3 and hadoop3?

@marton-bod
Copy link
Collaborator Author

marton-bod commented Sep 20, 2020

In order to build iceberg-mr/iceberg-mr-hive3, we need to build the other iceberg subprojects with Hive3/Hadoop3 as well which mr depends on, such as hive-metastore, core, parquet, etc. For example, there is an HMSHandler API change between Hive2 and Hive3, therefore if we didn't build the hive-metastore module with Hive3, mr (with Hive3) would not be able to communicate with it.

However, just to reiterate, the idea is that for the 'normal' gradle build, all modules will continue to be built using Hive2/Hadoop2 just as before, with no changes. Only when specifying the -Dhive3 flag to the build, will Hive3/Hadoop3 dependencies be used (and when this flag's enabled, only those modules are included that are ready for the Hive3-build - i.e. Spark and Flink excluded for now, although they can also be added by the community whenever suitable)

Copy link
Contributor

@massdosage massdosage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a Gradle expert so I can't comment on whether there is a better way to add Hive 3 support to the build. I do prefer this solution to the one from the previous PR and I like the idea of initially adding it as a optional for those who want to try out Hive 3 but keeping Hive 2 as the default for everything else.

FWIW I took the hive-runtime jar produced by a default build (i.e. with the Hive 2 default) and tried this out on a real Hive cluster in distributed mode and my set of test queries all completed successfully so I can confirm that this PR doesn't appear to break anything.

build.gradle Show resolved Hide resolved
@marton-bod
Copy link
Collaborator Author

@rdblue @massdosage @rdsr
Do you guys think we can go ahead with this change to enable Hive 3 builds?
Let me know if you have any suggestions, I'm open to discussion and changes. Thank you!

@massdosage
Copy link
Contributor

@rdblue @massdosage @rdsr
Do you guys think we can go ahead with this change to enable Hive 3 builds?
Let me know if you have any suggestions, I'm open to discussion and changes. Thank you!

Since this is optional and from what I can tell works fine with Hive 2 by default I'm OK with it. As I said above, I don't know enough about Gradle to really say whether there is a better way of achieving this so I'll defer to the others.

settings.gradle Outdated Show resolved Hide resolved
build.gradle Outdated Show resolved Hide resolved
@rdblue
Copy link
Contributor

rdblue commented Sep 24, 2020

Thanks for working on this, @marton-bod! And sorry for the delay in replying on this. I was initially focused on trying to avoid the problem of needing both hive 2 and hive 3 modules, but I don't see a way around it because the OI interfaces now specify incompatible objects. So I agree that we will need additional modules to handle this.

But, I think there are some ways to simplify the changes this introduces. Because the changes needed between 2 and 3 are minor, I think the goal should be to produce a single iceberg-hive-runtime Jar that works in both versions. To do that, we need to build a DateObjectInspector for Hive 2 and one for Hive 3, but return the correct one at runtime using reflection. That way, we detect whether to use Hive 2 or 3 (e.g., by checking if DateWritableV2 exists) and return an OI that matches. The other class would not be loaded, so it would not cause any issues.

I think that we can achieve this using just one new module, iceberg-hive3, that adds the new object inspectors. The other module could continue to depend on Hive 2.

I'd like to avoid selecting versions.props based on flags, so we would ideally just embed the Hive 3 version in the iceberg-hive3 dependency. That module should also depend on the iceberg-mr module so that it can run the same tests (maybe it can set the test source directory to share with hive2?). This module would have both hive 2 and hive 3 classes in its test classpath, so it would validate that having both doesn't break Hive. And, since Hadoop it always pulled in as a test dependency, it can use Hadoop 3.

Finally, the iceberg-hive-runtime module would pull in both iceberg-hive and iceberg-hive3 so that all of the classes are in our runtime module.

I think this would greatly simplify the support:

  1. It would add only one new module
  2. It would avoid using build flags

What do you think?

@rdsr
Copy link
Contributor

rdsr commented Sep 24, 2020

Thanks for working on this, @marton-bod! And sorry for the delay in replying on this. I was initially focused on trying to avoid the problem of needing both hive 2 and hive 3 modules, but I don't see a way around it because the OI interfaces now specify incompatible objects. So I agree that we will need additional modules to handle this.

But, I think there are some ways to simplify the changes this introduces. Because the changes needed between 2 and 3 are minor, I think the goal should be to produce a single iceberg-hive-runtime Jar that works in both versions. To do that, we need to build a DateObjectInspector for Hive 2 and one for Hive 3, but return the correct one at runtime using reflection. That way, we detect whether to use Hive 2 or 3 (e.g., by checking if DateWritableV2 exists) and return an OI that matches. The other class would not be loaded, so it would not cause any issues.

I think that we can achieve this using just one new module, iceberg-hive3, that adds the new object inspectors. The other module could continue to depend on Hive 2.

I'd like to avoid selecting versions.props based on flags, so we would ideally just embed the Hive 3 version in the iceberg-hive3 dependency. That module should also depend on the iceberg-mr module so that it can run the same tests (maybe it can set the test source directory to share with hive2?). This module would have both hive 2 and hive 3 classes in its test classpath, so it would validate that having both doesn't break Hive. And, since Hadoop it always pulled in as a test dependency, it can use Hadoop 3.

Finally, the iceberg-hive-runtime module would pull in both iceberg-hive and iceberg-hive3 so that all of the classes are in our runtime module.

I think this would greatly simplify the support:

  1. It would add only one new module
  2. It would avoid using build flags

What do you think?

+1. This seems like a much cleaner approach if we can get this working!

@marton-bod
Copy link
Collaborator Author

marton-bod commented Sep 24, 2020

Thanks @rdblue for your comment. I will look into refactoring the solution to use your suggested approach, which I like.

My only concern is that because there is a breaking change in the metastore API between Hive2 and 3, there will have to be two iceberg-hive-metastore module versions, one for Hive3 (iceberg-mr-hive3 would use this for the HiveCatalog tests) and one for Hive2 (the rest of the modules would use this). Not sure, but hopefully it's possible to create a second metastore module in Gradle pointing to the same source files, but using different dependency versions.

The other things I'm thinking is that if the hive2 specific parts are not factored out from iceberg-mr, when iceberg-mr-hive3 pulls that in as a dependency, but using Hive3, those couple of ObjectInspector classes would not compile

@rdblue
Copy link
Contributor

rdblue commented Sep 24, 2020

there is a breaking change in the metastore API between Hive2 and 3

What was the incompatibility? Ideally, we will handle it with reflection to avoid needing a different module.

if the hive2 specific parts are not factored out from iceberg-mr, when iceberg-mr-hive3 pulls that in as a dependency, but using Hive3, those couple of ObjectInspector classes would not compile

The classes are already compiled. We just have to avoid loading them. So we would use different class names for the inspectors between 2 and 3 and load the correct one using reflection depending on whether the detected Hive version in 2 or 3.

@marton-bod
Copy link
Collaborator Author

@rdblue @rdsr @massdosage
Thanks a lot for your very valuable comments from last week!
I have reworked the solution according to the proposal by @rdblue, and it came out simpler and better I think. I would greatly appreciate it if you could take a look at it once again.
Thank you!

build.gradle Show resolved Hide resolved
build.gradle Outdated Show resolved Hide resolved
build.gradle Outdated Show resolved Hide resolved
@rdblue
Copy link
Contributor

rdblue commented Sep 29, 2020

Nice work, @marton-bod! This looks a lot better and I think we will be able to merge it soon. We should clean it up further by using the reflection helpers in iceberg-common first, though.

@rdblue
Copy link
Contributor

rdblue commented Sep 29, 2020

Thank you, @marton-bod! Looking good but there are a few more issues. Mostly, I'd prefer not to parse values in tests.

And, if tests pass in Java 11, I think we should go ahead an keep the module in Java 11 as well.

@marton-bod
Copy link
Collaborator Author

marton-bod commented Sep 30, 2020

Thanks @rdblue once again for the improvement suggestions - I've addressed them.
As for JDK11, we do have test failures unfortunately (#1478 (comment)) therefore we'll need to go with JDK8 only for Hive3.

@rdblue
Copy link
Contributor

rdblue commented Sep 30, 2020

Looks like there are some tests failures:

org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithHadoopCatalog > testScanTable FAILED
    java.lang.RuntimeException: Cannot start TestHiveMetastore
        Caused by:
        java.lang.RuntimeException: MetaException(message:Persistence Manager has been closed)
            Caused by:
            MetaException(message:Persistence Manager has been closed)
                Caused by:
                MetaException(message:Persistence Manager has been closed)
                    Caused by:
                    javax.jdo.JDOFatalUserException: Persistence Manager has been closed

@rdblue
Copy link
Contributor

rdblue commented Sep 30, 2020

+1, we just need to get the tests working.

@marton-bod
Copy link
Collaborator Author

Thanks. I'm looking into the flaky test, it still occurs intermittently.

@rdblue
Copy link
Contributor

rdblue commented Oct 1, 2020

@marton-bod, we were just talking about test metastores on #1495: #1495 (comment)

I think part of the problem is that this is creating a new metastore instance for each test case. That's going to take longer and doesn't catch connection leaks. That's probably also causing the issue here, where something isn't cleaned up properly. I recommend moving Metastore setup to a @BeforeClass, like in SparkTestBase.

I think that would address the issue here.

@marton-bod
Copy link
Collaborator Author

@rdblue Thanks for the suggestion! As per that, I've changed the code to create a metastore instance only once per test class to make things faster.

The intermittent test failure was due to the fact that, in Hive3, the hive conf properties you set on the HiveRunner shell were not passed down from HiveRunner to all the threads spawned inside it (e.g. HiveMaterializedViewsRegistry) during HS2 initialization. This meant that these threads were callingObjectStore#setConf with a different set of jdo properties than the rest, leading to the closure and replacement of the global ObjectStore#PersistenceManagerFactory instance whenever a config change was detected (-- resulting in intermittent "Persistence Manager has been closed" errors for those threads that would still be using their cached, but now closed PMs). Fortunately, during HiveConf construction, all the system properties are surveyed and included so we can ensure that the metastore_uris property is properly propagated and get the desired behaviour.

@rdblue rdblue merged commit ac5df85 into apache:master Oct 7, 2020
@rdblue
Copy link
Contributor

rdblue commented Oct 7, 2020

Thanks for all your hard work on this, @marton-bod! Great to have support for Hive 3 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants