-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ when doing an Incremental CDC Query in 0.14.1 #10590
Comments
Thanks for raising this @johnwilsonabartys . You are correct. With 0.14.1 version we are facing this issue while doing incremental CDC read. Interestingly with 0.14.0, we are not facing this issue. Raised JIRA to track - https://issues.apache.org/jira/browse/HUDI-7360 |
The same happens with streaming source - since basically these 2 places - is where this module is used |
Is there a way to manually add the class after importing the spark bundle? |
I got it to compile, bootstrapped the spark bundle, hive sync, and aws bundle to emr. Now getting java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found |
99% following these steps should fix issue for you
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
Namely
Specify the value for hive.metastore.client.factory.class using the
spark-hive-site classification as shown in the following example:
[
{
"Classification": "spark-hive-site",
"Properties": {
"hive.metastore.client.factory.class":
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
]
Because this class
com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory is
hidden inside AWS EMR(not available in any public repo - this is old
version of it but AWS no longer publish it
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore
).
…On Fri, Mar 8, 2024 at 10:55 PM Tyler Rendina ***@***.***> wrote:
I got it to compile, bootstrapped the spark bundle, hive sync, and aws
bundle to emr. Now getting java.lang.ClassNotFoundException: Class
com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
not found
—
Reply to this email directly, view it on GitHub
<#10590 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZ45NKP4XHQVQRCKP7VKMLYXIXV7AVCNFSM6AAAAABCSD7ZLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGQ3TIOJSGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Final note, apologies for the amount of posts, but this may help EMR users with Glue as their Hive service. Make sure to build Hudi using Java 8, if you are on ARM use something like Azul OpenJDK and export $JAVA_HOME as the provided path, i.e., /Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home. Once you upload your jars to s3, bootstrap such as:
To use your custom built Hudi package, conform to your bootstrap paths in the following spark submit command element:
Now, you have overwritten your extraClassPath, append the following to both configurations: |
While I can kick off backfills, they eventually fail along side streams with Per #5053 |
This is fixed in |
why is it required to set these?
|
You can try without it. Basically |
On the spark config side: Any dependencies off of the default spark path will require setting the extra class path in two places so both the driver and executor containers can see them. You can use blob statements to make this more concise if there are no conflicts. My comments on this thread are based around the EMR environment so the long class path after the fact is also required if using Glue metastore. |
Thank you very much for your comment. I am on Amazon EMR on EKS env as well. What I am curious is Thank you |
Thank you for your response. I've been using That's why I am curios if setting "spark.driver.extraClassPath" or "spark.executor.extraClassPath" are required along with Thank you |
EMR on EKS gave me issues and I switched to EMR on EC2 about a year ago, probably needed to do the same thing done here. Planning to use something like kubeflow in the future. When you spin up a job and look at the environment vars in the spark UI you can ctrl + f for 'classpath' and see what the key value options are. If the location of the jar is not on the classpath then it must be specified in addition to the initial classpath specified in that var of the spark UI. In this case I was using --packages to get them directly from Maven, then I tried --jar and ran into version issues, hence the custom build above. This led to using the custom jars from s3. I imagine writing a more robust bootstrap script may have mitigated the issue. I also wanted to make sure the jar selected for hudi was not the preinstalled version in the particular case. |
Thank you very much for the detailed explanation. |
Describe the problem you faced
When doing an Incremental CDC Query ('hoodie.datasource.query.incremental.format': "cdc"), the job crashes with the mentioned error, this only happens in 0.14.1 and not in 0.13.1 for the same dataset.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Should return the same Dataframe and not Crash in 0.14.1
Environment Description
Hudi version : 0.13.1 and 0.14.1
Spark version : 3.3
Hive version : N/A
Hadoop version : 3.3.2
Storage (HDFS/S3/GCS..) : S3 and Local MacOS File System
Running on Docker? (yes/no) : No
Additional context
This original came from us trying to run Hudi in AWS Glue. We were running Hudi 0.13.1 in Glue 4.0 without any problems but a new QA feature to test Deletes being captured was deployed, and it started causing errors of "There should be a CDC Log File", We found out by reading through the repo, this was a bug that was occurring that was reported here https://github.com/apache/hudi/issues/9987
And that it was fixed in 0.14.1. Since Glue runs with Hudi 0.13.1 and spark 3.3 in Glue 4.0. We went ahead and downloaded and manually ran the hudi-bundle3.3 for 0.14.1, since Glue 4.0 runs with 3.3. And that's when we got the mentioned error of the Missing Class. To start "narrowing down problems". I went ahead and attempted to recreate the problem locally with a smaller test sample. And was able to reproduce it exactly the same. using the HUDI Quickstart Utils.
I leave attached the exact same code that was ran.
Some Notes:
As long as you have spark 3.3 installed, you should be able to reproduce/re-run this with no issues.
Just replace the org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1 with org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.1.
And Vice-Versa.
Stacktrace
The text was updated successfully, but these errors were encountered: