Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Apache Hudi 0.14.0 with AWS Glue (to use Custom Jar ) #10358

Closed
soumilshah1995 opened this issue Dec 19, 2023 · 13 comments
Closed

[SUPPORT] Apache Hudi 0.14.0 with AWS Glue (to use Custom Jar ) #10358

soumilshah1995 opened this issue Dec 19, 2023 · 13 comments
Labels
aws-support on-call-triaged priority:minor everything else; usability gaps; questions; feature reqs

Comments

@soumilshah1995
Copy link

soumilshah1995 commented Dec 19, 2023

Hudi 0.14 on AWS Glue

Overview

This project aims to use Hudi 0.14 on AWS Glue, leveraging Glue 4. The default Glue setup supports Hudi but uses an older version. The objective is to use the specified Hudi version with Glue 4.

Prerequisites

  • AWS Glue environment
  • Hudi 0.14
  • Glue 4

Setup Instructions

  1. Download the required Hudi and Avro JARs and upload them to S3
s3://jXX/jar/hudi-spark3-bundle_2.12-0.14.0.jar
s3://jXX/jar/spark-avro_2.12-3.4.0.jar
s3://jXX/jar/spark-avro_2.12-3.5.0.jar
s3://jXX/jar/hudi-spark3-bundle_2.12-1.0.0-beta1.jar
  1. Set up Glue job configuration.
  • Dependent JARs path: s3://XXX/jar/hudi-spark3-bundle_2.12-0.14.0.jar,s3://XXX/jar/spark-avro_2.12-3.5.0.jar

  • Job parameters info:

    • --extra-jars: s3://jXXX/jar/hudi-spark3-bundle_2.12-0.14.0.jar,s3://jt-dev-datateam/jar/spark-avro_2.12-3.5.0.jar
    • --additional-python-modules: faker==11.3.0

Screenshot 2023-12-18 at 7 14 35 PM

Test code

try:
    import sys, random, uuid
    from pyspark.context import SparkContext
    from pyspark.sql.session import SparkSession
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from awsglue.dynamicframe import DynamicFrame
    from pyspark.sql.functions import col, to_timestamp, monotonically_increasing_id, to_date, when
    from pyspark.sql.functions import *
    from awsglue.utils import getResolvedOptions
    from pyspark.sql.types import *
    from datetime import datetime, date
    from functools import reduce
    from pyspark.sql import Row
    from faker import Faker
except Exception as e:
    print("Modules are missing : {} ".format(e))

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

args = getResolvedOptions(sys.argv, options=[])

global faker
faker = Faker()


class DataGenerator(object):

    @staticmethod
    def get_data():
        return [
            (
                x,
                faker.name(),
                faker.random_element(elements=('IT', 'HR', 'Sales', 'Marketing')),
                faker.random_element(elements=('CA', 'NY', 'TX', 'FL', 'IL', 'RJ')),
                faker.random_int(min=10000, max=150000),
                faker.random_int(min=18, max=60),
                faker.random_int(min=0, max=100000),
                faker.unix_time()
            ) for x in range(10)
        ]



spark = (SparkSession.builder.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
         .config('spark.sql.hive.convertMetastoreParquet', 'false').getOrCreate())

sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
logger = glueContext.get_logger()


table_name = "employees"

hudi_options = {
    'hoodie.table.name': table_name,
    'hoodie.datasource.write.recordkey.field': 'emp_id',
    'hoodie.datasource.write.partitionpath.field': 'state',
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.precombine.field': 'ts',
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2,
}

# ====================================================
"""Create Spark Data Frame """
# ====================================================
bucket = "jXXX"


data = DataGenerator.get_data()
columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
df = spark.createDataFrame(data=data, schema=columns)
print(df.show())

try:
    print("Upserting Hudi ")
    hudi_path = f"s3://{bucket}/tm1/{table_name}"
    df.write.format("hudi").options(**hudi_options).mode("overwrite").save(hudi_path)
except Exception as e:
    print("Failed 1")
    print(e)


try:
    print("Upserting Hudi ")
    hudi_path = f"s3a://{bucket}/tm2/{table_name}"
    df.write.format("hudi").options(**hudi_options).mode("overwrite").save(hudi_path)
except Exception as e:
    print("Failed 1")
    print(e)


Error


An error occurred while calling o128.save.
: org.apache.hudi.internal.schema.HoodieSchemaException: Failed to convert struct type to avro schema: StructType(StructField(emp_id,LongType,true), StructField(employee_name,StringType,true), StructField(department,StringType,true), StructField(state,StringType,true), StructField(salary,LongType,true), StructField(age,LongType,true), StructField(bonus,LongType,true), StructField(ts,LongType,true))
	at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:147)
	at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:349)
	at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:196)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:144)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_1Adapter
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at org.apache.hudi.SparkAdapterSupport$.sparkAdapter$lzycompute(SparkAdapterSupport.scala:49)
	at org.apache.hudi.SparkAdapterSupport$.sparkAdapter(SparkAdapterSupport.scala:35)
	at org.apache.hudi.SparkAdapterSupport.sparkAdapter(SparkAdapterSupport.scala:29)
	at org.apache.hudi.SparkAdapterSupport.sparkAdapter$(SparkAdapterSupport.scala:29)
	at org.apache.hudi.HoodieSparkUtils$.sparkAdapter$lzycompute(HoodieSparkUtils.scala:66)
	at org.apache.hudi.HoodieSparkUtils$.sparkAdapter(HoodieSparkUtils.scala:66)
	at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:143)
	... 42 more

Upserting Hudi 
An error occurred while calling o128.save. : org.apache.hudi.internal.schema.HoodieSchemaException: Failed to convert struct type to avro schema: StructType(StructField(emp_id,LongType,true), StructField(employee_name,StringType,true), StructField(department,StringType,true), StructField(state,StringType,true), StructField(salary,LongType,true), StructField(age,LongType,true), StructField(bonus,LongType,true), StructField(ts,LongType,true)) at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:147) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:349) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:196) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:144) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_1Adapter at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.hudi.SparkAdapterSupport$.sparkAdapter$lzycompute(SparkAdapterSupport.scala:49) at org.apache.hudi.SparkAdapterSupport$.sparkAdapter(SparkAdapterSupport.scala:35) at org.apache.hudi.SparkAdapterSupport.sparkAdapter(SparkAdapterSupport.scala:29) at org.apache.hudi.SparkAdapterSupport.sparkAdapter$(SparkAdapterSupport.scala:29) at org.apache.hudi.HoodieSparkUtils$.sparkAdapter$lzycompute(HoodieSparkUtils.scala:66) at org.apache.hudi.HoodieSparkUtils$.sparkAdapter(HoodieSparkUtils.scala:66) at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:143) ... 42 more Upserting Hudi
@soumilshah1995
Copy link
Author

@ad1happy2go
Copy link
Collaborator

@soumilshah1995 You are downloading the wrong hudi bundle jar. Glue 4 supports spark 3.3. So you should download the once here -
https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.3-bundle_2.12/0.14.0

@codope codope added aws-support priority:minor everything else; usability gaps; questions; feature reqs on-call-triaged labels Dec 19, 2023
@soumilshah1995
Copy link
Author

I will try this out

@soumilshah1995
Copy link
Author

soumilshah1995 commented Dec 19, 2023

Screenshot 2023-12-19 at 9 13 46 AM

I'm still encountering the same issue. I downloaded the JAR files and uploaded them to S3 as instructed in the ticket. I also utilized the Glue job mentioned earlier, but the error persists.

: org.apache.hudi.internal.schema.HoodieSchemaException: Failed to convert struct type to avro schema: StructType(StructField(emp_id,LongType,true), StructField(employee_name,StringType,true), StructField(department,StringType,true), StructField(state,StringType,true), StructField(salary,LongType,true), StructField(age,LongType,true), StructField(bonus,LongType,true), StructField(ts,LongType,true))
at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:147)
at org.apache.hudi.HoodieSparkSqlWriter$.writeInternal(HoodieSparkSqlWriter.scala:276)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:132)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apac

@ad1happy2go
Copy link
Collaborator

Can you paste entire stack trace what you getting now? Last time it was clearly saying that Spark 3_1 adapter class bot found as we were using spark3 jars which is for spark 3.1 version and glue had 3.3

@soumilshah1995
Copy link
Author

Sure


	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:264)
	at org.apache.logging.log4j.util.LoaderUtil.loadClass(LoaderUtil.java:173)
	at org.apache.logging.log4j.util.LoaderUtil.newInstanceOf(LoaderUtil.java:211)
	at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:232)
	at org.apache.logging.log4j.core.util.Loader.newCheckedInstanceOf(Loader.java:301)
	at org.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:95)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.<init>(AbstractConfiguration.java:114)
	at org.apache.logging.log4j.core.config.DefaultConfiguration.<init>(DefaultConfiguration.java:55)
	at org.apache.logging.log4j.core.layout.PatternLayout$Builder.build(PatternLayout.java:430)
	at org.apache.logging.log4j.core.layout.PatternLayout.createDefaultLayout(PatternLayout.java:324)
	at org.apache.logging.log4j.core.appender.ConsoleAppender$Builder.<init>(ConsoleAppender.java:121)
	at org.apache.logging.log4j.core.appender.ConsoleAppender.newBuilder(ConsoleAppender.java:111)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.createBuilder(PluginBuilder.java:158)
	at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.build(PluginBuilder.java:119)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.createPluginObject(AbstractConfiguration.java:813)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:753)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:745)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.doConfigure(AbstractConfiguration.java:389)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.initialize(AbstractConfiguration.java:169)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.start(AbstractConfiguration.java:181)
	at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:446)
	at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:520)
	at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:536)
	at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:214)
	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:146)
	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:41)
	at org.apache.logging.log4j.LogManager.getContext(LogManager.java:194)
	at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:597)
	at org.apache.spark.metrics.sink.MetricsConfigUtils.<clinit>(MetricsConfigUtils.java:12)
	at org.apache.spark.metrics.sink.MetricsProxyInfo.fromConfig(MetricsProxyInfo.java:17)
	at com.amazonaws.services.glue.cloudwatch.CloudWatchLogsAppenderCommon.<init>(CloudWatchLogsAppenderCommon.java:62)
	at com.amazonaws.services.glue.cloudwatch.CloudWatchLogsAppenderCommon$CloudWatchLogsAppenderCommonBuilder.build(CloudWatchLogsAppenderCommon.java:79)
	at com.amazonaws.services.glue.cloudwatch.CloudWatchAppender.activateOptions(CloudWatchAppender.java:73)
	at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
	at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
	at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
	at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
	at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
	at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648)
	at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514)
	at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
	at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
	at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
	at org.slf4j.impl.Log4jLoggerFactory.<init>(Log4jLoggerFactory.java:66)
	at org.slf4j.impl.StaticLoggerBinder.<init>(StaticLoggerBinder.java:72)
	at org.slf4j.impl.StaticLoggerBinder.<clinit>(StaticLoggerBinder.java:45)
	at org.slf4j.LoggerFactory.bind(LoggerFactory.java:150)
	at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:124)
	at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:412)
	at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357)
	at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383)
	at org.apache.spark.network.util.JavaUtils.<clinit>(JavaUtils.java:41)
	at org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:67)
	at org.apache.spark.internal.config.ConfigBuilder.$anonfun$bytesConf$1(ConfigBuilder.scala:259)
	at org.apache.spark.internal.config.ConfigBuilder.$anonfun$bytesConf$1$adapted(ConfigBuilder.scala:259)
	at org.apache.spark.internal.config.TypedConfigBuilder.$anonfun$transform$1(ConfigBuilder.scala:101)
	at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:144)
	at org.apache.spark.internal.config.package$.<init>(package.scala:345)
	at org.apache.spark.internal.config.package$.<clinit>(package.scala)
	at org.apache.spark.SparkConf$.<init>(SparkConf.scala:654)
	at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
	at org.apache.spark.SparkConf.set(SparkConf.scala:94)
	at org.apache.spark.SparkConf.$anonfun$loadFromSystemProperties$3(SparkConf.scala:76)
	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:788)
	at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:230)
	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:461)
	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:461)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:787)
	at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:75)
	at org.apache.spark.SparkConf.<init>(SparkConf.scala:70)
	at org.apache.spark.SparkConf.<init>(SparkConf.scala:59)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.getSparkConf(ProcessLauncher.scala:45)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.getSparkConf$(ProcessLauncher.scala:44)
	at com.amazonaws.services.glue.ProcessLauncher$$anon$1.getSparkConf(ProcessLauncher.scala:200)
	at com.amazonaws.services.glue.ProcessLauncher.<init>(ProcessLauncher.scala:209)
	at com.amazonaws.services.glue.ProcessLauncher.<init>(ProcessLauncher.scala:201)
	at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:33)
	at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)

2023-12-19 14:12:00,004 main INFO Log4j appears to be running in a Servlet environment, but there's no log4j-web module available. If you want better web container support, please add the log4j-web JAR to your web archive or server lib directory.
+------+------------------+----------+-----+------+---+-----+----------+
|emp_id|     employee_name|department|state|salary|age|bonus|        ts|
+------+------------------+----------+-----+------+---+-----+----------+
|     0| Shannon Frederick|        HR|   TX|105023| 40|49326| 458792232|
|     1|   Kelly Christian|        IT|   IL| 60827| 34|30899|1438189473|
|     2|    Nancy Thompson| Marketing|   RJ|112698| 45|15104| 366953525|
|     3|      Xavier Crane|     Sales|   RJ|146419| 32|95363| 902407760|
|     4|   Amanda Castillo|        HR|   NY| 60163| 29|51515| 473773911|
|     5|     Ashley Wright|        IT|   TX|141370| 32|44223| 615176188|
|     6|Stephanie Sandoval|     Sales|   CA|128303| 34|74014|   4721797|
|     7| Danielle Mccarthy|        IT|   CA| 22041| 49|36731|1405027762|
|     8|    Hannah Pearson| Marketing|   NY| 85896| 21|90073|1357045062|
|     9|     Michael Lynch|        HR|   NY| 38922| 21| 7710|1466663385|
+------+------------------+----------+-----+------+---+-----+----------+

N
one
Upserting Hudi 
Failed 1
An error occurred while calling o128.save.
: org.apache.hudi.internal.schema.HoodieSchemaException: Failed to convert struct type to avro schema: StructType(StructField(emp_id,LongType,true), StructField(employee_name,StringType,true), StructField(department,StringType,true), StructField(state,StringType,true), StructField(salary,LongType,true), StructField(age,LongType,true), StructField(bonus,LongType,true), StructField(ts,LongType,true))
	at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:147)
	at org.apache.hudi.HoodieSparkSqlWriter$.writeInternal(HoodieSparkSqlWriter.scala:276)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:132)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apac
he.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_1Adapter
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at org.apache.hudi.SparkAdapterSupport$.sparkAdapter$lzycompute(SparkAdapterSupport.scala:49)
	at org.apache.hudi.SparkAdapterSupport$.sparkAdapter(SparkAdapterSupport.scala:35)
	at org.apache.hudi.SparkAdapterSupport.sparkAdapter(SparkAdapterSupport.scala:29)
	at org.apache.hudi.SparkAdapterSupport.sparkAdapter$(SparkAdapterSupport.scala:29)
	at org.apache.hudi.HoodieSparkUtils$.sparkAdapter$lzycompute(HoodieSparkUtils.scala:66)
	at org.apache.hudi.HoodieSparkUtils$.sparkAdapter(HoodieSparkUtils.scala:66)
	at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:143)
	... 41 more

Upserting Hudi 
Failed 1
An error occurred while calling o141.save.
: org.apache.hudi.internal.schema.HoodieSchemaException: Failed to convert struct type to avro schema: StructType(StructField(emp_id,LongType,true), StructField(employee_name,StringType,true), StructField(department,StringType,true), StructField(state,StringType,true), StructField(salary,LongType,true), StructField(age,LongType,true), StructField(bonus,LongType,true), StructField(ts,LongType,true))
	at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:147)
	at org.apache.hudi.HoodieSparkSqlWriter$.writeInternal(HoodieSparkSqlWriter.scala:276)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:132)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apac
he.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecu
tion.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
	at o
rg.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_1Adapter
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoa
der.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at org.apache.hudi.SparkAdapterSupport$.sparkAdapter$lzycompute(SparkAdapterSupport.scala:49)
	at org.apache.hudi.SparkAdapterSupport$.sparkAdapter(SparkAdapterSupport.scala:35)
	at org.apache.hudi.SparkAdapterSupport.sparkAdapter(SparkAdapterSupport.scala:29)
	at org.apache.hudi.SparkAdapterSupport.sparkAdapter$(SparkAdapterSupport.scala:29)
	at org.apache.hudi.HoodieSparkUtils$.sparkAdapter$lzycompute(HoodieSparkUtils.scala:66)
	at org.apache.hudi.HoodieSparkUtils$.sparkAdapter(HoodieSparkUtils.scala:66)
	at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:143)
	... 41 more

@soumilshah1995
Copy link
Author

will be making video for community to use Hudi 0.14 on Glue :D
cheers happy holiday

@soumilshah1995
Copy link
Author

@ad1happy2go
Copy link
Collaborator

Thanks @soumilshah1995

@soumilshah1995
Copy link
Author

Anytime @ad1happy2go

you guys are real champ just trying to learn more from you guys :D

@kamaagar
Copy link

kamaagar commented Mar 21, 2024

guys any pointer on how to provide these jars when provisioning using CloudFormation Templates.
CloudFormation template doesn't have any explicit option to provide "dependent jars" and not sure if "--extra-jars" parameter can be used for the same purpose.
Also, Glue docs states that when providing jars for different Hudi version "--datalake-formats: hudi" should not be provided.(saying if it would be helpful)

@soumilshah1995
Copy link
Author

I use server less that makes it very easy

@bibhu107
Copy link
Contributor

@kamaagar I have been following up with AWS teams no one suggests to bring your own JAR. EMR 6+ versions have their own Hudi jar in place. So my suggestion is don't waste time in bringing any extra jar. Even I am facing same problem and checking how to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-support on-call-triaged priority:minor everything else; usability gaps; questions; feature reqs
Projects
Archived in project
Development

No branches or pull requests

5 participants