[HUDI-6638] Upgrade AWS Java SDK to V2#9347
Conversation
|
Manually tested s3a path using EMR cluster. spark-shell \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \
--conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
--conf "spark.sql.hive.convertMetastoreParquet=false" \
--jars /usr/lib/hudi/hudi-aws-bundle-0.13.1-amzn-1-SNAPSHOT.jar,/usr/lib/hudi/hudi-spark3-bundle_2.12-0.13.1-amzn-1-SNAPSHOT.jarimport org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.hudi.hive.HiveSyncConfig
import org.apache.hudi.sync.common.HoodieSyncConfig
// Create a DataFrame
var tableName = "mansi_s3a_hudi_test"
var tablePath = "s3a://<myBucket>/tables/" + tableName
val inputDF = Seq(
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
("105", "2015-01-02", "2015-01-01T13:51:42.248818Z")
).toDF("id", "creation_date", "last_update_time")
//Specify common DataSourceWriteOptions in the single hudiOptions variable
val hudiOptions = Map[String,String](
HoodieWriteConfig.TABLE_NAME -> tableName,
DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "creation_date",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "last_update_time",
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "creation_date",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName
)
// Write the DataFrame as a Hudi dataset
(inputDF.write
.format("org.apache.hudi")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
.options(hudiOptions)
.mode(SaveMode.Overwrite)
.save(tablePath))
|
| <aws.sdk.httpclient.version>4.5.13</aws.sdk.httpclient.version> | ||
| <aws.sdk.httpcore.version>4.4.13</aws.sdk.httpcore.version> |
There was a problem hiding this comment.
Any reason we pick different versions here?
| if (CollectionUtils.nonEmpty(response.errors())) { | ||
| if (response.errors().stream() | ||
| .allMatch( | ||
| (error) -> "AlreadyExistsException".equals(error.errorDetail().errorCode()))) { |
There was a problem hiding this comment.
I assume the error name is the same here and this should still work. Is the error case tested?
There was a problem hiding this comment.
We haven't changed anything here other than making it compatible with sdk v2. Haven't run any test to specifically test this error case.
| ? this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL) | ||
| : DynamoDbClient.serviceMetadata().endpointFor(Region.of(region)).toString(); | ||
|
|
||
| if (!endpointURL.startsWith("https://") || !endpointURL.startsWith("http://")) { |
There was a problem hiding this comment.
could the endpoint URL start without the HTTP prefix?
There was a problem hiding this comment.
We are getting this error if we don't specify http:// or https:// in sdk v2. Based on this doc dynamodb supports both http and https so added a check for both. https://docs.aws.amazon.com/general/latest/gr/ddb.html
Caused by: java.lang.NullPointerException: The URI scheme of endpointOverride must not be null.
at org.apache.hudi.software.amazon.awssdk.utils.Validate.paramNotNull(Validate.java:156)
at org.apache.hudi.software.amazon.awssdk.core.client.builder.SdkDefaultClientBuilder.endpointOverride(SdkDefaultClientBuilder.java:445)
at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.getDynamoDBClient(DynamoDBBasedLockProvider.java:163)
at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.<init>(DynamoDBBasedLockProvider.java:87)
at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.<init>(DynamoDBBasedLockProvider.java:77)
... 65 more
| <!-- AWS Services --> | ||
| <!-- https://mvnrepository.com/artifact/software.amazon.awssdk/aws-java-sdk-sqs --> | ||
| <dependency> | ||
| <groupId>software.amazon.awssdk</groupId> | ||
| <artifactId>sqs</artifactId> | ||
| <version>${aws.sdk.version}</version> | ||
| </dependency> |
There was a problem hiding this comment.
Same question as before: https://github.com/apache/hudi/pull/8441/files#r1164644285
Should this be moved to hudi-aws module (hudi-utilities module has already relied on hudi-aws module)?
There was a problem hiding this comment.
We have added these 2 dependencies back to the hudi-utililes pom because of the TestS3EventsSource java.lang.NoClassDefFoundError: org/apache/http/impl/client/DefaultClientConnectionReuseStrategy failure. We still need to identify that why these dependencies are required to be specified in hudi-utilities even though hudi-aws is pulling them in hudi-utililes
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>${aws.sdk.httpclient.version}</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>${aws.sdk.httpcore.version}</version>
</dependency>
|
@mansipp could you also check the Azure CI failure? |
5c13915 to
c248258
Compare
|
@hudi-bot run azure |
|
Azure CI failures is due to a flaky test in |

Change Logs
Upgrade AWS Java SDK to v2.
Impact
None
Risk level (write none, low medium or high below)
None
Documentation Update
None
Contributor's checklist