-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Tips before filing an issue
- Have you gone through our FAQs?Yes
Describe the problem you faced
How can we change the location of a hudi table to new location. I've customer table that is saved at s3://aws-amazon-com/Customer/ which I want to change to s3://aws-amazon-com/CustomerUpdated/ .
A clear and concise description of the problem.
How can we change the location of a hudi table to new location. I've customer table that is saved at s3://aws-amazon-com/Customer/ which I want to change to s3://aws-amazon-com/CustomerUpdated/ .
To Reproduce
val partitionColumnName: String = "year"
val hudiTableName: String = "Customer"
val preCombineKey: String = "id"
val recordKey = "id"
val tablePath = "s3://aws-amazon-com/Customer/"
val databaseName="consumer_bureau"
val hudiCommonOptions: Map[String, String] = Map(
"hoodie.table.name" -> hudiTableName,
"hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.precombine.field" -> preCombineKey,
"hoodie.datasource.write.recordkey.field" -> recordKey,
"hoodie.datasource.write.operation" -> "bulk_insert",
//"hoodie.datasource.write.operation" -> "upsert",
"hoodie.datasource.write.row.writer.enable" -> "true",
"hoodie.datasource.write.reconcile.schema" -> "true",
"hoodie.datasource.write.partitionpath.field" -> partitionColumnName,
"hoodie.datasource.write.hive_style_partitioning" -> "true",
// "hoodie.bulkinsert.shuffle.parallelism" -> "2000",
// "hoodie.upsert.shuffle.parallelism" -> "400",
"hoodie.datasource.hive_sync.enable" -> "true",
"hoodie.datasource.hive_sync.table" -> hudiTableName,
"hoodie.datasource.hive_sync.database" -> databaseName,
"hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName,
"hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.use_jdbc" -> "false",
"hoodie.combine.before.upsert" -> "true",
"hoodie.index.type" -> "BLOOM",
"spark.hadoop.parquet.avro.write-old-list-structure" -> "false",
DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE"
)
val df=Seq((1,"Mark",1990),(2,"Martin",2009)).toDF("id","name","year")
df.write.format("org.apache.hudi")
.options(hudiCommonOptions)
.mode(SaveMode.Append)
.save(tablelocation)
val tablelocationUpdated="s3://eec-aws-uk-ukidcibatchanalytics-prod-hudi-replication/consumer_bureau/production/CustomerUpdated/"
df.write.format("org.apache.hudi") //writng to new location
.options(hudiCommonOptions)
.mode(SaveMode.Append)
.save(tablelocationUpdated)
I can use the following code to change the location of the table manually in aws glue.I'm reading the data from the new location and upserting it's one row to the updated location.
val dfupdated=spark.read.format("hudi").load(tablelocationUpdated).limit(1)
val hudiCommonOptions: Map[String, String] = Map(
"hoodie.table.name" -> hudiTableName,
"hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.precombine.field" -> preCombineKey,
"hoodie.datasource.write.recordkey.field" -> recordKey,
"hoodie.datasource.write.operation" -> "upsert", // Chnged to upsert
"hoodie.datasource.write.row.writer.enable" -> "true",
"hoodie.datasource.write.reconcile.schema" -> "true",
"hoodie.datasource.write.partitionpath.field" -> partitionColumnName,
"hoodie.datasource.write.hive_style_partitioning" -> "true",
// "hoodie.bulkinsert.shuffle.parallelism" -> "2000",
// "hoodie.upsert.shuffle.parallelism" -> "400",
"hoodie.datasource.hive_sync.enable" -> "true",
"hoodie.datasource.hive_sync.table" -> hudiTableName,
"hoodie.datasource.hive_sync.database" -> databaseName,
"hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName,
"hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.use_jdbc" -> "false",
"hoodie.combine.before.upsert" -> "true",
"hoodie.index.type" -> "BLOOM",
"spark.hadoop.parquet.avro.write-old-list-structure" -> "false",
DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE"
)
dfupdated.write.format("org.apache.hudi")
.options(getHudiOptions())
.mode(SaveMode.Append)
.save(tablelocation)
Expected behavior
How can we change the location of a hudi table to new location . I've customer table that is saved at s3://aws-amazon-com/Customer/ which I want to change to s3://aws-amazon-com/CustomerUpdated/ .
A clear and concise description of what you expected to happen.
Environment Description
Glue 4
hudi-spark3-bundle_2.12-0.12.1.jar
calcite-core-1.16.0.jar
libfb303-0.9.3.jar
-
Hudi version : 0.12
-
Spark version : Spark3.3
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : no
Additional context
How can we change the location of a hudi table to new location. I've customer table that is saved at s3://aws-amazon-com/Customer/ which I want to change to s3://aws-amazon-com/CustomerUpdated/ .
Metadata
Metadata
Assignees
Labels
Type
Projects
Status