Skip to content

[SUPPORT] Change the location of a hudi table in AWS ? #8922

@MathurCodes1

Description

@MathurCodes1

Tips before filing an issue

  • Have you gone through our FAQs?Yes

Describe the problem you faced
How can we change the location of a hudi table to new location. I've customer table that is saved at s3://aws-amazon-com/Customer/ which I want to change to s3://aws-amazon-com/CustomerUpdated/ .

A clear and concise description of the problem.

How can we change the location of a hudi table to new location. I've customer table that is saved at s3://aws-amazon-com/Customer/ which I want to change to s3://aws-amazon-com/CustomerUpdated/ .

To Reproduce

val partitionColumnName: String = "year"
    val hudiTableName: String = "Customer"
    val preCombineKey: String = "id"
    val recordKey = "id"
    val tablePath = "s3://aws-amazon-com/Customer/"
    val databaseName="consumer_bureau"
    
    
    
    
    
    val hudiCommonOptions: Map[String, String] = Map(
        "hoodie.table.name" -> hudiTableName,
        "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator",
        "hoodie.datasource.write.precombine.field" -> preCombineKey,
        "hoodie.datasource.write.recordkey.field" -> recordKey,
        "hoodie.datasource.write.operation" -> "bulk_insert",
        //"hoodie.datasource.write.operation" -> "upsert",
        "hoodie.datasource.write.row.writer.enable" -> "true",
        "hoodie.datasource.write.reconcile.schema" -> "true",
        "hoodie.datasource.write.partitionpath.field" -> partitionColumnName,
        "hoodie.datasource.write.hive_style_partitioning" -> "true",
        // "hoodie.bulkinsert.shuffle.parallelism" -> "2000",
        //  "hoodie.upsert.shuffle.parallelism" -> "400",
        "hoodie.datasource.hive_sync.enable" -> "true",
        "hoodie.datasource.hive_sync.table" -> hudiTableName,
        "hoodie.datasource.hive_sync.database" -> databaseName,
        "hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName,
        "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.use_jdbc" -> "false",
        "hoodie.combine.before.upsert" -> "true",
        "hoodie.index.type" -> "BLOOM",
        "spark.hadoop.parquet.avro.write-old-list-structure" -> "false",
        DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE"
      )
      
      
      val df=Seq((1,"Mark",1990),(2,"Martin",2009)).toDF("id","name","year")
      
      
         df.write.format("org.apache.hudi")
        .options(hudiCommonOptions)
        .mode(SaveMode.Append)
        .save(tablelocation)
        
        val tablelocationUpdated="s3://eec-aws-uk-ukidcibatchanalytics-prod-hudi-replication/consumer_bureau/production/CustomerUpdated/"
       


        df.write.format("org.apache.hudi") //writng to new location
        .options(hudiCommonOptions)
        .mode(SaveMode.Append)
        .save(tablelocationUpdated)

I can use the following code to change the location of the table manually in aws glue.I'm reading the data from the new location and upserting it's one row to the updated location.

val dfupdated=spark.read.format("hudi").load(tablelocationUpdated).limit(1)
   
val hudiCommonOptions: Map[String, String] = Map(
        "hoodie.table.name" -> hudiTableName,
        "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator",
        "hoodie.datasource.write.precombine.field" -> preCombineKey,
        "hoodie.datasource.write.recordkey.field" -> recordKey,
      
        "hoodie.datasource.write.operation" -> "upsert",  // Chnged to upsert
        "hoodie.datasource.write.row.writer.enable" -> "true",
        "hoodie.datasource.write.reconcile.schema" -> "true",
        "hoodie.datasource.write.partitionpath.field" -> partitionColumnName,
        "hoodie.datasource.write.hive_style_partitioning" -> "true",
        // "hoodie.bulkinsert.shuffle.parallelism" -> "2000",
        //  "hoodie.upsert.shuffle.parallelism" -> "400",
        "hoodie.datasource.hive_sync.enable" -> "true",
        "hoodie.datasource.hive_sync.table" -> hudiTableName,
        "hoodie.datasource.hive_sync.database" -> databaseName,
        "hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName,
        "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.use_jdbc" -> "false",
        "hoodie.combine.before.upsert" -> "true",
        "hoodie.index.type" -> "BLOOM",
        "spark.hadoop.parquet.avro.write-old-list-structure" -> "false",
        DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE"
      )
   
     dfupdated.write.format("org.apache.hudi")
         .options(getHudiOptions())
         .mode(SaveMode.Append)
         .save(tablelocation)


Expected behavior
How can we change the location of a hudi table to new location . I've customer table that is saved at s3://aws-amazon-com/Customer/ which I want to change to s3://aws-amazon-com/CustomerUpdated/ .

A clear and concise description of what you expected to happen.

Environment Description

Glue 4

hudi-spark3-bundle_2.12-0.12.1.jar
calcite-core-1.16.0.jar
libfb303-0.9.3.jar

  • Hudi version : 0.12

  • Spark version : Spark3.3

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context
How can we change the location of a hudi table to new location. I've customer table that is saved at s3://aws-amazon-com/Customer/ which I want to change to s3://aws-amazon-com/CustomerUpdated/ .

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:awsAWS ecosystem support

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions