[SUPPORT] Spark Fails to Process 300Gb Of Data

Hi Guys,

I'm trying to migrate my biggest dataset to Hudi and I'm facing some errors.

Data Size: 350Gb
Spark Master: 4 Cpus, 16 Gb Ram
Cores Nodes: 8 R5.4xLarge = 16 cpus, 122 Gb ram EACH

**MY spark Submit:**

`spark-submit --deploy-mode cluster --conf "spark.executor.extraJavaOptions -XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --conf spark.executor.cores=5 --conf spark.executor.memory=33g --conf spark.executor.memoryOverhead=2048 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 `


My hudi Options:

{
   "hoodie.datasource.write.recordkey.field":"id",
   "hoodie.table.name":"stockout",
   "hoodie.datasource.write.table.name":"stockout",
   "hoodie.datasource.write.operation":"bulk_insert",
   "hoodie.datasource.write.partitionpath.field":"created_date_brt",
   "hoodie.datasource.write.hive_style_partitioning":"true",
   "hoodie.combine.before.insert":"true",
   "hoodie.combine.before.upsert":"false",
   "hoodie.datasource.write.precombine.field":"LineCreatedTimestamp",
   "hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.SimpleKeyGenerator",
   "hoodie.parquet.small.file.limit":996147200,
   "hoodie.parquet.max.file.size":1073741824,
   "hoodie.parquet.block.size":1073741824,
   "hoodie.copyonwrite.record.size.estimate":512,
   "hoodie.cleaner.commits.retained":10,
   "hoodie.datasource.hive_sync.enable":"true",
   "hoodie.datasource.hive_sync.database":"datalake_raw",
   "hoodie.datasource.hive_sync.table":"stockout",
   "hoodie.datasource.hive_sync.partition_fields":"created_date_brt",
   "hoodie.datasource.hive_sync.partition_extractor_class":"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "hoodie.datasource.hive_sync.jdbcurl":"jdbc:hive2://ip-10-0-21-127.us-west-2.compute.internal:10000",
   "hoodie.insert.shuffle.parallelism":1500,
   "hoodie.bulkinsert.shuffle.parallelism":700,
   "hoodie.upsert.shuffle.parallelism":1500
}

<img width="1680" alt="Captura de Tela 2020-08-20 às 16 15 10" src="https://user-images.githubusercontent.com/36298331/90816019-f8640b80-e301-11ea-8334-c64bd3e0278c.png">
<img width="1680" alt="Captura de Tela 2020-08-20 às 16 14 38" src="https://user-images.githubusercontent.com/36298331/90816029-fc902900-e301-11ea-9515-6f407d05968e.png">
<img width="1680" alt="Captura de Tela 2020-08-20 às 16 14 10" src="https://user-images.githubusercontent.com/36298331/90816031-fdc15600-e301-11ea-9830-47c2c91ee983.png">
<img width="1680" alt="Captura de Tela 2020-08-20 às 16 13 46" src="https://user-images.githubusercontent.com/36298331/90816034-fe59ec80-e301-11ea-96e8-0b22de34e233.png">


I tried use bulk_insert paralelism with 4000 but didn't work. I really don't know what to do...

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Spark Fails to Process 300Gb Of Data #2003

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SUPPORT] Spark Fails to Process 300Gb Of Data #2003

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions