[SUPPORT] S3 slow file listing causes Hudi read performance. #1829

zuyanton · 2020-07-14T16:47:25Z

Hudi MoR reading performance gets slower on tables with many (1000+) partitions stored in S3. When running simple spark.sql("select * from table_ro).count command, we observe in spark UI that first 2.5 minutes no spark jobs gets scheduled and the load on cluster during that period is almost non existing.

When looking into logs to figure out what is going on during that period we observe that first two and a half minutes Hudi is busy running HoodieParquetInputFormat.listStatus code link. I placed timer logs lines around various parts of that function and was able to narrow down to this line

hudi/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java

Line 103 in f5dc8ca

FileStatus[] fileStatuses = super.listStatus(job);

this line execution takes over 2/3 of all time.
If I understand correctly what this line does it lists all files in a single partition.
Looks like this "overhead" is linearly depends on number of partitions as increasing number of partitions to 2000 almost doubles the overhead and cluster just runs HoodieParquetInputFormat.listStatus before starting executing any spark jobs.

To Reproduce
see code snippet bellow

Hudi version : master branch
Spark version : 2.4.4
Hive version : 2.3.6
Hadoop version : 2.8.5
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no

Additional context

    import org.apache.spark.sql.functions._
    import org.apache.hudi.hive.MultiPartKeysValueExtractor
    import org.apache.hudi.QuickstartUtils._
    import scala.collection.JavaConversions._
    import org.apache.spark.sql.SaveMode
    import org.apache.hudi.DataSourceReadOptions._
    import org.apache.hudi.DataSourceWriteOptions._
    import org.apache.hudi.DataSourceWriteOptions
    import org.apache.hudi.config.HoodieWriteConfig._
    import org.apache.hudi.config.HoodieWriteConfig
    import org.apache.hudi.keygen.ComplexKeyGenerator
    import org.apache.hadoop.hive.conf.HiveConf
    val hiveConf = new HiveConf()
    val hiveMetastoreURI = hiveConf.get("hive.metastore.uris").replaceAll("thrift://", "")
    val hiveServer2URI = hiveMetastoreURI.substring(0, hiveMetastoreURI.lastIndexOf(":"))
    var hudiOptions = Map[String,String](
      HoodieWriteConfig.TABLE_NAME → "testTable1",
      "hoodie.consistency.check.enabled"->"true",
      "hoodie.compact.inline.max.delta.commits"->"100",
      "hoodie.compact.inline"->"true",
      DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "MERGE_ON_READ",
      DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "pk",
      DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[ComplexKeyGenerator].getName,
      DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY ->"partition",
      DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "sort_key",
      DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY → "true",
      DataSourceWriteOptions.HIVE_TABLE_OPT_KEY → "testTable1",
      DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY → "partition",
      DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY → classOf[MultiPartKeysValueExtractor].getName,
      DataSourceWriteOptions.HIVE_URL_OPT_KEY ->s"jdbc:hive2://$hiveServer2URI:10000"
    )

    spark.sql("drop table if exists testTable1_ro")
    spark.sql("drop table if exists testTable1_rt")
    var seq = Seq((1, 2, 3))
    for (i<- 2 to 1000) {
      seq = seq :+ (i, i , 1)
    }
    var df = seq.toDF("pk", "partition", "sort_key")
    //create table
    df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://testBucket/test/hudi/zuyanton/1/testTable1")
    //update table couple times
    df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://testBucket/test/hudi/zuyanton/1/testTable1")
    df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://testBucket/test/hudi/zuyanton/1/testTable1")
    
    //read table 
    spark.sql("select * from testTable_ro").count

The text was updated successfully, but these errors were encountered:

vinothchandar · 2020-07-15T15:01:55Z

@zuyanton this seems like a general issue with FileInputFormat

 int numThreads = job
        .getInt(
            org.apache.hadoop.mapreduce.lib.input.FileInputFormat.LIST_STATUS_NUM_THREADS,
            org.apache.hadoop.mapreduce.lib.input.FileInputFormat.DEFAULT_LIST_STATUS_NUM_THREADS);

can you try adding spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads=8 or something to sparkConf and see if helps? (default inside Hadoop is 1)

cc @n3nash IIRC you mentioned a similar approach done at uber?

zuyanton · 2020-07-15T20:13:44Z

@vinothchandar it didnt have any effect and I believe it shouldn't, since from what it looks like that parameter only gives improvement if you are trying to list statuses of multiple dirs https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L216 where is in our case its always one dir - the root location of single partition.

umehrot2 · 2020-07-15T20:55:43Z

I think the finding by @zuyanton seems correct. Increasing the num-threads will not help because we just set the basepath of the table in the inputpath of jobconf. I believe we will have a good speed up, if instead of basePath we set all the partition paths in the inputpath of jobconf, and then increase the num-threads.

Another thing we can potentially explore is using Spark to perform this listing parallely on the cluster. But this seems like something we should target for 0.6.0 release with Blocker priority.

bvaradar · 2020-07-15T21:59:10Z

@zuyanton : This sounds like a general Spark/HMS query integration issue. Are we seeing similar behavior when running the same query over non-hudi table ?

zuyanton · 2020-07-17T04:17:47Z

@bvaradar we dont see similar issue with regular non hudi tables saved to s3 in parquet format. for regular tables "overhead" is the same and under one minute despite the number of partitions. Regular tables with 20k partitions as well as 100 partition take the same time to "load" before spark starts running its jobs where is hudi table on s3 becomes slow with 5k+ partitions. Although we use EMR 5.28 which comes with EMRFS s3 optimized committer enabled in spark by default ,so I assume whatever bottlenecks s3 has, are addressed in the committer.

bvaradar · 2020-07-17T06:46:30Z

Thanks @zuyanton for the updates. IIUC, S3 optimized committer was for optimizing writes reducing the renames done. I might be wrong but I am generally curious on EMR optimizations for Spark. @umehrot2 : We can look at the option you mentioned regarding setting the partition paths and then increasing the num-threads. Is this one of the optimizations done internally within EMR spark ?

umehrot2 · 2020-07-18T00:16:15Z

@zuyanton In your test with regular parquet tables you are probably not setting the following property in the spark config spark.sql.hive.convertMetastoreParquet=false. When you set this property to ```false`` only then will Spark use Parquet InputFormat as well as its listing code. Otherwise by default Spark uses its native listing (parallelized over the cluster) and parquet readers which are supposed to be faster.

However the way Hudi works is it uses InputFormat implementation. Thus for a fair comparison when you test regular parquet with Spark you should set spark.sql.hive.convertMetastoreParquet=false and I think you will observe quite similar behavior then as to what you are seeing. Would you mind trying that out once ?

But @bvaradar irrespective I think for Hudi we should always compare our performance against standard spark performance (native listing and reading) and not the performance of spark when it is made to go through InputFormat. So we need to get this fixed either ways if we have to be comparable to spark parquet performance which uses parallelized listing over the cluster.

umehrot2 · 2020-07-18T00:19:18Z

@bvaradar @zuyanton EMR S3 optimized committer only helps avoid renames. Again that does not come into effect for Hudi because of the way Hudi datasource is implemented. Hudi datasource is not an extension of FileFormat datasource of Spark. It has its own commit mechanism and writing logic and does not use Sparks commit/write process. So EMR optimized committer unfortunately does not come into effect for Hudi workloads.

Irrespective the committer would not have any effect on this listing performance.

zuyanton · 2020-07-18T02:59:05Z

@umehrot2 you are right , with convertMetastoreParquet set to false , when querying regular parquet table with 20k partitions I can see similar behavior of spark not running any jobs for first 4 minutes.

rubenssoto · 2021-01-25T02:17:30Z

@umehrot2 @bvaradar

Do you know if this problem will be solved in 0.7.0? I'm querying some big datasets with more than 500 partitions and I had the same problem.

2 Minutes doing nothing.

Thank you

vinothchandar · 2021-01-25T06:53:45Z

@rubenssoto for some code paths, it will be. if you turn on hoodie.metadata.enable=true on the writing, you should see improvements. Hive queries should see improvement, SparkSQL with --conf spark.sql.hive.convertMetastoreParquet=false and --conf "spark.hadoop.hoodie.metadata.enable=true" should see improvement. Spark datasource path will see modest gains for now, integration coming quickly in 0.8.0. Will include it in release highlights

rubenssoto · 2021-01-25T13:37:00Z

@vinothchandar

Thank you so much for your answer.
When do you plan to release this version? I will try to make some workarounds until then.

Is this configuration right?

{ "conf": {
            "spark.jars.packages": "org.apache.spark:spark-avro_2.12:2.4.4",
            "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
            "spark.jars": "s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar",
            "spark.sql.hive.convertMetastoreParquet": "false",
            "spark.hadoop.hoodie.metadata.enable": "true"}
}

I made these 2 queries:

spark.read.format('hudi').load('s3://ze-data-lake/temp/order_test').count()

%%sql
select count('*') from raw_courier_api.order_test

On the pyspark query spark creates a job with 143 tasks, after 10 seconds of listing the count was fast, but in the spark sql query spark creates a job with 2000 tasks and was very slow, is it a Hudi or spark issue?

SPARK SQL

PYSPARK

Another problem that I got it, my table has 36 million rows, with that config shows only 4 million.
Thank you so much!

vinothchandar · 2021-01-25T15:51:49Z

0.7.0 is being voted on right now. Hopefully today.

So the spark.read.format('hudi') route (spark datasource path) does not go through Hive, so those configs may not help at all. Between pySpark and spark datasource in scala, there should be no difference. So not sure whats going on :/

vinothchandar · 2021-01-29T04:26:19Z

0.7.0 is out!

n3nash · 2021-06-16T03:50:01Z

With 0.7.0, one can set hoodie.metadata.enable to true to eliminate issues due to file listings. Closing this ticket now.

vinothchandar added the performance label Jul 15, 2020

zuyanton mentioned this issue Jul 17, 2020

[SUPPORT] querying MoR tables on S3 becomes slow with number of files growing #1847

Closed

umehrot2 mentioned this issue Aug 21, 2020

[SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena #1981

Closed

njalan mentioned this issue Apr 13, 2021

[SUPPORT]Failed to enable hoodie.metadata.enable #2791

Closed

n3nash added this to In progress in GI Tracker Board Apr 22, 2021

n3nash added the awaiting-user-response label Jun 5, 2021

n3nash closed this as completed Jun 16, 2021

GI Tracker Board automation moved this from Blocked On User to Done Jun 16, 2021

n3nash added the archive label Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] S3 slow file listing causes Hudi read performance. #1829

[SUPPORT] S3 slow file listing causes Hudi read performance. #1829

zuyanton commented Jul 14, 2020

vinothchandar commented Jul 15, 2020 •

edited

zuyanton commented Jul 15, 2020

umehrot2 commented Jul 15, 2020

bvaradar commented Jul 15, 2020

zuyanton commented Jul 17, 2020

bvaradar commented Jul 17, 2020

umehrot2 commented Jul 18, 2020 •

edited

umehrot2 commented Jul 18, 2020

zuyanton commented Jul 18, 2020

rubenssoto commented Jan 25, 2021 •

edited

vinothchandar commented Jan 25, 2021

rubenssoto commented Jan 25, 2021 •

edited

vinothchandar commented Jan 25, 2021

vinothchandar commented Jan 29, 2021

n3nash commented Jun 16, 2021

[SUPPORT] S3 slow file listing causes Hudi read performance. #1829

[SUPPORT] S3 slow file listing causes Hudi read performance. #1829

Comments

zuyanton commented Jul 14, 2020

vinothchandar commented Jul 15, 2020 • edited

zuyanton commented Jul 15, 2020

umehrot2 commented Jul 15, 2020

bvaradar commented Jul 15, 2020

zuyanton commented Jul 17, 2020

bvaradar commented Jul 17, 2020

umehrot2 commented Jul 18, 2020 • edited

umehrot2 commented Jul 18, 2020

zuyanton commented Jul 18, 2020

rubenssoto commented Jan 25, 2021 • edited

vinothchandar commented Jan 25, 2021

rubenssoto commented Jan 25, 2021 • edited

vinothchandar commented Jan 25, 2021

vinothchandar commented Jan 29, 2021

n3nash commented Jun 16, 2021

vinothchandar commented Jul 15, 2020 •

edited

umehrot2 commented Jul 18, 2020 •

edited

rubenssoto commented Jan 25, 2021 •

edited

rubenssoto commented Jan 25, 2021 •

edited