Skip to content

[SUPPORT] - How to guarantee snapshot isolation when reading Hudi tables in S3? #2586

@Rap70r

Description

@Rap70r

Hello,

We have a setup where we process data incrementally against large Hudi tables in S3, using Hudi and Spark. When reading large tables from a different spark process or when applying time consuming queries against spark dataframes, the reading process crashes if another process attempts to update that table incrementally. I assume due to underlying parquet partitions being modified while the dataframe still being queried.
How can we isolate the table when reading and performing queries against that dataframe in Spark without being affected by the writers?

  • Sample Code
import org.apache.spark.sql.{SparkSession}
import org.apache.hudi._

val ss = SparkSession.builder().getOrCreate()
     
val df = ss.read
     .format("org.apache.hudi")
     .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
     .load("s3://path/to/hudi/table/*")
	 
df.createOrReplaceTempView("hudi_table")

While performing queries against 'hudi_table', if any process updates the table under that S3 path the table is located, the query crashes.
How can we guarantee snapshot isolation when reading without being affected by writers?

Environment Description

  • Hudi version: 0.7.0
  • Spark version: 3.0.1
  • Hadoop version: 3.2.1
  • Storage: S3
  • Running on Docker: No

Thank you

Metadata

Metadata

Assignees

Labels

priority:highSignificant impact; potential bugs

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions