[SUPPORT] - How to guarantee snapshot isolation when reading Hudi tables in S3?

Hello,

We have a setup where we process data incrementally against large Hudi tables in S3, using Hudi and Spark. When reading large tables from a different spark process or when applying time consuming queries against spark dataframes, the reading process crashes  if another process attempts to update that table incrementally. I assume due to underlying parquet partitions being modified while the dataframe still being queried.
How can we isolate the table when reading and performing queries against that dataframe in Spark without being affected by the writers?

* Sample Code
```
import org.apache.spark.sql.{SparkSession}
import org.apache.hudi._

val ss = SparkSession.builder().getOrCreate()
     
val df = ss.read
     .format("org.apache.hudi")
     .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
     .load("s3://path/to/hudi/table/*")
	 
df.createOrReplaceTempView("hudi_table")
```

While performing queries against 'hudi_table', if any process updates the table under that S3 path the table is located, the query crashes.
How can we guarantee snapshot isolation when reading without being affected by writers?

**Environment Description**
* Hudi version: 0.7.0
* Spark version: 3.0.1
* Hadoop version: 3.2.1
* Storage: S3
* Running on Docker: No

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] - How to guarantee snapshot isolation when reading Hudi tables in S3? #2586

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SUPPORT] - How to guarantee snapshot isolation when reading Hudi tables in S3? #2586

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions