-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Hello,
We have a setup where we process data incrementally against large Hudi tables in S3, using Hudi and Spark. When reading large tables from a different spark process or when applying time consuming queries against spark dataframes, the reading process crashes if another process attempts to update that table incrementally. I assume due to underlying parquet partitions being modified while the dataframe still being queried.
How can we isolate the table when reading and performing queries against that dataframe in Spark without being affected by the writers?
- Sample Code
import org.apache.spark.sql.{SparkSession}
import org.apache.hudi._
val ss = SparkSession.builder().getOrCreate()
val df = ss.read
.format("org.apache.hudi")
.option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
.load("s3://path/to/hudi/table/*")
df.createOrReplaceTempView("hudi_table")
While performing queries against 'hudi_table', if any process updates the table under that S3 path the table is located, the query crashes.
How can we guarantee snapshot isolation when reading without being affected by writers?
Environment Description
- Hudi version: 0.7.0
- Spark version: 3.0.1
- Hadoop version: 3.2.1
- Storage: S3
- Running on Docker: No
Thank you