[SPARK-46992]Fix cache consistency#45181
Conversation
7d80794 to
d3dcf7e
Compare
|
Could you enable GitHub Action on your repository, @doki23 ? Apache Spark community uses the contributor's GitHub Action resources. |
There was a problem hiding this comment.
BTW, the original code came from #4686 (at least) and seems to be the default Apache Spark behavior for a long time.
There was a problem hiding this comment.
IIUC, Apache Spark's the persist data (underlying RDD) can be recomputed always, doesn't it? It's only for the best-effort approach to reduce re-computation. Do we guarantee the cached data's immutability?
There was a problem hiding this comment.
Hmm...Sorry I do not understand your concerns. It does not change the immutability of cached data.
|
cc @cloud-fan and @HyukjinKwon because this seems to be claimed as a long standing correctness issue.
|
|
It's not an ideal behavior but should be easy to work around ( |
Ok, I've enabled it. @dongjoon-hyun |
b1870a7 to
c2ea02a
Compare
nchammas
left a comment
There was a problem hiding this comment.
In its current state, I don't think this PR fixes the issue described on Jira.
Here is the Python test I am running, which is simplified from the original reproduction that Denis posted:
data = spark.createDataFrame([(0,), (1,)], "val int")
data = data.orderBy("val").sample(fraction=0.5, seed=-1)
data.collect()
data.cache()
assert len(data.collect()) == data.count(), f"{data.collect()}, {data.count()}"The assertion fails as follows:
AssertionError: [Row(val=0), Row(val=1)], 1
Changing data.cache() to data = data.cache() doesn't change the result.
There was a problem hiding this comment.
Are you sure this is a valid test? Because this particular check passes for me on master.
There was a problem hiding this comment.
It makes sure that the cached data of the new Dataset instance is as expected. I'll also add one more case that proves the results of cached.count() and cached.collect() are consistent.
Just to be clear, do you not consider it a correctness issue? To me, it's a correctness issue since the existing behavior on But this is not always true, as the repro documented in SPARK-46992 shows. I also posted my own repro just above. |
|
All children have to be considered for changes of their persistence state. Currently it only checks the fist found child. |
So, we need a cache state signature for queryExecution |
| def queryExecution: QueryExecution = { | ||
| val cacheStatesSign = queryUnpersisted.computeCacheStateSignature() | ||
| // If all children aren't cached, directly return the queryUnpersisted | ||
| if (cacheStatesSign.forall(b => !b)) { |
There was a problem hiding this comment.
- It doesn't look like it's necessary to distinguish between
persistedandunpersistedanymore. If we wanted we could have a cacheMap[State, QueryExecution]for different states, but I think it'd add unjustified complexity. - We cannot use
var- it's not thread-safe.
class Dataset[T] private[sql](
@Unstable @transient val queryExecutionRef: AtomicReference[(Array[Boolean], QueryExecution)],
@DeveloperApi @Unstable @transient val encoder: Encoder[T])
extends Serializable {
@DeveloperApi
def queryExecution: QueryExecution = {
val (state, queryExecution) = queryExecutionRef.get()
val newState = queryExecution.computeCacheStateSignature()
if (state.sameElements(newState)) queryExecution
else {
val newQueryExecution = new QueryExecution(queryExecution)
queryExecutionRef.set((newState, newQueryExecution))
newQueryExecution
}
}
...There was a problem hiding this comment.
I don't think making queryExecution wrapped by AtomicReference means it's thread-safe. For example, we unpersist one of it's children in another thread, and at meanwhile we call ds.count(), the cache consistency may be incorrect.
Using 2 queryExecution variables may help reduce count of analysis.
There was a problem hiding this comment.
I don't think making queryExecution wrapped by AtomicReference means it's thread-safe. For example, we unpersist one of it's children in another thread, and at meanwhile we call ds.count(), the cache consistency may be incorrect.
Yes, consistency of results cannot be guaranteed when persistence state changes concurrently in different threads, but this is not what I was pointing to. Thread safety is a basic concept, not related to business logic: when we change a var in one thread, other threads might not see the updated reference. In order to avoid it the reference needs to be marked volatile. In the example above I used AtomicReference's set for simplicity, but it might make sense to implement it using compareAndSet to get additional guarantees.
Using 2 queryExecution variables may help reduce count of analysis.
I doubt that the additional complexity worth it. It's not a big deal... Let's see what reviewers think.
| * This method performs a pre-order traversal and return a boolean Array | ||
| * representing whether some nodes of the logical tree are persisted. | ||
| */ | ||
| def computeCacheStateSignature(): Array[Boolean] = { |
There was a problem hiding this comment.
How about using BitSet for persistence state representation?
It'll be easier to work with and it's more efficient.
There was a problem hiding this comment.
BitSet requires a numBits parameter. I cannot know the number of children in advance. Although current implementation is less efficient, it's still acceptable.
There was a problem hiding this comment.
It's not necessary to know the number of bits during construction.
It's up to you, but just FYI, here is how it'd look like:
val builder = BitSet.newBuilder
var index = 0
normalized.foreach { fragment =>
val cached = fragment match {
case _: IgnoreCachedData => false
case _ => cacheManager.lookupCachedData(fragment).isDefined
}
if (cached) builder += index
index += 1
}
builder.result()There was a problem hiding this comment.
It seems that scala's BitSet doesn't record the last index so that I can't judge the num of fragments.
|
This pr may get ready. All tests are passed. |
|
@dongjoon-hyun @cloud-fan @nchammas Hi, would you please take a look? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |


What changes were proposed in this pull request?
This pr fixes SPARK-46992.
Whenever which one child plan of the root plan is cached/uncached, the root plan will always get the newest consistent executedPlan.
Why are the changes needed?
See comments of SPARK-46992.
df.collect() df.cache() df.collect() // won't read cached data because the variable `executedPlan` of df's `QueryExecution` is already initialized.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Test case checks the consistence of
cached.count()andcached.collect(), and makes sure that size of collect is same as count.Was this patch authored or co-authored using generative AI tooling?
No.