-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make resolved broadcast tensors non-persistent #2563
base: devel
Are you sure you want to change the base?
Conversation
// Take all projectable persistent buffers, and move them to the inputs. This | ||
// function create dummy outputs which should be used in later stages of the | ||
// scheduling. | ||
TORCH_CUDA_CU_API std::vector<TensorView*> projectPersistentBuffers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
@@ -656,111 +656,5 @@ TensorView* sortAndRFactor(TensorView* reference_tv) { | |||
return ir_utils::rfactorHelper(reference_tv, rfactor_axes); | |||
} | |||
|
|||
std::vector<TensorView*> projectPersistentBuffers(Fusion* fusion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
// where the persitent buffer may no longer be needed (one point could be | ||
// after another, and the buffer would be needed until the last resolution | ||
// points) | ||
std::vector<TensorView*> getPersistentUseOfBuffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted from projectPersistentBuffers
with no change
|
||
} // namespace | ||
|
||
std::vector<TensorView*> projectPersistentBuffers(Fusion* fusion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from reduction_utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple questions, I'm not quite understanding the approach.
|
||
for (const auto i : c10::irange( | ||
buffer->getComputeAtPosition(), | ||
buffer->domain()->domain().size())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't the axes be parallelized consistently even though outside the inline point? I guess it's a tendency that if we parallelize consistently we put that parallel dim within the compute at point, but this check seems to assume that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. What I thought was that if a broadcast domain appears left of a computeAt position, it's effectively expanded to the mapped consumer domain, so I though that would make the producer indexed consistently. However, if the mapped consumer domain itself may also be a broadcast domain, which effectively means we need something like the RAW sync logic... Will think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, or I need to finish my damn indexing PR, which would have this information 😭
const auto persistent_use_of_buffer = | ||
getPersistentUseOfBuffer(buffer, resolution_points); | ||
|
||
for (auto use : persistent_use_of_buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is just adding a new expression for the first use of the persistent buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just follows the same logic as the buffer projection. It replicates all the dependent expressions from inputs to the first use of the persistent buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I thought, I'm just forgetting how as it looks like it's only recomputing one use.
Fixes #2559
See #2559 for a C++ repro. The fusion math is:
Here, there are two persistent buffers:
They are the producers of
T5
:Since
T3
andT4
are not inlined at all and are on Local, they must be exactly mapped withT5
with the same parallelization, which is not the case withT4
, thus the parallel validation gives an error:The workaround of this PR is to give up making
T4
persistent by recomputing it, much like the projection (reused most of the logic). Alternatively, if a buffer was just parallelized by TID, then saving it on Shared would work too, but not considered.None of the existing C++ tests and benchmarks is affected by this PR (except for a slight change of the buffer projection).