Skip to content

Commit

Permalink
[FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParamete…
Browse files Browse the repository at this point in the history
…r and SsdFlatParameterView (#974)

* [FSDP] fixing backward path for SsdFlatParameter and SsdFlatParameterView when overriding .data

* Get ssd_offload unit tests passing

* [FSDP] get all test_fsdp_offload tests passing w/ ssd_offload on

* Update changelog
  • Loading branch information
another-pjohnson committed May 2, 2022
1 parent 58ccb16 commit 51b53dd
Show file tree
Hide file tree
Showing 6 changed files with 770 additions and 104 deletions.
35 changes: 34 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,40 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- FSDP: Add pickle/unpickle support for SsdTensorHandle (and derived classes),
verified that FSDP models w/ ssd_offload enabled can correctly call model.state_dict()
and model.load_state_dict(...) and thus successfully checkpoint and recover parameters
stored as SsdFlatParameters.
stored as SsdFlatParameters. [#964]
- FSDP Ssd Offload: [#974]
* add __setattr__ function for SsdTensorHandle/SsdFlatParameter to capture when .data is overridden
and perform necessary checks/updates before letting the tensor metadata be updated.
* There was a bug in pytorch core that caused re-assigning TensorSubclass .data to
disable the __torch_dispatch__ mechanism and effectively revert back to a normal
tensor. Since ssd_offload feature uses __torch_dispatch__ extensively and FSDP overrides
.data, ssd_offload now can only be used if pytorch version is 1.12.0 or later
(currently pytorch-nightly release). pytest tests are disabled and trying to import
ssd_offload or use FSDP with ssd_offload enabled will raise ImportError Exceptions.
* Enhance storage_state ON_CPU into ON_CPU_CLEAN and ON_CPU_DIRTY. ON_CPU_CLEAN
indicates that the .tensor value and the value stored on disk are identical, and no
writes are needed when calling .to_file(). ON_CPU_DIRTY indicates .tensor does not
match value stored on disk, and a write is necessary. This is disabled in FSDP as
it does not currently flush variables to disk as soon as optimizer modifies values.
* Fix detection in SsdTensorHandle __torch_dispatch__ when it is used in an in-place
operator, or as an output, then calling mark_dirty().
* Fix grad_fn on SsdFlatParameterViews so that both ssd_flat_param.view.grad_fn points
to a split and view on ssd_flat_parameter so that when X.backwards() is called. ssd_flat_param.grad
is properly updated in the backward pass.
* Add unit tests to verify grad_fn and backward hooks are called appropriately on
SsdFlatParameters/SsdFlatParameterViews
* Implement SsdFlatParameter.from_tensor() direct_to_file option. This allows creating
and SsdFlatParameter and directly writing it to disk, rather than creating a new tensor
first. This prevents another copy of tensors to be instantiated and can save memory when
creating very large SsdFlatParameters.
* Add SsdFlatParameterViewProperty for overriding a parameter in an nn.Module
similar to pytorch core's parameterization code path. This allows SsdFlatParameterViews
to be updated as SsdFlatParameter.data is overridden by replacing Layer.weight with a
property whose getter returns ssd_flat_param.views[view_id].
* Added example SsdFlatParameterViewParameterization on how the existing parameterization
method could be used instead of SsdFlatParameterViewProperty. But this method will result
in memory inefficiencies due to parameterization always keeping a copy of the original
parameter internally.

### Fixed

Expand Down

0 comments on commit 51b53dd

Please sign in to comment.