Skip to content

Fix IDSelector memory leak with SWIG, add unit tests (#3704)#3810

Closed
mnorris11 wants to merge 1 commit into
facebookresearch:mainfrom
mnorris11:export-D61992599
Closed

Fix IDSelector memory leak with SWIG, add unit tests (#3704)#3810
mnorris11 wants to merge 1 commit into
facebookresearch:mainfrom
mnorris11:export-D61992599

Conversation

@mnorris11
Copy link
Copy Markdown
Contributor

Summary:

Background

Issue: #2996

Prior attempted fix: https://github.com/facebookresearch/faiss/pull/3007/files#diff-d704f33c46a4ef2936c8cf2a66b05f1993e25e79ee5c19d4b63c4e0cf46b0a42

Partially copied from #3139 with an additional unit test.

It is a confirmed and reproducible memory leak every time. There is a workaround. See the comments on #2996.

Relevant SWIG docs: https://www.swig.org/Doc4.1/SWIGDocumentation.html#Typemaps_nn2

Current status

buck test faiss/tests:test_search_params -- test_ownership_2

Test prints:

without SWIG fix:
[0.49, 0.82, 1.15, 1.39, 1.67, 1.88, 2.16, 2.36, 2.65, 2.85, 3.13, 3.34, 3.62, 3.82, 4.11, 4.31, 4.6, 4.8, 5.08, 5.28]

with SWIG fix:
[0.52, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71]

Note: This test is not ideal. Ideally we could enable ASAN to catch these memory leaks. But ASAN does not seem to catch these Python memory leaks. Example:

def test_ownership_3(self):
        d = 32
        quantizer = faiss.IndexFlatL2(d)
        quantizer.this.own(False)

The above test produces no ASAN error, even though the quantizer object leaks.

Differential Revision: D61992599

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D61992599

mnorris11 pushed a commit to mnorris11/faiss that referenced this pull request Aug 29, 2024
…h#3810)

Summary:
Pull Request resolved: facebookresearch#3810

Pull Request resolved: facebookresearch#3704

Background
--
Issue: facebookresearch#2996

Prior attempted fix: https://github.com/facebookresearch/faiss/pull/3007/files#diff-d704f33c46a4ef2936c8cf2a66b05f1993e25e79ee5c19d4b63c4e0cf46b0a42

Partially copied from facebookresearch#3139 with an additional unit test.

It is a confirmed and reproducible memory leak every time. There is a workaround. See the comments on facebookresearch#2996.

Relevant SWIG docs: https://www.swig.org/Doc4.1/SWIGDocumentation.html#Typemaps_nn2

Current status
--

`buck test faiss/tests:test_search_params -- test_ownership_2`

Test prints:

without SWIG fix:
`[0.49, 0.82, 1.15, 1.39, 1.67, 1.88, 2.16, 2.36, 2.65, 2.85, 3.13, 3.34, 3.62, 3.82, 4.11, 4.31, 4.6, 4.8, 5.08, 5.28]`

with SWIG fix:
`[0.52, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71]`

Note: This test is not ideal. Ideally we could enable ASAN to catch these memory leaks. But ASAN does not seem to catch these Python memory leaks. Example:
```
def test_ownership_3(self):
        d = 32
        quantizer = faiss.IndexFlatL2(d)
        quantizer.this.own(False)
```
The above test produces no ASAN error, even though the `quantizer` object leaks.

Differential Revision: D61992599
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D61992599

@mnorris11
Copy link
Copy Markdown
Contributor Author

Looking into errors.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D61992599

mnorris11 pushed a commit to mnorris11/faiss that referenced this pull request Sep 12, 2024
…h#3810)

Summary:
Pull Request resolved: facebookresearch#3810

Pull Request resolved: facebookresearch#3704

Background
--
Issue: facebookresearch#2996

Prior attempted fix: https://github.com/facebookresearch/faiss/pull/3007/files#diff-d704f33c46a4ef2936c8cf2a66b05f1993e25e79ee5c19d4b63c4e0cf46b0a42

Partially copied from facebookresearch#3139 with an additional unit test.

It is a confirmed and reproducible memory leak every time. There is a workaround. See the comments on facebookresearch#2996.

Relevant SWIG docs: https://www.swig.org/Doc4.1/SWIGDocumentation.html#Typemaps_nn2

Current status
--

`buck test faiss/tests:test_search_params -- test_ownership_2`

Test prints:

without SWIG fix:
`[0.49, 0.82, 1.15, 1.39, 1.67, 1.88, 2.16, 2.36, 2.65, 2.85, 3.13, 3.34, 3.62, 3.82, 4.11, 4.31, 4.6, 4.8, 5.08, 5.28]`

with SWIG fix:
`[0.52, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71]`

Note: This test is not ideal. Ideally we could enable ASAN to catch these memory leaks. But ASAN does not seem to catch these Python memory leaks. Example:
```
def test_ownership_3(self):
        d = 32
        quantizer = faiss.IndexFlatL2(d)
        quantizer.this.own(False)
```
The above test produces no ASAN error, even though the `quantizer` object leaks.

Why change HNSW test?
--
This fix causes the HNSW test to fail with heap-use-after-free. This is because the index.storage.get_distance_computer() somehow gets freed during clone_index, but only when reassigning to the same variable like `index.storage = clone(index.storage)`. I checked in https://fburl.com/code/qw6fznjt, and it is non-null before returning on the CPP side.

After adding the temp variable, I also had to set `index.own_fields = False`, otherwise we get a heap-use-after-free again due to it being deleted already.

Differential Revision: D61992599
…h#3810)

Summary:
Pull Request resolved: facebookresearch#3810

Pull Request resolved: facebookresearch#3704

Background
--
Issue: facebookresearch#2996

Prior attempted fix: https://github.com/facebookresearch/faiss/pull/3007/files#diff-d704f33c46a4ef2936c8cf2a66b05f1993e25e79ee5c19d4b63c4e0cf46b0a42

Partially copied from facebookresearch#3139 with an additional unit test.

It is a confirmed and reproducible memory leak every time. There is a workaround. See the comments on facebookresearch#2996.

Relevant SWIG docs: https://www.swig.org/Doc4.1/SWIGDocumentation.html#Typemaps_nn2

Current status
--

`buck test faiss/tests:test_search_params -- test_ownership_2`

Test prints:

without SWIG fix:
`[0.49, 0.82, 1.15, 1.39, 1.67, 1.88, 2.16, 2.36, 2.65, 2.85, 3.13, 3.34, 3.62, 3.82, 4.11, 4.31, 4.6, 4.8, 5.08, 5.28]`

with SWIG fix:
`[0.52, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71]`

Note: This test is not ideal. Ideally we could enable ASAN to catch these memory leaks. But ASAN does not seem to catch these Python memory leaks. Example:
```
def test_ownership_3(self):
        d = 32
        quantizer = faiss.IndexFlatL2(d)
        quantizer.this.own(False)
```
The above test produces no ASAN error, even though the `quantizer` object leaks.

Why change HNSW test?
--
This fix causes the HNSW test to fail with heap-use-after-free. This is because the index.storage.get_distance_computer() somehow gets freed during clone_index, but only when reassigning to the same variable like `index.storage = clone(index.storage)`. I checked in https://fburl.com/code/qw6fznjt, and it is non-null before returning on the CPP side.

After adding the temp variable, I also had to set `index.own_fields = False`, otherwise we get a heap-use-after-free again due to it being deleted already.

Differential Revision: D61992599
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D61992599

alibeklfc added a commit to alibeklfc/faiss that referenced this pull request May 12, 2026
Summary:
A user-reported leak (facebookresearch#2996 follow-up) reproduces on faiss 1.14.1 when an `IDSelectorBatch` is assigned to a `SearchParameters` field after construction. Each iteration of the reporter's loop leaks ~160 MB; over a long-running serving process this grows unbounded.

# Root cause

The Python wrapper for `IDSelectorBatch` carries a SWIG `thisown` flag that controls who calls C++ `delete`. The SWIG-generated property setter for `SearchParameters.sel` flips `thisown` to 0 on the assumption that the enclosing C++ object will take responsibility, but `SearchParameters`'s C++ destructor does not free `sel` (the field is borrowed by design). Result: the C++ `IDSelectorBatch` (a 40 MB sorted `std::vector<idx_t>` for the reporter's case) plus the retained numpy buffer are orphaned forever.

`handle_SearchParameters` in `class_wrappers.py` already protected the kwargs construction path (`SearchParameters(sel=x)`) by wrapping the assignment in `RememberSwigOwnership` and `add_to_referenced_objects`. The bare-attribute path was unprotected.

# Fix

Override `__setattr__` on the base `SearchParameters` class so the same ownership protection runs for both `SearchParameters(sel=x)` and `params.sel = x`. The override filters via `hasattr(v, "thisown")` so only SWIG-wrapped C++ objects go through the dance; SWIG internals like `self.this` (a `SwigPyObject` without `thisown`), bookkeeping lists, and plain Python values bypass it.

`replacement_init` is slimmed to a pure delegator: the protection moves into `replacement_setattr` so a single source of truth handles both code paths. Keeping both layers would double-wrap on kwargs construction.

The filter changes from a denylist (`type(v) not in (int, float, bool, str)`) to a duck-type check (`hasattr(v, "thisown")`). The new filter is stricter: numpy arrays, lists, and `None` no longer get appended. Safe in practice because `SearchParameters` SWIG fields are typed as C++ pointers and only legitimately accept SWIG wrappers; numpy buffers passed indirectly (e.g., to `IDSelectorBatch`) are kept alive by the inner wrapper's own `add_to_referenced_objects`.

The duck-type filter is field-name-agnostic, so other `SearchParameters` subclass fields with the same C++ shape (`Foo* field = nullptr` with no destructor freeing `field`) benefit from the same protection automatically. Examples include `SearchParametersPreTransform.index_params` and `SearchParametersIVF.quantizer_params`.

# Per-field ref tracking

A naive implementation that uses the existing `add_to_referenced_objects` helper (which appends to a list) leaks under a different access pattern: `params = SearchParameters(); for ...: params.sel = ...` accumulates one entry per reassignment because nothing drops the prior ref. This is a common pattern in long-lived servers that initialize a `SearchParameters` once and rebind `params.sel` per request.

Fixed by using a per-field dict `self._sp_field_refs` keyed by attribute name. Reassigning the same field replaces the entry in the dict, so the prior wrapper loses its Python ref and the C++ object is freed. The else branch of `replacement_setattr` also drops any prior ref when the field is set to None or to a non-SWIG value, so explicit clear (`params.sel = None`) releases the C++ object immediately.

# Subclass guard

`__setattr__` is installed only once per class hierarchy. SWIG generates per-class `__init__` slots but does NOT generate per-class `__setattr__`, so subclasses (`SearchParametersIVF`, `SearchParametersHNSW`, ...) inherit the override via MRO. Without the `_protected_setattr` sentinel, calling `handle_SearchParameters(SearchParametersIVF)` would capture the inherited `replacement_setattr` as `parent_setattr`, then install a second `replacement_setattr` on top - chained calls would fire field tracking twice. The guard uses `getattr` (which walks the MRO) rather than `__dict__` lookup (own-attribute only) so subclasses correctly see the base's marker.

`replacement_init` does not need this guard because each SWIG class generates its own `__init__` in its own `__dict__`; wrapping always captures SWIG's fresh init, never an inherited replacement.

# Prior attempts and why this approach is targeted

This bug has been attempted twice before via the SWIG layer:

- facebookresearch#3139 (Apr 2024, abandoned)
- facebookresearch#3810 (Sept 2024, changes requested)

Both added a `%typemap(in) SWIGTYPE *` in `swigfaiss.swig` that strips the `$disown` flag globally, so SWIG never transfers ownership for any pointer field assignment anywhere in the bindings. The minimal change is one typemap.

The trade-off proved untenable. PR 3810 surfaced the smoking gun while testing: `test_graph_based.test_io_no_storage` started crashing with heap-use-after-free at `index.storage = faiss.clone_index(index.storage)` with `own_fields = True`. The author worked around it by introducing a temp variable and flipping `own_fields = False` - but that flip is itself a behavior change to a documented FAISS contract that downstream users build on.

The right framing is not "SWIG's default ownership transfer is always wrong" (the typemap assumption) but "SWIG's default ownership transfer is wrong specifically for `SearchParameters` because that class lacks a destructor for its borrowed pointer fields." This diff scopes the change to that class hierarchy; every other SWIG-managed field keeps its existing semantics. No `swigfaiss.swig` change, no surprising downstream breakage.

The memory regression test pattern (`np.arange(5_000_000)` × N iterations, measure `faiss.get_mem_usage_kb`) was independently validated by PR 3810's `test_ownership_2` and is reused here. ASAN cannot catch this leak because the C++ object remains reachable from a Python wrapper that simply has no Python-side references - it is a Python-level reference leak, not a C++-level dangling pointer.

Differential Revision: D104750178
meta-codesync Bot pushed a commit that referenced this pull request May 13, 2026
Summary:
Pull Request resolved: #5208

A user-reported leak (#2996 follow-up) reproduces on faiss 1.14.1 when an `IDSelectorBatch` is assigned to a `SearchParameters` field after construction. Each iteration of the reporter's loop leaks ~160 MB; over a long-running serving process this grows unbounded.

# Root cause

The Python wrapper for `IDSelectorBatch` carries a SWIG `thisown` flag that controls who calls C++ `delete`. The SWIG-generated property setter for `SearchParameters.sel` flips `thisown` to 0 on the assumption that the enclosing C++ object will take responsibility, but `SearchParameters`'s C++ destructor does not free `sel` (the field is borrowed by design). Result: the C++ `IDSelectorBatch` (a 40 MB sorted `std::vector<idx_t>` for the reporter's case) plus the retained numpy buffer are orphaned forever.

`handle_SearchParameters` in `class_wrappers.py` already protected the kwargs construction path (`SearchParameters(sel=x)`) by wrapping the assignment in `RememberSwigOwnership` and `add_to_referenced_objects`. The bare-attribute path was unprotected.

# Fix

Override `__setattr__` on the base `SearchParameters` class so the same ownership protection runs for both `SearchParameters(sel=x)` and `params.sel = x`. The override filters via `hasattr(v, "thisown")` so only SWIG-wrapped C++ objects go through the dance; SWIG internals like `self.this` (a `SwigPyObject` without `thisown`), bookkeeping lists, and plain Python values bypass it.

`replacement_init` is slimmed to a pure delegator: the protection moves into `replacement_setattr` so a single source of truth handles both code paths. Keeping both layers would double-wrap on kwargs construction.

The filter changes from a denylist (`type(v) not in (int, float, bool, str)`) to a duck-type check (`hasattr(v, "thisown")`). The new filter is stricter: numpy arrays, lists, and `None` no longer get appended. Safe in practice because `SearchParameters` SWIG fields are typed as C++ pointers and only legitimately accept SWIG wrappers; numpy buffers passed indirectly (e.g., to `IDSelectorBatch`) are kept alive by the inner wrapper's own `add_to_referenced_objects`.

The duck-type filter is field-name-agnostic, so other `SearchParameters` subclass fields with the same C++ shape (`Foo* field = nullptr` with no destructor freeing `field`) benefit from the same protection automatically. Examples include `SearchParametersPreTransform.index_params` and `SearchParametersIVF.quantizer_params`.

# Per-field ref tracking

A naive implementation that uses the existing `add_to_referenced_objects` helper (which appends to a list) leaks under a different access pattern: `params = SearchParameters(); for ...: params.sel = ...` accumulates one entry per reassignment because nothing drops the prior ref. This is a common pattern in long-lived servers that initialize a `SearchParameters` once and rebind `params.sel` per request.

Fixed by using a per-field dict `self._sp_field_refs` keyed by attribute name. Reassigning the same field replaces the entry in the dict, so the prior wrapper loses its Python ref and the C++ object is freed. The else branch of `replacement_setattr` also drops any prior ref when the field is set to None or to a non-SWIG value, so explicit clear (`params.sel = None`) releases the C++ object immediately.

# Subclass guard

`__setattr__` is installed only once per class hierarchy. SWIG generates per-class `__init__` slots but does NOT generate per-class `__setattr__`, so subclasses (`SearchParametersIVF`, `SearchParametersHNSW`, ...) inherit the override via MRO. Without the `_protected_setattr` sentinel, calling `handle_SearchParameters(SearchParametersIVF)` would capture the inherited `replacement_setattr` as `parent_setattr`, then install a second `replacement_setattr` on top - chained calls would fire field tracking twice. The guard uses `getattr` (which walks the MRO) rather than `__dict__` lookup (own-attribute only) so subclasses correctly see the base's marker.

`replacement_init` does not need this guard because each SWIG class generates its own `__init__` in its own `__dict__`; wrapping always captures SWIG's fresh init, never an inherited replacement.

# Prior attempts and why this approach is targeted

This bug has been attempted twice before via the SWIG layer:

- #3139 (Apr 2024, abandoned)
- #3810 (Sept 2024, changes requested)

Both added a `%typemap(in) SWIGTYPE *` in `swigfaiss.swig` that strips the `$disown` flag globally, so SWIG never transfers ownership for any pointer field assignment anywhere in the bindings. The minimal change is one typemap.

The trade-off proved untenable. PR 3810 surfaced the smoking gun while testing: `test_graph_based.test_io_no_storage` started crashing with heap-use-after-free at `index.storage = faiss.clone_index(index.storage)` with `own_fields = True`. The author worked around it by introducing a temp variable and flipping `own_fields = False` - but that flip is itself a behavior change to a documented FAISS contract that downstream users build on.

The right framing is not "SWIG's default ownership transfer is always wrong" (the typemap assumption) but "SWIG's default ownership transfer is wrong specifically for `SearchParameters` because that class lacks a destructor for its borrowed pointer fields." This diff scopes the change to that class hierarchy; every other SWIG-managed field keeps its existing semantics. No `swigfaiss.swig` change, no surprising downstream breakage.

The memory regression test pattern (`np.arange(5_000_000)` × N iterations, measure `faiss.get_mem_usage_kb`) was independently validated by PR 3810's `test_ownership_2` and is reused here. ASAN cannot catch this leak because the C++ object remains reachable from a Python wrapper that simply has no Python-side references - it is a Python-level reference leak, not a C++-level dangling pointer.

Reviewed By: mnorris11

Differential Revision: D104750178

fbshipit-source-id: 41f865028d0dc863fe560d1fd1bbe979dc8e8e8f
@alibeklfc alibeklfc closed this May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants