Memory leak in hashindex_free_buckets if the index was read from file

The relevant code:

https://github.com/borgbackup/borg/blob/1fbbbe3da25ccd1b3822d3b6d8a1da09a15155e1/src/borg/_hashindex.c#L128-L139

When an index is read from a file, the bucket memory is backed by a `Py_buffer`. This is embedded in the `HashIndex` struct as the `buckets_buffer` field. (If the index is made by `hashindex_init` instead, `buckets_buffer.buf` is set to `NULL` and the rest of the struct is ignored.)

When the index has to resize, it uses regular `calloc`ed memory to copy the entries to, and the `Py_buffer` is released. Any resize after that `calloc`s a new memory region, copies the entries to it, and is then supposed to free the old `calloc`ed region. The `hashindex_free_buckets` function reflects that. If there's an active `Py_buffer` then it releases that, else it frees what it got from `calloc`.

The problem is that `PyBuffer_Release` does not zero out the `Py_buffer.buf` field, and so if an index was read from a file, `hashindex_free_buckets` will always try to call `PyBuffer_Release`. This does not cause an error as `PyBuffer_Release` does check if the buffer is still active, via the `Py_buffer.obj` field, and bails out early if that's `NULL`. And it sets it to `NULL` after successfully releasing the buffer, so it can be called any amount of times with no side effects. However, `hashindex_free_buckets` will never free the `calloc`ed memory if the `Py_buffer` was at some point active. 

A demo (note that the memory use function only works on Linux, if you want to test elsewhere you should run the parts one by one and just use whatever system monitor you have at hand to see how much it uses):
```Python
Python 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import struct, resource, pathlib, gc
>>> import borg.hashindex
>>> 
>>> print(borg.__version__)
1.4.4rc2.dev15+g289364b7d
>>> 
>>> def print_memory_use(marker):
...     gc.collect()
...     # This is platform specific.
...     print(f'\nMemory use {marker}: {(int(pathlib.Path('/proc/self/statm').read_text().split()[0]) * resource.getpagesize()):,} bytes.\n')
... 
>>> def fill_index(index):
...     for n in range(10_000_000):
...         index[struct.pack('<i', n) + b'\x00' * 28] = (n, n)
... 
>>> index = borg.hashindex.NSIndex()
>>> index.write('fresh-index-file')
>>> print_memory_use('at the start')

Memory use at the start: 23,171,072 bytes.

>>> 
>>> fill_index(index) ; print_memory_use('with 10M entries')

Memory use with 10M entries: 682,754,048 bytes.

>>> del index ; print_memory_use('after deleting the index')

Memory use after deleting the index: 23,330,816 bytes.

>>> 
>>> index = borg.hashindex.NSIndex.read('fresh-index-file')
>>> print_memory_use('after loading a fresh index from disk')

Memory use after loading a fresh index from disk: 23,330,816 bytes.

>>> fill_index(index); print_memory_use('with 10M entries')

Memory use with 10M entries: 2,232,836,096 bytes.

>>> del index ; print_memory_use('after deleting the index')

Memory use after deleting the index: 2,232,836,096 bytes.
```

<details><summary>Copy-paste code</summary>
<p>

```Python
import struct, resource, pathlib, gc
import borg.hashindex

print(borg.__version__)

def print_memory_use(marker):
    gc.collect()
    # This is platform specific.
    print(f'\nMemory use {marker}: {(int(pathlib.Path('/proc/self/statm').read_text().split()[0]) * resource.getpagesize()):,} bytes.\n')

def fill_index(index):
    for n in range(10_000_000):
        index[struct.pack('<i', n) + b'\x00' * 28] = (n, n)

index = borg.hashindex.NSIndex()
index.write('fresh-index-file')
print_memory_use('at the start')

fill_index(index) ; print_memory_use('with 10M entries')
del index ; print_memory_use('after deleting the index')

index = borg.hashindex.NSIndex.read('fresh-index-file')
print_memory_use('after loading a fresh index from disk')
fill_index(index); print_memory_use('with 10M entries')
del index ; print_memory_use('after deleting the index')
```
</p>
</details>

The memory use after the first fill is pretty much as expected. For 10M elements it will have picked a bucket count of 16485527, and at a bucket size of 40 bytes that's about 659 MB on top of the initial 23 MB for Python itself. And after deleting the index, it properly gets rid of all of it.

The memory use after the second fill matches closely what you get if you allocate a memory region of every size in `hash_sizes` up to 16485527 and keep all of them (except for the first one, which would be the Python buffer, but its size is negligible). And deleting the index does not get rid of it.

For a fix, just add `index->buckets_buffer.buf = NULL;` after line 133.

	static void
	hashindex_free_buckets(HashIndex *index)
	{
	#ifndef BORG_NO_PYTHON
	if(index->buckets_buffer.buf) {
	PyBuffer_Release(&index->buckets_buffer);
	} else
	#endif
	{
	free(index->buckets);
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory leak in hashindex_free_buckets if the index was read from file #9497

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Memory leak in hashindex_free_buckets if the index was read from file #9497

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions