Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7.4.1 Memory Leak #981

Open
wpirkl opened this issue Feb 1, 2024 · 7 comments
Open

7.4.1 Memory Leak #981

wpirkl opened this issue Feb 1, 2024 · 7 comments

Comments

@wpirkl
Copy link

wpirkl commented Feb 1, 2024

Python version 3.11.7

When opening and closing multiple MF4 files, there is a memory leak, which causes asammdf to allocate a lot of memory, which is never freed.

Here you can find a tracemalloc output (tracing generates a lot of overhead, so only a few traces are shown).

2024-02-01 09:34:18,284 :: INFO :: [ Top 10 Memory Stats!!!]
2024-02-01 09:34:18,284 :: INFO :: 0: D:\Python311\Lib\site-packages\asammdf\blocks\mdf_v4.py:7525: size=50.1 MiB, count=1402870, average=37 B
2024-02-01 09:34:18,284 :: INFO :: 1: D:\Python311\Lib\site-packages\asammdf\blocks\mdf_v4.py:1365: size=2231 KiB, count=1, average=2231 KiB
2024-02-01 09:34:18,285 :: INFO :: 2: D:\Python311\Lib\site-packages\canmatrix\formats\dbc.py:573: size=1020 KiB, count=6528, average=160 B
2024-02-01 09:34:18,287 :: INFO :: 3: D:\Python311\Lib\site-packages\asammdf\blocks\mdf_v4.py:1179: size=839 KiB, count=1, average=839 KiB
2024-02-01 09:34:18,290 :: INFO :: 4: D:\Python311\Lib\site-packages\asammdf\blocks\v4_blocks.py:3139: size=638 KiB, count=3, average=213 KiB
2024-02-01 09:34:18,290 :: INFO :: 5: D:\Python311\Lib\site-packages\asammdf\blocks\mdf_v4.py:7314: size=637 KiB, count=2, average=319 KiB
2024-02-01 09:34:18,290 :: INFO :: 6: <attrs generated init canmatrix.canmatrix.Signal>:7: size=332 KiB, count=3270, average=104 B
2024-02-01 09:34:18,291 :: INFO :: 7: D:\Python311\Lib\site-packages\canmatrix\canmatrix.py:204: size=332 KiB, count=3265, average=104 B
2024-02-01 09:34:18,291 :: INFO :: 8: D:\Python311\Lib\site-packages\canmatrix\formats\dbc.py:964: size=332 KiB, count=3264, average=104 B
2024-02-01 09:34:18,291 :: INFO :: 9: D:\Python311\Lib\site-packages\canmatrix\canmatrix.py:193: size=332 KiB, count=3264, average=104 B
2024-02-01 09:35:35,793 :: INFO :: [ Top 10 Memory Stats!!!]
2024-02-01 09:35:35,794 :: INFO :: 0: D:\Python311\Lib\site-packages\asammdf\blocks\mdf_v4.py:7525: size=109 MiB, count=3125168, average=37 B
2024-02-01 09:35:35,794 :: INFO :: 1: D:\Python311\Lib\site-packages\asammdf\blocks\mdf_v4.py:1365: size=2231 KiB, count=1, average=2231 KiB
2024-02-01 09:35:35,795 :: INFO :: 2: D:\Python311\Lib\site-packages\asammdf\mdf.py:4834: size=1205 KiB, count=2, average=602 KiB
2024-02-01 09:35:35,795 :: INFO :: 3: D:\Python311\Lib\site-packages\asammdf\mdf.py:4831: size=1205 KiB, count=2, average=602 KiB
2024-02-01 09:35:35,796 :: INFO :: 4: D:\Python311\Lib\site-packages\canmatrix\formats\dbc.py:573: size=1020 KiB, count=6528, average=160 B
2024-02-01 09:35:35,796 :: INFO :: 5: D:\Python311\Lib\site-packages\asammdf\blocks\mdf_v4.py:1179: size=839 KiB, count=1, average=839 KiB
2024-02-01 09:35:35,797 :: INFO :: 6: D:\Python311\Lib\site-packages\asammdf\blocks\v4_blocks.py:3139: size=638 KiB, count=3, average=213 KiB
2024-02-01 09:35:35,797 :: INFO :: 7: D:\Python311\Lib\site-packages\asammdf\blocks\mdf_v4.py:7314: size=637 KiB, count=2, average=319 KiB
2024-02-01 09:35:35,798 :: INFO :: 8: D:\Python311\Lib\site-packages\asammdf\mdf.py:4840: size=603 KiB, count=4, average=151 KiB
2024-02-01 09:35:35,799 :: INFO :: 9: D:\Python311\Lib\site-packages\asammdf\mdf.py:4841: size=602 KiB, count=2, average=301 KiB
@paweller
Copy link

Hello all,
it seems like the memory leak issue has not been fixed until now.

Python setup

'os = Windows-10-10.0.19045-SP0'
'python = 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)]'
'asammdf = 7.4.2'
'numpy = 1.26.4'

Dummy code

import numpy as np
from asammdf import MDF


class Container:
    def __init__(self):
        self.data = np.empty((1,))
        self.data = np.NAN

    # not a @staticmethod as in the real code some other actions are performed
    def get_data(self, fd):
        with MDF(name=fd, memory='minimal') as file:
            data = file.extract_bus_logging(
                database_files={'CAN': [(data_base, can_channel)], }
            ).to_dataframe()['sig_1', 'sig_2', 'sig_n'].to_numpy(dtype=float)
        return data


fds = ['file_0.mf4', 'file_1.mf4', 'file_n.mf4'] 
obj = Container()

for f in fds:
    if np.isnan(obj.data).any():
        obj.data = obj.get_data(f)
    else:
        np.append(
            arr=obj.data,
            values=obj.get_data(f),
            axis=0
        )

Description

Each of the .mf4 files is about 900 MiB in size. The data extracted from each of these files and stored to the obj is about 5 MiB. Thus, I would expect a RAM gain of just about that per for loop iteration. Monitoring the RAM, however, shows that each iteration requires about 600 MiB of additional memory space.

Workaround try: Threading/Multiprocessing

Since it is known that Python does not necessarily release all the memory back to the OS once variables get cleared, I thought to try off-loading the get_data() into threads. It is reported here that this will release the memory back to the OS once the thread terminates. Thethreading is implemented such that every time the code reaches the get_data() a thread is spawned, the data read in, the thread terminated, and the extracted data stored to the obj.

for f in fds:
    if np.isnan(obj.data).any():
        with concurrent.futures.ThreadPoolExecutor() as executor:
            tmp_data = executor.map(obj.get_data, [f])
        obj.data = tmp_data.__next__()
        del tmp_data
    else:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            tmp_data = executor.map(obj.get_data, [f])
        np.append(
            arr=obj.data,
            values=tmp_data.__next__(),
            axis=0
        )
        del tmp_data

Unfortunately, even that did not do the trick. The memory leak is unchanged. Trying a ProcessPoolExecutor() instead of the ThreadPoolExecutor() does not work and throws the error that a process in the pool got terminated unexpectedly.

Workaround try: copy.deepcopy()

The intention behind utilizing deepcopy() is to break any references to the MDF.extract_bus_logging()'s return value that might linger somewhere in the background. Unfortunately, that ends in an AttributeError: 'MDF4' object has no attribute '_closed' error. Is the MDF4 object not properly closed, indeed? Or is it that the _closed attribute is just not implemented?

with MDF(name=fd, memory='minimal') as file:
    data = deepcopy(
        file.extract_bus_logging(
            database_files={'CAN': [(data_base, can_channel)], }
        )
    )

Side note: Executing a deepcopy() on the data in np.ndarray format works. That, however, does not fix the memory leakage issue.

with MDF(name=fd, memory='minimal') as file:
    data = deepcopy(
        file.extract_bus_logging(
            database_files={'CAN': [(data_base, can_channel)], }
        ).to_dataframe()['sig_1', 'sig_2', 'sig_n'].to_numpy(dtype=float)
    )

@danielhrisca
Copy link
Owner

Is this better?

import numpy as np
from asammdf import MDF


class Container:
    def __init__(self):
        self.data = np.empty((1,))
        self.data = np.NAN

    # not a @staticmethod as in the real code some other actions are performed
    def get_data(self, fd):
        with MDF(name=fd, memory='minimal') as file:
            data_mdf = file.extract_bus_logging(
                database_files={'CAN': [(data_base, can_channel)], }
            )
            data = data_mdf.to_dataframe()['sig_1', 'sig_2', 'sig_n'].to_numpy(dtype=float)
            data_mdf.close()
        return data


fds = ['file_0.mf4', 'file_1.mf4', 'file_n.mf4'] 
obj = Container()

for f in fds:
    if np.isnan(obj.data).any():
        obj.data = obj.get_data(f)
    else:
        np.append(
            arr=obj.data,
            values=obj.get_data(f),
            axis=0
        )

@paweller
Copy link

Unfortunately, assigning the numpy data to a new variable and closing the data_mdf does not make a difference.

@paweller
Copy link

paweller commented May 3, 2024

In a desperate attempt to try and bypass the memory leak issue I tried wrapping the extract_bus_logging() function into a standalone .exe. Unfortunately, that throws the error shown in the third code block. Any idea why that might be?

# extract_bus_logging_wrapper.py
# Got turned into a standalone executable utilizing: pyinstaller --onefile extract_bus_logging_wrapper.py
# (pyinstaller v6.6.0)

def arg_parser():
    parser = argparse.ArgumentParser()
    for arg in ['fp', 'bus', 'db', 'ch']:
        parser.add_argument(arg, type=str)
    args = parser.parse_args()
    return args.fp, args.bus, args.db, args.ch


def extract_bus_logging_wrapper(_fp, _bus, _db, _ch):
    _db_dict = {_bus: [(_db, int(_ch))]}
    return pickle.dumps(MDF(_fp).extract_bus_logging(_db_dict).to_dataframe())


if __name__ == '__main__':
    fp, bus, db, ch = arg_parser()
    sys.stdout.buffer.write(extract_bus_logging_wrapper(fp, bus, db, ch))
# main.py

try:
    result = pickle.loads(
        subprocess.check_output(
            args=['extract_bus_logging_wrapper.exe', str('log.mf4'), str('CAN'), str('data_base.dbc'), str('2')]
        )
    )
except subprocess.CalledProcessError as e:
    print(f'Error running {exe_path}: {e}')
Traceback (most recent call last):
  File "extract_bus_logging_wrapper.py", line 45, in <module>
    sys.stdout.buffer.write(extract_bus_logging_wrapper(fp, bus, db, ch))
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "extract_bus_logging_wrapper.py", line 38, in extract_bus_logging_wrapper
    MDF(_fp).extract_bus_logging(_db_dict).to_dataframe()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "asammdf\mdf.py", line 4680, in extract_bus_logging
  File "asammdf\mdf.py", line 4723, in _extract_can_logging
  File "asammdf\blocks\utils.py", line 1763, in load_can_database
  File "canmatrix\formats\__init__.py", line 71, in loadp
  File "canmatrix\formats\__init__.py", line 86, in load
KeyError: 'canmatrix.formats.dbc'

Note: Calling the extract_bus_logging_wrapper() from within Python works just fine.

result = pickle.loads(extract_bus_logging_wrapper(str('log.mf4'), str('CAN'), str('data_base.dbc'), str('2')))

@danielhrisca
Copy link
Owner

Why do you think you would get different result with pyinstaller? It's still python running

@paweller
Copy link

paweller commented May 3, 2024

I hoped that the .exe gets terminated for good once done with it's execution. Just like what one would manually do when terminating the execution in the IDE. Doing the latter releases the blocked memory. I am not that deep into the entire pyinstaller-game, though.

@eXezor
Copy link

eXezor commented Aug 2, 2024

I am running into the same issue, even when trying to explicitly close the files after being done with them. Each iteration uses about 200MB of RAM more when using mdf.extract_bus_logging(). Using tracemalloc shows line 7652 in mdf_v4.py using the memory, which is this snippet which goes directly to the cutils:

vals = extract(signal_data, 1, vals - vals[0])

I tried deleting that variable after it is returned and not used anymore, however the memory was still not freed.
When using mdf.select() and working only with a single channel on the same file, it uses about 30MB of RAM per iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants