Memory (leak) aggregation after multiple runs with .compute() #2464

den4uk · 2019-01-15T14:50:16Z

When dask.compute() or client.compute() are executed, memory does not get released. I have dask running in a celery worker, which keeps executing tasks, however, memory use consumption increases as well. Dask seems not be able to release some of the memory that it used up.

I tried using pympler, tracemalloc and memory_profiler, but nothing is pointing to any data objects. The memory consumption is seen in monitoring tools, such as htop.

To reproduce the memory leak, here is an example I compiled:

import gc
import os.path
import dask
import dask.distributed
import dask.dataframe

gc.enable()
INPUT = 'test.csv'

def make_input(count=1000000):
    if not os.path.exists(INPUT):
        import pandas as pd
        import numpy as np
        pdf = pd.DataFrame(np.random.rand(count, 2), columns=['a', 'b'])
        pdf.to_csv(INPUT)
        del pd, np

# Setup dask instance
cluster = dask.distributed.LocalCluster(processes=False, silence_logs=False)
client = dask.distributed.Client(cluster)

def dask_task():
    df = dask.dataframe.read_csv(INPUT)
    x = df['a'].sum()
    y = df['b'].mean()
    res = client.compute([x, y])
    return [r.result() for r in res]

def clean_dask():
    client.retire_workers()
    client.close()
    cluster.close()
    gc.collect()


make_input()
times = input('>>> Run a dask task (n=1) [ENTER]: ')
for _ in range(int(times) if times else 1):
    print(dask_task())
input('>>> Clean up dask client/cluster [ENTER]: ')
clean_dask()
input('>>> Exit [ENTER]')  # Do not close the execution, review memory usage

# Python 3.6.7
# requirements:
# numpy==1.14.3 pandas==0.22.0 dask[dataframe,distributed]==1.0.0
# Also, same behaviour observed with these:
# numpy==1.16.0 pandas==0.23.4 dask[dataframe,distributed]==1.0.0

To break this down

make_input() creates an input file for you;
I suggest running dask_task() several times (5-10); with each iterration memory use would increment;
clean_dask() call closes workers, client, local scheduler and tries cleaning up the memory;

Here are my observations for memory resources

Dask client and cluster set up: 91MB

dask_task() executed 1 time: 122MB
clean_dask() ran: 122MB

dask_task() executed 5 times: 198MB
clean_dask() ran: 198MB

Re-set, dask_task() executed 20 times: 237MB
clean_dask() ran: 237MB

Re-set, dask_task() executed 50 times: 332MB
clean_dask() ran: 233MB

Re-set, dask_task() executed 100 times: 492MB
clean_dask() ran: 492MB

Re-set, dask_task() executed 200 times: 497MB
clean_dask() ran: 497MB

Re-set, dask_task() executed 500 times: 610MB
clean_dask() ran: 574MB

Re-set, dask_task() executed 1,000 times: 624MB
clean_dask() ran: 570MB

Closing client/cluster only managed to clear the memory on some occasions, but not in call cases. The trend of the behaviour is that once a limit is reached, the memory does not go up as much.

Desired behaviour

Once a dask execution is completed, I would like to release the memory used for other processes to be consumed. Dask runs tasks inside a Celery worker, inside a docker container. Restarting workers/containers is obviously an undesired way of managing memory consumption.

I do not require to have a LocalCluster setup, nor the Client. When setting up a client and cluster, it gives an option to explicitly run .close() on them, which frees up some more memory, rather than running dask alone.

I would much appreciate your assistance or suggestions. Thank you.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-01-16T20:48:43Z

Do you observe the memory leak only with read_csv? pandas-dev/pandas#19941 sounds somewhat similar.

mrocklin · 2019-01-22T17:06:27Z

I recommend trying this same workflow not with dask.dataframe, but with concurrent.futures. This would help us to determine if the problem was a dask problem, or a pandas/numpy with threads problem.

den4uk · 2019-01-23T16:31:27Z

@TomAugspurger @mrocklin Thank you for suggestions. I tried replicating the code that used pure pandas, and same with concurrent.futures . When using concurrent.futures - memory did expand with iterative executions of the same task, as described in other related posts.

The fix for the read_csv appears to have been merged, and I can't now wait to try it in with pandas version 0.24!
I will come back to report.

den4uk · 2019-01-26T09:42:40Z

Just tried re-running this with pandas v 0.24, and the memory leak is gone! Many thanks for help.

den4uk changed the title ~~Memory (leak) aggregation when running~~ Memory (leak) aggregation after multiple runs with .compute() Jan 15, 2019

den4uk changed the title ~~Memory (leak) aggregation after multiple runs with .compute()~~ Memory (leak) aggregation after multiple runs with .compute() Jan 15, 2019

den4uk closed this as completed Jan 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory (leak) aggregation after multiple runs with .compute() #2464

Memory (leak) aggregation after multiple runs with .compute() #2464

den4uk commented Jan 15, 2019

TomAugspurger commented Jan 16, 2019

mrocklin commented Jan 22, 2019

den4uk commented Jan 23, 2019

den4uk commented Jan 26, 2019

Memory (leak) aggregation after multiple runs with .compute() #2464

Memory (leak) aggregation after multiple runs with .compute() #2464

Comments

den4uk commented Jan 15, 2019

To break this down

Here are my observations for memory resources

Desired behaviour

TomAugspurger commented Jan 16, 2019

mrocklin commented Jan 22, 2019

den4uk commented Jan 23, 2019

den4uk commented Jan 26, 2019