azure-storage-blob >=v12 causes slow, high memory dd.read_csv #57

mgsnuno · 2020-05-20T12:47:31Z

dd.read_csv(path, storage_options) on a 15Gb csv file

with adlfs==0.2.4 normal speed and low memory usage
with adlfs==0.3 much slower and high memory usage

I think is related to azure-storage-blob >=v12 and I thought it would be important for you to be aware. with read_parquet didn't find issues.

Any idea? How to best report this upstream (azure)?

The text was updated successfully, but these errors were encountered:

hayesgb · 2020-05-20T12:55:05Z

Thanks for reporting the issue. Can you provide some specifics on what was observed for both 0.2.4 and 0.3.0 so I can attempt to reproduce?

mgsnuno · 2020-05-20T13:53:09Z

Yes sure:

create a big csv file in azure storage blob
use dask.distributed/dask_kubernetes to create a cluster
dd.read_csv(path, storage_options).persist() that file

looking at the dask.distributed dashboard you should be able to see how much longer it takes to read the file. Going through the dask.distributed Profile you can see that an unusual amount of time is being spent on an operation part of python.core ssl.

If you want me to say how I created a dask_kubernetes cluster in azure I can expand on it.

andersbogsnes · 2020-07-08T11:58:18Z

Seems like it is related to this issue:
Azure/azure-sdk-for-python#9596

mgsnuno · 2020-07-23T12:19:06Z

Closing for now, stopped having the issue using the same code and storage as before, either something changed internally in azure or in the library. Will reopen if reapers. Thank you.

mgsnuno · 2020-07-30T07:34:18Z

Issue is still there, I was just being lucky with one specific file.

mgsnuno · 2021-05-04T10:37:35Z

@hayesgb the biggest issue I experience with the latest version of adlfs while read/writing to/from csv is the high memory usage, high meaning that writing a 200Mb dataframe, with multiple partitions, can take upwards of 20Gb of RAM.
Not sure why azure-storage-blob >=12 causes the memory usage increased so much.
Any idea?

mgsnuno · 2021-06-03T11:55:29Z

@hayesgb what can I do to help you debug this issue?
It still exists with the latest version of adlfs, if adlfs>0.3.3 then writing to an azure blob container with adlfs incurs much higher memory usage (50-100% more)

anders-kiaer · 2021-06-12T19:07:08Z

@mgsnuno I have seen a similar problem as you (however in my case it was only a read operation, and of a .parquet file - the performance regression when going >=0.3 was 10x+ in runtime).

what can I do to help you debug this issue?

In my case, I got the same performance also for adlfs>=0.3 (and then also using recent azure-storage-blob versions) by simply changing length argument in this line https://github.com/dask/adlfs/blob/2bfdb02d13d14c0787e769c6686fecd2e3861a4b/adlfs/spec.py#L1785 from end to (end - start). See also #247.

hayesgb · 2021-06-14T12:51:31Z

@mgsnuno -- #247 has been merged into master and is included in release 0.7.7. Would be great to get your feedback on if it fixes this issue.

mgsnuno · 2021-06-15T11:43:22Z

@hayesgb I tested 0.7.7, it gives me the same issue, something like the code bellow, with adlfs>=0.3.3 it uses 2x-3x more memory, it is much faster (it looks like it's in parallel and before was not). Also maybe the parallelism explains the high memory usage.
I'm not being overly picky about the memory, the issue I have is that in a 32Gb RAM VM where I run dask to read, parse and write to abfs multiple parquet files in parallel, before I had no problems, 15Gb would be more than enough, now I have killed worker errors as the memory quickly reaches the 32Gb limit.
Is there a way to limit parallelism of adlfs with asyncio to make it slower but less memory hungry?

import dask
from adlfs import AzureBlobFileSystem
from pipelines.core import base as corebase

# create dummy dataset
df = dask.datasets.timeseries(end="2001-01-31", partition_freq="7d")
df.to_csv("../dataset/")

# upload data
credentials = get_credentials()
storage_options = {
    "account_name": credentials["abfs-name"],
    "account_key": credentials["abfs-key"],
}
abfs = AzureBlobFileSystem(**storage_options)
abfs.mkdir("testadlfs/")
abfs.put("../dataset/", "testadlfs/", recursive=True)

mgsnuno · 2021-06-22T07:18:37Z

@hayesgb any pointers you can share? Thank you

hayesgb · 2021-06-23T15:08:45Z

@mgsnuno -- I'm looking at your example now. I see the high memory usage with the put method, but not when writing the df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options)

Is this consistent with your issue?

mgsnuno · 2021-06-28T11:40:55Z

@hayesgb very good point and yes, consistent with what I see, put has high memory usage, to_csv(remote_folder...) has not.
But, if you do to_csv(local_folder...) it takes time for dask to write those text files, can it be that that time is masking the issue that is highlighted with just the put method?

hayesgb · 2021-06-28T13:31:33Z

When I monitor memory usage with the .to_csv() method, either locally or remotely, I don't see high memory usage. However, when I monitor memory usage wiht the .put() method, I do see memory usage rising significantly.

Just want to be sure I'm working on the correct issue. The put() method opens a bufferedreader object and streams it to azure. This currently happens async, and may be the cause for the high memory usage.

mgsnuno · 2021-07-14T11:44:32Z

@hayesgb I have tested with our business tables/dataframes and even with to_parquet("abfs://...", with the latest adlfs/fsspec vs adlfs==0.3.3/fsspec==0.7.4 the memory usage is much higher, I get KilledWorker errors that I never had with those fixed versions.

Unfortunately I cannot share those tables, but maybe you can make some tests with dummy data you usually use and replicate it.

I went back to your previous comment where you mention df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options), is az same as abfs?

hayesgb · 2021-07-14T13:02:28Z

Yes, the “az” and “abfs” protocols are identical.

…

On Jul 14, 2021, at 6:44 AM, Nuno Gomes Silva ***@***.***> wrote: @hayesgb I have tested with our business tables/dataframes and even with to_parquet("abfs://...", with the latest adlfs/fsspec vs adlfs==0.3.3/fsspec==0.7.4 the memory usage is much higher, I get KilledWorker errors that I never had with those fixed versions. Unfortunately I cannot share those tables, but maybe you can make some tests with dummy data you usually use and replicate it. I went back to your previous comment where you mention df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options), is az same as abfs? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mgsnuno · 2021-08-02T10:04:05Z

@hayesgb did you manage to reproduce high mem usage with to_parquet("abfs://..." as well?

hayesgb · 2021-08-03T22:33:21Z

No, I did not. We actually use to_parquet regularly in our workloads without issue

mgsnuno · 2022-04-12T09:49:00Z

all good now with:

adlfs 2022.2.0
fsspec 2022.3.0
dask 2022.4.0
azure-storage-blob 12.11.0

mgsnuno closed this as completed Jul 23, 2020

mgsnuno reopened this Jul 30, 2020

mgsnuno mentioned this issue Jul 30, 2020

Customer is asserting that V12 Storage SDK has slower performance that V2.1 sdk Azure/azure-sdk-for-python#9596

Closed

hayesgb mentioned this issue Jan 14, 2021

Performance issues with adlfs mappers #161

Closed

hayesgb mentioned this issue Jun 12, 2021

Added fixes for handling fetch_range extending beyond length of the file #247

Merged

mgsnuno closed this as completed Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure-storage-blob >=v12 causes slow, high memory dd.read_csv #57

azure-storage-blob >=v12 causes slow, high memory dd.read_csv #57

mgsnuno commented May 20, 2020

hayesgb commented May 20, 2020

mgsnuno commented May 20, 2020

andersbogsnes commented Jul 8, 2020

mgsnuno commented Jul 23, 2020

mgsnuno commented Jul 30, 2020

mgsnuno commented May 4, 2021

mgsnuno commented Jun 3, 2021

anders-kiaer commented Jun 12, 2021

hayesgb commented Jun 14, 2021

mgsnuno commented Jun 15, 2021

mgsnuno commented Jun 22, 2021

hayesgb commented Jun 23, 2021

mgsnuno commented Jun 28, 2021

hayesgb commented Jun 28, 2021

mgsnuno commented Jul 14, 2021

hayesgb commented Jul 14, 2021 via email

mgsnuno commented Aug 2, 2021

hayesgb commented Aug 3, 2021

mgsnuno commented Apr 12, 2022

azure-storage-blob >=v12 causes slow, high memory dd.read_csv #57

azure-storage-blob >=v12 causes slow, high memory dd.read_csv #57

Comments

mgsnuno commented May 20, 2020

hayesgb commented May 20, 2020

mgsnuno commented May 20, 2020

andersbogsnes commented Jul 8, 2020

mgsnuno commented Jul 23, 2020

mgsnuno commented Jul 30, 2020

mgsnuno commented May 4, 2021

mgsnuno commented Jun 3, 2021

anders-kiaer commented Jun 12, 2021

hayesgb commented Jun 14, 2021

mgsnuno commented Jun 15, 2021

mgsnuno commented Jun 22, 2021

hayesgb commented Jun 23, 2021

mgsnuno commented Jun 28, 2021

hayesgb commented Jun 28, 2021

mgsnuno commented Jul 14, 2021

hayesgb commented Jul 14, 2021 via email

mgsnuno commented Aug 2, 2021

hayesgb commented Aug 3, 2021

mgsnuno commented Apr 12, 2022