Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure-storage-blob >=v12 causes slow, high memory dd.read_csv #57

Closed
mgsnuno opened this issue May 20, 2020 · 19 comments
Closed

azure-storage-blob >=v12 causes slow, high memory dd.read_csv #57

mgsnuno opened this issue May 20, 2020 · 19 comments

Comments

@mgsnuno
Copy link

mgsnuno commented May 20, 2020

dd.read_csv(path, storage_options) on a 15Gb csv file

with adlfs==0.2.4 normal speed and low memory usage
with adlfs==0.3 much slower and high memory usage

I think is related to azure-storage-blob >=v12 and I thought it would be important for you to be aware. with read_parquet didn't find issues.

Any idea? How to best report this upstream (azure)?

@hayesgb
Copy link
Collaborator

hayesgb commented May 20, 2020

Thanks for reporting the issue. Can you provide some specifics on what was observed for both 0.2.4 and 0.3.0 so I can attempt to reproduce?

@mgsnuno
Copy link
Author

mgsnuno commented May 20, 2020

Yes sure:

  • create a big csv file in azure storage blob
  • use dask.distributed/dask_kubernetes to create a cluster
  • dd.read_csv(path, storage_options).persist() that file

looking at the dask.distributed dashboard you should be able to see how much longer it takes to read the file. Going through the dask.distributed Profile you can see that an unusual amount of time is being spent on an operation part of python.core ssl.

If you want me to say how I created a dask_kubernetes cluster in azure I can expand on it.

@andersbogsnes
Copy link
Contributor

Seems like it is related to this issue:
Azure/azure-sdk-for-python#9596

@mgsnuno
Copy link
Author

mgsnuno commented Jul 23, 2020

Closing for now, stopped having the issue using the same code and storage as before, either something changed internally in azure or in the library. Will reopen if reapers. Thank you.

@mgsnuno mgsnuno closed this as completed Jul 23, 2020
@mgsnuno
Copy link
Author

mgsnuno commented Jul 30, 2020

Issue is still there, I was just being lucky with one specific file.

@mgsnuno
Copy link
Author

mgsnuno commented May 4, 2021

@hayesgb the biggest issue I experience with the latest version of adlfs while read/writing to/from csv is the high memory usage, high meaning that writing a 200Mb dataframe, with multiple partitions, can take upwards of 20Gb of RAM.
Not sure why azure-storage-blob >=12 causes the memory usage increased so much.
Any idea?

@mgsnuno
Copy link
Author

mgsnuno commented Jun 3, 2021

@hayesgb what can I do to help you debug this issue?
It still exists with the latest version of adlfs, if adlfs>0.3.3 then writing to an azure blob container with adlfs incurs much higher memory usage (50-100% more)

@anders-kiaer
Copy link
Contributor

@mgsnuno I have seen a similar problem as you (however in my case it was only a read operation, and of a .parquet file - the performance regression when going >=0.3 was 10x+ in runtime).

what can I do to help you debug this issue?

In my case, I got the same performance also for adlfs>=0.3 (and then also using recent azure-storage-blob versions) by simply changing length argument in this line https://github.com/dask/adlfs/blob/2bfdb02d13d14c0787e769c6686fecd2e3861a4b/adlfs/spec.py#L1785 from end to (end - start). See also #247.

@hayesgb
Copy link
Collaborator

hayesgb commented Jun 14, 2021

@mgsnuno -- #247 has been merged into master and is included in release 0.7.7. Would be great to get your feedback on if it fixes this issue.

@mgsnuno
Copy link
Author

mgsnuno commented Jun 15, 2021

@hayesgb I tested 0.7.7, it gives me the same issue, something like the code bellow, with adlfs>=0.3.3 it uses 2x-3x more memory, it is much faster (it looks like it's in parallel and before was not). Also maybe the parallelism explains the high memory usage.
I'm not being overly picky about the memory, the issue I have is that in a 32Gb RAM VM where I run dask to read, parse and write to abfs multiple parquet files in parallel, before I had no problems, 15Gb would be more than enough, now I have killed worker errors as the memory quickly reaches the 32Gb limit.
Is there a way to limit parallelism of adlfs with asyncio to make it slower but less memory hungry?

import dask
from adlfs import AzureBlobFileSystem
from pipelines.core import base as corebase

# create dummy dataset
df = dask.datasets.timeseries(end="2001-01-31", partition_freq="7d")
df.to_csv("../dataset/")

# upload data
credentials = get_credentials()
storage_options = {
    "account_name": credentials["abfs-name"],
    "account_key": credentials["abfs-key"],
}
abfs = AzureBlobFileSystem(**storage_options)
abfs.mkdir("testadlfs/")
abfs.put("../dataset/", "testadlfs/", recursive=True)

@mgsnuno
Copy link
Author

mgsnuno commented Jun 22, 2021

@hayesgb any pointers you can share? Thank you

@hayesgb
Copy link
Collaborator

hayesgb commented Jun 23, 2021

@mgsnuno -- I'm looking at your example now. I see the high memory usage with the put method, but not when writing the df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options)

Is this consistent with your issue?

@mgsnuno
Copy link
Author

mgsnuno commented Jun 28, 2021

@hayesgb very good point and yes, consistent with what I see, put has high memory usage, to_csv(remote_folder...) has not.
But, if you do to_csv(local_folder...) it takes time for dask to write those text files, can it be that that time is masking the issue that is highlighted with just the put method?

@hayesgb
Copy link
Collaborator

hayesgb commented Jun 28, 2021

When I monitor memory usage with the .to_csv() method, either locally or remotely, I don't see high memory usage. However, when I monitor memory usage wiht the .put() method, I do see memory usage rising significantly.

Just want to be sure I'm working on the correct issue. The put() method opens a bufferedreader object and streams it to azure. This currently happens async, and may be the cause for the high memory usage.

@mgsnuno
Copy link
Author

mgsnuno commented Jul 14, 2021

@hayesgb I have tested with our business tables/dataframes and even with to_parquet("abfs://...", with the latest adlfs/fsspec vs adlfs==0.3.3/fsspec==0.7.4 the memory usage is much higher, I get KilledWorker errors that I never had with those fixed versions.

Unfortunately I cannot share those tables, but maybe you can make some tests with dummy data you usually use and replicate it.

I went back to your previous comment where you mention df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options), is az same as abfs?

@hayesgb
Copy link
Collaborator

hayesgb commented Jul 14, 2021 via email

@mgsnuno
Copy link
Author

mgsnuno commented Aug 2, 2021

@hayesgb did you manage to reproduce high mem usage with to_parquet("abfs://..." as well?

@hayesgb
Copy link
Collaborator

hayesgb commented Aug 3, 2021

No, I did not. We actually use to_parquet regularly in our workloads without issue

@mgsnuno
Copy link
Author

mgsnuno commented Apr 12, 2022

all good now with:

  • adlfs 2022.2.0
  • fsspec 2022.3.0
  • dask 2022.4.0
  • azure-storage-blob 12.11.0

@mgsnuno mgsnuno closed this as completed Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants