-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
azure-storage-blob >=v12 causes slow, high memory dd.read_csv #57
Comments
Thanks for reporting the issue. Can you provide some specifics on what was observed for both 0.2.4 and 0.3.0 so I can attempt to reproduce? |
Yes sure:
looking at the dask.distributed dashboard you should be able to see how much longer it takes to read the file. Going through the dask.distributed If you want me to say how I created a dask_kubernetes cluster in azure I can expand on it. |
Seems like it is related to this issue: |
Closing for now, stopped having the issue using the same code and storage as before, either something changed internally in azure or in the library. Will reopen if reapers. Thank you. |
Issue is still there, I was just being lucky with one specific file. |
@hayesgb the biggest issue I experience with the latest version of adlfs while read/writing to/from csv is the high memory usage, high meaning that writing a 200Mb dataframe, with multiple partitions, can take upwards of 20Gb of RAM. |
@hayesgb what can I do to help you debug this issue? |
@mgsnuno I have seen a similar problem as you (however in my case it was only a read operation, and of a
In my case, I got the same performance also for |
@hayesgb I tested 0.7.7, it gives me the same issue, something like the code bellow, with adlfs>=0.3.3 it uses 2x-3x more memory, it is much faster (it looks like it's in parallel and before was not). Also maybe the parallelism explains the high memory usage. import dask
from adlfs import AzureBlobFileSystem
from pipelines.core import base as corebase
# create dummy dataset
df = dask.datasets.timeseries(end="2001-01-31", partition_freq="7d")
df.to_csv("../dataset/")
# upload data
credentials = get_credentials()
storage_options = {
"account_name": credentials["abfs-name"],
"account_key": credentials["abfs-key"],
}
abfs = AzureBlobFileSystem(**storage_options)
abfs.mkdir("testadlfs/")
abfs.put("../dataset/", "testadlfs/", recursive=True) |
@hayesgb any pointers you can share? Thank you |
@mgsnuno -- I'm looking at your example now. I see the high memory usage with the Is this consistent with your issue? |
@hayesgb very good point and yes, consistent with what I see, |
When I monitor memory usage with the .to_csv() method, either locally or remotely, I don't see high memory usage. However, when I monitor memory usage wiht the .put() method, I do see memory usage rising significantly. Just want to be sure I'm working on the correct issue. The put() method opens a bufferedreader object and streams it to azure. This currently happens async, and may be the cause for the high memory usage. |
@hayesgb I have tested with our business tables/dataframes and even with Unfortunately I cannot share those tables, but maybe you can make some tests with dummy data you usually use and replicate it. I went back to your previous comment where you mention |
Yes, the “az” and “abfs” protocols are identical.
… On Jul 14, 2021, at 6:44 AM, Nuno Gomes Silva ***@***.***> wrote:
@hayesgb I have tested with our business tables/dataframes and even with to_parquet("abfs://...", with the latest adlfs/fsspec vs adlfs==0.3.3/fsspec==0.7.4 the memory usage is much higher, I get KilledWorker errors that I never had with those fixed versions.
Unfortunately I cannot share those tables, but maybe you can make some tests with dummy data you usually use and replicate it.
I went back to your previous comment where you mention df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options), is az same as abfs?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@hayesgb did you manage to reproduce high mem usage with |
No, I did not. We actually use to_parquet regularly in our workloads without issue |
all good now with:
|
dd.read_csv(path, storage_options)
on a 15Gb csv filewith adlfs==0.2.4 normal speed and low memory usage
with adlfs==0.3 much slower and high memory usage
I think is related to azure-storage-blob >=v12 and I thought it would be important for you to be aware. with read_parquet didn't find issues.
Any idea? How to best report this upstream (azure)?
The text was updated successfully, but these errors were encountered: