-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
abfs produces corrupted files when data is big enough #33
Comments
Looking at the implementation differences between |
I'm not sure. I've moved to my own, historic, storage lib since then. So I can't really test whether the bug is still present or not. I'm okay with closing the issue. |
I integrated @AlbertDeFusco’s PR into master, then attempted to integrate #37, which caused a failure of the 2nd last test (fs.rm), so I reverted the change with the intent to follow-up, but haven’t been able to circle back yet.
… On Feb 7, 2020, at 10:36 AM, Cloves Almeida ***@***.***> wrote:
—
|
I just created a new PR #38 to resolve the issues. Tests should pass now. |
Integrated into 0.2.0 release. |
Hi,
Thanks again for working on azure integration.
I've been hitting a tricky
adlfs/abfs
bug on a dask data extraction pipeline.I have a bag containing json data spread in several partitions spread over several workers in a dask cluster hosted in Kubernetes using dask-kubernetes. The whole dataset contains ~1-10 million items. When the partition size exceeds a certain threshold (in my case more than
150_000
items), the export to textfiles produces corrupted files:jsonl.gz
compressed files, the resulting gzipped files are invalid and can't be read.I've been unable to reproduce this issue on a local machine using the distributed client, but I'm quite sure the issue comes from
adlfs
because exporting to amazon s3 usings3fs
produces clean files.I've been trying to reproduce the issue on a distributed cluster in a simple and reproducible way that I could share with you, but I couldn't reproduce the problem. I think that it's my specific pipeline's steps that result in a state where
adlfs
fails to export the data properly.Each item in the bag looks like this:
The partitions are not evenly sized, here is an example of their sizes:
In this example partitions
2, 5, 8
result in corrupted files.I export the data using:
For the moment I work around the problem by using more and smaller partitions, but because getting balanced partitions efficiently is tricky in dask, the situation is not really ideal.
I'm using:
Each worker in the dask cluster has:
The text was updated successfully, but these errors were encountered: