abfs produces corrupted files when data is big enough #33

Timost · 2020-01-10T16:55:50Z

Hi,
Thanks again for working on azure integration.

I've been hitting a tricky adlfs/abfs bug on a dask data extraction pipeline.

I have a bag containing json data spread in several partitions spread over several workers in a dask cluster hosted in Kubernetes using dask-kubernetes. The whole dataset contains ~1-10 million items. When the partition size exceeds a certain threshold (in my case more than 150_000 items), the export to textfiles produces corrupted files:

If I export to plain jsonl files, the files contain only a subset of the partition's data and they start with an empty line.
If I export to jsonl.gz compressed files, the resulting gzipped files are invalid and can't be read.

I've been unable to reproduce this issue on a local machine using the distributed client, but I'm quite sure the issue comes from adlfs because exporting to amazon s3 using s3fs produces clean files.

I've been trying to reproduce the issue on a distributed cluster in a simple and reproducible way that I could share with you, but I couldn't reproduce the problem. I think that it's my specific pipeline's steps that result in a state where adlfs fails to export the data properly.

Each item in the bag looks like this:

{
    "geohash_aggregation_precision": 8,
    "location": {"type": "Polygon",
                 "coordinates": [[[
                     129.12368774414062,
                     35.201568603515625],
                     [
                         129.12403106689453,
                         35.201568603515625],
                     [
                         129.12403106689453,
                         35.20174026489258],
                     [
                         129.12368774414062,
                         35.20174026489258],
                     [
                         129.12368774414062,
                         35.201568603515625]]]},
    "location_centroid": {"lat": 35.2016544342041,
                          "lon": 129.12385940551758},
    "short_events_count": 1, 
    "long_events_count": 0,
    "total_events_count": 0,
    "utc_datetime": "2019-11-01 00:00:00.000000",
    "local_datetime": "2019-11-01 09:00:00.000000",
    "_id": "KR-wy7b6230-2019-11-01 00:00:00.000000"
}

The partitions are not evenly sized, here is an example of their sizes:

[(0, 82302),
 (1, 59934),
 (2, 230304),
 (3, 46914),
 (4, 114548),
 (5, 140497),
 (6, 69740),
 (7, 80690),
 (8, 134518),
 (9, 99581),
(10, 59099),
(11, 106069)]

In this example partitions 2, 5, 8 result in corrupted files.
I export the data using:

    final_bag.map(json.dumps).to_textfiles(
        'abfs://storage/data/data_part_*.jsonl.gz'
        storage_options=STORAGE_OPTIONS,
        last_endline=True
    )

For the moment I work around the problem by using more and smaller partitions, but because getting balanced partitions efficiently is tricky in dask, the situation is not really ideal.

I'm using:

python 3.6.9
dask 2.9.1
dask_kubernetes master
adlfs 0.1.5

Each worker in the dask cluster has:

1 vcpu
7 GiB of ram

The text was updated successfully, but these errors were encountered:

Timost · 2020-01-13T10:14:29Z

Looking at the implementation differences between s3fs and adlfs I suspect this is due to bad handling of data that comes in several chunks. I'll try to dig deeper later.

cjalmeida · 2020-02-07T16:36:00Z

Update on this issue. This was fixed by PR #37, however it was reverted in commit 7bca626 by @hayesgb

Any particular issue with that PR that needs fixing?

Edit: I see it's not playing well with changes from #35. I'll fix it recreate the PR shortly.

Timost · 2020-02-07T16:45:10Z

I'm not sure. I've moved to my own, historic, storage lib since then. So I can't really test whether the bug is still present or not. I'm okay with closing the issue.

hayesgb · 2020-02-07T17:25:38Z

I integrated @AlbertDeFusco’s PR into master, then attempted to integrate #37, which caused a failure of the 2nd last test (fs.rm), so I reverted the change with the intent to follow-up, but haven’t been able to circle back yet.

…

On Feb 7, 2020, at 10:36 AM, Cloves Almeida ***@***.***> wrote: —

cjalmeida · 2020-02-07T20:00:49Z

I just created a new PR #38 to resolve the issues. Tests should pass now.

hayesgb · 2020-02-15T22:14:08Z

Integrated into 0.2.0 release.

Timost changed the title ~~abfs produces corrupted gzipped files when data is big enough~~ abfs produces corrupted files when data is big enough Jan 10, 2020

hayesgb closed this as completed Feb 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

abfs produces corrupted files when data is big enough #33

abfs produces corrupted files when data is big enough #33

Timost commented Jan 10, 2020 •

edited

Loading

Timost commented Jan 13, 2020 •

edited

Loading

cjalmeida commented Feb 7, 2020 •

edited

Loading

Timost commented Feb 7, 2020

hayesgb commented Feb 7, 2020 via email

cjalmeida commented Feb 7, 2020

hayesgb commented Feb 15, 2020

abfs produces corrupted files when data is big enough #33

abfs produces corrupted files when data is big enough #33

Comments

Timost commented Jan 10, 2020 • edited Loading

Timost commented Jan 13, 2020 • edited Loading

cjalmeida commented Feb 7, 2020 • edited Loading

Timost commented Feb 7, 2020

hayesgb commented Feb 7, 2020 via email

cjalmeida commented Feb 7, 2020

hayesgb commented Feb 15, 2020

Timost commented Jan 10, 2020 •

edited

Loading

Timost commented Jan 13, 2020 •

edited

Loading

cjalmeida commented Feb 7, 2020 •

edited

Loading