Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abfs produces corrupted files when data is big enough #33

Closed
Timost opened this issue Jan 10, 2020 · 6 comments
Closed

abfs produces corrupted files when data is big enough #33

Timost opened this issue Jan 10, 2020 · 6 comments

Comments

@Timost
Copy link

Timost commented Jan 10, 2020

Hi,
Thanks again for working on azure integration.

I've been hitting a tricky adlfs/abfs bug on a dask data extraction pipeline.

I have a bag containing json data spread in several partitions spread over several workers in a dask cluster hosted in Kubernetes using dask-kubernetes. The whole dataset contains ~1-10 million items. When the partition size exceeds a certain threshold (in my case more than 150_000 items), the export to textfiles produces corrupted files:

  • If I export to plain jsonl files, the files contain only a subset of the partition's data and they start with an empty line.
  • If I export to jsonl.gz compressed files, the resulting gzipped files are invalid and can't be read.

I've been unable to reproduce this issue on a local machine using the distributed client, but I'm quite sure the issue comes from adlfs because exporting to amazon s3 using s3fs produces clean files.

I've been trying to reproduce the issue on a distributed cluster in a simple and reproducible way that I could share with you, but I couldn't reproduce the problem. I think that it's my specific pipeline's steps that result in a state where adlfs fails to export the data properly.

Each item in the bag looks like this:

{
    "geohash_aggregation_precision": 8,
    "location": {"type": "Polygon",
                 "coordinates": [[[
                     129.12368774414062,
                     35.201568603515625],
                     [
                         129.12403106689453,
                         35.201568603515625],
                     [
                         129.12403106689453,
                         35.20174026489258],
                     [
                         129.12368774414062,
                         35.20174026489258],
                     [
                         129.12368774414062,
                         35.201568603515625]]]},
    "location_centroid": {"lat": 35.2016544342041,
                          "lon": 129.12385940551758},
    "short_events_count": 1, 
    "long_events_count": 0,
    "total_events_count": 0,
    "utc_datetime": "2019-11-01 00:00:00.000000",
    "local_datetime": "2019-11-01 09:00:00.000000",
    "_id": "KR-wy7b6230-2019-11-01 00:00:00.000000"
}

The partitions are not evenly sized, here is an example of their sizes:

[(0, 82302),
 (1, 59934),
 (2, 230304),
 (3, 46914),
 (4, 114548),
 (5, 140497),
 (6, 69740),
 (7, 80690),
 (8, 134518),
 (9, 99581),
(10, 59099),
(11, 106069)]

In this example partitions 2, 5, 8 result in corrupted files.
I export the data using:

    final_bag.map(json.dumps).to_textfiles(
        'abfs://storage/data/data_part_*.jsonl.gz'
        storage_options=STORAGE_OPTIONS,
        last_endline=True
    )

For the moment I work around the problem by using more and smaller partitions, but because getting balanced partitions efficiently is tricky in dask, the situation is not really ideal.

I'm using:

  • python 3.6.9
  • dask 2.9.1
  • dask_kubernetes master
  • adlfs 0.1.5

Each worker in the dask cluster has:

  • 1 vcpu
  • 7 GiB of ram
@Timost Timost changed the title abfs produces corrupted gzipped files when data is big enough abfs produces corrupted files when data is big enough Jan 10, 2020
@Timost
Copy link
Author

Timost commented Jan 13, 2020

Looking at the implementation differences between s3fs and adlfs I suspect this is due to bad handling of data that comes in several chunks. I'll try to dig deeper later.

@cjalmeida
Copy link
Contributor

cjalmeida commented Feb 7, 2020

Update on this issue. This was fixed by PR #37, however it was reverted in commit 7bca626 by @hayesgb

Any particular issue with that PR that needs fixing?

Edit: I see it's not playing well with changes from #35. I'll fix it recreate the PR shortly.

@Timost
Copy link
Author

Timost commented Feb 7, 2020

I'm not sure. I've moved to my own, historic, storage lib since then. So I can't really test whether the bug is still present or not. I'm okay with closing the issue.

@hayesgb
Copy link
Collaborator

hayesgb commented Feb 7, 2020 via email

@cjalmeida
Copy link
Contributor

I just created a new PR #38 to resolve the issues. Tests should pass now.

@hayesgb
Copy link
Collaborator

hayesgb commented Feb 15, 2020

Integrated into 0.2.0 release.

@hayesgb hayesgb closed this as completed Feb 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants