Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index_parallel task fails because of error in opening zip file (running on indexers) #11478

Closed
vpeack opened this issue Jul 21, 2021 · 8 comments

Comments

@vpeack
Copy link

vpeack commented Jul 21, 2021

Hi everyone,

Following a post on ASF slack, I open up a new issue here on the advice of someone from Imply.
We are running compaction tasks through indexers that randomly fail on phase 3 (partial_index_generic_merge) with the following error message (more details below) : "error in opening zip file"

The reply we had on slack :
As to the specific error, I'm not sure if it's exactly the same as what's going on in #9993, but that issue does point out an important thing, which is that if the shuffle server returns an error, the shuffle client will not actually log out that error, but it will just log this sort of obtuse zip decompression error. (Because it's trying to unzip the error message.) This isn't good error behavior, so we should adjust that to log the actual server error instead of trying to unzip the error message. Which is silly!
This seems an indexer bug .Could you please create a BUG request in druid github project with all the details.

Affected Version

0.21.0

Description

  • Cluster size
    1 master (coordinator/overlord)
    2 routers/brokers
    ~10 historicals
    ~20 indexers (dedicated to these tasks) + ~5 indexers for realtime ingestion (kafka)
    ~30TB data

  • Configurations in use
    Spec object we are using :
    { "type": "index_parallel", "spec": { "ioConfig": { "type": "index_parallel", "inputSource": { "type": "druid", "dataSource": "events", "interval": "2021-07-13T00:00:00/2021-07-14T00:00:00" } }, "tuningConfig": { "type": "index_parallel", "partitionsSpec": { "type": "hashed", "maxRowsPerSegment": 800000 }, "forceGuaranteedRollup": true, "maxNumConcurrentSubTasks": 40, "totalNumMergeTasks": 20, "maxRetry": 10, "maxPendingPersists": 1, "maxRowsPerSegment": 800000 }, "dataSchema": { "dataSource": "events", "granularitySpec": { "type": "uniform", "queryGranularity": "HOUR", "segmentGranularity": "HOUR", "rollup": true }, "timestampSpec": { "column": "__time", "format": "iso" }, "dimensionsSpec": { }, "metricsSpec": [ ] } } }

  • Steps to reproduce the problem
    Happens randomly

  • The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.

{"severity": "INFO", "message": "[[partial_index_generic_merge_events_gpceoeme_2021-07-21T11:15:41.883Z]-threading-task-runner-executor-0] org.apache.druid.utils.CompressionUtils - Unzipping file[/opt/druid-data/task/partial_index_generic_merge_events_gpceoeme_2021-07-21T11:15:41.883Z/work/indexing-tmp/2021-07-20T08:00:00.000Z/2021-07-20T09:00:00.000Z/10/temp_partial_index_generate_events_ooikmkan_2021-07-21T11:00:25.016Z] to [/opt/druid-data/task/partial_index_generic_merge_events_gpceoeme_2021-07-21T11:15:41.883Z/work/indexing-tmp/2021-07-20T08:00:00.000Z/2021-07-20T09:00:00.000Z/10/unzipped_partial_index_generate_events_ooikmkan_2021-07-21T11:00:25.016Z]"} {"severity": "ERROR", "message": "[[partial_index_generic_merge_events_gpceoeme_2021-07-21T11:15:41.883Z]-threading-task-runner-executor-0] org.apache.druid.indexing.overlord.ThreadingTaskRunner - Exception caught while running the task."} java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) ~[?:1.8.0_292] at java.util.zip.ZipFile.<init>(ZipFile.java:225) ~[?:1.8.0_292] at java.util.zip.ZipFile.<init>(ZipFile.java:155) ~[?:1.8.0_292] at java.util.zip.ZipFile.<init>(ZipFile.java:169) ~[?:1.8.0_292] at org.apache.druid.utils.CompressionUtils.unzip(CompressionUtils.java:235) ~[druid-core-0.21.0.jar:0.21.0] at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.fetchSegmentFiles(PartialSegmentMergeTask.java:224) ~[druid-indexing-service-0.21.0.jar:0.21.0] at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.runTask(PartialSegmentMergeTask.java:162) ~[druid-indexing-service-0.21.0.jar:0.21.0] at org.apache.druid.indexing.common.task.batch.parallel.PartialGenericSegmentMergeTask.runTask(PartialGenericSegmentMergeTask.java:41) ~[druid-indexing-service-0.21.0.jar:0.21.0] at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:152) ~[druid-indexing-service-0.21.0.jar:0.21.0] at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:211) [druid-indexing-service-0.21.0.jar:0.21.0] at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:151) [druid-indexing-service-0.21.0.jar:0.21.0] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]

  • Any debugging that you have already done
    N/A

Any ideas on how to we can resolve this ?
Feel free to ask if you need anything else.

Thanks a lot

@vpeack
Copy link
Author

vpeack commented Oct 14, 2021

up

@ritvik-statsig
Copy link

I am running into this issue as well - posted in the druid forum here https://www.druidforum.org/t/error-in-opening-zip-file-during-ingestion/7429

@ThomasBarach
Copy link

Hello,
FYI, we're not seeing this error anymore. We were using GCP preemptible instances back then. Once we've switched to non-preemptible instances, everything was fine.

@ritvik-statsig
Copy link

Interesting. So your indexer nodes were getting pre-empted and that is what was causing this? So the zip file error is just a weird message for some other underlying issue

@ThomasBarach
Copy link

Yep, I guess so.
Are you using Spot/Preemptible cloud instances as well?

@ritvik-statsig
Copy link

I am not - and this also repros consistently for me. Must be something like an OOM

Copy link

github-actions bot commented Nov 9, 2023

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 9, 2023
Copy link

github-actions bot commented Dec 8, 2023

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants