-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue 11340] Fix concurrency issues in NarUnpacker #11343
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM
I left a few suggestions, especially I am not sure it is worth to add a test that spawns 10 other Java processes
pulsar-common/src/main/java/org/apache/pulsar/common/nar/NarUnpacker.java
Show resolved
Hide resolved
pulsar-common/src/test/java/org/apache/pulsar/common/nar/NarUnpackerTest.java
Outdated
Show resolved
Hide resolved
pulsar-common/src/test/java/org/apache/pulsar/common/nar/NarUnpackerTest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pulsar-common/src/main/java/org/apache/pulsar/common/nar/NarUnpacker.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhotari Can you provide a case that there would be a concurrency issue? I looked through the code and we always use tmp files downloading the archive for each instance. This should prevent any concurrency issues. For example:
12:29:37.588 INFO [worker-scheduler-0] o.a.p.f.w.FunctionActioner@241 - Function package file is linked from /var/folders/ds/3dnmfjkd6gnfqgvbjgwwbm640000gn/T/PulsarSourceE2ETest3142222813321573090/downloadDirectory/external-repl-prop/io/PulsarSource-test/0/pulsar-io-data-generator-2.9.0-SNAPSHOT.nar.0.d65f1667-d711-4fbb-a6bd-d8b1b5c804ad to /var/folders/ds/3dnmfjkd6gnfqgvbjgwwbm640000gn/T/PulsarSourceE2ETest3142222813321573090/downloadDirectory/external-repl-prop/io/PulsarSource-test/0/pulsar-io-data-generator-2.9.0-SNAPSHOT.nar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unclear to me where the concurrent issue is from.
Also why do you combine two fixes in one PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unclear to me where the concurrent issue is from.
I have explained this in the issue report #11340 . Does that information help clarify the issue?
Also why do you combine two fixes in one PR?
The reason to change form MD5 to SHA-256 is to reduce the chances for any hash collisions. I think that's relevant in making the solution complete and therefore it belongs together.
I have explained the issue in #11340 description. The problem isn't in downloading the function jar. It's the NarUnpacker that contains the issues explained in #11340 description. I didn't duplicate this information in this PR description. The problem can be reproduced when using the thread or process runtime and when running a function with parallelism > brokers/function workers. The problem cannot be reproduced on kubernetes runtime since the different function instance processes won't be sharing the same /tmp/pulsar-nar directory in that case. Please check the #11340 description and provide some follow-up questions so that I can fill the possible missing information to the issue description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I approve this PR since I am facing the same issue when having a single broker+function-worker with any functions that parallelism>1, the tmp directory will facing the race condition, and it can be reproduced.
Although I agree that MD5 is outdated, but I would like to see this PR only focus on the race condition issue, and maybe another PR to addressing the switch of MD5 -> SHA-256. As you mentioned that switch is to reduce chances of hash collisions, so I would consider the switching as an improvement but not a BUG fix.
Also, PTAL of my comment, thanks.
pulsar-common/src/main/java/org/apache/pulsar/common/nar/NarUnpacker.java
Show resolved
Hide resolved
@freeznet Thanks for confirming that the issue exists for others.
I'm fine with doing that change if it's a showstopper for accepting the PR. |
@lhotari @freeznet
Each instance has its own temp directory for downloading artifacts, why would we have concurrency issues here? |
@jerrypeng I have created an issue #11379 with my observation, and I have provided steps to reproduce, PTAL, thanks. |
I think we should at least separate the PR. |
I'll do that. |
@sijie @jerrypeng I have reverted the md5->sha256 changes. PTAL |
@jerrypeng There's a repro now in #11379 , thanks to @freeznet . There's also short repro instructions and one possible error message in the original issue description, #11340 .
Downloading of the artifacts isn't the issue. The NAR classloader will extract the downloaded jar file to a directory called /tmp/pulsar-nar by default. This is a fixed location currently. For e2e Pulsar Function tests, there's a unique narExtractionDirectory Please take a closer look to the solution in this PR since it solves the problem in a fairly simple way. |
Related to #6054 |
dd4bf35
to
6a10ef4
Compare
@jerrypeng @sijie @freeznet I have addressed the review feedback. Please take another look at this PR. Thanks! |
Why not? Even with your solution, it does not prevent the scenario when two different functions are started by submitting distinct NARs/JARs that is named the same. |
For the thread runtime I meant that it would have to be for each function instance also in that case and a JVM level unique directory isn't sufficient. It would be possible to have a unique directory for every parallel instance of the function, but the problem that it brings is the excessive disk space usage. Each parallel instance would have to have it's own unique directory. Some NAR/JAR files are very large. 50MB isn't uncommon. for example, pulsar-io-debezium-*.nar files are over 80MB (there are many others in this range). With a high parallelism value such as 8, several of hundreds of additional MBs of disk space might be required if the solution would be based on a unique directory per function instance.
This PR addresses the case where the NARs or JARs use the same name. The content hash will be used as part of the directory name. Is there some problem with the changes proposed in the PR? I'd like to complete work on this PR before my vacation which starts tomorrow. |
@jerrypeng @sijie @freeznet This patch adds unit tests that will fail without this patch |
I agree that we want to avoid the additional disk costs of having a
directory for each function instance if it means a copy of the jar would be
saved for every function instance. That would introduce scaling issues on
clusters that have high function parallelism, and it would be a significant
behavioral change for operations since many teams might not anticipate such
a change in disk utilization and may not have enough disk allocated for the
upgrade.
…--
Devin G. Bost
On Thu, Jul 22, 2021, 4:50 AM Enrico Olivelli ***@***.***> wrote:
@jerrypeng <https://github.com/jerrypeng> @sijie
<https://github.com/sijie> @freeznet <https://github.com/freeznet>
IMHO I believe that if this patch is solving one problem and this is not
adding regressions or potentials problems in production it is fine to
commit it.
we could add more enhancements in the future if you have some additional
case that is still not covered by this fix.
This patch adds unit tests that will fail without this patch
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#11343 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABYTBL6VOHZYUX3B6JHPSWLTY7ZY5ANCNFSM5APZSMLA>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhotari thanks for working on this and answering my questions!
Fixes #11340 Fixes #11379 NarUnpacker used to extract Pulsar Functions jar files and Pulsar IO connector nar files has concurrency and file corruption issues. When multiple processes start at once, the files in the extracted directory might get corrupted. There are also cases where the same extraction directory is used for different versions of the functions jar file therefore creating file corruption issues. - Calculate MD5 hash of the file content and use the Base64 encoded hash as the directory name - Prevent concurrent extraction of the file content across multiple JVMs on the same host machine by using a OS level file lock. This solution is necessary since Nar file extraction happens in separate JVMs, the Java function instance processes. - Prevent concurrent extraction of the file content in the same JVM with a object monitor based on the file path. - Java's file locking solution throws OverlappingFileLockException if the same JVM acquires the file lock twice. Therefore it's necessary to have 2 solutions: one for preventing concurrent access within the JVM and the other to prevent concurrent access across JVMs. (cherry picked from commit dbf3138)
Fixes apache#11340 Fixes apache#11379 ### Motivation NarUnpacker used to extract Pulsar Functions jar files and Pulsar IO connector nar files has concurrency and file corruption issues. When multiple processes start at once, the files in the extracted directory might get corrupted. There are also cases where the same extraction directory is used for different versions of the functions jar file therefore creating file corruption issues. ### Modifications - Calculate MD5 hash of the file content and use the Base64 encoded hash as the directory name - Prevent concurrent extraction of the file content across multiple JVMs on the same host machine by using a OS level file lock. This solution is necessary since Nar file extraction happens in separate JVMs, the Java function instance processes. - Prevent concurrent extraction of the file content in the same JVM with a object monitor based on the file path. - Java's file locking solution throws OverlappingFileLockException if the same JVM acquires the file lock twice. Therefore it's necessary to have 2 solutions: one for preventing concurrent access within the JVM and the other to prevent concurrent access across JVMs.
Fixes #11340
Fixes #11379
Motivation
NarUnpacker used to extract Pulsar Functions jar files and Pulsar IO connector nar files has concurrency and file corruption issues.
When multiple processes start at once, the files in the extracted directory might get corrupted.
There are also cases where the same extraction directory is used for different versions of the functions jar file therefore creating file corruption issues.
Modifications