Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 11340] Fix concurrency issues in NarUnpacker #11343

Merged
merged 4 commits into from
Jul 22, 2021

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Jul 16, 2021

Fixes #11340
Fixes #11379

Motivation

NarUnpacker used to extract Pulsar Functions jar files and Pulsar IO connector nar files has concurrency and file corruption issues.

When multiple processes start at once, the files in the extracted directory might get corrupted.
There are also cases where the same extraction directory is used for different versions of the functions jar file therefore creating file corruption issues.

Modifications

  • Calculate MD5 hash of the file content and use the Base64 encoded hash as the directory name
  • Prevent concurrent extraction of the file content across multiple JVMs on the same host machine by using a OS level file lock. This solution is necessary since Nar file extraction happens in separate JVMs, the Java function instance processes.
  • Prevent concurrent extraction of the file content in the same JVM with a object monitor based on the file path.
    • Java's file locking solution throws OverlappingFileLockException if the same JVM acquires the file lock twice. Therefore it's necessary to have 2 solutions: one for preventing concurrent access within the JVM and the other to prevent concurrent access across JVMs.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM

I left a few suggestions, especially I am not sure it is worth to add a test that spawns 10 other Java processes

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jerrypeng jerrypeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari Can you provide a case that there would be a concurrency issue? I looked through the code and we always use tmp files downloading the archive for each instance. This should prevent any concurrency issues. For example:

12:29:37.588 INFO [worker-scheduler-0] o.a.p.f.w.FunctionActioner@241 - Function package file is linked from /var/folders/ds/3dnmfjkd6gnfqgvbjgwwbm640000gn/T/PulsarSourceE2ETest3142222813321573090/downloadDirectory/external-repl-prop/io/PulsarSource-test/0/pulsar-io-data-generator-2.9.0-SNAPSHOT.nar.0.d65f1667-d711-4fbb-a6bd-d8b1b5c804ad to /var/folders/ds/3dnmfjkd6gnfqgvbjgwwbm640000gn/T/PulsarSourceE2ETest3142222813321573090/downloadDirectory/external-repl-prop/io/PulsarSource-test/0/pulsar-io-data-generator-2.9.0-SNAPSHOT.nar

Copy link
Member

@sijie sijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear to me where the concurrent issue is from.

Also why do you combine two fixes in one PR?

Copy link
Member Author

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear to me where the concurrent issue is from.

I have explained this in the issue report #11340 . Does that information help clarify the issue?

Also why do you combine two fixes in one PR?

The reason to change form MD5 to SHA-256 is to reduce the chances for any hash collisions. I think that's relevant in making the solution complete and therefore it belongs together.

@lhotari
Copy link
Member Author

lhotari commented Jul 19, 2021

@lhotari Can you provide a case that there would be a concurrency issue? I looked through the code and we always use tmp files downloading the archive for each instance. This should prevent any concurrency issues. For example:

12:29:37.588 INFO [worker-scheduler-0] o.a.p.f.w.FunctionActioner@241 - Function package file is linked from /var/folders/ds/3dnmfjkd6gnfqgvbjgwwbm640000gn/T/PulsarSourceE2ETest3142222813321573090/downloadDirectory/external-repl-prop/io/PulsarSource-test/0/pulsar-io-data-generator-2.9.0-SNAPSHOT.nar.0.d65f1667-d711-4fbb-a6bd-d8b1b5c804ad to /var/folders/ds/3dnmfjkd6gnfqgvbjgwwbm640000gn/T/PulsarSourceE2ETest3142222813321573090/downloadDirectory/external-repl-prop/io/PulsarSource-test/0/pulsar-io-data-generator-2.9.0-SNAPSHOT.nar

I have explained the issue in #11340 description. The problem isn't in downloading the function jar. It's the NarUnpacker that contains the issues explained in #11340 description. I didn't duplicate this information in this PR description.

The problem can be reproduced when using the thread or process runtime and when running a function with parallelism > brokers/function workers. The problem cannot be reproduced on kubernetes runtime since the different function instance processes won't be sharing the same /tmp/pulsar-nar directory in that case.

Please check the #11340 description and provide some follow-up questions so that I can fill the possible missing information to the issue description.

Copy link
Contributor

@freeznet freeznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I approve this PR since I am facing the same issue when having a single broker+function-worker with any functions that parallelism>1, the tmp directory will facing the race condition, and it can be reproduced.
Although I agree that MD5 is outdated, but I would like to see this PR only focus on the race condition issue, and maybe another PR to addressing the switch of MD5 -> SHA-256. As you mentioned that switch is to reduce chances of hash collisions, so I would consider the switching as an improvement but not a BUG fix.
Also, PTAL of my comment, thanks.

@lhotari
Copy link
Member Author

lhotari commented Jul 19, 2021

Overall I approve this PR since I am facing the same issue when having a single broker+function-worker with any functions that parallelism>1, the tmp directory will facing the race condition, and it can be reproduced.

@freeznet Thanks for confirming that the issue exists for others.

Although I agree that MD5 is outdated, but I would like to see this PR only focus on the race condition issue, and maybe another PR to addressing the switch of MD5 -> SHA-256. As you mentioned that switch is to reduce chances of hash collisions, so I would consider the switching as an improvement but not a BUG fix.

I'm fine with doing that change if it's a showstopper for accepting the PR.

@jerrypeng
Copy link
Contributor

@lhotari @freeznet
If the error is easily reproducible, can you provide steps to reproduce? What are the errors you see when this issue happens?

the tmp directory will facing the race condition,

Each instance has its own temp directory for downloading artifacts, why would we have concurrency issues here?

@freeznet
Copy link
Contributor

@jerrypeng I have created an issue #11379 with my observation, and I have provided steps to reproduce, PTAL, thanks.

@sijie
Copy link
Member

sijie commented Jul 20, 2021

I'm fine with doing that change if it's a showstopper for accepting the PR.

I think we should at least separate the PR.

@lhotari
Copy link
Member Author

lhotari commented Jul 20, 2021

I'm fine with doing that change if it's a showstopper for accepting the PR.

I think we should at least separate the PR.

I'll do that.

@lhotari
Copy link
Member Author

lhotari commented Jul 20, 2021

It is unclear to me where the concurrent issue is from.

Also why do you combine two fixes in one PR?

@sijie @jerrypeng I have reverted the md5->sha256 changes. PTAL

@lhotari
Copy link
Member Author

lhotari commented Jul 20, 2021

If the error is easily reproducible, can you provide steps to reproduce? What are the errors you see when this issue happens?

@jerrypeng There's a repro now in #11379 , thanks to @freeznet . There's also short repro instructions and one possible error message in the original issue description, #11340 .

the tmp directory will facing the race condition,

Each instance has its own temp directory for downloading artifacts, why would we have concurrency issues here?

Downloading of the artifacts isn't the issue. The NAR classloader will extract the downloaded jar file to a directory called /tmp/pulsar-nar by default. This is a fixed location currently.

For e2e Pulsar Function tests, there's a unique narExtractionDirectory
in use
, so perhaps that is causing the confusion when running e2e tests. A unique narExtractionDirectory isn't a suitable solution for Pulsar Functions in production mode for the reason that it doesn't solve the problem for thread runtime. It would also consume more disk space since each instance would have it's own copy of the files.

Please take a closer look to the solution in this PR since it solves the problem in a fairly simple way.

@lhotari lhotari self-assigned this Jul 20, 2021
@lhotari lhotari added the type/bug The PR fixed a bug or issue reported a bug label Jul 20, 2021
@lhotari lhotari added this to the 2.9.0 milestone Jul 20, 2021
@lhotari lhotari requested review from jerrypeng and sijie July 20, 2021 08:35
@lhotari lhotari added the doc-not-needed Your PR changes do not impact docs label Jul 20, 2021
@devinbost
Copy link
Contributor

Related to #6054

@lhotari
Copy link
Member Author

lhotari commented Jul 22, 2021

@jerrypeng @sijie @freeznet I have addressed the review feedback. Please take another look at this PR. Thanks!

@jerrypeng
Copy link
Contributor

@lhotari

A unique narExtractionDirectory isn't a suitable solution for Pulsar Functions in production mode for the reason that it doesn't solve the problem for thread runtime.

Why not? Even with your solution, it does not prevent the scenario when two different functions are started by submitting distinct NARs/JARs that is named the same.

@lhotari
Copy link
Member Author

lhotari commented Jul 22, 2021

@jerrypeng

Why not?

For the thread runtime I meant that it would have to be for each function instance also in that case and a JVM level unique directory isn't sufficient.

It would be possible to have a unique directory for every parallel instance of the function, but the problem that it brings is the excessive disk space usage. Each parallel instance would have to have it's own unique directory. Some NAR/JAR files are very large. 50MB isn't uncommon. for example, pulsar-io-debezium-*.nar files are over 80MB (there are many others in this range). With a high parallelism value such as 8, several of hundreds of additional MBs of disk space might be required if the solution would be based on a unique directory per function instance.

Even with your solution, it does not prevent the scenario when two different functions are started by submitting distinct NARs/JARs that is named the same.

This PR addresses the case where the NARs or JARs use the same name. The content hash will be used as part of the directory name.
The file locking solution prevents race conditions which could happen when parallel functions start up at the same time (very common scenario when parallelism > function workers and using process runtime).

Is there some problem with the changes proposed in the PR? I'd like to complete work on this PR before my vacation which starts tomorrow.

@eolivelli
Copy link
Contributor

@jerrypeng @sijie @freeznet
IMHO I believe that if this patch is solving one problem and this is not adding regressions or potentials problems in production it is fine to commit it.
we could add more enhancements in the future if you have some additional case that is still not covered by this fix.

This patch adds unit tests that will fail without this patch

@devinbost
Copy link
Contributor

devinbost commented Jul 22, 2021 via email

Copy link
Contributor

@jerrypeng jerrypeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari thanks for working on this and answering my questions!

@sijie sijie merged commit dbf3138 into apache:master Jul 22, 2021
codelipenghui pushed a commit that referenced this pull request Jul 23, 2021
Fixes #11340
Fixes #11379

NarUnpacker used to extract Pulsar Functions jar files and Pulsar IO connector nar files has concurrency and file corruption issues.

When multiple processes start at once, the files in the extracted directory might get corrupted.
There are also cases where the same extraction directory is used for different versions of the functions jar file therefore creating file corruption issues.

- Calculate MD5 hash of the file content and use the Base64 encoded hash as the directory name
- Prevent concurrent extraction of the file content across multiple JVMs on the same host machine by using a OS level file lock. This solution is necessary since Nar file extraction happens in separate JVMs, the Java function instance processes.
- Prevent concurrent extraction of the file content in the same JVM with a object monitor based on the file path.
  - Java's file locking solution throws OverlappingFileLockException if the same JVM acquires the file lock twice. Therefore it's necessary to have 2 solutions: one for preventing concurrent access within the JVM and the other to prevent concurrent access across JVMs.

(cherry picked from commit dbf3138)
@codelipenghui codelipenghui added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Jul 23, 2021
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this pull request Mar 18, 2022
Fixes apache#11340
Fixes apache#11379 

### Motivation

NarUnpacker used to extract Pulsar Functions jar files and Pulsar IO connector nar files has concurrency and file corruption issues.

When multiple processes start at once, the files in the extracted directory might get corrupted.
There are also cases where the same extraction directory is used for different versions of the functions jar file therefore creating file corruption issues.

### Modifications

- Calculate MD5 hash of the file content and use the Base64 encoded hash as the directory name
- Prevent concurrent extraction of the file content across multiple JVMs on the same host machine by using a OS level file lock. This solution is necessary since Nar file extraction happens in separate JVMs, the Java function instance processes.
- Prevent concurrent extraction of the file content in the same JVM with a object monitor based on the file path. 
  - Java's file locking solution throws OverlappingFileLockException if the same JVM acquires the file lock twice. Therefore it's necessary to have 2 solutions: one for preventing concurrent access within the JVM and the other to prevent concurrent access across JVMs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.8 Archived: 2.8 is end of life doc-not-needed Your PR changes do not impact docs release/2.8.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
7 participants