Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Race in jar upload during hadoop indexing #1815

Merged
merged 1 commit into from
Oct 23, 2015

Conversation

nishantmonu51
Copy link
Member

fixes Race while uploading jar files.

  • SNAPSHOT jar files are always uploaded to job specific non-shared directory and cleaned up with the job cleanup.
  • Non-Snapshot jar files are first uploaded to temporary intermediate path and then renamed to the shared directory without overwriting any existing file. this fixes the race when multiple jobs sees the jar file as not present and try to upload it simultaneously.

Fixes - #582

@nishantmonu51 nishantmonu51 changed the title Fix Race in jar upload during hadoop indexing - https://github.com/dr… Fix Race in jar upload during hadoop indexing Oct 9, 2015
uploadJar(jarFile, intermediateHdfsPath, fs);
try {
log.info("Renaming jar to path[%s]", hdfsPath);
fs.rename(intermediateHdfsPath, hdfsPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this races can it leave intermediateHdfsPath stale in the distributed FS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intermediate path is in job working directory which gets cleaned up during job cleanup.
Also, fixed it to clean the intermediate jar file here.

@xvrl xvrl added this to the 0.8.2 milestone Oct 13, 2015
throw e;
}
} finally {
if (fs.exists(intermediateHdfsPath)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exists and delete can throw IOException, you'll want to catch those and at least log them. This is a bit tricky

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally the exceptions would bubble up as suppressed exceptions, but that may not be possible to do in a clean way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@nishantmonu51 nishantmonu51 force-pushed the jar-upload-race branch 2 times, most recently from f86b5f6 to 31bf270 Compare October 13, 2015 18:54
}
}
DistributedCache.addFileToClassPath(hdfsPath, conf, fs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistributedCache is deprecated, can you use job.addFileToClassPath to add the files?

Yes, right now that seems to do exactly this under the hood, but if hadoop ever gets their crap together they can change this without warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@xvrl
Copy link
Member

xvrl commented Oct 14, 2015

@nishantmonu51 @drcrallen looks like this change is a little less trivial than originally thought. Should we push it back to 0.8.3 / 0.9.0?

public static void cleanup(Job job) throws IOException
{
final Path jobDir = getJobPath(job.getJobID(), job.getWorkingDirectory());
final FileSystem fs = jobDir.getFileSystem(job.getConfiguration());
fs.delete(jobDir, true);
fs.delete(getJobClassPathDir(job.getJobName(), job.getWorkingDirectory()), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could use some extra assurances / retries as well. (non-blocking)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do this for cleanup for other Hadoop Jobs as well, would like to do it in a separate PR. Opening a github issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drcrallen
Copy link
Contributor

@xvrl I'm actually ok with it right now. I would love to have some more improvements on it and Unit tests, which can come in 0.9.x but at this point I think it is in a much better place than it was, and I'm 👍 if @nishantmonu51 is willing to add some more assurances and unit tests for 0.9

@nishantmonu51
Copy link
Member Author

I tried to write a Unit test for above, but the JobHelper.setupClasspath skips the jar copying if the fileSystem is LocalFileSystem, so i tested it manually with local HDFS setup.
For adding retry logic for task cleanups, I think we can add that in 0.9, will create an issue for doing that in a separate PR.

@nishantmonu51
Copy link
Member Author

btw, I am fine with moving this to 0.9/0.8.3 also.

@xvrl xvrl modified the milestones: 0.8.3, 0.8.2 Oct 14, 2015
@drcrallen
Copy link
Contributor

@nishantmonu51 io.druid.segment.loading.HdfsFileTimestampVersionFinderTest#setupStatic sets up a mini hdfs cluster for actual hdfs testing.

@nishantmonu51
Copy link
Member Author

@drcrallen thanks, I will use that and add a Test.

/**
* Uploads jar files to hdfs and configures the classpath.
* Snapshot jar files are uploaded to intermediateClasspath and not shared across multiple jobs.
* Non-Snapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-Snapshot? Is this incomplete sentence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@nishantmonu51 nishantmonu51 force-pushed the jar-upload-race branch 2 times, most recently from fa3753f to fca3250 Compare October 15, 2015 17:59
@nishantmonu51
Copy link
Member Author

@drcrallen added test.

}
return apply(input.getCause());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addJarToClassPath has

throw new ISE("File does not exist even after moving from[%s] to [%s]", intermediateHdfsPath, hdfsPath);

that means it won't be retried. Is that intentional?

why not retry here irrespective of what the exception type is, in the worst case there would be 3 retries instead of having to do a recursion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will modify it to IOException though this should ideally never happen in any case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -79,6 +79,8 @@
private static final int NUM_RETRIES = 8;
private static final int SECONDS_BETWEEN_RETRIES = 2;
private static final int DEFAULT_FS_BUFFER_SIZE = 1 << 18; // 256KB
private static final Pattern SNAPSHOT_JAR = Pattern.compile(".*SNAPSHOT(-selfcontained)?\\.jar$");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should be

Pattern.compile(".*\\-SNAPSHOT(-selfcontained)?\\.jar$")

just in case someone was crazy enough to have artifact name be something like iAmNotSNAPSHOT.jar , it is not a "snapshot" jar really.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@himanshug
Copy link
Contributor

LGTM

if (exception != null) {
exception.addSuppressed(e);
} else {
exception = e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just making sure we don't have a jar left over at intermediateHdfsPath?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see :) Can we also invert the condition in "if else"? I actually feel it takes a bit less mind gymnastics to understand if it was

if (exception == null) {
  exception = e;
} else {
  exception.addSuppressed(e);
}

@nishantmonu51 nishantmonu51 modified the milestones: 0.9.0, 0.8.3 Oct 20, 2015
{
Job job = Job.getInstance(conf, "test-job");
DistributedFileSystem fs = miniCluster.getFileSystem();
Path intermediatePath = new Path("/tmp/classpath");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intermediate path should be unique per task, right? can this be randomized a bit so one test cannot pollute another?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, this is on hdfs, diff runs will have diff hdfs directory.

@drcrallen
Copy link
Contributor

@nishantmonu51 is there a way to add tests for failure cases?

int id = barrier.await();
Job job = Job.getInstance(conf, "test-job-" + id);
Path intermediatePathForJob = new Path(intermediatePath, "job-" + id);
barrier.await();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we need to await() for the second time?

@nishantmonu51 nishantmonu51 force-pushed the jar-upload-race branch 2 times, most recently from 9ba4f83 to 41cc078 Compare October 21, 2015 20:05
@nishantmonu51
Copy link
Member Author

I added a test for concurrent upload which tries to simulate the failure case which this PR is intended to fix.

);
}

for (Future<Boolean> future : futures) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest Futures.allAsList(futures).get(....)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you don't really need to check the return value because it should bubble up any exceptions in the execution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice suggestion, done.

@nishantmonu51 nishantmonu51 force-pushed the jar-upload-race branch 2 times, most recently from d457699 to 31097f5 Compare October 22, 2015 06:12
@drcrallen
Copy link
Contributor

👍

{
hdfsTmpDir = File.createTempFile("hdfsClasspathSetupTest", "dir");
hdfsTmpDir.deleteOnExit();
if (!hdfsTmpDir.delete()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is hdfsTmpDir.deleteOnExit() needed when hdfsTmpDir.delete() is ensured?
Also, it is safer to use @rule TemporaryFolder instead as that will do cleanup as soon as test is done instead of jvm exit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using TemporaryFolder, but we cant use it as a static field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, if this is to be done only once for all the tests here then we can probably make things non static and use a no argument constructor to do the setup? Is there a reason for hdfsTmpDir to be static?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use a non arg constructor, but still i will need to make the fields static since i will have to shut it down in a static afterClass method. I think it will be wierd to initialize static fields of a class in constructor rather than a BeforeClass method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i see, junit unfortunately does not have anything to setup/teardown "instance" .

few fixes

delete intermediate file early

better exception handling

use static pattern instead of compiling it every time

Add retry for transient exceptions

remove usage of deprecated method.

Add test

fix imports

fix javadoc

review comment.

review comment: handle crazy snapshot naming

review comments

remove default retry count in favour of already present constant

review comment

make random intermediate and final paths.

review comment, use temporaryFolder where possible
himanshug added a commit that referenced this pull request Oct 23, 2015
Fix Race in jar upload during hadoop indexing
@himanshug himanshug merged commit 2c3753c into apache:master Oct 23, 2015
@drcrallen drcrallen deleted the jar-upload-race branch October 26, 2015 22:29
@gianm gianm modified the milestones: 0.8.3, 0.9.0 Dec 1, 2015
This was referenced Dec 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants