Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-4778] add option to flink job server to clean staged artifacts per-job #5958

Merged
merged 15 commits into from Aug 2, 2018

Conversation

ryan-williams
Copy link
Contributor

@ryan-williams ryan-williams commented Jul 16, 2018

when set, the InMemoryJobService subscribes to invoked jobs' state-changes, and when it sees them complete, removes all associated artifacts

A few incidental moves/changes:

  • expose BeamFileSystemArtifactRetrievalService.loadManifest publicly
  • moved StagingSessionToken JSON serde into methods

R: @angoenka

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
--- --- --- ---

@ryan-williams
Copy link
Contributor Author

Looking over this again, I'm going to revert the bits that make StagingSessionToken public, I thought it was nicer to deal with the structured object in a few places instead of an opaque String, but the test usages in particular make it seem like keeping it as an opaque String is deliberate / has advantages

Copy link
Contributor

@angoenka angoenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this up.
Added a few comments.
Please add some test cases for the this.

private final JobInvoker invoker;

private InMemoryJobService(
Endpoints.ApiServiceDescriptor stagingServiceDescriptor,
Function<String, String> stagingServiceTokenProvider,
ThrowingConsumer<String> cleanupJobFn,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make it a jobTerminationListener

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

state.equals(JobState.Enum.DONE) || state.equals(JobState.Enum.FAILED)) {
String stagingSessionToken = stagingSessionTokens.get(preparationId);
try {
cleanupJobFn.accept(stagingSessionToken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given its job Termination/Cleanup listener, we should be passing the job id or job object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -70,26 +71,43 @@
public static InMemoryJobService create(
Endpoints.ApiServiceDescriptor stagingServiceDescriptor,
Function<String, String> stagingServiceTokenProvider,
ThrowingConsumer<String> cleanupJobFn,
Boolean cleanArtifactsPerJob,
Copy link
Contributor

@angoenka angoenka Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the boolean. We can simply set the jobTerminationListener in previous line or not set it.
Alternatively the jobTerminationListener can have an internal check for this boolean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

private final ConcurrentMap<String, JobPreparation> preparations;
private final ConcurrentMap<String, JobInvocation> invocations;
private final ConcurrentMap<String, String> stagingSessionTokens;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also cleanup the stagingSessionTokens on job completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok; I've made it so that we always clean them up on job termination, regardless of whether a cleanup hook is set.

my understanding is that we have no plans to use them beyond a job's life-span, so this falls under "not leaking memory from this map".

@@ -305,7 +311,7 @@ public void onCompleted() {
* Serializable StagingSessionToken used to stage files with {@link
* BeamFileSystemArtifactStagingService}.
*/
private static class StagingSessionToken implements Serializable {
static class StagingSessionToken implements Serializable {
Copy link
Contributor

@angoenka angoenka Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the StagingSessionToken private as no one else should care about its structure outside this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return loadManifest(manifestResourceId);
}

public static ProxyManifest loadManifest(ResourceId manifestResourceId) throws IOException {
Copy link
Contributor

@angoenka angoenka Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep it package private as BeamFileSystemArtifactRetricalService is the only other user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

LOG.info("Removing manifest: {}", manifestResourceId);
FileSystems.delete(Collections.singletonList(manifestResourceId));
LOG.info("Removing empty dir: {}", dir);
FileSystems.delete(Collections.singletonList(dir));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check if the directory is empty or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not, currently we'll get a DirectoryNotEmptyException, which seems basically desirable?

afaict FileSystems offers no way to recursively delete, cf. BEAM-4843

for now, it seems like throwing an exception if the structure has changed from what is assumed here is the correct behavior, to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DirectoryNotEmptyException is good enough.

Copy link
Contributor Author

@ryan-williams ryan-williams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I responded to everything.

I added test logic to BeamFileSystemArtifactServicesTest and InMemoryJobServiceTest that covers most of what's new here.

However, some of the plumbing for the directory-removal callback happens in FlinkJobServerDriver which doesn't have its own test atm.

I looked at adding one but it feels like it gets into IT / VR test territory a bit, or would want a TestPortableRunner, so I held off on that for now, but if you'd like me to forge ahead with one of those approaches lmk!

return loadManifest(manifestResourceId);
}

public static ProxyManifest loadManifest(ResourceId manifestResourceId) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

LOG.info("Removing manifest: {}", manifestResourceId);
FileSystems.delete(Collections.singletonList(manifestResourceId));
LOG.info("Removing empty dir: {}", dir);
FileSystems.delete(Collections.singletonList(dir));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not, currently we'll get a DirectoryNotEmptyException, which seems basically desirable?

afaict FileSystems offers no way to recursively delete, cf. BEAM-4843

for now, it seems like throwing an exception if the structure has changed from what is assumed here is the correct behavior, to me

}

private final ConcurrentMap<String, JobPreparation> preparations;
private final ConcurrentMap<String, JobInvocation> invocations;
private final ConcurrentMap<String, String> stagingSessionTokens;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok; I've made it so that we always clean them up on job termination, regardless of whether a cleanup hook is set.

my understanding is that we have no plans to use them beyond a job's life-span, so this falls under "not leaking memory from this map".

@@ -70,26 +71,43 @@
public static InMemoryJobService create(
Endpoints.ApiServiceDescriptor stagingServiceDescriptor,
Function<String, String> stagingServiceTokenProvider,
ThrowingConsumer<String> cleanupJobFn,
Boolean cleanArtifactsPerJob,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

state.equals(JobState.Enum.DONE) || state.equals(JobState.Enum.FAILED)) {
String stagingSessionToken = stagingSessionTokens.get(preparationId);
try {
cleanupJobFn.accept(stagingSessionToken);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private final JobInvoker invoker;

private InMemoryJobService(
Endpoints.ApiServiceDescriptor stagingServiceDescriptor,
Function<String, String> stagingServiceTokenProvider,
ThrowingConsumer<String> cleanupJobFn,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -305,7 +311,7 @@ public void onCompleted() {
* Serializable StagingSessionToken used to stage files with {@link
* BeamFileSystemArtifactStagingService}.
*/
private static class StagingSessionToken implements Serializable {
static class StagingSessionToken implements Serializable {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ryan-williams
Copy link
Contributor Author

Run Java PreCommit

@ryan-williams
Copy link
Contributor Author

First Jenkins build failed with a lot of messages like:

Expiring Daemon because JVM Tenured space is exhausted

and eventually:

Build timed out (after 90 minutes). Marking the build as aborted.

Now I'm seeing the first message in the latest build as well.

Any ideas what that's about are welcome!

@ryan-williams
Copy link
Contributor Author

Run Java PreCommit

@ryan-williams
Copy link
Contributor Author

Same thing 3x here, not sure how to triage.

Seeing a timed out build on #6018 with a lot of OOMs but not the Tenured space is exhausted msg.

@ryan-williams
Copy link
Contributor Author

Looks like this is tracked at BEAM-4847, per @lukecwik on Slack (thanks!)

Not sure if I should keep trying here or wait until it seems like there's some movement on that issue?

Copy link
Contributor

@angoenka angoenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

configuration.cleanArtifactsPerJob ?
(String stagingSessionToken) ->
artifactStagingService.getService().removeArtifacts(stagingSessionToken)
: null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw NullPointer on accept call if configuration.cleanArtifactsPerJob is false.

I was suggesting:

(String stagingSessionToken) -> if(configuration.cleanArtifactsPerJob) artifactStagingService.getService().removeArtifacts(stagingSessionToken)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw NullPointer on accept call if configuration.cleanArtifactsPerJob is false.

I don't think that's true, as written? I check cleanupJobFn != null in the only place it's referenced in InMemoryJobService.

Or maybe I'm missing what you mean?

Anyway, if you think this way is cleaner I'm almost fine with it, lmk!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to what you suggested, anyway

LOG.info("Removing manifest: {}", manifestResourceId);
FileSystems.delete(Collections.singletonList(manifestResourceId));
LOG.info("Removing empty dir: {}", dir);
FileSystems.delete(Collections.singletonList(dir));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DirectoryNotEmptyException is good enough.

@ryan-williams
Copy link
Contributor Author

Merged in HEAD, let's see if the Java PreCommit issue has been resolved 🤞🏻

@ryan-williams
Copy link
Contributor Author

whew, finally passed precommit. lmk if you want me to squash+rebase @angoenka (or if there are any other comments I missed! I think I got everything)

@angoenka
Copy link
Contributor

The PR looks good.
We can simplify it a bit by reusing FinkJobInvokation#addStateListener to listen to the termination instead of adding a separate listener.
But we can take it up later.
Otherwise, the PR looks good.

@angoenka
Copy link
Contributor

angoenka commented Aug 1, 2018

@ryan-williams Is the PR good for merge?
cc: @tweise

@ryan-williams
Copy link
Contributor Author

I removed the terminationListener per your suggestion @angoenka; it should be ready to go!

try {
return MAPPER.writeValueAsString(this);
} catch (JsonProcessingException e) {
LOG.error("Error {} occurred while serializing {}.", e.getMessage(), this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why logging vs. using StatusRuntimeException as shown below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done

} catch (JsonProcessingException e) {
LOG.error("Error {} occurred while serializing {}.", e.getMessage(), stagingSessionToken);
throw e;
LOG.info("Removing dir {}", dir);
Copy link
Contributor

@tweise tweise Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really this and following logging at info level? Why not debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point; I made just the last one info and the rest debug, lmk if you want them all as debug

Copy link
Contributor Author

@ryan-williams ryan-williams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @tweise, I think I addressed both comments (and hopefully the CI issue)

} catch (JsonProcessingException e) {
LOG.error("Error {} occurred while serializing {}.", e.getMessage(), stagingSessionToken);
throw e;
LOG.info("Removing dir {}", dir);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point; I made just the last one info and the rest debug, lmk if you want them all as debug

try {
return MAPPER.writeValueAsString(this);
} catch (JsonProcessingException e) {
LOG.error("Error {} occurred while serializing {}.", e.getMessage(), this);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done

@ryan-williams
Copy link
Contributor Author

Thanks for the spotless fix; I was mistakenly only running it on beam-runners-java-fn-execution!

I see another spotless nit locally, will push it if this build fails (which it presumably will?)

@tweise
Copy link
Contributor

tweise commented Aug 2, 2018

FYI this can be avoided by running ./gradlew spotlessApply prior to opening the PR.

@tweise
Copy link
Contributor

tweise commented Aug 2, 2018

Run Java Precommit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants