-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make start of OpenShift machines parallel #8836
Conversation
ci-test-ocp |
ci-test build report: |
@@ -28,6 +28,7 @@ | |||
* @author Sergii Leshchenko | |||
*/ | |||
public abstract class AbstractBootstrapper { | |||
|
|||
private final String machineName; | |||
private final int bootstrappingTimeoutMinutes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's used only by one bootstrapping method (#bootstrap but not #bootstrapAsync), I would move it to parameters of the #bootstrap method. It'd make clear that fact that bootstrap async is not limited by time out of the box.
final CompletableFuture<Void> allDone = | ||
CompletableFuture.allOf( | ||
machinesFutures.toArray(new CompletableFuture[machinesFutures.size()])); | ||
CompletableFuture.anyOf(allDone, failure).get(machineStartTimeoutMin, TimeUnit.MINUTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand previously it was used for setting a timeout of waiting for one machine to become running (Pod should be running but installers and servers are not launched yet). Now within this time all machines must become running, installers must be launched and servers must become RUNNING. Please consider renaming configuration parameter. Now it looks like Kubernetes workspace start timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very interesting, but requires careful review. Good job!
@@ -113,4 +114,11 @@ public void waitRunning(int timeoutMin) throws InfrastructureException { | |||
timeoutMin, | |||
p -> (KUBERNETES_POD_STATUS_RUNNING.equals(p.getStatus().getPhase()))); | |||
} | |||
|
|||
public CompletableFuture<Void> watchReadinessAsync() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to ensure that we won't have connection leaks because of that? @akorneta @sleshchenko WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll write doc for this method and will describe how this method might be used, to prevent leaks.
toCancelFutures.add(machineRunningFuture); | ||
final CompletableFuture<Void> machineBootChain = | ||
machineRunningFuture | ||
.thenComposeAsync(checkFailure(failure), executor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used executor uses non-daemon threads. But in this case, it seems to me that daemon threads would be a better fit, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, I'll change that.
try { | ||
final PodResource<Pod, DoneablePod> podResource = | ||
clientFactory.create().pods().inNamespace(namespace).withName(name); | ||
final Watch watch = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to ensure that we won't have connection leaks because of that? @akorneta @sleshchenko WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there should not be any problems with WebSocket connection leaks because the method documents indicate that the future must be explicitly cancelled if it is not needed and has not reached the final state.
5e34902
to
a6d6c3e
Compare
ci-test-ocp |
ci-test build report: |
c0af867
to
6de9531
Compare
ci-test-ocp |
ci-test build report: |
ci-test-ocp |
ci-test build report: |
ci-test-ocp |
ci-test build report: |
ci-test-ocp |
ce789d8
to
9c72a6e
Compare
ci-test-ocp |
ci-test build report: |
ffacfaf
to
e71d4a0
Compare
@@ -33,8 +32,7 @@ public MachineTokenProviderImpl(MachineTokenRegistry tokenRegistry) { | |||
} | |||
|
|||
@Override | |||
public String getToken(String workspaceId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather left this method here and call getToken() with two params within it instead of adding
EnvironmentContext.getCurrent()...
in 15 other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it makes sense. The reason why the method was removed is to force developers to explicitly call EnvironmentContext.getCurrent()
in cases where they sure that user won't be anonymous, but I think it might be reworked in a better way, so I'll consider to back this method and adding a check(that user is not anonymous) before token retrieval.
ci-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! 👍
It's mega cool feature, especially for agents that are launched as parallel machines.
ci-test build report: |
ci-test |
ci-test build report: |
ci-test-ocp |
ci-test build report: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool PR! Looks good.
What does this PR do?
Parallelizes the start of OpenShift machines.
Schematically the process of machine boot described in the following picture:
failure context its a state that shared overall machine threads in the scope of one workspace, it's used as an indicator of error that might appear in process of the start. This context holds only one exception that was first thrown(and its is visible for all the threads).
s1. Creates OpenShift pod watcher that listens to pod events and if the state satisfies predicate then this step is considered as finished. If the connection is closed or the listening is cancelled then this step is considered as failed;
s2, s4, s6 These steps are used to break the chain of invocation if there is an exception in the failure context(so if one of machine is failed we need to prevent the launching of other machines);
s3. Sets machine running status and propagates machine running event through event service.
s5. Asynchronously starts bootstrapping of installers and listen to bootstrapper status events, when an event with status
done
received then this step is considered as finished if the received event has statusfailed
or listening is cancelled then this step is considered as failed.s7. Asynchronously starts checking the readiness of servers of a machine, when servers are available then this step is considered as finished when this step is cancelled then it is considered as failed.
What issues does this PR fix or reference?
fixes #7067
Release Notes
n/a