Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make start of OpenShift machines parallel #8836

Merged
merged 1 commit into from
Feb 26, 2018

Conversation

akorneta
Copy link
Contributor

@akorneta akorneta commented Feb 19, 2018

What does this PR do?

Parallelizes the start of OpenShift machines.

Schematically the process of machine boot described in the following picture:

parallel start of machines_

failure context its a state that shared overall machine threads in the scope of one workspace, it's used as an indicator of error that might appear in process of the start. This context holds only one exception that was first thrown(and its is visible for all the threads).

s1. Creates OpenShift pod watcher that listens to pod events and if the state satisfies predicate then this step is considered as finished. If the connection is closed or the listening is cancelled then this step is considered as failed;
s2, s4, s6 These steps are used to break the chain of invocation if there is an exception in the failure context(so if one of machine is failed we need to prevent the launching of other machines);
s3. Sets machine running status and propagates machine running event through event service.
s5. Asynchronously starts bootstrapping of installers and listen to bootstrapper status events, when an event with status done received then this step is considered as finished if the received event has status failed or listening is cancelled then this step is considered as failed.
s7. Asynchronously starts checking the readiness of servers of a machine, when servers are available then this step is considered as finished when this step is cancelled then it is considered as failed.

par_boot

What issues does this PR fix or reference?

fixes #7067

Release Notes

n/a

@akorneta akorneta added status/in-progress This issue has been taken by an engineer and is under active development. kind/task Internal things, technical debt, and to-do tasks to be performed. labels Feb 19, 2018
@akorneta akorneta self-assigned this Feb 19, 2018
@akorneta
Copy link
Contributor Author

ci-test-ocp

@akorneta akorneta changed the title Make start of OpenShift machines parallel [WIP] - Make start of OpenShift machines parallel Feb 19, 2018
@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:8836
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@@ -28,6 +28,7 @@
* @author Sergii Leshchenko
*/
public abstract class AbstractBootstrapper {

private final String machineName;
private final int bootstrappingTimeoutMinutes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's used only by one bootstrapping method (#bootstrap but not #bootstrapAsync), I would move it to parameters of the #bootstrap method. It'd make clear that fact that bootstrap async is not limited by time out of the box.

final CompletableFuture<Void> allDone =
CompletableFuture.allOf(
machinesFutures.toArray(new CompletableFuture[machinesFutures.size()]));
CompletableFuture.anyOf(allDone, failure).get(machineStartTimeoutMin, TimeUnit.MINUTES);
Copy link
Member

@sleshchenko sleshchenko Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand previously it was used for setting a timeout of waiting for one machine to become running (Pod should be running but installers and servers are not launched yet). Now within this time all machines must become running, installers must be launched and servers must become RUNNING. Please consider renaming configuration parameter. Now it looks like Kubernetes workspace start timeout.

Copy link

@garagatyi garagatyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very interesting, but requires careful review. Good job!

@@ -113,4 +114,11 @@ public void waitRunning(int timeoutMin) throws InfrastructureException {
timeoutMin,
p -> (KUBERNETES_POD_STATUS_RUNNING.equals(p.getStatus().getPhase())));
}

public CompletableFuture<Void> watchReadinessAsync() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to ensure that we won't have connection leaks because of that? @akorneta @sleshchenko WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll write doc for this method and will describe how this method might be used, to prevent leaks.

toCancelFutures.add(machineRunningFuture);
final CompletableFuture<Void> machineBootChain =
machineRunningFuture
.thenComposeAsync(checkFailure(failure), executor)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used executor uses non-daemon threads. But in this case, it seems to me that daemon threads would be a better fit, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I'll change that.

try {
final PodResource<Pod, DoneablePod> podResource =
clientFactory.create().pods().inNamespace(namespace).withName(name);
final Watch watch =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to ensure that we won't have connection leaks because of that? @akorneta @sleshchenko WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there should not be any problems with WebSocket connection leaks because the method documents indicate that the future must be explicitly cancelled if it is not needed and has not reached the final state.

@akorneta
Copy link
Contributor Author

ci-test-ocp

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:8836
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@akorneta akorneta force-pushed the CHE-7067 branch 2 times, most recently from c0af867 to 6de9531 Compare February 20, 2018 16:39
@akorneta
Copy link
Contributor Author

ci-test-ocp

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:8836
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@akorneta
Copy link
Contributor Author

ci-test-ocp

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:8836
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@akorneta
Copy link
Contributor Author

ci-test-ocp

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:8836
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@akorneta
Copy link
Contributor Author

ci-test-ocp

@akorneta akorneta force-pushed the CHE-7067 branch 2 times, most recently from ce789d8 to 9c72a6e Compare February 22, 2018 11:49
@riuvshin
Copy link
Contributor

"ПУШКА"

@akorneta akorneta changed the title [WIP] - Make start of OpenShift machines parallel Make start of OpenShift machines parallel Feb 22, 2018
@akorneta akorneta added status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. and removed status/in-progress This issue has been taken by an engineer and is under active development. labels Feb 22, 2018
@akorneta
Copy link
Contributor Author

ci-test-ocp

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:8836
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@akorneta akorneta force-pushed the CHE-7067 branch 2 times, most recently from ffacfaf to e71d4a0 Compare February 23, 2018 08:32
@@ -33,8 +32,7 @@ public MachineTokenProviderImpl(MachineTokenRegistry tokenRegistry) {
}

@Override
public String getToken(String workspaceId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather left this method here and call getToken() with two params within it instead of adding
EnvironmentContext.getCurrent()... in 15 other places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it makes sense. The reason why the method was removed is to force developers to explicitly call EnvironmentContext.getCurrent() in cases where they sure that user won't be anonymous, but I think it might be reworked in a better way, so I'll consider to back this method and adding a check(that user is not anonymous) before token retrieval.

@akorneta
Copy link
Contributor Author

ci-test

Copy link
Member

@sleshchenko sleshchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! 👍
It's mega cool feature, especially for agents that are launched as parallel machines.

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:8836
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@akorneta
Copy link
Contributor Author

ci-test

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:8836
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@akorneta
Copy link
Contributor Author

ci-test-ocp

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:8836
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

Copy link

@garagatyi garagatyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool PR! Looks good.

@akorneta akorneta merged commit 806a6da into eclipse-che:master Feb 26, 2018
@akorneta akorneta deleted the CHE-7067 branch February 26, 2018 09:53
@akorneta akorneta removed the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Feb 26, 2018
@benoitf benoitf added this to the 6.2.0 milestone Feb 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make bootstrapping of OpenShift machines parallel
8 participants