Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added an ability to interrupt the start of a Kubernetes/OpenShift runtime #9816

Merged
merged 4 commits into from
Jun 7, 2018

Conversation

sleshchenko
Copy link
Member

@sleshchenko sleshchenko commented May 25, 2018

What does this PR do?

Adds an ability to interrupt the start of a Kubernetes/OpenShift runtime.

It is done by adding Utility class StartSynchronizer that is designed to synchronize start/stop operations. Synchronization is performed by registering all start/stop operations (like start is launched, an error occurs during start, start is interrupted, etc.)
So, KubernetesInternalRuntime registers when the runtime is starting and then checks failure on each phase of start and interrupts start is a failure is registered.
In turn, KubernetesInternalRuntime checks that start is in progress on runtime stopping, if yes - then start failure will be registered. So, Thread which is performing start will receive the registered exception and interrupt start.

It also works in cluster mode, now there can be two Che Servers during Rolling Update. Synchronization between Che Servers is done by using JGroups Replicated Cache listeners. Each Che Server receives the corresponding event when a runtime stop is launched/completed.

There are 3 more changes:

  1. Rework UnrecoverableEventHandler to fail workspace start instead interruption (earlier ) since there is such ability.

  2. Improve executing of commands in KubernetesPods class to throw if occurs while a WebSocket connection establishing.

  3. KubernetesBootstrapper now checks start interruption before execution each of commands. Without that there is much more probability that commands fails with the following error:

2018-06-01 15:33:21,204[20.200:8443/...]  [ERROR] [.k.c.d.i.ExecWebSocketListener 213]  - Exec Failure: HTTP:403. Message:pods "non-existing" is forbidden: pods "non-existing" not found
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
	at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

And it leads to connection was leaked warning

Jun 01, 2018 3:01:42 PM okhttp3.internal.platform.Platform log
WARNING: A connection to https://172.19.20.200:8443/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);

What issues does this PR fix or reference?

#5918

It also fixes #9905

Release Notes

Added an ability to interrupt the start of a Kubernetes/OpenShift runtime.

Docs PR

@sleshchenko sleshchenko added kind/enhancement A feature request - must adhere to the feature request template. status/in-progress This issue has been taken by an engineer and is under active development. labels May 25, 2018
@sleshchenko sleshchenko self-assigned this May 25, 2018
@sleshchenko sleshchenko force-pushed the k8sStartInterruption branch 4 times, most recently from a7f0170 to c00881b Compare May 31, 2018 15:03
@sleshchenko
Copy link
Member Author

ci-test

@sleshchenko sleshchenko force-pushed the k8sStartInterruption branch 2 times, most recently from afbca11 to ebfda55 Compare May 31, 2018 15:41
@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:9816
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@sleshchenko sleshchenko added status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. and removed status/in-progress This issue has been taken by an engineer and is under active development. labels Jun 1, 2018
@sleshchenko sleshchenko changed the title [WIP] Added an ability to interrupt the start of a Kubernetes/OpenShift runtime Added an ability to interrupt the start of a Kubernetes/OpenShift runtime Jun 1, 2018
@@ -139,22 +140,30 @@ public KubernetesInternalRuntime(
this.executor = sharedPool.getExecutor();
this.runtimeStates = runtimeStates;
this.machines = machines;
this.startSynchronizer = new StartSynchronizer(context.getIdentity(), eventService);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: looks like event service is just a dependency of StartSynchronizer and KubernetesInternalRuntime doesn't have to have this dependency too. So it might be cleaner to create some factory for this class and hide its dependencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, and while there is only one injected field creating one more class looks a bit overkill. I would prefer to leave it as it is for now.

Copy link

@garagatyi garagatyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.
I have concerns regarding the complexity of KubernetesinternalRuntime and WorkspaceRuntimes classes but it is not concern about this PR but general concern about their design. I encourage you and will try to do that eventually too, to simplify them in future PRs.
I also have some question/comments about this PR which I inlined. Please, elaborate.

import org.eclipse.che.api.core.model.workspace.WorkspaceStatus;

/**
* Listener interface for being notified about workspace status changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify in javadocs why it might be used instead of subscribing to events from EventsService?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added note about clustering.

} catch (ExecutionException e) {
throw new InfrastructureException(
"Error occured while executing command in pod: " + e.getMessage(), e);
} catch (TimeoutException ignored) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception is not ignored, so can you fix the variable name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not ignored by not used =) Renamed to e )

*
* @throws InternalInfrastructureException when {@link #startThread} is already set.
*/
public synchronized void setStartThread() throws InternalInfrastructureException {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can include this operation into start method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I afraid I can't. Because start method is invoked in two places: #internalStart and #markStopping. While #setStartThread is invoked only by #internalStart method


/** Registers a runtime start. */
public synchronized void start() {
if (!isStarting) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate what will happen if 2 servers will try to simultaneously start a runtime? Is there a protection from that on this or some another level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkspaceRuntimes class should synchronize invocation of InternalRuntime#start.
So 2 server must not simultaneously start a runtime.

And it's possible that two threads will invoke StartSynchronizer#start method on start interruption because start method is invoked in two places: #internalStart and #markStopping.
Then the first call will set state to STARTING with performing the corresponding actions and the second call will do nothing.

* @throws RuntimeStartInterruptedException when start was interrupted
* @throws InfrastructureException when any other exception occurs during start
*/
public synchronized void complete() throws InfrastructureException {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that methods are synchronized but fields might not be synced between threads caches. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand using synchronized method ensures synchronization shared data between threads, am I wrong?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But those fields are not shared between threads. They can be cached by each thread separately.

@@ -106,6 +105,7 @@
private final Executor executor;
private final KubernetesRuntimeStateCache runtimeStates;
private final KubernetesMachineCache machines;
private final StartSynchronizer startSynchronizer;

@Inject
public KubernetesInternalRuntime(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK, but, to be honest, I no longer can validate the logic of this class as it gets too complicated. I encourage you, and myself, to refactor this class as soon as possible to simplify it drastically.

* @author Sergii Leshchenko
*/
@Listeners(MockitoTestNGListener.class)
public class StartSynchronizerTest {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand these tests don't check multi-threaded use cases that are covered by the StartSynchronizer design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right

Copy link

@garagatyi garagatyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sleshchenko sleshchenko force-pushed the k8sStartInterruption branch 3 times, most recently from 5956e1f to a1bbeb4 Compare June 6, 2018 09:32
@sleshchenko
Copy link
Member Author

ci-test

Signed-off-by: Sergii Leshchenko <sleshche@redhat.com>
It is needed to avoid '403 Pod doesn't exists' errors.
It happens when start is interrupted when any of machines is on bootstrapping phase.
As result connection leak happens TODO Create an issue for fabric8-client
@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:9816
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@sleshchenko
Copy link
Member Author

sleshchenko commented Jun 7, 2018

@eclipse/eclipse-che-qa Please check test report.

@vkuznyetsov vkuznyetsov mentioned this pull request Jun 7, 2018
33 tasks
@Ohrimenko1988
Copy link
Contributor

In the functionality which is covered by selenium tests - new regression was not found

@sleshchenko sleshchenko merged commit 8549d95 into eclipse-che:master Jun 7, 2018
@sleshchenko sleshchenko deleted the k8sStartInterruption branch June 7, 2018 12:03
@benoitf benoitf removed the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Jun 7, 2018
@benoitf benoitf added this to the 6.7.0 milestone Jun 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A feature request - must adhere to the feature request template.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workspace can not be deleted after using not accessible image
7 participants