Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recreate http client on resume() #4185

Merged
merged 3 commits into from Jan 21, 2019

Conversation

tysonnorris
Copy link
Contributor

In testing concurrent actions (multiple activations in same container) we noticed that when a container is used after Container.suspend(), the http connection may be re-created multiple times (and may have side affects of losing connections each time a new one is created.

Description

This PR addresses this by:

  • using a separate function for creating the connection - this will be invoked first during /init
  • same function will be invoked again during Container.resume() (which means that subclasses must invoke super.resume())

This way the subsequent + multiple calls to /run will arrive after connection has already been recreated.
I extended the ContainerProxyTests to assert connection state (exists or not) after suspend() and after resume().

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@codecov-io
Copy link

codecov-io commented Dec 18, 2018

Codecov Report

Merging #4185 into master will decrease coverage by 4.85%.
The diff coverage is 86.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4185      +/-   ##
==========================================
- Coverage   85.93%   81.08%   -4.86%     
==========================================
  Files         152      152              
  Lines        7304     7312       +8     
  Branches      484      480       -4     
==========================================
- Hits         6277     5929     -348     
- Misses       1027     1383     +356
Impacted Files Coverage Δ
...la/org/apache/openwhisk/core/mesos/MesosTask.scala 73.8% <0%> (ø) ⬆️
...sk/core/containerpool/docker/DockerContainer.scala 96.2% <100%> (ø) ⬆️
...containerpool/kubernetes/KubernetesContainer.scala 94.73% <100%> (ø) ⬆️
...pache/openwhisk/core/containerpool/Container.scala 88.46% <91.66%> (+6.77%) ⬆️
...core/database/cosmosdb/RxObservableImplicits.scala 0% <0%> (-100%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0% <0%> (-95.54%) ⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala 0% <0%> (-92.6%) ⬇️
...whisk/core/database/cosmosdb/CosmosDBSupport.scala 0% <0%> (-83.34%) ⬇️
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala 0% <0%> (-62.5%) ⬇️
...in/scala/org/apache/openwhisk/common/Counter.scala 40% <0%> (-20%) ⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb96c21...80c3519. Read the comment docs.

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general

Left 2 nits. We should explicitly document that this trait is not thread-safe.

@@ -1078,12 +1089,20 @@ class ContainerProxyTests

def runCount = atomicRunCount.get()
override def suspend()(implicit transid: TransactionId): Future[Unit] = {
println("suspending!!!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove that extranous logline.

}
def resume()(implicit transid: TransactionId): Future[Unit] = {
override def resume()(implicit transid: TransactionId): Future[Unit] = {
println("resuming!!!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@@ -73,6 +73,10 @@ trait Container {
/** HTTP connection to the container, will be lazily established by callContainer */
protected var httpConnection: Option[ContainerClient] = None

/** maxConcurrent+timeout are cached during first init, so that resuming connections can reference */
protected var containerHttpMaxConcurrent: Int = 1
protected var containerHttpTimeout: FiniteDuration = 60.seconds
Copy link
Contributor

@markusthoemmes markusthoemmes Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was the case before, but we should add a comment to the trait stating that it's not thread-safe and the caller MUST ensure synchronization (which we do by using an actor)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markusthoemmes Hello. I'm a newbie at scala and having a hard time reading the source code. Can you please explain what thread-safe means above, and why resume() (which I'm guessing that it somehow calls "docker container resume") is not thread safe??

@rabbah rabbah added invoker review Review for this PR has been requested and yet needs to be done. and removed invoker labels Jan 5, 2019
@tysonnorris
Copy link
Contributor Author

@markusthoemmes any other comments?

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo some nits.

@@ -207,6 +207,7 @@ class MesosTask(override protected val id: ContainerId,
override def resume()(implicit transid: TransactionId): Future[Unit] = {
// resume not supported
Future.successful(Unit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead code, you can just omit the Future.successful call.

@@ -170,6 +173,7 @@ class ContainerProxyTests
activation.annotations.get("limits") shouldBe Some(a.limits.toJson)
activation.annotations.get("path") shouldBe Some(a.fullyQualifiedName(false).toString.toJson)
activation.annotations.get("kind") shouldBe Some(a.exec.kind.toJson)
system.log.info(s"acking ${activation.activationId}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these loglines needed? (Same above).

@markusthoemmes markusthoemmes added the reviewed Review for this PR is finished. It is mergeable from a review's perspective. label Jan 8, 2019
@tysonnorris
Copy link
Contributor Author

@markusthoemmes I think this is ready now

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment on the test, then it's LGTM.

Future.successful(())
val r = super.resume()
//verify that httpconn is recreated
httpConnection should be('defined)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please await the futures of either resume and suspend before checking that the client is there. Right now this is not an issue but once the implementation of the trait changes we're prone to heisenbugs.

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I only caught this now... If I get time I'll throw a commit on there myself changing these bits to flatMaps.

@@ -104,7 +104,8 @@ class KubernetesContainer(protected[core] val id: ContainerId,
super.suspend().flatMap(_ => kubernetes.suspend(this))
}

def resume()(implicit transid: TransactionId): Future[Unit] = kubernetes.resume(this)
override def resume()(implicit transid: TransactionId): Future[Unit] =
kubernetes.resume(this).map(_ => super.resume())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a flatMap to also "wait" for the future returned by super.resume.

def resume()(implicit transid: TransactionId): Future[Unit] =
if (useRunc) { runc.resume(id) } else { docker.unpause(id) }
override def resume()(implicit transid: TransactionId): Future[Unit] = {
if (useRunc) { runc.resume(id) } else { docker.unpause(id) }.map(_ => super.resume())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flatMap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the map() really run on the full if expression or just the else block?

Copy link
Member

@rabbah rabbah Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when in doubt, refactor...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question @sven-lange-last. Better safe than sorry and make an interim value or put parentheses around the if expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's needed, but agree it will at least be more clear. Also changed to flatMap

@tysonnorris
Copy link
Contributor Author

Thanks for the feedback, it is always welcome.

@@ -204,6 +209,17 @@ trait Container {
RunResult(Interval(started, finished), response)
}
}
private def openConnections(timeout: FiniteDuration, maxConcurrent: Int) = {
if (Container.config.akkaClient) {
Copy link
Contributor

@ddragosd ddragosd Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot why are we still maintaining the Akka Client and the Apache Client, since the latter was found in #3812 to perform poorly ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we ready to move to the akka client exclusively? there were some reservations about making sure we didn't get bit by some akka issues as in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ready 😄
FWIW, we have been using akka client exclusively in prod, with concurrency enabled, since December.
Tests are already using it as well.
I'm not sure if there are others that are interested in testing it more, but it would be great to retire the apache client IMHO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's confirm that via dev-list.

Copy link
Contributor

@ddragosd ddragosd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm approving based on previous reviews and successful tests.

@markusthoemmes markusthoemmes merged commit 2312b7d into apache:master Jan 21, 2019
@ddragosd ddragosd deleted the resume-http-client branch January 22, 2019 07:04
BillZong pushed a commit to BillZong/openwhisk that referenced this pull request Nov 18, 2019
reopen connections only once, during Container.resume()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invoker review Review for this PR has been requested and yet needs to be done. reviewed Review for this PR is finished. It is mergeable from a review's perspective.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants