Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge branch 'container-ordering-feature' into dev #1904

Merged
merged 37 commits into from Feb 28, 2019

Conversation

Projects
None yet
5 participants
@ubhattacharjya
Copy link
Contributor

ubhattacharjya commented Feb 28, 2019

Summary

This introduces container ordering as a feature in ECS agent.

Testing

New integration and functional tests have been written to cover the features.

Description for the changelog

  • Feature - Explicit ordering of containers via 'dependsOn' field
  • Feature - Container-level docker stop timeouts
  • Enhancement - Shutdown order is now observed for 'dependsOn', 'links', and 'volumes'

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ubhattacharjya and others added some commits Jan 28, 2019

Merge pull request #1853 from ubhattacharjya/mergeBranch
Merge 'dev' branch into 'container-ordering' branch
engine,dockerclient: per container start/stop timeouts
* start timeout is governed by the agent and is the context timeout for
the StartContainer api call

* stop timeout is the parameter passed to StopContainer and is time the
docker daemon waits after a StopContainer call to issue a SIGKILL to the
container
Add Dependency conditions of 'Complete', 'Success' and 'Healthy'
'Success' condition on a dependency container will allow a target container to start only  when the dependency container has exitted successfully with exitcode 0.
'Complete' condition on a dependency container will allow a target container to start only  when the dependency container has exitted.
'Healthy' condition on a dependency container will allow a target container to start only  when the dependency container is reported to be healthy.
engine: adding poll function during progressTask
Prior to this commit, we only tracked state we explicitly tried to
change when the task was starting. We did not respond to the event
stream or any other source of information from Docker. This means
that when we are waiting for certain dependency conditions
("SUCCESS", "COMPLETE", or "HEALTHY") the task progression logic
does not update the agent-internal model of container state.

Since we rely on that state for determining when conditions are
met, tasks would get stuck in infinite startup loops. This change
adds a call to engine.checkTaskState(), which explicity updates
any changed container state. We only call this function if we know
that we are waiting on the aforementioned subset of dependency
conditions.

Co-authored-by: Utsa Bhattacharjya <utsa@amazon.com>
Merge pull request #1876 from petderek/container-ordering-task-sync
engine: adding poll function during progressTask
dependencygraph: Enforce shutdown order
We now apply shutdown order in any dependency case, including
dependsOn directives, links, or volumes. What this means is that
agent will now make a best effort attempt to shutdown containers
in the inverse order they were created.

For example, a container using a link for communication will wait
until the linked container has terminated before terminating
itself. Likewise, a container named in another container's
dependsOn directive will wait until that dependent container
terminates.

One note about the current implementation is that the dependencies
aren't assumed to be transitive, so if a chain exists such that:

A -> B -> C

Container "C" will shutdown before "B", but it won't check status
against container "A" explicity. If A depends on C, we expect:

A -> B -> C
     A -> C

The lack of transitive dependency logic applies is consistent with
startup order as of this commit.
functional tests: increase timeout on link/volume
The link / volume dependency tests are now affected by shutdown
order, so the tests now take longer. Previously, it would take a
max of 30s (the default docker stop timeout for agent). Now, since
the containers stop in order, it will take a max of 30s * n, where
n is the number of containers.

Increasing the test timeout is a short term fix until we have
granular start/stop timeouts plumbed through the ecs service.
dependencygraph: simplify container start logic
Instead of explicitly checking against many conditions, we now
validate that the expected condition has progressed beyond started

This mirrors prior behavior in the codebase, and reduces cyclo
complexity.
Merge pull request #1866 from petderek/container-ordering-feature
dependencygraph: Enforce shutdown order
Remove the functionality of StartTimeout as Docker API Start Timeout
The ‘StartTimeout’ now will only serve as the the time duration after which if a container has a dependency on another container and the conditions are ‘SUCCESS’, ‘HEALTHY’ and ‘COMPLETE’, then the dependency will not be resolved. For example:

•	If a container A has a dependency on container B and the condition is ‘START’, the StartTimeout for container B will roughly be the time required for it to exit successfully with exit code 0
•	If a container A has a dependency on container B and the condition is ‘COMPLETE’, the StartTimeout for container B will roughly be the time required for it to exit.
•	If a container A has a dependency on container B and the condition is ‘HEALTHY’, the StartTimeout for container B will roughly be the time required for it to emit a ‘HEALTHY’ status.

If the StartTimeout exceeds in any of the above cases, container A will not be able to transition to ‘CREATED’ status.

It effectively reverts the implementation of StartTimeout in commit: 79bd517
engine: add ordering integration tests
This is the first batch of integration tests for container ordering. The
tests handle the basic use cases for each of the conditions that
introduces new behavior into agent (HEALTHY,COMPLETE,SUCCESS).
Dependency Condition Naming change:
* "START" Dependency condition has been changed to "CREATE" as it waits for the dependency to atleast get created
* "RUNNING" Dependency Condition has been changed to "START" as it waits for the dependency to get started.
Validating that the dependency container has not timed out
Here, the time duration(StartTimeout) mentioned by the user for a container is expired or not is checked before resolving the dependency for target container.

For example,
* if a target container 'A' has dependency on container 'B' and the dependency condiiton is 'SUCCESS', then the dependency will not be resolved if B times out before exitting successfully with exit code 0.
* if a target container 'A' has dependency on container 'B' and the dependency condiiton is 'COMPLETE', then the dependency will not be resolved if B times out before exitting.
* if a target container 'A' has dependency on container 'B' and the dependency condiiton is 'HEALTHY', then the dependency will not be resolved if B times out before emtting 'Healthy' status.

The advantage of this is that the user will get to know that something is wrong with the task if the task is stuck in pending..
Merge pull request #1882 from ubhattacharjya/Naming
Dependency Condition Naming change:
Merge pull request #1880 from ubhattacharjya/ChangeStartTimeout
Remove the functionality of StartTimeout as Docker API Start Timeout
tests: make windows test run a bit faster
* Remove need to pull 'latest' server core

By removing the :latest tag from all windowsservercore containers,
we will have the tests use the container thats already baked into
the AMI.

* Remove depdency on golang and python containers

We are removing the need to use any containers other than servercore
and nanoserver. This reduces the number of downloads needed and the
number of builds that happen before the tests start running.

* Explicit timeouts on order tests

The ordering tests are broken at the moment, so we are capping them
with a fixed timeout.
engine: adding timeouts to the other order tests
The other order tests need timeouts so that the test fails fast
instead of waiting for the test runnner to timeout.
Merge pull request #1884 from ubhattacharjya/bugFix
Checking dependency resolution after timeout and successful exit check
tests: prefer cached images
This should eventually make windows tests faster to run.

Fixes a bug where task context cancel causes an infinite steady state
loop. Previously if the context expired, waitSteady() will spin forever
since the timeout no longer works. This introduces a check for context
expiration earlier in the code.

petderek and others added some commits Feb 27, 2019

Merge pull request #1885 from ubhattacharjya/ACS
ACS Model change for container ordering
integ tests: allocate more cpu per container
The default helper function will now allocate 1024 cpu shares or 100%
cpu-percent on windows. This will enable the windows based tests to
finish in more predictable ways. When Windows tests were constrained,
simple tasks liek "sleep 10" were taking much longer than the expected
10 seconds.
Merge pull request #1887 from ubhattacharjya/StopTimeoutTest
Adding Integ Tests for Granular Stop Timeout

@ubhattacharjya ubhattacharjya requested a review from aws/aws-ecs-agent Feb 28, 2019

@ubhattacharjya ubhattacharjya changed the title Merge cont order Merge branch 'container-ordering-feature' into dev Feb 28, 2019

@yunhee-l
Copy link
Contributor

yunhee-l left a comment

Once tests pass!

@petderek

This comment has been minimized.

Copy link
Contributor

petderek commented Feb 28, 2019

The windows functional tests are expected to fail. I'll have a fix out for the broken tests soon.

@petderek

This comment has been minimized.

Copy link
Contributor

petderek commented Feb 28, 2019

Arm tests are failing since the library/nginx image. We are going to go ahead and merge this in since we saw acceptance on this earlier today. We re-validate arm for both the plugin change and the release.

@petderek petderek merged commit 90793e4 into aws:dev Feb 28, 2019

10 of 13 checks passed

ecs/arm/functional_test Failed: Arm functional test suite
ecs/arm/integ_test Failed: Arm integration test suite
ecs/windows/functional_test Failed: Windows functional test suite
continuous-integration/travis-ci/pr The Travis CI build passed
Details
ecs/arm/build Passed: Arm build
ecs/arm/unit_test Passed: Arm unit test suite
ecs/linux/build Passed: Linux build
ecs/linux/functional_test Passed: Linux functional test suite
ecs/linux/integ_test Passed: Linux integration test suite
ecs/linux/unit_test Passed: Linux unit test suite
ecs/windows/build Passed: Windows build
ecs/windows/integ_test Passed: Windows integration test suite
ecs/windows/unit_test Passed: Windows unit test suite
@petderek

This comment has been minimized.

Copy link
Contributor

petderek commented Feb 28, 2019

The arm issue will be fixed on #1898

This was referenced Feb 28, 2019

@ubhattacharjya ubhattacharjya deleted the ubhattacharjya:MergeContOrder branch Apr 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.