Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker_network IPAM test / general docker test cleanup #50499

Merged
merged 4 commits into from Jan 8, 2019

Conversation

felixfontein
Copy link
Contributor

@felixfontein felixfontein commented Jan 3, 2019

SUMMARY

The test had to be deactivated (in #50477) since it caused a lot of spurious failures; see also here.

Fixes #50527, fixes #50478.

ISSUE TYPE
  • Bugfix Pull Request
COMPONENT NAME

docker_network

@felixfontein
Copy link
Contributor Author

(Copying from #50354, since that's closed now)

CC @mattclay as he knows a lot more about the CI system.

@smueller18 I think this can only happen in one of two instances:

  1. An earlier run of the test died and the network wasn't cleaned up; this should only happen in case the test run was interrupted (or the instance of the docker_network module used was horribly broken, but there hasn't been any such change to it recently) so that the test role didn't finish (it contains cleanup code). I don't know what happens if running tests are cancelled;

    @mattclay is it possible that cancelling interrupts the role and doesn't run cleanup? And if that happens, is it possible that the same CI node is used for another test (without "cleaning up" docker in some way first)?

  2. Two docker_network tests are running on the same CI node simulatenously, or are somehow else using the same docker daemon. @mattclay is this possible?

@ansibot
Copy link
Contributor

ansibot commented Jan 3, 2019

@ansibot ansibot added WIP This issue/PR is a work in progress. Nevertheless it was shared for getting input from peers. affects_2.8 This issue/PR affects Ansible v2.8 bug This issue/PR relates to a bug. cloud docker needs_triage Needs a first human triage before being processed. support:community This issue/PR relates to code supported by the Ansible community. test This PR relates to tests. labels Jan 3, 2019
@felixfontein
Copy link
Contributor Author

Idea: modify setup_docker to clean up docker:

  • force remove all containers starting with ansible-test-
  • force remove all networks starting with ansible-test-
  • force remove all volumes starting with ansible-test-

That should prevent errors like left-over containers or networks from before reserving resources (like IP addresses, subnets, ...) which are needed.

(I think the CI system already clears up docker containers, but probably doesn't clean up docker networks or docker volumes.)

@felixfontein felixfontein force-pushed the docker_network-ipam-tests branch 5 times, most recently from 376238f to f5dcf10 Compare January 4, 2019 11:27
@felixfontein felixfontein changed the title [WIP] docker_network IPAM test docker_network IPAM test / general docker test cleanup Jan 4, 2019
@felixfontein
Copy link
Contributor Author

I think this should do it (at least as a first approximation).

ready_for_review

@ansibot ansibot added community_review In order to be merged, this PR must follow the community review workflow. and removed WIP This issue/PR is a work in progress. Nevertheless it was shared for getting input from peers. labels Jan 4, 2019
@smueller18
Copy link
Contributor

smueller18 commented Jan 4, 2019

I would wait with merging until we get more information about the underlying CI system.

(Copying from #50354, since that's closed now)

CC @mattclay as he knows a lot more about the CI system.

@smueller18 I think this can only happen in one of two instances:

  1. An earlier run of the test died and the network wasn't cleaned up; this should only happen in case the test run was interrupted (or the instance of the docker_network module used was horribly broken, but there hasn't been any such change to it recently) so that the test role didn't finish (it contains cleanup code). I don't know what happens if running tests are cancelled;
    @mattclay is it possible that cancelling interrupts the role and doesn't run cleanup? And if that happens, is it possible that the same CI node is used for another test (without "cleaning up" docker in some way first)?
  2. Two docker_network tests are running on the same CI node simulatenously, or are somehow else using the same docker daemon. @mattclay is this possible?

If 2 is possible, than cleaning up docker networks might cause other issues. Have a look at

- name: Create network with custom IPAM config
docker_network:
name: "{{ nname_ipam_1 }}"
ipam_config:
- subnet: 172.3.27.0/24
gateway: 172.3.27.2
iprange: 172.3.27.0/26
aux_addresses:
host1: 172.3.27.3
host2: 172.3.27.4
register: network
- assert:
that:
- network is changed
- name: Change subnet, gateway, iprange and auxiliary addresses of network with custom IPAM config
docker_network:
name: "{{ nname_ipam_1 }}"
ipam_config:
- subnet: 172.3.28.0/24
gateway: 172.3.28.2
iprange: 172.3.28.0/26
aux_addresses:
host1: 172.3.28.3
register: network
diff: yes
- assert:
that:
- network is changed
- network.diff.differences | length == 4
- '"ipam_config[0].subnet" in network.diff.differences'
- '"ipam_config[0].gateway" in network.diff.differences'
- '"ipam_config[0].iprange" in network.diff.differences'
- '"ipam_config[0].aux_addresses" in network.diff.differences'
- name: Remove gateway and iprange of network with custom IPAM config
docker_network:
name: "{{ nname_ipam_1 }}"
ipam_config:
- subnet: 172.3.28.0/24
register: network
- assert:
that:
- network is not changed
- name: Cleanup network with custom IPAM config
docker_network:
name: "{{ nname_ipam_1 }}"
state: absent
Imagine there are 2 tests, both with docker_network and sharing the same docker host. Test 1 is creating and modifying the docker network in multiple steps (see script excerpt). So there is a little time window of some seconds the network must not be deleted). At this moment, test 2 starts running and therefore force the cleanup of every network (also that one test 1 is working on). Then the steps of test 1 that are not executed so far will fail because the network was removed in the meantime.

edit:
Just figured out that the same behavior is happening if 2 tests run in parallel with a minimal offset.

@smueller18
Copy link
Contributor

While I developed the tests for the IPAM options, I overtook the existing IP addresses for tests. But the chosen IPv4 addresses are not in the private network segments described in https://en.wikipedia.org/wiki/Private_network#Private_IPv4_addresses. There could be an interference.

@smueller18
Copy link
Contributor

Here are all observed tests failures that appeared so far (that they do not get lost in the other closed issues):

docker.errors.APIError: 500 Server Error: Internal Server Error ("failed to allocate gateway (172.3.28.2): Address already in use")

docker.errors.APIError: 403 Client Error: Forbidden ("cannot create network 6cf2f23a9499ff3904eb21d0524250fd5f7abae8fff42dfdd1cc794c4643b9b8 (br-6cf2f23a9499): conflicts with network d501a521181cc354889484d36dd0901e3f035a07589bf48fee2add1b27abb57c (br-d501a521181c): networks have overlapping IPv6")

@ansibot ansibot removed the needs_triage Needs a first human triage before being processed. label Jan 6, 2019
@mattclay
Copy link
Member

mattclay commented Jan 8, 2019

@felixfontein You are correct. If a Shippable run is cancelled, none of our cleanup code will execute. Test nodes on Shippable are used for running only one job at a time, so there's no need to worry about parallel execution on a single host.

@felixfontein
Copy link
Contributor Author

Great, thanks! I think this is ready to be merged, then. @smueller18 any more objections? :)

@smueller18
Copy link
Contributor

@felixfontein LGTM

@jborean93
Copy link
Contributor

Thanks @felixfontein and @smueller18 for working through this one.

@jborean93 jborean93 merged commit b52d7d5 into ansible:devel Jan 8, 2019
@felixfontein felixfontein deleted the docker_network-ipam-tests branch January 9, 2019 04:52
@felixfontein
Copy link
Contributor Author

Thanks @smueller18 @jborean93 @mattclay for suggestions and reviewing and merging!

felixfontein added a commit to felixfontein/ansible that referenced this pull request Jan 9, 2019
* Re-enable docker_network tests.

* Basic cleanup of docker daemon.

* Add docker CLI detection.

* YAML notation.

(cherry picked from commit b52d7d5)
@felixfontein
Copy link
Contributor Author

I created a backport to stable-2.7 in #50696.

abadger pushed a commit that referenced this pull request Jan 9, 2019
* Re-enable docker_network tests.

* Basic cleanup of docker daemon.

* Add docker CLI detection.

* YAML notation.

(cherry picked from commit b52d7d5)
kbreit pushed a commit to kbreit/ansible that referenced this pull request Jan 11, 2019
* Re-enable docker_network tests.

* Basic cleanup of docker daemon.

* Add docker CLI detection.

* YAML notation.
knumskull pushed a commit to knumskull/ansible that referenced this pull request Jan 21, 2019
* Re-enable docker_network tests.

* Basic cleanup of docker daemon.

* Add docker CLI detection.

* YAML notation.
@ansible ansible locked and limited conversation to collaborators Jul 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.8 This issue/PR affects Ansible v2.8 bug This issue/PR relates to a bug. cloud community_review In order to be merged, this PR must follow the community review workflow. docker support:community This issue/PR relates to code supported by the Ansible community. test This PR relates to tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unstable integration test docker_network
5 participants