New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addServiceBinding and rmServiceBinding log messages filling up disk #1983

Closed
hairyhenderson opened this Issue Oct 13, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@hairyhenderson

hairyhenderson commented Oct 13, 2017

Over the past few days I've been battling an issue where disk space was rapidly declining (and had caused one of my swarms to lose quorum entirely), and it turned out to be some extremely large messages (~100kB) being logged at a rate of about once per second. Because of another bug (docker/for-aws#111), these messages were being written to disk in 3 places, compounding the problem further, but that's only tangentially relevant to this issue.

The messages are all similar to (with lines broken for clarity):

Oct 12 14:31:07 moby root: time="2017-10-12T14:31:07.052191199Z" level=warning 
msg="addServiceBinding 
b9805e0d00432d23c37d3c2a5dde0741961deebbfbb4fb383cd144222e7d1e46 possible trainsient state 
ok:true entries:1729 set:true 
Set{afb4064e5f279d04c8cde8ffa04bf70776a30d95ebd0f660cf1f2be26805ef4f, f8355a00b5104760c7f22a1a6dfb2c319ab80b9d518975b95ff38b094d8d933a,
...(~100kB of entries)
5e598be09a25bb0ebe927333e354fdf40b38f20266caaed6298013ca447ecfb6}"

The bug is mainly that the message prints all entries of the set, regardless of how large the set actually is. I'm not sure what use this information would even have in a troubleshooting scenario, but more importantly it's certainly not relevant at a warning level.

These are the two lines responsible:

logrus.Warnf("addServiceBinding %s possible trainsient state ok:%t entries:%d set:%t %s", eID, ok, entries, b, setStr)

logrus.Warnf("rmServiceBinding %s possible trainsient state ok:%t entries:%d set:%t %s", eID, ok, entries, b, setStr)

I'll issue a PR that removes the setStr argument from the message, though I'm certainly open to other solutions.

Thanks!

@fcrisciani

This comment has been minimized.

Show comment
Hide comment
@fcrisciani

fcrisciani Oct 13, 2017

Member

@hairyhenderson we should investigate further, the transient state should resolve in matter of seconds and is accounting for propagation time of the networkDB and local events. If you are accumulating 100Kb of entries means that the endpoint are not properly deleted.

Member

fcrisciani commented Oct 13, 2017

@hairyhenderson we should investigate further, the transient state should resolve in matter of seconds and is accounting for propagation time of the networkDB and local events. If you are accumulating 100Kb of entries means that the endpoint are not properly deleted.

@hairyhenderson

This comment has been minimized.

Show comment
Hide comment
@hairyhenderson

hairyhenderson Oct 13, 2017

@fcrisciani sure - the root issue was that there was a global service flapping for around 24hrs (dying/restarting very very quickly), so there were a lot of endpoints being added and deleted.

At the very least if the full set of endpoints is printed there needs to be some sort of limit on the length of the string being printed!

hairyhenderson commented Oct 13, 2017

@fcrisciani sure - the root issue was that there was a global service flapping for around 24hrs (dying/restarting very very quickly), so there were a lot of endpoints being added and deleted.

At the very least if the full set of endpoints is printed there needs to be some sort of limit on the length of the string being printed!

@m4r10k

This comment has been minimized.

Show comment
Hide comment
@m4r10k

m4r10k Nov 13, 2017

@hairyhenderson I can confirm this behavior. We have had the same situation with a flapping service (prometheus/nodeexporter) for about 24 hours too. The syslog is filled up and is continuing with filling it up. We will rm the causing containers and truncate the syslog.

level=warning msg="rmServiceBinding 4801e5eb1403
201bb1a008ac341f6fbc01ffc5afda2ebd7ce4bea99ec96763b2 possible trainsient state ok:true entries:1194 set:true Set{9ae756614969de8147aaf
476d13ee86289b04447f7ff379212dbdff60bec55fd, d35f6d941a96dcfc919802e7bfc6909c44148a847ae639f7377790e562608f42, 840b213a3e35e13599b85de
89c4b682b0db793fffbb7951ca0e4e2626ec7c16b, c3117adf179bbdd7ec9856749fa0e0c8345e1621aa38ad6022ca0f5412d76438, 2d8559d21782ef79b9f470276
c5d75c087a60a68b7924b9fade3e348a25de107, 4626511406e3abe11b2af55f5781808696d7f27556c62b4ac15ca4347e2cfd30, c531593...```

Can I do anything on our site to help you to find further hints @fcrisciani ?

docker version
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:23:31 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:04 2017
 OS/Arch:      linux/amd64
 Experimental: false

m4r10k commented Nov 13, 2017

@hairyhenderson I can confirm this behavior. We have had the same situation with a flapping service (prometheus/nodeexporter) for about 24 hours too. The syslog is filled up and is continuing with filling it up. We will rm the causing containers and truncate the syslog.

level=warning msg="rmServiceBinding 4801e5eb1403
201bb1a008ac341f6fbc01ffc5afda2ebd7ce4bea99ec96763b2 possible trainsient state ok:true entries:1194 set:true Set{9ae756614969de8147aaf
476d13ee86289b04447f7ff379212dbdff60bec55fd, d35f6d941a96dcfc919802e7bfc6909c44148a847ae639f7377790e562608f42, 840b213a3e35e13599b85de
89c4b682b0db793fffbb7951ca0e4e2626ec7c16b, c3117adf179bbdd7ec9856749fa0e0c8345e1621aa38ad6022ca0f5412d76438, 2d8559d21782ef79b9f470276
c5d75c087a60a68b7924b9fade3e348a25de107, 4626511406e3abe11b2af55f5781808696d7f27556c62b4ac15ca4347e2cfd30, c531593...```

Can I do anything on our site to help you to find further hints @fcrisciani ?

docker version
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:23:31 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:04 2017
 OS/Arch:      linux/amd64
 Experimental: false
@m4r10k

This comment has been minimized.

Show comment
Hide comment
@m4r10k

m4r10k Nov 13, 2017

Oh, found out one more thing. After we managed to delete the malfunctioning service, and after we repaired the Swarm (managers where out of order), we recognized, that a docker network ls -v <network> does not show any information of the network which still exists. What I mean is, that there are no services visible, but they are functioning without any problems. If we redeploy a stack, all services are visible after.

docker serivce inspect:

       },
        "Endpoint": {
            "Spec": {
                "Mode": "vip"
            },
            "VirtualIPs": [
                {
                    "NetworkID": "j8g1fllw1ns3wpw5xejfnrmjt",
                    "Addr": "10.0.12.9/24"
                }
            ]
        },

docker network inspect -v

[
    {
        "Name": "<serivcename>",
        "Id": "j8g1fllw1ns3wpw5xejfnrmjt",
        "Created": "0001-01-01T00:00:00Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": []
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4109"
        },
        "Labels": null
    }
]

m4r10k commented Nov 13, 2017

Oh, found out one more thing. After we managed to delete the malfunctioning service, and after we repaired the Swarm (managers where out of order), we recognized, that a docker network ls -v <network> does not show any information of the network which still exists. What I mean is, that there are no services visible, but they are functioning without any problems. If we redeploy a stack, all services are visible after.

docker serivce inspect:

       },
        "Endpoint": {
            "Spec": {
                "Mode": "vip"
            },
            "VirtualIPs": [
                {
                    "NetworkID": "j8g1fllw1ns3wpw5xejfnrmjt",
                    "Addr": "10.0.12.9/24"
                }
            ]
        },

docker network inspect -v

[
    {
        "Name": "<serivcename>",
        "Id": "j8g1fllw1ns3wpw5xejfnrmjt",
        "Created": "0001-01-01T00:00:00Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": []
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4109"
        },
        "Labels": null
    }
]

@fcrisciani

This comment has been minimized.

Show comment
Hide comment
@fcrisciani

fcrisciani Feb 23, 2018

Member

@hairyhenderson the PR #2088 should address your concern and still leave enough info for debug (around 5 endpoints)

Member

fcrisciani commented Feb 23, 2018

@hairyhenderson the PR #2088 should address your concern and still leave enough info for debug (around 5 endpoints)

@hairyhenderson

This comment has been minimized.

Show comment
Hide comment
@hairyhenderson

hairyhenderson commented Feb 26, 2018

Thanks @fcrisciani!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment