Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect network scope in prometheus plugin #58

Merged
merged 13 commits into from
Jul 9, 2019

Conversation

philipnrmn
Copy link

@philipnrmn philipnrmn commented Jul 3, 2019

This PR fixes an issue where metrics were pulled via the local agent IP address instead of the task's IP, which could prevent it working in container mode.

Instead of joining the port up to the local IP, we now use the ip_addresses field in container status to collect the container's declared ip address, which is present in all networking modes.

Unit tests: 7eb7ee4
Implementation: ce332c6

The final five commits update the existing tests to use a sample IP address instead of their original value.

This commit adds a test case which tests prometheus-labeled ports in all three
network modes: host, bridge and container.

Add ports and labels to test case
@philipnrmn philipnrmn self-assigned this Jul 3, 2019
@philipnrmn
Copy link
Author

The failing test on this PR is a flake.

branden
branden previously approved these changes Jul 8, 2019
Copy link

@branden branden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -202,7 +202,7 @@ func (p *Prometheus) GetAllURLs() (map[string]URLAndAddress, error) {
return allURLs, err
}

for _, url := range getMesosTaskPrometheusURLs(tasks, p.mesosHostname) {
for _, url := range getMesosTaskPrometheusURLs(tasks) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to call getTaskIP above this line and pass in the IP, keeping this and the other function signatures the same? It would also only run the getTaskIP logic once as opposed to in multiple places.

I also did a cursory search of p.mesosHostname and it doesn't look like it's used anywhere anymore so maybe we can do a clean up and remove that field and getMesosHostname

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change this PR makes is that instead of using the hostname to form the metrics source URL, we now use the task's ip address. So we no longer need to pass the hostname in.

Each task can potentially have a different IP address. Since this method retrieves URLs from multiple tasks, passing it a single IP address won't work - it needs to use the IP address from each task.

WRT your second point, you're quite right - I had thought it was used in getMesosClient, but it isn't. I'll remove it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see, missed that loop the first time round :). It looks like both getEndpointsFromTaskPorts and getEndpointFromTaskLabels get passed the same task, so maybe getTaskIP can just be called once on the task and passed into both of those functions to clean that up a bit and avoid the extra call - but it's not a big deal and I won't block on it 😸

branden
branden previously approved these changes Jul 9, 2019
@philipnrmn
Copy link
Author

The DC/OS integration tests have caught an issue! In bridge mode, we need either to use host_ip:host_port or container_ip:container_port. Right now we're using container_ip:host_port, which doesn't work.

gracedo
gracedo previously approved these changes Jul 9, 2019
@@ -202,7 +202,7 @@ func (p *Prometheus) GetAllURLs() (map[string]URLAndAddress, error) {
return allURLs, err
}

for _, url := range getMesosTaskPrometheusURLs(tasks, p.mesosHostname) {
for _, url := range getMesosTaskPrometheusURLs(tasks) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see, missed that loop the first time round :). It looks like both getEndpointsFromTaskPorts and getEndpointFromTaskLabels get passed the same task, so maybe getTaskIP can just be called once on the task and passed into both of those functions to clean that up a bit and avoid the extra call - but it's not a big deal and I won't block on it 😸

The correct hostname for a task is determined by the network-scope label on the
port. If set to 'container', the hostname is the first IP address in the last
container status. If set to 'host' or absent, the hostname is the same as the
local node.
@philipnrmn philipnrmn dismissed stale reviews from gracedo and branden via 9198182 July 9, 2019 20:16
@gracedo
Copy link

gracedo commented Jul 9, 2019

retest this please

@philipnrmn
Copy link
Author

the blkio_cfq input in dcos_containers test is a flake and unrelated to code changed in this PR. I'm going to merge this now.

@philipnrmn philipnrmn merged commit 4dba7d9 into 1.9.4-dcos Jul 9, 2019
@philipnrmn philipnrmn deleted the prom-network-scope branch July 9, 2019 23:09
@vespian
Copy link

vespian commented Jul 10, 2019

I wonder if networks isolation (e.g. some iptables rules) in calico will not interfere here - i.e. it will not be possible to make a connection from host to the container that is attached to the SDN network.

@philipnrmn
Copy link
Author

@vespian thanks for pointing this out. We can't readily work around it, so we'll document this as a limitation: https://jira.mesosphere.com/browse/DCOS-56350

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants