Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to activate alerts + must manually restart monitor to register new alerts #39

Closed
st3xupery opened this issue Mar 2, 2018 · 11 comments

Comments

@st3xupery
Copy link

st3xupery commented Mar 2, 2018

My main problem is no matter how restrictive I set my mem limit I cannot get the alert to indicate active on the /alerts page in Prometheus. In the example below you will see I have set my service's mem_limit to 10% where at rest, the service in question uses at least 60% of it's available memory limit, and to be triggered with no timespan. Yet no long how I wait for the alert says (0 active)

screen shot 2018-03-01 at 10 12 45 pm

      resources:
        limits:
          memory: 1000M
      labels:
        - com.df.notify=true
        - com.df.alertName=memlimit
        - com.df.alertIf=@service_mem_limit:0.1

This is how the alert translates into Prometheus

alert: monitoring_elasticsearch_memlimit 
expr: 
container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"}   / 
container_spec_memory_limit_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"}   > 0.1 
labels:   receiver: system   service: monitoring_elasticsearch annotations:   
summary: Memory of the service monitoring_elasticsearch is over 0.1

When I plug the expr into the Prometheus Expression receiver I get no-data. Not even container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"} seems to produce a result.

Here are the relevant docker-compose instructions

  swarm-listener:
    image: vfarcic/docker-flow-swarm-listener
    networks:
      - proxy
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - DF_NOTIFY_CREATE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/reconfigure
      - DF_NOTIFY_REMOVE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/remove
    deploy:
      placement:
        constraints: [node.role == manager]
  monitor:
    image: vfarcic/docker-flow-monitor:${TAG:-latest}
    environment:
      - LISTENER_ADDRESS=swarm-listener
      - GLOBAL_SCRAPE_INTERVAL=10s
    networks:
      - proxy
    deploy:
      placement:
        constraints:
          - node.role == manager
    ports:
      - 9090:9090

It may be worth noting that I have not incorporated the alert-manager as I didn't want to set it up and figured I could test my alert settings before moving on to that step. Am I wrong in assuming I can continue with docker-flow-monitor without alert-manager.

It's also worth noting that I am using proxy as the shared network between docker-flow-monitor, docker-flow-swarm-listener because I am also using docker-flow-proxy in this stack.

It may also be worth noting that I must manually restart the docker-flow-monitor service for new alerts to register in the prometheus web console after spinning up other services that are not docker-flow-monitor I am not sure if that is intended behavior and perhaps this is a sign of something else wrong.

Nothing in the monitor logs seem to indicate anything is amiss either

proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Requesting services from Docker Flow Swarm Listener
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Processing: [{"alertFor":"30s","alertIf":"@service_mem_limit:0.8","alertName":"memlimit","distribute":"true","replicas":"1","serviceName":"monitoring_kibana"},{"alertIf":"@service_mem_limit:0.1","alertName":"memlimit","distribute":"true","replicas":"1","serviceName":"monitoring_elasticsearch"},{"distribute":"true","port":"80","replicas":"1","serviceName":"proxy_letsencrypt-companion","servicePath":"/.well-known/acme-challenge"}]
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Writing to alert.rules
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Writing to prometheus.yml
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Starting Docker Flow Monitor
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Starting Prometheus
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 /bin/sh -c prometheus --config.file="/etc/prometheus/prometheus.yml" --storage.tsdb.path="/prometheus" --web.console.libraries="/usr/share/prometheus/console_libraries" --web.console.templates="/usr/share/prometheus/consoles"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.425281311Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.425401927Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.42546681Z caller=main.go:227 host_details="(Linux 4.4.0-1047-aws #56-Ubuntu SMP Sat Jan 6 19:39:06 UTC 2018 x86_64 ba5f63bfc96a (none))"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.425555206Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.428645759Z caller=main.go:499 msg="Starting TSDB ..."
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.438652055Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.443432951Z caller=main.go:509 msg="TSDB started"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.443526522Z caller=main.go:585 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.444105035Z caller=main.go:486 msg="Server is ready to receive web requests."
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.444482222Z caller=manager.go:59 component="scrape manager" msg="Starting scrape manager..."

I am fully at a loss on how to debug this further. Perhaps I have made some mistake along the way or misunderstand what I should be expecting.

@vfarcic
Copy link
Collaborator

vfarcic commented Mar 2, 2018

Can you execute container_memory_usage_bytes expression and send the output?

@st3xupery
Copy link
Author

When I execute

container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"} / container_spec_memory_limit_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"} > 0.1

or

container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"} 

or

container_memory_usage_bytes

I get no data . Something about my deployment must be off.
screen shot 2018-03-02 at 7 10 31 am

@st3xupery
Copy link
Author

Here are some additional specs I can find

prometheus --version

prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d82a045d103ea7f3c89a91fba4a93e6367a)
  build user:       root@6e784304d3ff
  build date:       20180119-12:01:23
  go version:       go1.9.2

cat /etc/prometheus/prometheus.yml

global:
  scrape_interval: 10s
rule_files:
- alert.rules

The content of alert.rules looks all in order also.

@vfarcic
Copy link
Collaborator

vfarcic commented Mar 2, 2018

The problem is in DFSL. It has only the proxy as the address in DF_NOTIFY_CREATE_SERVICE_URL and DF_NOTIFY_REMOVE_SERVICE_URL. You need to add (comma separated) the address of Prometheus as well (DFM). Otherwise, it will never receive a notification about exporters.

@st3xupery
Copy link
Author

OH! I do see what you are saying and have modified my environment variables accordingly.

  swarm-listener:
    ...
    environment:
      - 'DF_NOTIFY_CREATE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/reconfigure,http://monitor:8080/v1/docker-flow-proxy/reconfigure'
      - 'DF_NOTIFY_REMOVE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/remove,http://monitor:8080/v1/docker-flow-proxy/remove'

What I had to failed to see was proxy was a reference to the service and not to the overlay network. Because of that confusion, I thought to place DFP, DFSL and DFM on the proxy network was sufficient and that proxy in the URL would talk to them all. I do know that's not how overlay networks work, but clearly, I needed a second set of eyeballs to help.

That being said, it has been several minutes now and my aforementioned issues don't seem to have changed.

@st3xupery st3xupery reopened this Mar 2, 2018
@st3xupery
Copy link
Author

I even went so far as to remove DFP from the equation but none of the queries e.g. container_memory_usage_bytes seem to produce any result in the Prometheus dashboard. Even an error would be more insightful to me.

@vfarcic
Copy link
Collaborator

vfarcic commented Mar 4, 2018

The problem is that you changed the name of the service to monitor but you left the rest of the address intact (http://monitor:8080/v1/docker-flow-proxy/reconfigure). The reconfigure address should be http://monitor:8080/v1/docker-flow-monitor/reconfigure. You can find an example in http://monitor.dockerflow.com/tutorial/ .

@st3xupery
Copy link
Author

Oh wow, I feel rather stupid. Well, I appreciate your patience with assisting me as this certainly resolves my issue. Much thanks again!

@st3xupery
Copy link
Author

I found some time to revisit this part of my project again hopeful resolving my URL mistake would be the key, but I still find myself with unresponsive alerts and queries that execute to no data

In the example below I keep swarm-listener on a proxy network and a monitor network with DFM sharing the monitor network. But I also tried putting them both on exclusively proxy. In both cases nothing changes.

  swarm-listener:
    image: vfarcic/docker-flow-swarm-listener
    networks:
      - proxy
      - monitor
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - 'DF_NOTIFY_CREATE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/reconfigure,http://monitor:8080/v1/docker-flow-monitor/reconfigure'
      - 'DF_NOTIFY_REMOVE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/remove,http://monitor:8080/v1/docker-flow-monitor/remove'
    deploy:
      placement:
        constraints: [node.role == manager]

  monitor:
    image: vfarcic/docker-flow-monitor
    environment:
      - LISTENER_ADDRESS=swarm-listener
      - GLOBAL_SCRAPE_INTERVAL=10s
    networks:
      - monitor
    ports:
      - 9090:9090

I really wish I could provide more substantial info but I have exhausted all possible logs.

Is there an example that uses both DFM and DFP that you know to work that I can experiment with locally?

@st3xupery st3xupery reopened this Mar 7, 2018
@vfarcic
Copy link
Collaborator

vfarcic commented Mar 11, 2018

Please send me the current config of your stacks and I'll try to replicate the problem inside one of my clusters.

P.S. Sorry for not responding earlier. I had too much work on my plate.

@vfarcic
Copy link
Collaborator

vfarcic commented Mar 31, 2018

Closing due to inactivity.

@vfarcic vfarcic closed this as completed Mar 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants