Skip to content
This repository has been archived by the owner on Nov 6, 2023. It is now read-only.

Latest commit

 

History

History
854 lines (768 loc) · 37.9 KB

ch10-monitoring.adoc

File metadata and controls

854 lines (768 loc) · 37.9 KB

Monitoring Docker Containers

This chapter will cover different ways to monitor a Docker container.

Docker CLI

docker container stats command displays a live stream of container(s) runtime metrics. The command supports CPU, memory usage, memory limit, and network IO metrics.

  1. Start a container: docker container run --name web -d jboss/wildfly:10.1.0.Final

  2. Check the container stats using docker container stats web. It shows the output as:

    CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
    web                 0.14%               274.9MiB / 1.952GiB   13.75%              828B / 0B           96.7MB / 4.1kB      53

    The output is continually updated. It shows:

    1. Container name

    2. Percent CPU utilization

    3. Total memory usage vs amount available to the container

    4. Percent memory utilization

    5. Network activity

    6. Disk activity

    7. PIDS??

  3. Check stats for multiple containers

    1. Terminate single instance of the container

      docker container stop web
      docker container rm web
    2. Create a service with multiple replicas of the container

      docker swarm init
      docker service create --name=web jboss/wildfly
    3. Now check the stats again:

      docker stats

      to see the output:

      CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
      1f9d5a1c08a8        0.24%               294.9MiB / 1.952GiB   14.75%              828B / 0B           96.5MB / 4.1kB      115

      Note that the container id is shown instead of container’s name in this case.

  4. Scale the service to 3 replicas:

    docker service scale web=3
  5. The stats are now shown for all three containers:

    CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
    1f9d5a1c08a8        0.14%               287.8MiB / 1.952GiB   14.40%              1.03kB / 0B         96.5MB / 4.1kB      53
    a3395efbb6ff        0.77%               263.4MiB / 1.952GiB   13.18%              578B / 0B           0B / 4.1kB          114
    f484ff2f4c12        0.06%               294.8MiB / 1.952GiB   14.75%              578B / 0B           0B / 4.1kB          114
  6. Display only container id and percent CPU utilization using the command docker container stats --format "{{.Container}}: {{.CPUPerc}}":

    55198043b6aa: 0.10%
    5b5dd33b675d: 0.11%
    6e98a9597e6a: 0.10%
  7. Format the output in a table. The results should include container name, percent CPU utilization and percent memory utilization. This can be achieved using the command:

    docker container stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"

    It shows the output:

    NAME                              CPU %               MEM USAGE / LIMIT
    web.2.1ic0vevvvu2nwwyc6css58ref   0.10%               267.5MiB / 1.952GiB
    web.3.pwcgr58s1xo28gwa1znlrn2s3   0.11%               274.7MiB / 1.952GiB
    web.1.bg1dcwagl9tzz0azuegxzoys8   0.12%               274MiB / 1.952GiB
  8. Display only the first result using the command docker container stats --no-stream:

    CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
    55198043b6aa        0.12%               267.5MiB / 1.952GiB   13.39%              1.44kB / 0B         0B / 4.1kB          51
    5b5dd33b675d        0.07%               274.7MiB / 1.952GiB   13.75%              1.51kB / 0B         0B / 4.1kB          51
    6e98a9597e6a        0.26%               274.1MiB / 1.952GiB   13.71%              1.69kB / 0B         0B / 4.1kB          51

Docker Remote API

Docker Remote API provides a lot more details about the health of the container. It can be invoked using the following format:

curl --unix-socket /var/run/docker.sock http://localhost/containers/<name>/stats

On Docker for Mac, enabling remote HTTP API still requires a few steps. So this command uses the --unix-socket option to invoke the Remote API.

A specific invocation for a container can be done as:

curl --unix-socket /var/run/docker.sock http://localhost/containers/web.1.bg1dcwagl9tzz0azuegxzoys8/stats

Note, the container name is from the previous run of the service. It will show the output:

{"read":"2017-10-13T14:08:38.045217731Z","preread":"0001-01-01T00:00:00Z","pids_stats":{"current":51},"blkio_stats":{"io_service_bytes_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":4096},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":4096},{"major":8,"minor":0,"op":"Total","value":4096}],"io_serviced_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":1},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":1},{"major":8,"minor":0,"op":"Total","value":1}],"io_queue_recursive":[],"io_service_time_recursive":[],"io_wait_time_recursive":[],"io_merged_recursive":[],"io_time_recursive":[],"sectors_recursive":[]},"num_procs":0,"storage_stats":{},"cpu_stats":{"cpu_usage":{"total_usage":11130296115,"percpu_usage":[2687118654,3014514615,2971860160,2456802686],"usage_in_kernelmode":2700000000,"usage_in_usermode":7630000000},"system_cpu_usage":952826800000000,"online_cpus":4,"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"precpu_stats":{"cpu_usage":{"total_usage":0,"usage_in_kernelmode":0,"usage_in_usermode":0},"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"memory_stats":{"usage":288051200,"max_usage":297189376,"stats":{"active_anon":283893760,"active_file":0,"cache":135168,"dirty":16384,"hierarchical_memory_limit":9223372036854771712,"hierarchical_memsw_limit":9223372036854771712,"inactive_anon":0,"inactive_file":135168,"mapped_file":32768,"pgfault":83204,"pgmajfault":0,"pgpgin":78441,"pgpgout":9093,"rss":283914240,"rss_huge":0,"swap":0,"total_active_anon":283893760,"total_active_file":0,"total_cache":135168,"total_dirty":16384,"total_inactive_anon":0,"total_inactive_file":135168,"total_mapped_file":32768,"total_pgfault":83204,"total_pgmajfault":0,"total_pgpgin":78441,"total_pgpgout":9093,"total_rss":283914240,"total_rss_huge":0,"total_swap":0,"total_unevictable":0,"total_writeback":0,"unevictable":0,"writeback":0},"limit":2095874048},"name":"/web.1.bg1dcwagl9tzz0azuegxzoys8","id":"6e98a9597e6af085e73a4d211fff9a164aa012727a46525d4fbaa164b572e23f","networks":{"eth0":{"rx_bytes":1882,"rx_packets":37,"rx_errors":0,"rx_dropped":0,"tx_bytes":0,"tx_packets":0,"tx_errors":0,"tx_dropped":0}}}

As you can see, far more details about container’s health are shown here. These stats are refereshed every one second. The continuous refresh of metrics can be terminated using Ctrl + C.

Docker Events

docker system events provide real time events for the Docker host.

  1. In one terminal (T1), type docker system events. The command does not show output and waits for any event worth reporting to occur. The list of events is listed at https://docs.docker.com/engine/reference/commandline/events/#/extended-description.

  2. In a new terminal (T2), kill existing container using docker service scale web=2.

  3. T1 shows the updated list of events as:

    2017-10-13T07:12:00.223791013-07:00 service update r4i0x8ujnn2q8osj8dowgvw72 (name=web, replicas.new=2, replicas.old=3)
    2017-10-13T07:12:00.332724880-07:00 container kill 5b5dd33b675d3b6be3e6aaf0ecde928b3ac882b0a221ff71e57c86faae8181ab (build-date=20170911, com.docker.swarm.node.id=wgujclh0492kkszpil81d3ugb, com.docker.swarm.service.id=r4i0x8ujnn2q8osj8dowgvw72, com.docker.swarm.service.name=web, com.docker.swarm.task=, com.docker.swarm.task.id=pwcgr58s1xo28gwa1znlrn2s3, com.docker.swarm.task.name=web.3.pwcgr58s1xo28gwa1znlrn2s3, image=jboss/wildfly:latest@sha256:d3af084d024753e4799809c10cd188f675a5b254a8e279b34709035b95d27dc7, license=GPLv2, name=web.3.pwcgr58s1xo28gwa1znlrn2s3, signal=15, vendor=CentOS)
    2017-10-13T07:12:00.613143701-07:00 container die 5b5dd33b675d3b6be3e6aaf0ecde928b3ac882b0a221ff71e57c86faae8181ab (build-date=20170911, com.docker.swarm.node.id=wgujclh0492kkszpil81d3ugb, com.docker.swarm.service.id=r4i0x8ujnn2q8osj8dowgvw72, com.docker.swarm.service.name=web, com.docker.swarm.task=, com.docker.swarm.task.id=pwcgr58s1xo28gwa1znlrn2s3, com.docker.swarm.task.name=web.3.pwcgr58s1xo28gwa1znlrn2s3, exitCode=0, image=jboss/wildfly:latest@sha256:d3af084d024753e4799809c10cd188f675a5b254a8e279b34709035b95d27dc7, license=GPLv2, name=web.3.pwcgr58s1xo28gwa1znlrn2s3, vendor=CentOS)
    2017-10-13T07:12:00.897831488-07:00 network disconnect 8f8e6ce771d6db6065f2472a7e83612ff6a657de3b6d08dab0617b8a596234fa (container=5b5dd33b675d3b6be3e6aaf0ecde928b3ac882b0a221ff71e57c86faae8181ab, name=bridge, type=bridge)
    2017-10-13T07:12:01.017523717-07:00 container stop 5b5dd33b675d3b6be3e6aaf0ecde928b3ac882b0a221ff71e57c86faae8181ab (build-date=20170911, com.docker.swarm.node.id=wgujclh0492kkszpil81d3ugb, com.docker.swarm.service.id=r4i0x8ujnn2q8osj8dowgvw72, com.docker.swarm.service.name=web, com.docker.swarm.task=, com.docker.swarm.task.id=pwcgr58s1xo28gwa1znlrn2s3, com.docker.swarm.task.name=web.3.pwcgr58s1xo28gwa1znlrn2s3, image=jboss/wildfly:latest@sha256:d3af084d024753e4799809c10cd188f675a5b254a8e279b34709035b95d27dc7, license=GPLv2, name=web.3.pwcgr58s1xo28gwa1znlrn2s3, vendor=CentOS)
    2017-10-13T07:12:01.023414108-07:00 container destroy 5b5dd33b675d3b6be3e6aaf0ecde928b3ac882b0a221ff71e57c86faae8181ab (build-date=20170911, com.docker.swarm.node.id=wgujclh0492kkszpil81d3ugb, com.docker.swarm.service.id=r4i0x8ujnn2q8osj8dowgvw72, com.docker.swarm.service.name=web, com.docker.swarm.task=, com.docker.swarm.task.id=pwcgr58s1xo28gwa1znlrn2s3, com.docker.swarm.task.name=web.3.pwcgr58s1xo28gwa1znlrn2s3, image=jboss/wildfly:latest@sha256:d3af084d024753e4799809c10cd188f675a5b254a8e279b34709035b95d27dc7, license=GPLv2, name=web.3.pwcgr58s1xo28gwa1znlrn2s3, vendor=CentOS)

    The output shows a list of events, one in each line. The events shown here are container kill, container die, network disconnect, container stop, and container destroy. Date and timestamp for each event is displayed at the beginning of the line. Other event specific information is displayed as well.

  4. In T2, scale the service back to 3 replicas: docker service scale web=3

  5. The output in T1 is updated to show:

    2017-10-13T07:13:47.161848609-07:00 service update r4i0x8ujnn2q8osj8dowgvw72 (name=web, replicas.new=3, replicas.old=2)
    2017-10-13T07:13:47.429074382-07:00 container create 0574d1fd74bef2e6fc54174e1fbeda25efd7ed270dce1d6dbede4ead19c7c485 (build-date=20170911, com.docker.swarm.node.id=wgujclh0492kkszpil81d3ugb, com.docker.swarm.service.id=r4i0x8ujnn2q8osj8dowgvw72, com.docker.swarm.service.name=web, com.docker.swarm.task=, com.docker.swarm.task.id=xcmylcwlag5vot4tp3l5z6oam, com.docker.swarm.task.name=web.3.xcmylcwlag5vot4tp3l5z6oam, image=jboss/wildfly:latest@sha256:d3af084d024753e4799809c10cd188f675a5b254a8e279b34709035b95d27dc7, license=GPLv2, name=web.3.xcmylcwlag5vot4tp3l5z6oam, vendor=CentOS)
    2017-10-13T07:13:47.445010259-07:00 network connect 8f8e6ce771d6db6065f2472a7e83612ff6a657de3b6d08dab0617b8a596234fa (container=0574d1fd74bef2e6fc54174e1fbeda25efd7ed270dce1d6dbede4ead19c7c485, name=bridge, type=bridge)
    2017-10-13T07:13:47.778855117-07:00 container start 0574d1fd74bef2e6fc54174e1fbeda25efd7ed270dce1d6dbede4ead19c7c485 (build-date=20170911, com.docker.swarm.node.id=wgujclh0492kkszpil81d3ugb, com.docker.swarm.service.id=r4i0x8ujnn2q8osj8dowgvw72, com.docker.swarm.service.name=web, com.docker.swarm.task=, com.docker.swarm.task.id=xcmylcwlag5vot4tp3l5z6oam, com.docker.swarm.task.name=web.3.xcmylcwlag5vot4tp3l5z6oam, image=jboss/wildfly:latest@sha256:d3af084d024753e4799809c10cd188f675a5b254a8e279b34709035b95d27dc7, license=GPLv2, name=web.3.xcmylcwlag5vot4tp3l5z6oam, vendor=CentOS)

    The list of events shown here are container create, network connect, and container start.

Use filters

The list of events can be restricted by filters specified using --filter or -f option. The currently supported filters are:

  1. container (container=<name or id>)

  2. daemon (daemon=<name or id>)

  3. event (event=<event action>)

  4. image (image=<tag or id>)

  5. label (label=<key> or label=<key>=<value>)

  6. network (network=<name or id>)

  7. plugin (plugin=<name or id>)

  8. type (type=<container or image or volume or network or daemon>)

  9. volume (volume=<name or id>)

Let’s look at the list of running containers first using docker container ls, and then learn how to apply these filters.

Here is the list of running containers from the service:

CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
074447f26452        jboss/wildfly:latest   "/opt/jboss/wildfl..."   3 minutes ago       Up 3 minutes        8080/tcp            web.1.ytyv0gqi7dzxtetssrlsgvvbu
0574d1fd74be        jboss/wildfly:latest   "/opt/jboss/wildfl..."   8 minutes ago       Up 8 minutes        8080/tcp            web.3.xcmylcwlag5vot4tp3l5z6oam
55198043b6aa        jboss/wildfly:latest   "/opt/jboss/wildfl..."   25 minutes ago      Up 25 minutes       8080/tcp            web.2.1ic0vevvvu2nwwyc6css58ref

Let’s apply the filters.

  1. Show events for a container by name

    1. In T1, give the command to listen to a specific container as:

      docker system events -f container=web.1.ytyv0gqi7dzxtetssrlsgvvbu

      You may have to terminate previous run of docker system events using Ctrl + C to give this new command.

    2. In T2, terminate the second replica of the service as docker container rm -f web.2.1ic0vevvvu2nwwyc6css58ref.

    3. T1 does not show any events because its only listening for events from the first replica of the service.

  2. Show events for an event

    1. In T1, give the command docker system events -f event=create.

    2. In T2, scale the service by one more replica:

      docker service scale web=4
    3. T1 shows the event for container creation

      2017-10-13T07:24:22.971050949-07:00 container create 84e4604ffd983cfcc53ad619b4c11156518834fe23e4a0a8b299905b978a0022 (build-date=20170911, com.docker.swarm.node.id=wgujclh0492kkszpil81d3ugb, com.docker.swarm.service.id=r4i0x8ujnn2q8osj8dowgvw72, com.docker.swarm.service.name=web, com.docker.swarm.task=, com.docker.swarm.task.id=38unfmcsxmnvr844gysn28lwa, com.docker.swarm.task.name=web.4.38unfmcsxmnvr844gysn28lwa, image=jboss/wildfly:latest@sha256:d3af084d024753e4799809c10cd188f675a5b254a8e279b34709035b95d27dc7, license=GPLv2, name=web.4.38unfmcsxmnvr844gysn28lwa, vendor=CentOS)

      This is accurate as a new container is created and the event is shown in T1 console.

    4. In T2, scale the service back to 2 using the command docker servie scale web=2

    5. T1 does not show any additional events because its only looking for create events

    6. More samples are explained at https://docs.docker.com/engine/reference/commandline/events/#/filter-events-by-criteria.

Prometheus

Prometheus is an open-source systems monitoring and alerting toolkit. Prometheus collects metrics from monitored targets by scraping metrics from HTTP endpoints on these targets. Docker instance can be configured as Prometheus target.

Different targets to scrape are defined in the Prometheus configuration file. Targets may be statically configured via the static_configs parameter in the configuration fle or dynamically discovered using one of the supported service-discovery mechanisms (Consul, DNS, Etcd, etc.).

Prometheus collects metrics from monitored targets by scraping metrics from HTTP endpoints on these targets. Since Prometheus also exposes data in the same manner about itself, it can also scrape and monitor its own health.

Docker metrics for Prometheus

Docker exposes Prometheus-compatible metrics on port 9323. This support is only available as an experimental feature.

  1. For Docker for Mac, click on Docker icon in the status menu

  2. Select Preferences…​, Daemon, Advanced tab

  3. Update daemon settings:

    {
      "metrics-addr" : "0.0.0.0:9323",
      "experimental" : true
    }
  4. Click on Apply & Restart to restart the daemon

    prometheus metrics config
  5. Show the complete list of metrics using curl http://localhost:9323/metrics

  6. Show the list of engine metrics using curl http://localhost:9323/metrics | grep engine

Start Prometheus

In this section, we’ll start Prometheus and use it to scrape it’s own health.

  1. Create a new directory prometheus and change to that directory

  2. Create a text file prometheus.yml and use the following content

    # A scrape configuration scraping a Node Exporter and the Prometheus server
    # itself.
    scrape_configs:
      # Scrape Prometheus itself every 5 seconds.
      - job_name: 'prometheus'
        scrape_interval: 5s
        static_configs:
          - targets: ['localhost:9090']

    This configuration file scrapes data from the Prometheus container which will be started subsequently on port 9090.

  3. Start a single-replica Prometheus service:

    docker service create \
      --replicas 1 \
      --name metrics \
      --mount type=bind,source=`pwd`/prometheus.yml,destination=/etc/prometheus/prometheus.yml \
      --publish 9090:9090/tcp \
      prom/prometheus

    This will start the Prometheus container on port 9090.

  4. Prometheus dashboard is at http://localhost:9090. Check the list of enabled targets at http://localhost:9090/targets (also accessible from StatusTargets menu).

    prometheus metrics target

    It shows that the Prometheus endpoint is available for scraping.

  5. Click on Graph and click on -insert metric at cursor- to see the list of metrics available:

    prometheus metrics1

    These are all the metrics published by the Prometheus endpoint.

  6. Choose http_request_total metrics, click on Execute

    prometheus metrics2
  7. Switch from Console to Graph

    prometheus metrics3
  8. Change the duration from 1h to 5m

    prometheus metrics4
  9. Click on Add Graph, select a different metric, say http_requests_duration_microseconds, and click on Execute

    prometheus metrics5
  10. Switch from Console to Graph and change the duration from 1h to 5m

    prometheus metrics6
  11. Stop the container: docker service rm metrics

Multiple graphs can be added this way.

Node health

In this section, we’ll start Prometheus node exporter that will publish machine metrics. Then we’ll use Prometheus to scrape its health information about the node running Docker.

Start Node Exporter

  1. All containers need to use the same overlay network so that they can communicate with each other. Let’s create an overlay network:

    docker network create --driver overlay prom
  2. Start Prometheus node exporter:

    docker service create --name node \
     --mode global \
     --mount type=bind,source=/proc,target=/host/proc \
     --mount type=bind,source=/sys,target=/host/sys \
     --mount type=bind,source=/,target=/rootfs \
     --network prom \
     --publish 9100:9100 \
     prom/node-exporter:v0.15.0 \
      --path.procfs /host/proc \
      --path.sysfs /host/sys \
      --collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"

    A few observations in this command:

    1. This is started as a global service such that it is started on all nodes of the cluster.

    2. As explained in prometheus/node_exporter#610, node exporter only works with host network on Mac OSX. This is not needed if you are running on Linux.

    3. It uses the overlay network previously created.

    4. It needs access to host’s filesystems such that the metrics about the node can be published.

Restart Prometheus

  1. Update prometheus.yml to the following text:

    global:
      scrape_interval: 10s
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets:
            - 'localhost:9090'
      - job_name: 'node resources'
        dns_sd_configs:
          - names: ['tasks.node']
            type: 'A'
            port: 9100
        params:
          collect[]:
            - cpu
            - meminfo
            - diskstats
            - netdev
            - netstat
    
      - job_name: 'node storage'
        scrape_interval: 1m
        dns_sd_configs:
          - names: ['tasks.node']
            type: 'A'
            port: 9100
        params:
          collect[]:
            - filefd
            - filesystem
            - xfs

    A few observations:

    1. DNS-based service discovery is used to discover the scraper for node-exporter. This is further explained at dns_sd_configs. A record-based queries are used to discover the service.

    2. Two different jobs are created even though they are scraping from the same endpoint. This provides a more logical way to represent data.

  2. Terminate previously running Prometheus service:

    docker service rm metrics
  3. Restart the Prometheus service, this time using the overlay network, as:

    docker service create \
      --replicas 1 \
      --name metrics \
      --network prom \
      --mount type=bind,source=`pwd`/prometheus.yml,destination=/etc/prometheus/prometheus.yml \
      --publish 9090:9090/tcp \
      prom/prometheus

Check metrics

  1. Confirm that both the services have started:

    ID                  NAME                MODE                REPLICAS            IMAGE                       PORTS
    lzl41s2i66jd        metrics             replicated          1/1                 prom/prometheus:latest      *:9090->9090/tcp
    dro3ncpyuchp        node                global              1/1                 prom/node-exporter:latest
  2. Confirm that all the targets are configured correctly at Prometheus dashboard:

    prometheus metrics target2
  3. Now a lot more metrics, this time from the node, are also available:

    prometheus metrics7

    Console output and graphs for all these metrics is now available:

    prometheus metrics8

    Complete list of metrics is available at https://github.com/prometheus/node_exporter.

Query using PromQL (TODO)

Add some fun queries from https://prometheus.io/docs/querying/basics/.

cAdvisor

cAdvisor (Container Advisor) provides resource usage and performance characteristics running containers. Let’s take a look on how cAdvisor can be used to get these metrics from containers.

  1. Run cAdvisor

    docker container run \
      --volume=/:/rootfs:ro \
      --volume=/var/run:/var/run:rw \
      --volume=/sys:/sys:ro \
      --volume=/var/lib/docker/:/var/lib/docker:ro \
      --publish=8080:8080 \
      --detach=true \
      --name=cadvisor \
      google/cadvisor:latest
  2. Dashboard is available at http://localhost:8080

    cadvisor default dashboard
  3. A high-level CPU and Memory utilization is shown. More details about CPU, memory, network and filesystem usage is shown in the same page. CPU usage looks like as shown:

    cadvisor cpu snapshot
  4. All Docker containers are in /docker sub-container.

    cadvisor docker metrics

    Click on any of the containers and see more details about the container.

cAdvisor samples once a second and has historical data for only one minute. The data generated from cAdvisor can be exported to InfluxDB. Optionally, you may use a Grafana front end to visualize the data as explained in How to setup Docker monitoring.

Prometheus and cAdvisor

cAdvisor also exposes container statistics as Prometheus metrics out of the box. By default, these metrics are served under the /metrics HTTP endpoint. Let’s take a look at how these container metrics can be observed using Prometheus.

  1. Terminate previously running cAdvisor:

    docker container rm -f cadvisor
  2. Start a new cAdvisor service, using the prom overlay network created earlier:

    docker service create \
      --name cadvisor \
      --network prom \
      --mode global \
      --mount type=bind,source=/,target=/rootfs \
      --mount type=bind,source=/var/run,target=/var/run \
      --mount type=bind,source=/sys,target=/sys \
      --mount type=bind,source=/var/lib/docker,target=/var/lib/docker \
      google/cadvisor:latest
  3. Terminate the previously running Prometheus service:

    docker service rm metrics
  4. The update prometheus.yml configuration file is:

    global:
      scrape_interval: 10s
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets:
            - 'localhost:9090'
    
      - job_name: 'node resources'
        dns_sd_configs:
          - names: ['tasks.node']
            type: 'A'
            port: 9100
        params:
          collect[]:
            - cpu
            - meminfo
            - diskstats
            - netdev
            - netstat
    
      - job_name: 'node storage'
        scrape_interval: 1m
        dns_sd_configs:
          - names: ['tasks.node']
            type: 'A'
            port: 9100
        params:
          collect[]:
            - filefd
            - filesystem
            - xfs
    
      - job_name: 'cadvisor'
        dns_sd_configs:
          - names: ['tasks.cadvisor']
            type: 'A'
            port: 8080
  5. Start the new Prometheus service

    docker service create \
      --replicas 1 \
      --name metrics \
      --network prom \
      --mount type=bind,source=`pwd`/prometheus.yml,destination=/etc/prometheus/prometheus.yml \
      --publish 9090:9090/tcp \
      prom/prometheus
  6. Confirm that all the targets are configured correctly at Prometheus dashboard:

    prometheus metrics target3

    Note, all four scrape endpoints are shown here.

  7. In Graphs, now, a lot more metrics, this time from cAdvisor, are also available:

    prometheus metrics9

    Console output and graphs for all these metrics is now available:

    prometheus metrics10

    Complete list of metrics is available at https://github.com/google/cadvisor.

Here is a basic query written using PromQL worth trying:

sum by (container_label_com_docker_swarm_node_id) (
  irate(
    container_cpu_usage_seconds_total{
      container_label_com_docker_swarm_service_name="metrics"
      }[1m]
  )
)

This shows the average amount of CPU used per minute by the service metrics aggregated over multiple CPUs. The graph will look as shown:

prometheus metrics11

Monitor Java Applications

This section will explain how an existing Java application can be updated to publish metrics and monitored by Prometheus.

Prometheus collects metrics from monitored targets by scraping metrics HTTP endpoints on these targets.

As discussed earlier, Prometheus collects metrics from monitored targets by scraping from an HTTP endpoint on these targets. By default, these metrics are expected to be published at /metrics. Any existing Java application can be updated to publish Prometheus-style metrics at this endpoint.

An earlier chapter explained a simple Java EE application that talks to a MySQL database. This application also publishes Prometheus-style metrics for the underlying JVM at /metrics. It also publishes application-specific metrics such as total number of times GET / and GET /{id} is called.

The complete set of JVM metrics are explained at https://github.com/prometheus/client_java. Refer to https://github.com/arun-gupta/docker-javaee/tree/master/employees/src/main/java/org/javaee/samples/employees/metrics for more details on how these metrics are enabled.

Start Java application

  1. Use the Compose file to deploy a simple the Java EE application. This will start WildFly Swarm application and MySQL database.

    docker stack deploy --compose-file=docker-compose.yml webapp

    This will create webapp_default overlay network, and start the webapp_web and webapp_db services.

  2. Verify the network:

    $ docker network ls
    NETWORK ID          NAME                DRIVER              SCOPE
    u6ybdaqx5h5y        webapp_default      overlay             swarm

    Other networks may be shown here as well.

  3. Verify the services:

    $ docker service ls
    ID                  NAME                MODE                REPLICAS            IMAGE                            PORTS
    ucztcpf1vw0a        webapp_db           replicated          1/1                 mysql:8                          *:3306->3306/tcp
    jttfgvr09kre        webapp_web          replicated          1/1                 arungupta/docker-javaee:latest   *:8080->8080/tcp,*:9990->9990/tcp
  4. Verify that the endpoint is accessible:

    $ curl http://localhost:8080/resources/employees
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?><collection><employee><id>1</id><name>Penny</name></employee><employee><id>2</id><name>Sheldon</name></employee><employee><id>3</id><name>Amy</name></employee><employee><id>4</id><name>Leonard</name></employee><employee><id>5</id><name>Bernadette</name></employee><employee><id>6</id><name>Raj</name></employee><employee><id>7</id><name>Howard</name></employee><employee><id>8</id><name>Priya</name></employee></collection>
  5. Access the metrics published by the endpoint using curl http://localhost:8080/metrics to see the output:

    # HELP jvm_info JVM version info
    # TYPE jvm_info gauge
    jvm_info{version="1.8.0_141-8u141-b15-1~deb9u1-b15",vendor="Oracle Corporation",} 1.0
    # HELP jvm_gc_collection_seconds Time spent in a given JVM garbage collector in seconds.
    # TYPE jvm_gc_collection_seconds summary
    jvm_gc_collection_seconds_count{gc="PS Scavenge",} 25.0
    jvm_gc_collection_seconds_sum{gc="PS Scavenge",} 0.386
    jvm_gc_collection_seconds_count{gc="PS MarkSweep",} 6.0
    jvm_gc_collection_seconds_sum{gc="PS MarkSweep",} 0.546
    # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
    # TYPE process_cpu_seconds_total counter
    process_cpu_seconds_total 25.5
    # HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
    # TYPE process_start_time_seconds gauge
    process_start_time_seconds 1.508056592419E9
    # HELP process_open_fds Number of open file descriptors.
    # TYPE process_open_fds gauge
    process_open_fds 499.0
    # HELP process_max_fds Maximum number of open file descriptors.
    # TYPE process_max_fds gauge
    process_max_fds 1048576.0
    # HELP process_virtual_memory_bytes Virtual memory size in bytes.
    # TYPE process_virtual_memory_bytes gauge
    process_virtual_memory_bytes 4.244393984E9
    # HELP process_resident_memory_bytes Resident memory size in bytes.
    # TYPE process_resident_memory_bytes gauge
    process_resident_memory_bytes 5.06601472E8
    # HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM
    # TYPE jvm_classes_loaded gauge
    jvm_classes_loaded 13096.0
    # HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started execution
    # TYPE jvm_classes_loaded_total counter
    jvm_classes_loaded_total 13096.0
    # HELP jvm_classes_unloaded_total The total number of classes that have been unloaded since the JVM has started execution
    # TYPE jvm_classes_unloaded_total counter
    jvm_classes_unloaded_total 0.0
    # HELP jvm_threads_current Current thread count of a JVM
    # TYPE jvm_threads_current gauge
    jvm_threads_current 60.0
    # HELP jvm_threads_daemon Daemon thread count of a JVM
    # TYPE jvm_threads_daemon gauge
    jvm_threads_daemon 12.0
    # HELP jvm_threads_peak Peak thread count of a JVM
    # TYPE jvm_threads_peak gauge
    jvm_threads_peak 67.0
    # HELP jvm_threads_started_total Started thread count of a JVM
    # TYPE jvm_threads_started_total counter
    jvm_threads_started_total 93.0
    # HELP jvm_threads_deadlocked Cycles of JVM-threads that are in deadlock waiting to acquire object monitors or ownable synchronizers
    # TYPE jvm_threads_deadlocked gauge
    jvm_threads_deadlocked 0.0
    # HELP jvm_threads_deadlocked_monitor Cycles of JVM-threads that are in deadlock waiting to acquire object monitors
    # TYPE jvm_threads_deadlocked_monitor gauge
    jvm_threads_deadlocked_monitor 0.0
    # HELP jvm_memory_bytes_used Used bytes of a given JVM memory area.
    # TYPE jvm_memory_bytes_used gauge
    jvm_memory_bytes_used{area="heap",} 1.2072508E8
    jvm_memory_bytes_used{area="nonheap",} 9.3550048E7
    # HELP jvm_memory_bytes_committed Committed (bytes) of a given JVM memory area.
    # TYPE jvm_memory_bytes_committed gauge
    jvm_memory_bytes_committed{area="heap",} 2.69484032E8
    jvm_memory_bytes_committed{area="nonheap",} 1.0133504E8
    # HELP jvm_memory_bytes_max Max (bytes) of a given JVM memory area.
    # TYPE jvm_memory_bytes_max gauge
    jvm_memory_bytes_max{area="heap",} 4.66092032E8
    jvm_memory_bytes_max{area="nonheap",} -1.0
    # HELP jvm_memory_pool_bytes_used Used bytes of a given JVM memory pool.
    # TYPE jvm_memory_pool_bytes_used gauge
    jvm_memory_pool_bytes_used{pool="Code Cache",} 1.4589888E7
    jvm_memory_pool_bytes_used{pool="Metaspace",} 6.9998048E7
    jvm_memory_pool_bytes_used{pool="Compressed Class Space",} 8962112.0
    jvm_memory_pool_bytes_used{pool="PS Eden Space",} 2.3732032E7
    jvm_memory_pool_bytes_used{pool="PS Survivor Space",} 6073592.0
    jvm_memory_pool_bytes_used{pool="PS Old Gen",} 9.0919456E7
    # HELP jvm_memory_pool_bytes_committed Committed bytes of a given JVM memory pool.
    # TYPE jvm_memory_pool_bytes_committed gauge
    jvm_memory_pool_bytes_committed{pool="Code Cache",} 1.47456E7
    jvm_memory_pool_bytes_committed{pool="Metaspace",} 7.5800576E7
    jvm_memory_pool_bytes_committed{pool="Compressed Class Space",} 1.0788864E7
    jvm_memory_pool_bytes_committed{pool="PS Eden Space",} 9.2274688E7
    jvm_memory_pool_bytes_committed{pool="PS Survivor Space",} 3.8797312E7
    jvm_memory_pool_bytes_committed{pool="PS Old Gen",} 1.38412032E8
    # HELP jvm_memory_pool_bytes_max Max bytes of a given JVM memory pool.
    # TYPE jvm_memory_pool_bytes_max gauge
    jvm_memory_pool_bytes_max{pool="Code Cache",} 2.5165824E8
    jvm_memory_pool_bytes_max{pool="Metaspace",} -1.0
    jvm_memory_pool_bytes_max{pool="Compressed Class Space",} 1.073741824E9
    jvm_memory_pool_bytes_max{pool="PS Eden Space",} 9.699328E7
    jvm_memory_pool_bytes_max{pool="PS Survivor Space",} 3.8797312E7
    jvm_memory_pool_bytes_max{pool="PS Old Gen",} 3.49700096E8

    It shows all the JVM metrics that are published by the Prometheus JVM Client. The metrics generated by the application are not shown yet. It requires for the application to be accessed first.

Let’s access the JVM metrics in Prometheus dashboard first, and then we’ll access the app to show app-specific metrics.

Start Prometheus service

  1. Make sure to terminate any previously running Prometheus endpoints:

    docker service rm metrics
  2. Create a directory prometheus and change into that directory.

  3. Create a text file prometheus.yml and add the following content:

    global:
      scrape_interval: 10s
    scrape_configs:
      - job_name: 'webapp'
        dns_sd_configs:
          - names: ['tasks.webapp_web']
            type: 'A'
            port: 8080

    This defines the configuration for the HTTP endpoint that publishes Prometheus-style metrics from the Java application.

  4. Start Prometheus service:

    docker service create \
      --replicas 1 \
      --network webapp_default \
      --name metrics \
      --mount type=bind,source=`pwd`/prometheus.yml,destination=/etc/prometheus/prometheus.yml \
      --publish 9090:9090 \
      prom/prometheus

    Note, this service is using the webapp_default overlay network that is created when the application stack was deployed.

  5. Access Prometheus dashboard at http://localhost:9090

  6. Check the configured targets at http://localhost:9090/targets:

    prometheus metrics target4

    It shows that the application metrics HTTP endpoint is configured as a Prometheus target.

View application metrics

  1. On Prometheus dashboard, click on -insert metric at cursor- to see the list of metrics available:

    prometheus metrics12

    JVM metrics shown earlier are displayed here as well.

  2. Select jvm_memory_pool_bytes_used metric and click on Execute to view the metric.

    prometheus metrics13
  3. Select Graph to view the graphical representation

    prometheus metrics14
  4. Now access the application using curl http://localhost:8080/resources/employees a few times.

  5. Refresh Prometheus dashboard and see the updated list of metrics:

    prometheus metrics15

    Note, app* and requests* that are generated by the application.

  6. Select requests_get_all metric and view the graph:

    prometheus metrics16
  7. Access the application a few times using curl http://localhost:8080/resources/employees/5 and then watch the requests_get_one metric.

Grafana

Grafana is an open source metric analytics & visualization suite. It supports many different storage backends, called as Data Source. Prometheus can be added as Grafana data source. It even provides support for runnning Prometheus queries from the Grafana dashboard as well. More details can be found in Using Prometheus in Grafana.

Start Grafana

This section will explain how to start Grafana, use Prometheus as the data source, and view some container metrics.

  1. Start Grafana:

    docker container run \
      -d \
      -p 3000:3000 \
      --name=grafana \
      -e "GF_SECURITY_ADMIN_PASSWORD=secret" \
      grafana/grafana

    Use the login name admin and password secret.

    Read more details about different configuration options.

  2. Access Grafana dashboard at http://localhost:3000. Use the login and password as credentials to see Grafana console.

    grafana metrics1

Add Prometheus as data source

  1. Click the Add data source button in the top header.

  2. Specify the parameters as shown:

    grafana metrics2
  3. Click on Add to test and save the data source:

    grafana metrics3

    The green bar indicates that the data source was added successfully.

Create chart with Prometheus data source

  1. Click on Create your first dashboard, save it and give it a name, say Docker and Java dashboard

  2. Click on Graph, edit, under the Metrics tab, select your Prometheus data source.

  3. Enter the following Prometheus query expressions in the query field. The graphs will referesh in a few seconds and will look like as shown:

    grafana metrics4