Exercise 1 - Horizontal Scaling App - Not enough resources
-
Deploy an application
dcos marathon app add app-scaling.json
-
Scale the instances to 100
dcos marathon app update /app-scaling instances=100 dcos marathon app list dcos marathon deployment list
-
Look into the app details. Can you find the reason why the deployment is in waiting state?
dcos marathon debug details /app-scaling HOSTNAME ROLE CONSTRAINTS CPUS MEM DISK PORTS RECEIVED 10.0.0.177 ok ok - ok ok ok 2018-03-26T13:09:02.156Z 10.0.1.239 ok ok - ok ok ok 2018-03-26T13:09:02.155Z 10.0.3.134 ok ok - ok ok ok 2018-03-26T13:09:02.154Z 10.0.7.199 - ok - - ok - 2018-03-26T13:09:02.155Z
-
Reset the deployment by forcing the instances to be 1
dcos marathon app update /app-scaling --force instances=1
Exercise 2 - Vertical scaling - No matching resources
-
Now increase the MEM allocation of the app
dcos marathon app update /app-scaling mem=100000 dcos marathon app list dcos marathon deployment list
-
Look into the app details. Can you find the reason why the deployment is in waiting state?
dcos marathon debug details /app-scaling HOSTNAME ROLE CONSTRAINTS CPUS MEM DISK PORTS RECEIVED 10.0.0.177 ok ok ok - ok ok 2018-03-26T13:35:21.817Z 10.0.1.239 ok ok ok - ok ok 2018-03-26T13:35:21.817Z 10.0.3.134 ok ok ok - ok ok 2018-03-26T13:35:21.816Z 10.0.7.199 - ok - - ok - 2018-03-26T13:35:21.817Z
-
Reset the deployment by forcing the mem allocation to be 128
dcos marathon app update /app-scaling --force mem=128
Exercise 3 - OOM Situations
-
Deploy the file
app-oom.json
dcos marathon app add app-oom.json
-
Look into the Marathon logs. Can you find the reason why the application is failing?
dcos node ssh --master-proxy --leader sudo journalctl -flu dcos-marathon | grep app-oom
-
Login to one of the agents the task failed on. You can see in the journal that the process got killed because of the cgroup memory violation.
dcos node ssh --master-proxy --mesos-id=$(dcos task app-oom --json | jq -r '.[] | .slave_id') journalctl -f _TRANSPORT=kernel
-
Remove the application
dcos marathon app remove /app-oom
Exercise 4 - Docker Images
-
Deploy the file
dockerimage.json
dcos marathon app add dockerimage.json
The app will fail over and over again without any logs.
-
Can you find the reason why the application is failing?
docker pull noimage:idonotexist
Exercise 5 - Debugging Web Applications
Prerequisite:
- Marathon-LB installed on all public nodes
- IP of your public agent
- DC/OS CLI installed
We want to deploy a webserver listening on port 3030
, and even set the service port to 10105
so we can reach it via marathon-lb.
-
Deploy the file
webserver1.json
dcos marathon app add webserver1.json
This will start the webserver on Port
3030
and also defines a label that marathon-lb uses to bind to the service port10105
on the public agent. -
Try to reach it via marathon-lb
We try to go to
http://<public_agent>:10105/
in our browser, but the webapp does not show up. So let us try to figure out why this is not working. -
Check marathon-lb/HAProxy In order to see if/how marathon-lb has picked up the app, we look at it the HAProxy stats. These are available on all nodes where marathon-lb is installed:
http://<public_ip>:9090/haproxy?stats
a. The page should look similar to the screenshot below. So we can see that marathon-lb has picked up the app. And
10.0.0.212:7727
is used as backend for the service port frontend10105
. -
Check whether we can reach that backend from within the cluster
dcos node ssh --leader --master-proxy curl 10.0.0.212:7727
We should see something similar to:
curl: (7) Failed to connect to 10.0.0.212 port 7727: Connection refused
So seemingly the app isn't serving on
10.0.0.212:7727
, which is the address used by marathon-lb. -
Check on which port our app is listening
Let us revisit our application definition:
"cmd": "echo 'Hello DC/OS' > index.html && python -m http.server 3030"
So let us check whether we can reach our application on port 3030
dcos node ssh --leader --master-proxy curl 10.0.0.212:3030 Hello DC/OS
So we have identified the problem: marathon-lb tries to redirect to the random assigned port
7727
, while our app is listening on port3030
. Note that, port3030
is not allocated (as it is not in the app specification), so it might actually happen that another app is already using that port. Also that implies we can only run a single instance per node. -
Fixing the problem We have two options for fixing this problem
a) Have the application listen to the random port DC/OS will give you the
PORT0
environment variable holding the first random port assigned to your app. So we could change our webserver to listen to that port:"cmd": "echo 'Hello DC/OS' > index.html && python -m http.server $PORT0"
b) Sometimes using a random port is not possible, as the application needs to listen to a fixed port (e.g.,
3030
). In that case we can run our docker container in bridge mode (see here for details on host versus bridge mode). That would mean inside the container network the application can use port3030
, which is mapped to a random port on the host.See
webserver2.json
for more details.
Exercise 6 - Debugging nginx using dcos task exec
Prerequisite:
- IP of your public agent
- DC/OS CLI installed
- Marathon-LB not installed
We want to deploy a nginx webserver, but cannot reach nginx (depite it running on a public agent).
-
Deploy the file
nginx.json
.dcos marathon app add nginx.json
This will start the nginx on the public agent.
-
Try to reach via the public agent
We try to go to
http://<public_agent>/
(default port is 80) in our browser, but the webapp does not show up. So let us try to figure out why this is not working. -
Check that app is running
Let us first check whether app is really running, this can be done either via UI or the CLI:
$ dcos task NAME HOST USER STATE ID MESOS ID nginx 10.0.6.146 root R nginx.cd3b7ef1-0e9e-11e8-b971-92f543ad2ed3 6a2b8b96-82bb-4ee8-b26a-39d3bb553a73-S0
Looks like everything is ok here (State equal R unning). So let us continue..
-
Check whether we can reach the instance from within the cluster
dcos node ssh --leader --master-proxy curl <public_agent>
We should see something similar to:
curl: (7) Failed to connect to ...: Connection refused
If we are on AWS it might be wort trying to curl both the interal and external IP of the the public agent.
-
Check nginx logs
$ dcos task log nginx Executing pre-exec command '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/opt\/mesosphere\/active\/mesos\/libexec\/mesos\/mesos-containerizer"}' Executing pre-exec command '{"arguments":["mount","-n","--rbind","\/var\/lib\/mesos\/slave\/slaves\/6a2b8b96-82bb-4ee8-b26a-39d3bb553a73-S0\/frameworks\/6a2b8b96-82bb-4ee8-b26a-39d3bb553a73-0001\/executors\/nginx.cd3b7ef1-0e9e-11e8-b971-92f543ad2ed3\/runs\/3f0846e3-9c40-430f-82a6-b217de90ae15","\/var\/lib\/mesos\/slave\/provisioner\/containers\/3f0846e3-9c40-430f-82a6-b217de90ae15\/backends\/overlay\/rootfses\/996094c4-9a28-47d7-9678-ccd3c2458bd1\/mnt\/mesos\/sandbox"],"shell":false,"value":"mount"}' Executing pre-exec command '{"shell":true,"value":"mount -n -t proc proc \/proc -o nosuid,noexec,nodev"}' Executing pre-exec command '{"arguments":["mount","-n","-t","ramfs","ramfs","\/var\/lib\/mesos\/slave\/slaves\/6a2b8b96-82bb-4ee8-b26a-39d3bb553a73-S0\/frameworks\/6a2b8b96-82bb-4ee8-b26a-39d3bb553a73-0001\/executors\/nginx.cd3b7ef1-0e9e-11e8-b971-92f543ad2ed3\/runs\/3f0846e3-9c40-430f-82a6-b217de90ae15\/.secret-ae0b7c8c-19bb-436e-9bb5-09036ff3bdee"],"shell":false,"value":"mount"}' Changing root to /var/lib/mesos/slave/provisioner/containers/3f0846e3-9c40-430f-82a6-b217de90ae15/backends/overlay/rootfses/996094c4-9a28-47d7-9678-ccd3c2458bd1
Unfortunately, still nothing helpful...
-
Exec into task
Next, let us interactively debug the task by launching a shell into the container environment: Note,
dcos task exec currently only works with UCR containers. If you are using the docker runtime, you will need too ssh to that node where the task is running and then use
docker exec`.dcos task exec -it nginx bash
Then from inside the container environment, we would check the running processes.
# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.1 0.1 106468 31368 ? Ss 21:19 0:00 /opt/mesosphere/active/mesos/libexec/mesos/mesos-containerizer la root 6 0.2 0.2 770436 34092 ? Sl 21:19 0:00 mesos-executor --launcher_dir=/opt/mesosphere/active/mesos/libexe root 16 0.0 0.0 4336 728 ? Ss 21:19 0:00 sh -c sleep 100000 root 17 0.0 0.0 4236 712 ? S 21:19 0:00 sleep 100000 root 19 0.2 0.1 106488 31200 ? Ss 21:19 0:00 /opt/mesosphere/active/mesos/libexec/mesos/mesos-containerizer la root 20 0.0 0.0 20240 3292 ? S 21:19 0:00 bash root 23 0.0 0.0 17500 2128 ? R+ 21:19 0:00 ps aux
It seems nginx is not running (but instead the sleep command we supplied in nginx.json).
Let us double check:
# /usr/sbin/service nginx status nginx is not running ... failed! # /usr/sbin/service nginx start
If we now check again
http://<public_agent>/
, we should seeWelcome to nginx!
. -
Update the app definition
Even though it seems that everything is fine now, the next crucial step would be to update your
nginx.json
by either removing the cmd override or explicitly starting nginx yourself. Otherwise, the failure will occure again as soon as the app is redeployed or scaled.