Debug common issues

Exercise 1 - Horizontal Scaling App - Not enough resources

Deploy an application
```
dcos marathon app add app-scaling.json
```

Scale the instances to 100

dcos marathon app update /app-scaling instances=100
dcos marathon app list
dcos marathon deployment list

Look into the app details. Can you find the reason why the deployment is in waiting state?

dcos marathon debug details /app-scaling
HOSTNAME    ROLE  CONSTRAINTS  CPUS  MEM  DISK  PORTS  RECEIVED
10.0.0.177   ok        ok       -     ok   ok     ok   2018-03-26T13:09:02.156Z
10.0.1.239   ok        ok       -     ok   ok     ok   2018-03-26T13:09:02.155Z
10.0.3.134   ok        ok       -     ok   ok     ok   2018-03-26T13:09:02.154Z
10.0.7.199   -         ok       -     -    ok     -    2018-03-26T13:09:02.155Z

Reset the deployment by forcing the instances to be 1

dcos marathon app update /app-scaling --force instances=1

Exercise 2 - Vertical scaling - No matching resources

Now increase the MEM allocation of the app

dcos marathon app update /app-scaling mem=100000
dcos marathon app list
dcos marathon deployment list

Look into the app details. Can you find the reason why the deployment is in waiting state?

dcos marathon debug details /app-scaling
HOSTNAME    ROLE  CONSTRAINTS  CPUS  MEM  DISK  PORTS  RECEIVED
10.0.0.177   ok        ok       ok    -    ok     ok   2018-03-26T13:35:21.817Z
10.0.1.239   ok        ok       ok    -    ok     ok   2018-03-26T13:35:21.817Z
10.0.3.134   ok        ok       ok    -    ok     ok   2018-03-26T13:35:21.816Z
10.0.7.199   -         ok       -     -    ok     -    2018-03-26T13:35:21.817Z

Reset the deployment by forcing the mem allocation to be 128
```
dcos marathon app update /app-scaling --force mem=128
```

Exercise 3 - OOM Situations

Deploy the file app-oom.json
```
dcos marathon app add app-oom.json
```

Look into the Marathon logs. Can you find the reason why the application is failing?

dcos node ssh --master-proxy --leader
sudo journalctl -flu dcos-marathon | grep app-oom

Login to one of the agents the task failed on. You can see in the journal that the process got killed because of the cgroup memory violation.
```
dcos node ssh --master-proxy --mesos-id=$(dcos task app-oom --json | jq -r '.[] | .slave_id')
journalctl -f _TRANSPORT=kernel
```
Remove the application
```
dcos marathon app remove /app-oom
```

Exercise 4 - Docker Images

Deploy the file dockerimage.json
```
dcos marathon app add dockerimage.json
```
The app will fail over and over again without any logs.
Can you find the reason why the application is failing?
```
docker pull noimage:idonotexist
```

Exercise 5 - Debugging Web Applications

Prerequisite:

Marathon-LB installed on all public nodes
IP of your public agent
DC/OS CLI installed

We want to deploy a webserver listening on port 3030, and even set the service port to 10105 so we can reach it via marathon-lb.

Deploy the file webserver1.json
```
dcos marathon app add webserver1.json
```
This will start the webserver on Port 3030 and also defines a label that marathon-lb uses to bind to the service port 10105 on the public agent.
Try to reach it via marathon-lb

We try to go to http://<public_agent>:10105/ in our browser, but the webapp does not show up. So let us try to figure out why this is not working.
Check marathon-lb/HAProxy In order to see if/how marathon-lb has picked up the app, we look at it the HAProxy stats. These are available on all nodes where marathon-lb is installed: http://<public_ip>:9090/haproxy?stats

a. The page should look similar to the screenshot below. So we can see that marathon-lb has picked up the app. And 10.0.0.212:7727 is used as backend for the service port frontend 10105.
Check whether we can reach that backend from within the cluster
```
dcos node ssh --leader --master-proxy
curl 10.0.0.212:7727
```
We should see something similar to: curl: (7) Failed to connect to 10.0.0.212 port 7727: Connection refused

So seemingly the app isn't serving on 10.0.0.212:7727, which is the address used by marathon-lb.
Check on which port our app is listening

Let us revisit our application definition:
```
"cmd": "echo 'Hello DC/OS' > index.html && python -m http.server 3030"
```
So let us check whether we can reach our application on port 3030
```
dcos node ssh --leader --master-proxy
curl 10.0.0.212:3030
Hello DC/OS
```
So we have identified the problem: marathon-lb tries to redirect to the random assigned port 7727, while our app is listening on port 3030. Note that, port 3030 is not allocated (as it is not in the app specification), so it might actually happen that another app is already using that port. Also that implies we can only run a single instance per node.
Fixing the problem We have two options for fixing this problem

a) Have the application listen to the random port DC/OS will give you the PORT0 environment variable holding the first random port assigned to your app. So we could change our webserver to listen to that port:
```
"cmd": "echo 'Hello DC/OS' > index.html && python -m http.server $PORT0"
```
b) Sometimes using a random port is not possible, as the application needs to listen to a fixed port (e.g., 3030). In that case we can run our docker container in bridge mode (see here for details on host versus bridge mode). That would mean inside the container network the application can use port 3030, which is mapped to a random port on the host.

See webserver2.json for more details.

Exercise 6 - Debugging nginx using dcos task exec

Prerequisite:

IP of your public agent
DC/OS CLI installed
Marathon-LB not installed

We want to deploy a nginx webserver, but cannot reach nginx (depite it running on a public agent).

Deploy the file nginx.json.
```
dcos marathon app add nginx.json
```
This will start the nginx on the public agent.
Try to reach via the public agent

We try to go to http://<public_agent>/ (default port is 80) in our browser, but the webapp does not show up. So let us try to figure out why this is not working.

Check that app is running

Let us first check whether app is really running, this can be done either via UI or the CLI:

$ dcos task
NAME   HOST        USER  STATE  ID                                          MESOS ID
nginx  10.0.6.146  root    R    nginx.cd3b7ef1-0e9e-11e8-b971-92f543ad2ed3  6a2b8b96-82bb-4ee8-b26a-39d3bb553a73-S0

Looks like everything is ok here (State equal R unning). So let us continue..

Check whether we can reach the instance from within the cluster
```
dcos node ssh --leader --master-proxy
curl <public_agent>
```
We should see something similar to: curl: (7) Failed to connect to ...: Connection refused

If we are on AWS it might be wort trying to curl both the interal and external IP of the the public agent.

Check nginx logs

$ dcos task log nginx
Executing pre-exec command '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/opt\/mesosphere\/active\/mesos\/libexec\/mesos\/mesos-containerizer"}'
Executing pre-exec command '{"arguments":["mount","-n","--rbind","\/var\/lib\/mesos\/slave\/slaves\/6a2b8b96-82bb-4ee8-b26a-39d3bb553a73-S0\/frameworks\/6a2b8b96-82bb-4ee8-b26a-39d3bb553a73-0001\/executors\/nginx.cd3b7ef1-0e9e-11e8-b971-92f543ad2ed3\/runs\/3f0846e3-9c40-430f-82a6-b217de90ae15","\/var\/lib\/mesos\/slave\/provisioner\/containers\/3f0846e3-9c40-430f-82a6-b217de90ae15\/backends\/overlay\/rootfses\/996094c4-9a28-47d7-9678-ccd3c2458bd1\/mnt\/mesos\/sandbox"],"shell":false,"value":"mount"}'
Executing pre-exec command '{"shell":true,"value":"mount -n -t proc proc \/proc -o nosuid,noexec,nodev"}'
Executing pre-exec command '{"arguments":["mount","-n","-t","ramfs","ramfs","\/var\/lib\/mesos\/slave\/slaves\/6a2b8b96-82bb-4ee8-b26a-39d3bb553a73-S0\/frameworks\/6a2b8b96-82bb-4ee8-b26a-39d3bb553a73-0001\/executors\/nginx.cd3b7ef1-0e9e-11e8-b971-92f543ad2ed3\/runs\/3f0846e3-9c40-430f-82a6-b217de90ae15\/.secret-ae0b7c8c-19bb-436e-9bb5-09036ff3bdee"],"shell":false,"value":"mount"}'
Changing root to /var/lib/mesos/slave/provisioner/containers/3f0846e3-9c40-430f-82a6-b217de90ae15/backends/overlay/rootfses/996094c4-9a28-47d7-9678-ccd3c2458bd1

Unfortunately, still nothing helpful...

Exec into task

Next, let us interactively debug the task by launching a shell into the container environment: Note, dcos task exec currently only works with UCR containers. If you are using the docker runtime, you will need too ssh to that node where the task is running and then use docker exec`.

dcos task exec -it nginx bash

Then from inside the container environment, we would check the running processes.

# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  0.1 106468 31368 ?        Ss   21:19   0:00 /opt/mesosphere/active/mesos/libexec/mesos/mesos-containerizer la
root         6  0.2  0.2 770436 34092 ?        Sl   21:19   0:00 mesos-executor --launcher_dir=/opt/mesosphere/active/mesos/libexe
root        16  0.0  0.0   4336   728 ?        Ss   21:19   0:00 sh -c sleep 100000
root        17  0.0  0.0   4236   712 ?        S    21:19   0:00 sleep 100000
root        19  0.2  0.1 106488 31200 ?        Ss   21:19   0:00 /opt/mesosphere/active/mesos/libexec/mesos/mesos-containerizer la
root        20  0.0  0.0  20240  3292 ?        S    21:19   0:00 bash
root        23  0.0  0.0  17500  2128 ?        R+   21:19   0:00 ps aux

It seems nginx is not running (but instead the sleep command we supplied in nginx.json).

Let us double check:

# /usr/sbin/service nginx status
nginx is not running ... failed!

# /usr/sbin/service nginx start

If we now check again http://<public_agent>/, we should see Welcome to nginx!.

Update the app definition

Even though it seems that everything is fine now, the next crucial step would be to update your nginx.json by either removing the cmd override or explicitly starting nginx yourself. Otherwise, the failure will occure again as soon as the app is redeployed or scaled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Debug common issues

Files

README.md

Latest commit

History

README.md

File metadata and controls

Debug common issues