Application logrotation #4752

nathansamson · 2015-11-15T10:16:25Z

TLDR:
Application logs need some logrotation so they can not get unbounded in size, filling up the disk and causing all sorts of problems with ceph / registry / deploying new releases / scaling applications.

I let the implementation up to deis maintainers but I would implement this as follows.
Instead of storing all logs of one app in one file, I would make it one directory per app, and one file per day in this directory. In this way apps older than X days can easily be removed (automatically as a deis setting, or manually when needed).
As a seperate setting a max log size per app could be enforced, if the total log size exceeds the allowed log size (for that app), it would start removing old log files first until the limit is not exceeded anymore.

I imagine 3 settings.
Log max age (in days) (false/null = default = never remove old log files). Files older than this will be automatically removed by the logger component once a day
Log minimum retention history (in days) (default = 0, don't keep anything). Files younger than this period will never be automatically deleted (see next setting)
Log max size (in bytes) (0 = default = no max size). Once an apps total log size exceeds this limits older files first are removed until the total log size is within limits again. If files younger than the minimum retention history is reached those are not automatically deleted.

Obviously if you set the minimum retention history to a few days, and the app goes haywire this would still not protect you, but at least it won't fill up the disk behind your back in most cases when settings are well chosen

Background story:

Last few days my cluster was having hard disk trouble. 4/5 of my nodes were low on disk space.
I already had some monitors down getting more than 95% of diskspace. (Note: I am running on DO with 160GB disks)
I tried cleaning disk space with a modified command from the deis docs.

docker ps -a | grep "Exited" | grep -v "hours" | grep -v deis | cut -f1 -d" " | xargs docker rm
docker images -aq | xargs -l10 docker rmi

(Side note, I don't think the -a is necessary in the second command, as docker will automatically remove the ancestors if they are not tagged. This gives less errors, and will speed up the execution of this command).

This helped a bit, but far from enough.

My next step was looing into the "S3" of deis to see how many images were stored, and if I could remove some of them. Listing these didn't give a clear idea what was using so much disk space

(Important side story) During working on my cluster I noticed that a test version of my app was getting a bit crazy, and A) crashing and rebooting and B) sending a large API query (> 10MB) to a 3rdpary every time, which got rejected because of the size C) spewing out the logs of this query every time.
I fixed the issue because obviously this was non intentional.

Going on my investigation, I saw ceph was telling me how much disk space it was using (142GB). Only later I realized this might have been because of the app that was spewing out logs at a fast rate... And indeed the log file for this app was 126GB, Removing this suddenly freed up 377GB over my complete cluster (exactly 3 times 126GB...). This solved all my problems...

The text was updated successfully, but these errors were encountered:

krancour · 2015-11-17T17:03:22Z

This is actually a duplicate of #4280. I'm going to close this and then move further discussion over to there.

krancour closed this as completed Nov 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application logrotation #4752

Application logrotation #4752

nathansamson commented Nov 15, 2015

krancour commented Nov 17, 2015

Application logrotation #4752

Application logrotation #4752

Comments

nathansamson commented Nov 15, 2015

krancour commented Nov 17, 2015