Long running job blocks execution of short scheduled jobs #335

HenryRc0 · 2018-10-22T11:22:31Z

Hi,

a long running job blocks the execution of short scheduled jobs even if it is allowed to execute more
jobs in parallel.

Example:
Within parameters: Max parallel jobs is 5
5 jobs are configured with a policy to be executed every hour (redologs from databases)
1 job is configured with a policy to be executed on a daily basis (online/offline backup of a database)
The job which is doing the online/offline backup runs for about 3 hours.
While the job is running the tick command adds the 5 jobs correctly to the queue (status queued), but
they aren´t executed. Only after the long running job is finished all 5 hourly based jobs are started
(in parallel).
Therefor 2 hourly backups aren´t executed.

Henry

xezpeleta · 2018-10-22T17:48:41Z

Your 6 jobs are included in the same client or do you have different clients? How many clients do you have?

xezpeleta · 2018-10-22T18:12:02Z

By the way, do you have different backup locations? Did you remember to set the max parallel jobs option also there?

Thanks

HenryRc0 · 2018-10-23T05:53:49Z

They are all different cllients.
Max parallel option in backup locations is also set to 5.
The parallel execution is working (every hour in my case).
More detailed example:
at 08:00h 5 jobs are executed in parallel (run time between 5 to 10 minutes each)
at 09:00h 5 jobs are exectued in parallel (run time between 5 to 10 minutes each)
at 09:30h 1 long runing job is executed (run time will be about 3 hours)
at 10:00h 5 jobs are queued but not executed
at 11:00h since the 5 jobs are already queued nothing happens
at 12:00h since the 5 jobs are already queued nothing happens
at 12:30h the long running job finishes
at 12:31h the queued 5 jobs are executed in parallel

hope this helps

xezpeleta · 2018-10-24T18:25:27Z

Hi @HenryRc0

Sorry, but I'm not able to reproduce your issue:

I have 3 clients. Each client has 1 job (let's say job 1.1, job 2.2, job 3.3 )
Job 1.1 is scheduled at 21:00
Jobs 2.2 and 3.3 are scheduled at 21:30
If job 1.1 is still running at 21:30, jobs 2.2 and 3.3 are added to the queue and they start running immediately
I see the 3 jobs in running status

There must be something I'm missing. Just to be sure, I'd like to highlight the 3 configurable options related to parallel jobs:

General max parallel job option in Parameters (or envvar if you use Docker)
Max parallel job in the backup location (Backup locations window)
Max parallel job per client (Client window)

In the example above, I let my max parallel job per client to 1, as every job is from a different client.

xezpeleta · 2018-10-24T18:32:52Z

By the way, can you try enqueueing the jobs on-demand using the "Enqueue now" button?

Enqueue first your long job
Wait 1 minute until it starts running
Enqueue the other jobs

They should start immediately. Does it work like that?

thanks

HenryRc0 · 2018-10-25T09:01:41Z

Hi @xezpeleta
I tested with enqueueing on demand - It works there as expected.
My config in short: General max is 5 - backup location 5 - per client always 2
Screens from today:

Hourly Backup Jobs (5 parallel)
At 08:01h I enqueud a long running one on demand.
I waited until its running (the start time 08:04 results because I forgot to activate the job)
After it started I could succesfully start some hourlys on demand and they where processed.
At 09:00h The hourlys where due and this is the screen:

And here the logs:

So - on demand works - scheduled doesn´t.
Do you have any hint what I could also look at (debug)?

Henry

HenryRc0 · 2018-10-25T09:07:42Z

Additional screen from the logs from enqueueing on demand:

The Job 4.4 has been started right after the it has been queued

HenryRc0 · 2018-10-25T09:22:16Z

One difference I can see ist that with on demand enqueueing the log entry "StatusReport QUEUED" has an user entry (root in my case), while the scheduled entries don´t.
To explain the (root) from on demand:
Server is started as user root (./console server:start)
Apache is configured as a proxy for the internal server
The cron entry for the tick command has been configured to be exectued as root.
Reasons for this setup: I can successfully backup and recover files with the propper user/group/permissions and even selinux attributes.
Downside of this setup: security (but server IP is in a management vlan so no big worries here)

xezpeleta · 2018-10-25T14:23:37Z

As I'm not able to reproduce your issue, I guess that it's related to your custom installation. Let me know if you have any more udpates.

HenryRc0 · 2018-10-27T15:27:05Z

@xezpeleta New findings from todays testing:
in short: could be some timeout issue

after the long running job is started for about 50 minutes everything works as expected.
Scheduled and on demands are queued and executed .
After about 50 minutes the process "php /opt/elkarbackup/app/console elkarbackup:tick"
goes into zombie state: Zs [php] defunct
There is no log in any logfile I could find which explains this
Any new job after this is only queued but not executed until the long running one is finished
In Detail: rsnapshot, rsync and "php /opt/elkarbackup/app/console elkarbackup:run_job XX"
When those are finished the zombie process also stops, freeing the lock and the next tick command
starts all queued jobs.
(sometimes I had to remove the long running job from the Queue in mysql to get it working again)

any ideas/hints how to proceed?

HenryRc0 · 2018-10-27T16:47:13Z

@xezpeleta I think I found the real problem:

the process "php /opt/elkarbackup/app/console elkarbackup:tick" looks like to have a memleak
Over time it consumes more and more memory until it reaches the max allowed memory size of php
or finishes the last job.
In my case it took about 50 minutes to reach the default memory size limit of 134217728 bytes (RHEL7).

You could probably not reproduce this issue because you either have a higher memory limit configured or in your tests the long running job was not as long running as mine.

xezpeleta · 2018-10-29T18:01:07Z

Thanks for the info, I'm gonna try to reproduce it using a fake rsnapshot binary with a sleep of 50min. I'll keep you updated.

xezpeleta · 2018-10-30T21:30:23Z

@HenryRc0 can you try commenting the following lines?

elkarbackup/src/Binovo/ElkarBackupBundle/Command/RunJobCommand.php

Lines 275 to 278 in 671e006

    
           $this->info('Client "%clientid%", Job "%jobid%" du begin.', array('%clientid%' => $job->getClient()->getId(), '%jobid%' => $job->getId()), $context); 
        
           $du = (int)shell_exec(sprintf("du -ks '%s' | sed 's/\t.*//'", $job->getSnapshotRoot())); 
        
           $job->setDiskUsage($du); 
        
           $this->info('Client "%clientid%", Job "%jobid%" du end.', array('%clientid%' => $job->getClient()->getId(), '%jobid%' => $job->getId()), $context);

HenryRc0 · 2018-10-31T07:03:38Z

@xezpeleta commented out the lines -> no effect on memory consumption
("ps xv | grep tick" -> the process "/opt/elkarbackup/app/console elkarbackup:tick"
increases its memory usage still over time)
Did you want me to look for something else while the job is running?

FYI: What I did in the meantime to slow down the increasing memory consumption is
changing the sleep time within TickCommand.php (function waitWithTimout)
I changed "sleep(1);" to "sleep(5)".
With this change the increase is slowed down by a factor of 5.
This is not a solution but this workaround solves my usecase.
I guess that the increase is somewhere within the while loop in the function execute (TickCommand.php), since with my change I slowed down how often this loop is executed within
a given timeframe.

HenryRc0 · 2018-10-31T11:15:18Z

supplement info: The long running job is now finished - In the overview it says "Disk Usage 0 MB" - but that should have been expected since the above lines have been commented.

xezpeleta · 2018-11-05T18:56:04Z

Hi @HenryRc0 ,

Looking for some info about memory leaks and Doctrine, I've stumbled upon in two possible causes: debug and profiler.

Debug is disabled by default in Elkarbackup, but profiler is enabled.

elkarbackup/app/AppKernel.php

Line 31 in 611895b

$bundles[] = new Symfony\Bundle\WebProfilerBundle\WebProfilerBundle();

Can you disable it commenting out the line?

HenryRc0 · 2018-11-06T07:39:41Z

Hi @xezpeleta
commented the line as suggested
I also had to comment the first six lines in routing_dev.yml and three in config_dev.yml (web_profiler) to get it working again, but I guess that was expected.

I did two tests: one with those changes and one without.
Without changes:
TickCommand MemUsage at the beginning 38100k
after 5 minutes 48344k
after 10 minutes 62640k
With changes:
TickCommand MemUsage at the beginning 28368k
after 5 minutes 43416k
after 10 minutes 57760k

So the changes have an impact on the initial mem consumption, but interestingly
it seems that the increase over time is higher.

Henry

xezpeleta · 2018-11-07T19:03:29Z

HI @HenryRc0

How are you measuring the memory usage of TickCommand? Just using ps aux or any other utility?

I'm trying to simulate your issue using a fake rsnapshot but I don't see memory usage increasing...

Thanks

EDIT: ah ok, I saw your previous comment. You're using: ps xv | grep tick

xezpeleta · 2018-11-07T20:29:21Z

I've just checked it in my test environment (a Docker container with a fake rsnapshot script, PHP v7.1.22) and my production environment (Debian 9, "real" rsnapshot v1.4.2 and PHP v7.0.30).

In my test environment, after a 50min long job the memory usage of TickCommand remains stable: 40368

In my production server, after more than 20 jobs (some of them take 20min), memory usage remains stable: 33176

HenryRc0 · 2018-11-12T13:58:41Z

Environment here:
RHEL 7.5
php 7.1.22
rsnapshot 1.4.2
mariadb 5.5.60

In your environments is the memusage of TickCommand always stable?
What I also see is, while a job is running and others get scheduled and executed the increase is higher.

HenryRc0 · 2018-11-13T12:28:07Z

@xezpeleta
I found the reason for the increasing memory - and a solution for my installation.
Within TickCommand (function execute) right after the line:
$this->manager = $this->container->get('doctrine')->getManager();
I added the line:
$this->manager->getConnection()->getConfiguration()->setSQLLogger(null);

With this change the memory consumption stays stable at ~30MB for the TickCommand process.

Source:
https://coderwall.com/p/awzjhw/avoiding-memory-leaks-in-symfony2-doctrine-entitymanager

So there is only one question: Why does only my installation have this issue?

Henry

xezpeleta · 2018-11-13T18:53:56Z

Hi @HenryRc0

Yes, I've found that website and more: https://stackoverflow.com/questions/9699185/memory-leaks-symfony2-doctrine2-exceed-memory-limit

Apparently the issue is related to the debugging option, which should be disabled by default. So I imagine that your problem comes from the not-debian based installation process (the installation script, i guess?).

HenryRc0 · 2018-11-14T07:15:16Z

Hi @xezpeleta

yes the initial installation was done with the script.

Here is the (snipped) output of ./console debug:config doctrine

doctrine:
dbal:
default_connection: default
connections:
default:
driver: pdo_mysql
host: localhost
port: null
dbname: elkarbackup
user: elkarbackup
password: ************ (cleartext password)
charset: UTF8
server_version: 5.5
logging: true
profiling: true
options: { }
mapping_types: { }
default_table_options: { }
slaves: { }
shards: { }

I guess logging is false on your installations.

I will also try the other solution mentiond in your link:

doctrine:
dbal:
driver: ...
...
logging: false
profiling: false

HenryRc0 · 2018-11-16T07:36:35Z

I finished now tests with the parameters set within config.yml

Memory consumption stays stable but at a higher level of ~39MB
Removing the parameters and adding the oneliner in TickCommand again
the memory consumption is stable at ~30MB

The additional 9MB are interesting, but both solutions work.

@xezpeleta Any idea yet, in which direction you are heading

elacunza · 2021-01-18T17:10:14Z

Can we close this issue?

HenryRc0 · 2021-01-19T12:06:36Z

For me it is solved. I don´t know if any of the solutions is now included within the project (I changed the parameters locally).

elacunza · 2021-01-26T17:04:12Z

Thanks @HenryRc0 . We're discontinuing script-based installation on 1.4.0, so I close this issue having you fixed the problem.

I suggest you switch to docker image which will be much more maintainable.

Thanks

HenryRc0 mentioned this issue Oct 22, 2018

Long running job blocks execution of short scheduled jobs. #336

Closed

xezpeleta added kind/question contributor/need-more-information area/backupjob labels Oct 25, 2018

xezpeleta added kind/bug/possible and removed contributor/need-more-information kind/question labels Oct 29, 2018

xezpeleta added the status/work-in-progress label Nov 5, 2018

elacunza added area:centos7 and removed area:centos7 labels Jan 18, 2021

elacunza closed this as completed Jan 26, 2021

elacunza added kind/bug/fix wontfix and removed kind/bug/possible status/work-in-progress labels Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running job blocks execution of short scheduled jobs #335

Long running job blocks execution of short scheduled jobs #335

HenryRc0 commented Oct 22, 2018

xezpeleta commented Oct 22, 2018

xezpeleta commented Oct 22, 2018

HenryRc0 commented Oct 23, 2018

xezpeleta commented Oct 24, 2018 •

edited

xezpeleta commented Oct 24, 2018

HenryRc0 commented Oct 25, 2018

HenryRc0 commented Oct 25, 2018

HenryRc0 commented Oct 25, 2018

xezpeleta commented Oct 25, 2018

HenryRc0 commented Oct 27, 2018

HenryRc0 commented Oct 27, 2018

xezpeleta commented Oct 29, 2018

xezpeleta commented Oct 30, 2018 •

edited

HenryRc0 commented Oct 31, 2018

HenryRc0 commented Oct 31, 2018

xezpeleta commented Nov 5, 2018

HenryRc0 commented Nov 6, 2018

xezpeleta commented Nov 7, 2018 •

edited

xezpeleta commented Nov 7, 2018 •

edited

HenryRc0 commented Nov 12, 2018

HenryRc0 commented Nov 13, 2018

xezpeleta commented Nov 13, 2018

HenryRc0 commented Nov 14, 2018

HenryRc0 commented Nov 16, 2018

elacunza commented Jan 18, 2021

HenryRc0 commented Jan 19, 2021

elacunza commented Jan 26, 2021

Long running job blocks execution of short scheduled jobs #335

Long running job blocks execution of short scheduled jobs #335

Comments

HenryRc0 commented Oct 22, 2018

xezpeleta commented Oct 22, 2018

xezpeleta commented Oct 22, 2018

HenryRc0 commented Oct 23, 2018

xezpeleta commented Oct 24, 2018 • edited

xezpeleta commented Oct 24, 2018

HenryRc0 commented Oct 25, 2018

HenryRc0 commented Oct 25, 2018

HenryRc0 commented Oct 25, 2018

xezpeleta commented Oct 25, 2018

HenryRc0 commented Oct 27, 2018

HenryRc0 commented Oct 27, 2018

xezpeleta commented Oct 29, 2018

xezpeleta commented Oct 30, 2018 • edited

HenryRc0 commented Oct 31, 2018

HenryRc0 commented Oct 31, 2018

xezpeleta commented Nov 5, 2018

HenryRc0 commented Nov 6, 2018

xezpeleta commented Nov 7, 2018 • edited

xezpeleta commented Nov 7, 2018 • edited

HenryRc0 commented Nov 12, 2018

HenryRc0 commented Nov 13, 2018

xezpeleta commented Nov 13, 2018

HenryRc0 commented Nov 14, 2018

HenryRc0 commented Nov 16, 2018

elacunza commented Jan 18, 2021

HenryRc0 commented Jan 19, 2021

elacunza commented Jan 26, 2021

xezpeleta commented Oct 24, 2018 •

edited

xezpeleta commented Oct 30, 2018 •

edited

xezpeleta commented Nov 7, 2018 •

edited

xezpeleta commented Nov 7, 2018 •

edited