Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running job blocks execution of short scheduled jobs #335

Closed
HenryRc0 opened this issue Oct 22, 2018 · 27 comments
Closed

Long running job blocks execution of short scheduled jobs #335

HenryRc0 opened this issue Oct 22, 2018 · 27 comments

Comments

@HenryRc0
Copy link

Hi,

a long running job blocks the execution of short scheduled jobs even if it is allowed to execute more
jobs in parallel.

Example:
Within parameters: Max parallel jobs is 5
5 jobs are configured with a policy to be executed every hour (redologs from databases)
1 job is configured with a policy to be executed on a daily basis (online/offline backup of a database)
The job which is doing the online/offline backup runs for about 3 hours.
While the job is running the tick command adds the 5 jobs correctly to the queue (status queued), but
they aren´t executed. Only after the long running job is finished all 5 hourly based jobs are started
(in parallel).
Therefor 2 hourly backups aren´t executed.

Henry

@xezpeleta
Copy link
Contributor

Your 6 jobs are included in the same client or do you have different clients? How many clients do you have?

@xezpeleta
Copy link
Contributor

By the way, do you have different backup locations? Did you remember to set the max parallel jobs option also there?

Thanks

@HenryRc0
Copy link
Author

They are all different cllients.
Max parallel option in backup locations is also set to 5.
The parallel execution is working (every hour in my case).
More detailed example:
at 08:00h 5 jobs are executed in parallel (run time between 5 to 10 minutes each)
at 09:00h 5 jobs are exectued in parallel (run time between 5 to 10 minutes each)
at 09:30h 1 long runing job is executed (run time will be about 3 hours)
at 10:00h 5 jobs are queued but not executed
at 11:00h since the 5 jobs are already queued nothing happens
at 12:00h since the 5 jobs are already queued nothing happens
at 12:30h the long running job finishes
at 12:31h the queued 5 jobs are executed in parallel

hope this helps

@xezpeleta
Copy link
Contributor

xezpeleta commented Oct 24, 2018

Hi @HenryRc0

Sorry, but I'm not able to reproduce your issue:

  • I have 3 clients. Each client has 1 job (let's say job 1.1, job 2.2, job 3.3 )
  • Job 1.1 is scheduled at 21:00
  • Jobs 2.2 and 3.3 are scheduled at 21:30
  • If job 1.1 is still running at 21:30, jobs 2.2 and 3.3 are added to the queue and they start running immediately
  • I see the 3 jobs in running status

There must be something I'm missing. Just to be sure, I'd like to highlight the 3 configurable options related to parallel jobs:

  • General max parallel job option in Parameters (or envvar if you use Docker)
  • Max parallel job in the backup location (Backup locations window)
  • Max parallel job per client (Client window)

In the example above, I let my max parallel job per client to 1, as every job is from a different client.

@xezpeleta
Copy link
Contributor

By the way, can you try enqueueing the jobs on-demand using the "Enqueue now" button?

  • Enqueue first your long job
  • Wait 1 minute until it starts running
  • Enqueue the other jobs

They should start immediately. Does it work like that?

thanks

@HenryRc0
Copy link
Author

Hi @xezpeleta
I tested with enqueueing on demand - It works there as expected.
My config in short: General max is 5 - backup location 5 - per client always 2
Screens from today:
elkar1
Hourly Backup Jobs (5 parallel)
At 08:01h I enqueud a long running one on demand.
I waited until its running (the start time 08:04 results because I forgot to activate the job)
After it started I could succesfully start some hourlys on demand and they where processed.
At 09:00h The hourlys where due and this is the screen:
elkar3
And here the logs:
elkar2
So - on demand works - scheduled doesn´t.
Do you have any hint what I could also look at (debug)?

Henry

@HenryRc0
Copy link
Author

Additional screen from the logs from enqueueing on demand:
elkar4
The Job 4.4 has been started right after the it has been queued

@HenryRc0
Copy link
Author

One difference I can see ist that with on demand enqueueing the log entry "StatusReport QUEUED" has an user entry (root in my case), while the scheduled entries don´t.
To explain the (root) from on demand:
Server is started as user root (./console server:start)
Apache is configured as a proxy for the internal server
The cron entry for the tick command has been configured to be exectued as root.
Reasons for this setup: I can successfully backup and recover files with the propper user/group/permissions and even selinux attributes.
Downside of this setup: security (but server IP is in a management vlan so no big worries here)

@xezpeleta
Copy link
Contributor

As I'm not able to reproduce your issue, I guess that it's related to your custom installation. Let me know if you have any more udpates.

@HenryRc0
Copy link
Author

@xezpeleta New findings from todays testing:
in short: could be some timeout issue

after the long running job is started for about 50 minutes everything works as expected.
Scheduled and on demands are queued and executed .
After about 50 minutes the process "php /opt/elkarbackup/app/console elkarbackup:tick"
goes into zombie state: Zs [php] defunct
There is no log in any logfile I could find which explains this
Any new job after this is only queued but not executed until the long running one is finished
In Detail: rsnapshot, rsync and "php /opt/elkarbackup/app/console elkarbackup:run_job XX"
When those are finished the zombie process also stops, freeing the lock and the next tick command
starts all queued jobs.
(sometimes I had to remove the long running job from the Queue in mysql to get it working again)

any ideas/hints how to proceed?

@HenryRc0
Copy link
Author

@xezpeleta I think I found the real problem:

the process "php /opt/elkarbackup/app/console elkarbackup:tick" looks like to have a memleak
Over time it consumes more and more memory until it reaches the max allowed memory size of php
or finishes the last job.
In my case it took about 50 minutes to reach the default memory size limit of 134217728 bytes (RHEL7).

You could probably not reproduce this issue because you either have a higher memory limit configured or in your tests the long running job was not as long running as mine.

@xezpeleta
Copy link
Contributor

Thanks for the info, I'm gonna try to reproduce it using a fake rsnapshot binary with a sleep of 50min. I'll keep you updated.

@xezpeleta
Copy link
Contributor

xezpeleta commented Oct 30, 2018

@HenryRc0 can you try commenting the following lines?

$this->info('Client "%clientid%", Job "%jobid%" du begin.', array('%clientid%' => $job->getClient()->getId(), '%jobid%' => $job->getId()), $context);
$du = (int)shell_exec(sprintf("du -ks '%s' | sed 's/\t.*//'", $job->getSnapshotRoot()));
$job->setDiskUsage($du);
$this->info('Client "%clientid%", Job "%jobid%" du end.', array('%clientid%' => $job->getClient()->getId(), '%jobid%' => $job->getId()), $context);

@HenryRc0
Copy link
Author

@xezpeleta commented out the lines -> no effect on memory consumption
("ps xv | grep tick" -> the process "/opt/elkarbackup/app/console elkarbackup:tick"
increases its memory usage still over time)
Did you want me to look for something else while the job is running?

FYI: What I did in the meantime to slow down the increasing memory consumption is
changing the sleep time within TickCommand.php (function waitWithTimout)
I changed "sleep(1);" to "sleep(5)".
With this change the increase is slowed down by a factor of 5.
This is not a solution but this workaround solves my usecase.
I guess that the increase is somewhere within the while loop in the function execute (TickCommand.php), since with my change I slowed down how often this loop is executed within
a given timeframe.

@HenryRc0
Copy link
Author

supplement info: The long running job is now finished - In the overview it says "Disk Usage 0 MB" - but that should have been expected since the above lines have been commented.

@xezpeleta
Copy link
Contributor

Hi @HenryRc0 ,

Looking for some info about memory leaks and Doctrine, I've stumbled upon in two possible causes: debug and profiler.

Debug is disabled by default in Elkarbackup, but profiler is enabled.

$bundles[] = new Symfony\Bundle\WebProfilerBundle\WebProfilerBundle();

Can you disable it commenting out the line?

@HenryRc0
Copy link
Author

HenryRc0 commented Nov 6, 2018

Hi @xezpeleta
commented the line as suggested
I also had to comment the first six lines in routing_dev.yml and three in config_dev.yml (web_profiler) to get it working again, but I guess that was expected.

I did two tests: one with those changes and one without.
Without changes:
TickCommand MemUsage at the beginning 38100k
after 5 minutes 48344k
after 10 minutes 62640k
With changes:
TickCommand MemUsage at the beginning 28368k
after 5 minutes 43416k
after 10 minutes 57760k

So the changes have an impact on the initial mem consumption, but interestingly
it seems that the increase over time is higher.

Henry

@xezpeleta
Copy link
Contributor

xezpeleta commented Nov 7, 2018

HI @HenryRc0

How are you measuring the memory usage of TickCommand? Just using ps aux or any other utility?

I'm trying to simulate your issue using a fake rsnapshot but I don't see memory usage increasing...

Thanks

EDIT: ah ok, I saw your previous comment. You're using: ps xv | grep tick

@xezpeleta
Copy link
Contributor

xezpeleta commented Nov 7, 2018

I've just checked it in my test environment (a Docker container with a fake rsnapshot script, PHP v7.1.22) and my production environment (Debian 9, "real" rsnapshot v1.4.2 and PHP v7.0.30).

In my test environment, after a 50min long job the memory usage of TickCommand remains stable: 40368

In my production server, after more than 20 jobs (some of them take 20min), memory usage remains stable: 33176

@HenryRc0
Copy link
Author

Environment here:
RHEL 7.5
php 7.1.22
rsnapshot 1.4.2
mariadb 5.5.60

In your environments is the memusage of TickCommand always stable?
What I also see is, while a job is running and others get scheduled and executed the increase is higher.

@HenryRc0
Copy link
Author

@xezpeleta
I found the reason for the increasing memory - and a solution for my installation.
Within TickCommand (function execute) right after the line:
$this->manager = $this->container->get('doctrine')->getManager();
I added the line:
$this->manager->getConnection()->getConfiguration()->setSQLLogger(null);

With this change the memory consumption stays stable at ~30MB for the TickCommand process.

Source:
https://coderwall.com/p/awzjhw/avoiding-memory-leaks-in-symfony2-doctrine-entitymanager

So there is only one question: Why does only my installation have this issue?

Henry

@xezpeleta
Copy link
Contributor

Hi @HenryRc0

Yes, I've found that website and more: https://stackoverflow.com/questions/9699185/memory-leaks-symfony2-doctrine2-exceed-memory-limit

Apparently the issue is related to the debugging option, which should be disabled by default. So I imagine that your problem comes from the not-debian based installation process (the installation script, i guess?).

@HenryRc0
Copy link
Author

Hi @xezpeleta

yes the initial installation was done with the script.

Here is the (snipped) output of ./console debug:config doctrine

doctrine:
dbal:
default_connection: default
connections:
default:
driver: pdo_mysql
host: localhost
port: null
dbname: elkarbackup
user: elkarbackup
password: ************ (cleartext password)
charset: UTF8
server_version: 5.5
logging: true
profiling: true
options: { }
mapping_types: { }
default_table_options: { }
slaves: { }
shards: { }

I guess logging is false on your installations.

I will also try the other solution mentiond in your link:

doctrine:
dbal:
driver: ...
...
logging: false
profiling: false

@HenryRc0
Copy link
Author

I finished now tests with the parameters set within config.yml

Memory consumption stays stable but at a higher level of ~39MB
Removing the parameters and adding the oneliner in TickCommand again
the memory consumption is stable at ~30MB

The additional 9MB are interesting, but both solutions work.

@xezpeleta Any idea yet, in which direction you are heading

@elacunza
Copy link
Contributor

Can we close this issue?

@HenryRc0
Copy link
Author

For me it is solved. I don´t know if any of the solutions is now included within the project (I changed the parameters locally).

@elacunza
Copy link
Contributor

Thanks @HenryRc0 . We're discontinuing script-based installation on 1.4.0, so I close this issue having you fixed the problem.

I suggest you switch to docker image which will be much more maintainable.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants