Troubleshooting: Condor

Jobs in Hold status

If a job is held, use condor_q -l $JOBID | grep Hold to understand the reason.

Jobs are using too much memory

Short-term solution

You can manually change the amount of memory requested for your a given job using the condor_qedit command ( documentation). For example, to set the requested memory to 3 GB for job 12345, call:

condor_qedit 12345 RequestMemory '3000'

If you want to edit all jobs held due to the memory issue, you can also do that:

condor_qedit -constraint 'HoldReasonCode == 34' RequestMemory '3000' -n $SCHEDULER

where SCHEDULER is the scheduler your jobs run on, e.g. lpcschedd1.fnal.gov (call condor_q to find out which one it is).

Long-term solution

Generally, we want to avoid requesting more memory than the default 2100 MB, since higher requests lead to longer wait times for a job slot. You can reduce the memory consumption of your jobs in two ways:

Don't save as much stuff. Try to minimize your histograms, calculations, etc, to what you really need. If you need extra histograms, regions, etc. for a specific study, either store that code in a separate branch or implement a configuration switch that allows you to turn off the creation of these extra objects by default.
Process fewer events in one go. This can be accomplished by changing the chunksize parameter in the do_worker function. This parameter controls how many events are processed as a contiguous set, so lower values mean lower memory consumption. At the same time, lower values result in overall slower execution, so we want this value to remain as high as we can afford.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting: Condor

Jobs in Hold status

Jobs are using too much memory

Short-term solution

Long-term solution

Clone this wiki locally