-
Notifications
You must be signed in to change notification settings - Fork 17
Troubleshooting: Condor
If a job is held, use condor_q -l $JOBID | grep Hold
to understand the reason.
You can manually change the amount of memory requested for your a given job using the condor_qedit
command ( documentation). For example, to set the requested memory to 3 GB for job 12345, call:
condor_qedit 12345 RequestMemory '3000'
If you want to edit all jobs held due to the memory issue, you can also do that:
condor_qedit -constraint 'HoldReasonCode == 34' RequestMemory '3000' -n $SCHEDULER
where SCHEDULER
is the scheduler your jobs run on, e.g. lpcschedd1.fnal.gov
(call condor_q
to find out which one it is).
Generally, we want to avoid requesting more memory than the default 2100 MB, since higher requests lead to longer wait times for a job slot. You can reduce the memory consumption of your jobs in two ways:
-
Don't save as much stuff. Try to minimize your histograms, calculations, etc, to what you really need. If you need extra histograms, regions, etc. for a specific study, either store that code in a separate branch or implement a configuration switch that allows you to turn off the creation of these extra objects by default.
-
Process fewer events in one go. This can be accomplished by changing the
chunksize
parameter in thedo_worker
function. This parameter controls how many events are processed as a contiguous set, so lower values mean lower memory consumption. At the same time, lower values result in overall slower execution, so we want this value to remain as high as we can afford.