-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SGE Changes Tracking #24
Comments
Changed default qsub priority to 100 from 0. This allows a user to |
Initial default qsub memory target will be 4GB. Memory as a consumable is NOT enabled yet. But noting this selection from Skype call. |
Due to the priority request the two high memory nodes will be removed from the all.q (default) in a few minutes. Please confirm the name of the desired long running queue. I believe it was |
long queue is fine. thanks! |
Nodes 15 and 16 will shortly not be in the default queue (all.q). Do you want just those two nodes in the long running queue or do you want all the nodes to be able to run some number of long.q jobs as well if they are idle? |
Memory as a consumable will be prepared today but NOT activated as @hurleyLi is running I believe a stack of jobs. I will be oversubscribing ram on the nodes by 20% to start. The default qsub if non-specified will be 4GB. |
Let's do the first option (nodes 15 and 16 as the long queue). |
When the jobs on node 15 and node 16 are complete I will test that the long.q is correct and ready for use and add the usage statements to the wiki. What was the other queue you wanted? |
Also, as discussed, please make 128 cores (four nodes amount) as the short queue (2 days maximum). The remaining nodes (10 nodes) will be the week queue (7 days maximum). |
Noted. |
What queue do you want as default if they don't specify it. I assume |
Yes, let's make the week queue as a default. |
Week queue for simplicity is going to simply be the all.q for now. I will set the maximum run time to one week after @hurleyLi jobs are done. |
I'll stop submitting jobs for now then. |
No, thats not been part of any thing I've recommended. I was told you should be pushing jobs through! My efforts here are to NOT conflict with that as a stated goal from Kelly. |
Sorry I misunderstood.. But if I'm gonna continuously running those jobs, they'll probably be finished in at least a week. So you'll activate the changes at that time? |
I've been told your jobs are the priority. If my changes are going to interfere with that I will leave the changes until when you are done. However, yes, you are correct on that n16 matter. Which is a config item I need to finish when I believed getting that queue was my priority. I might as well finish it if you can hold a moment. |
I've disabled that queue for now. Shouldn't happen again and will let your jobs finish. |
Basically and I could use some clarification still. Are people waiting to run jobs besides @hurleyLi ? If @hurleyLi reduces his slot count a bit as noted in #34 there will even be a place to run such jobs.But I defer to his goals in getting things done for those numbers. |
Hi Paul, how long does it take for you to activate/implement those changes? |
I have the first pass ready for some testing to make sure what has been described is what will occur. I am not however changing the name of the all.q to week.q at the moment due to your jobs in said queue. |
For example at the moment due to node 3 being idle I am testing the short queue. |
What I can do is I can kill the jobs and let you make the changes, coz I also want these to be implemented and tested sooner than later. I have to stop at some point anyway for you the activate these changes, if I don't stop, the whole process after these jobs will last a month. Now I'm really just waiting 2-3 jobs to finish. 276, 279, 285. They'll probably finish around 2-3pm this afternoon. Do you think you can activate the changes this afternoon, so I can restart those jobs later today. And at the meantime, I can test some of my changes as mentioned in #34 Does it sound a plan? |
I will implement the changes when I see your jobs exit. Be aware your 2-3PM is my end of day. Afterhours monitoring is best effort which is why I normally do not make heavy changes on a Friday. But in the interests of moving this along and given I am mostly around this weekend I will attempt to implement the items above for the queues. IMPORTANT question however: are you really ready for memory reservations? I can do that separately. But you would need to add the proper As we are outpacing my ability to document and configure we'd have to sort out any issues as we go. |
Do you prefer to do this on Monday? We can do this next week if it's best for you to monitor the cluster after these changes. I don't think other people are gonna use the cluster beside me |
If that is truly the case I would prefer Monday. That way you will get uninterrupted use this weekend. While SGE is fairly direct, it periodically gets confused and I'd rather not scramble through config if I can do it during regular work hours. |
That's fine for me. Let's change the config on Monday. |
Old nodes except for Remember, its a single 1G link over to these nodes and for now we are using an NFS->Lustre method. |
So this is to remind me: I have NOT turned on memory as a consumable yet on There is no walltime limit on the |
Walltime on |
The SGE s_core limit (soft core limit) was changed from UNLIMITED to 0 a moment ago per #102 This means the default UNIX limit "coredump" is set to zero which results in a crashing program not dumping potentially a large core file. A recent reminder of the joys of this with many large memory jobs all crashing due to a bug was the impetus for this what I normally do as a default and forgot. If you NEED coredumps in a job add a |
I submitted a group of jobs (job id: 405730-405751) by I'm expecting that the submitted jobs can be running more than two days (that is why I submitted jobs in long queue), but it looks like some of the jobs running in all.q will be terminated after two days? Is this situation (submit jobs in long.q, but actually using all.q) expected? Do you have any ideas? |
I'm looking at job 405731 as an example of this. No, I would say that is not expected but will look at the logs to see what I can locate to explain it or locate a config item that needs to change. |
And if you have handy the precise qsub command you issued it would be helpful. |
Assuming it wasn't just |
Thank you for taking a look. The command was (in Makefile): calc_LD_005_ALL: The path is: |
OK. I noted that "pe smp 1" in and wanted to check on that. Is there a reason for requesting a PE of only one processor? (compared to more than one as that is really the same as not asking for a PE). Shouldn't matter but I've actually never used that construct. I will try to reproduce using a similar makefile. I wonder if the I wonder if the behavior will change if you embed the SGE hard resource in the script ala:
|
I have reproduced the above. For whatever reason the first job is issued to long.q as expected but the following jobs are in all.q I am trying to step through the outputted items to see why. |
When I changed from "-l long" to "-l week", all of the jobs are running in week.q (405752-405773). |
Interesting. When I removed "pe smp 1" they all ended up in long.q ;) So I am suspecting something about the final commands being output here. Is there a reason you are doing this make loop compared to an SGE array job? |
Very interesting. No, I didn't have a specific reason not using an array job. Next time, I'll try to remove "pe smp 1" if I run jobs with one cpu, or try to use an array job. |
Let me look at it a bit more in the morning. Whatever is going on its subtle. The array job suggestion I can explain more its just a simpler way of running a series of jobs from the same submit script using SGE built in shell variables that increment similar to your make file loop. Basically a notch less setup work and SGE is a notch more efficient with array jobs particularly for large arrays. But I'd still like to understand what is going on there. I am suspicious of the PE perhaps confusing things but I don't know why it would drop into all.q. So "unsure" is the answer for now and I will look closer with some coffee in the morning. |
Thanks, please let me know if you can figure it out in more detail. I just re-submitted jobs in long queue. |
So basically its got something to do with the Removing the
So I'm suspecting some kind of qsub argument bug at the moment but will double check my queue assumptions. I am also going to try different |
Interestingly using So I'm guessing a form of bug involving asking for an SMP environment (multiple processors) but then really saying "not really" as that is what "1" would mean. |
There are a few mutterings on the mailing lists of somewhat similar experiences. But nothing conclusive. I believe the queues are correct but that the outcome when the PE is |
BTW, random choices of PE smp values above 1 all seem to result in the jobs correctly in the long.q. |
SGE spool moved to |
Moved seq_no for juplow and juphigh to non-matching seq_no's as other queues. Current settings for reference:
|
Just for reference. |
For reference: Per Git #278 the s_rt and h_rt were removed from juphigh.q. |
|
I will be documenting when I make changes to the SGE config as we move toward a first goal of queue and settings config here. I will also attempt to get into the Wiki the details of such changes usages.
The text was updated successfully, but these errors were encountered: