Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGE Changes Tracking #24

Open
tatarsky opened this issue Nov 4, 2015 · 180 comments
Open

SGE Changes Tracking #24

tatarsky opened this issue Nov 4, 2015 · 180 comments
Assignees

Comments

@tatarsky
Copy link
Collaborator

tatarsky commented Nov 4, 2015

I will be documenting when I make changes to the SGE config as we move toward a first goal of queue and settings config here. I will also attempt to get into the Wiki the details of such changes usages.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 4, 2015

Changed default qsub priority to 100 from 0. This allows a user to qalter their own priorities on queued obs later or on the qsub command line without an SGE manager intervention. Details coming to wiki when I get a moment. Goal is to allow user to re-order their own priorities within reason once queued.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 4, 2015

Initial default qsub memory target will be 4GB. Memory as a consumable is NOT enabled yet. But noting this selection from Skype call.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

Due to the priority request the two high memory nodes will be removed from the all.q (default) in a few minutes. Please confirm the name of the desired long running queue. I believe it was long

@nariai
Copy link
Contributor

nariai commented Nov 6, 2015

long queue is fine. thanks!

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

Nodes 15 and 16 will shortly not be in the default queue (all.q).

Do you want just those two nodes in the long running queue or do you want all the nodes to be able to run some number of long.q jobs as well if they are idle?

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

Memory as a consumable will be prepared today but NOT activated as @hurleyLi is running I believe a stack of jobs. I will be oversubscribing ram on the nodes by 20% to start. The default qsub if non-specified will be 4GB.

@nariai
Copy link
Contributor

nariai commented Nov 6, 2015

Let's do the first option (nodes 15 and 16 as the long queue).
We want to use the long queue as the high memory requiring jobs.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

When the jobs on node 15 and node 16 are complete I will test that the long.q is correct and ready for use and add the usage statements to the wiki.

What was the other queue you wanted?

@nariai
Copy link
Contributor

nariai commented Nov 6, 2015

Also, as discussed, please make 128 cores (four nodes amount) as the short queue (2 days maximum).
In this case, let all the nodes be able to run the short queue as well if they are idle.

The remaining nodes (10 nodes) will be the week queue (7 days maximum).
In this case, if the high memory nodes (2 nodes) are idle, then these nodes can be
used for the week queue.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

Noted.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

What queue do you want as default if they don't specify it. I assume week

@nariai
Copy link
Contributor

nariai commented Nov 6, 2015

Yes, let's make the week queue as a default.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

Week queue for simplicity is going to simply be the all.q for now. I will set the maximum run time to one week after @hurleyLi jobs are done.

@hurleyLi
Copy link

hurleyLi commented Nov 6, 2015

I'll stop submitting jobs for now then.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

No, thats not been part of any thing I've recommended. I was told you should be pushing jobs through! My efforts here are to NOT conflict with that as a stated goal from Kelly.

@tatarsky tatarsky self-assigned this Nov 6, 2015
@hurleyLi
Copy link

hurleyLi commented Nov 6, 2015

Sorry I misunderstood.. But if I'm gonna continuously running those jobs, they'll probably be finished in at least a week. So you'll activate the changes at that time?
Another thing I noticed is there're two jobs running on n16 now, although both of them are specified as using 32 cores.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

I've been told your jobs are the priority. If my changes are going to interfere with that I will leave the changes until when you are done.

However, yes, you are correct on that n16 matter. Which is a config item I need to finish when I believed getting that queue was my priority.

I might as well finish it if you can hold a moment.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

I've disabled that queue for now. Shouldn't happen again and will let your jobs finish.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

Basically and I could use some clarification still. Are people waiting to run jobs besides @hurleyLi ?
If so, why? Just because we don't have named queues yet? I am trying to honor a request to not cause issues with his jobs but configuring a scheduler while people are using it is not really my favorite thing to do. However, there is nothing preventing anyone from issuing jobs to the default queue.

If @hurleyLi reduces his slot count a bit as noted in #34 there will even be a place to run such jobs.But I defer to his goals in getting things done for those numbers.

@hurleyLi
Copy link

hurleyLi commented Nov 6, 2015

Hi Paul, how long does it take for you to activate/implement those changes?

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

I have the first pass ready for some testing to make sure what has been described is what will occur. I am not however changing the name of the all.q to week.q at the moment due to your jobs in said queue.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

For example at the moment due to node 3 being idle I am testing the short queue.

@hurleyLi
Copy link

hurleyLi commented Nov 6, 2015

What I can do is I can kill the jobs and let you make the changes, coz I also want these to be implemented and tested sooner than later. I have to stop at some point anyway for you the activate these changes, if I don't stop, the whole process after these jobs will last a month.

Now I'm really just waiting 2-3 jobs to finish. 276, 279, 285. They'll probably finish around 2-3pm this afternoon. Do you think you can activate the changes this afternoon, so I can restart those jobs later today. And at the meantime, I can test some of my changes as mentioned in #34

Does it sound a plan?

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

I will implement the changes when I see your jobs exit. Be aware your 2-3PM is my end of day. Afterhours monitoring is best effort which is why I normally do not make heavy changes on a Friday.

But in the interests of moving this along and given I am mostly around this weekend I will attempt to implement the items above for the queues.

IMPORTANT question however: are you really ready for memory reservations? I can do that separately. But you would need to add the proper -l h_vmem=XXG to your jobs....

As we are outpacing my ability to document and configure we'd have to sort out any issues as we go.

@hurleyLi
Copy link

hurleyLi commented Nov 6, 2015

Do you prefer to do this on Monday? We can do this next week if it's best for you to monitor the cluster after these changes. I don't think other people are gonna use the cluster beside me

@tatarsky
Copy link
Collaborator Author

tatarsky commented Nov 6, 2015

If that is truly the case I would prefer Monday. That way you will get uninterrupted use this weekend. While SGE is fairly direct, it periodically gets confused and I'd rather not scramble through config if I can do it during regular work hours.

@nariai
Copy link
Contributor

nariai commented Nov 6, 2015

That's fine for me. Let's change the config on Monday.

@tatarsky
Copy link
Collaborator Author

Old nodes except for cn7 and cn12 are in SGE opt.q which is selected with the -l opt resource. Report issues in #33 and I'll get to them fairly quickly but with some in/out today for me.

Remember, its a single 1G link over to these nodes and for now we are using an NFS->Lustre method.

@tatarsky
Copy link
Collaborator Author

So this is to remind me: I have NOT turned on memory as a consumable yet on opt.q. I will shortly after an errand. I am going to start with an oversubsciption of 74G (real=54G"

There is no walltime limit on the opt.q either. I assume for now that is fine.

@tatarsky
Copy link
Collaborator Author

Walltime on opt.q set to one week per above and memory as a consumable activated with some oversubscription.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Feb 4, 2016

The SGE s_core limit (soft core limit) was changed from UNLIMITED to 0 a moment ago per #102

This means the default UNIX limit "coredump" is set to zero which results in a crashing program not dumping potentially a large core file. A recent reminder of the joys of this with many large memory jobs all crashing due to a bug was the impetus for this what I normally do as a default and forgot.

If you NEED coredumps in a job add a ulimt -c unlimited to your submit script.

@nariai
Copy link
Contributor

nariai commented Feb 19, 2016

I submitted a group of jobs (job id: 405730-405751) by
qsub -l long
However, it looks like only two of the submitted jobs are using long.q, whereas other jobs are using all.q.

I'm expecting that the submitted jobs can be running more than two days (that is why I submitted jobs in long queue), but it looks like some of the jobs running in all.q will be terminated after two days? Is this situation (submit jobs in long.q, but actually using all.q) expected? Do you have any ideas?

@tatarsky
Copy link
Collaborator Author

I'm looking at job 405731 as an example of this.

No, I would say that is not expected but will look at the logs to see what I can locate to explain it or locate a config item that needs to change.

@tatarsky
Copy link
Collaborator Author

And if you have handy the precise qsub command you issued it would be helpful.

@tatarsky
Copy link
Collaborator Author

Assuming it wasn't just qsub -l long calc_LD_005_ALL.sh BTW...just checking if any other command line args were used...

@nariai
Copy link
Contributor

nariai commented Feb 19, 2016

Thank you for taking a look.

The command was (in Makefile):

calc_LD_005_ALL:
number=1 ; while [[ $$number -le 22 ]] ; do
qsub -l long -p 0 -pe smp 1 -cwd -e logs -o stdouts -v ARG1=$$number calc_LD_005_ALL.sh ;
((number = number + 1)) ;
done

The path is:
/projects/T2D/analysis/common_variants

@tatarsky
Copy link
Collaborator Author

OK. I noted that "pe smp 1" in and wanted to check on that. Is there a reason for requesting a PE of only one processor? (compared to more than one as that is really the same as not asking for a PE). Shouldn't matter but I've actually never used that construct.

I will try to reproduce using a similar makefile. I wonder if the -l long is somehow not being fully preserved in the make commands that are finally issued.

I wonder if the behavior will change if you embed the SGE hard resource in the script ala:

#$ -l long

@tatarsky
Copy link
Collaborator Author

I have reproduced the above. For whatever reason the first job is issued to long.q as expected but the following jobs are in all.q

I am trying to step through the outputted items to see why.

@nariai
Copy link
Contributor

nariai commented Feb 19, 2016

When I changed from "-l long" to "-l week", all of the jobs are running in week.q (405752-405773).
Other parameters were set exactly the same.
Is this something long.q specific?

@tatarsky
Copy link
Collaborator Author

Interesting. When I removed "pe smp 1" they all ended up in long.q ;)

So I am suspecting something about the final commands being output here.

Is there a reason you are doing this make loop compared to an SGE array job?

@nariai
Copy link
Contributor

nariai commented Feb 19, 2016

Very interesting. No, I didn't have a specific reason not using an array job. Next time, I'll try to remove "pe smp 1" if I run jobs with one cpu, or try to use an array job.

@tatarsky
Copy link
Collaborator Author

Let me look at it a bit more in the morning. Whatever is going on its subtle.

The array job suggestion I can explain more its just a simpler way of running a series of jobs from the same submit script using SGE built in shell variables that increment similar to your make file loop. Basically a notch less setup work and SGE is a notch more efficient with array jobs particularly for large arrays.

But I'd still like to understand what is going on there.

I am suspicious of the PE perhaps confusing things but I don't know why it would drop into all.q.

So "unsure" is the answer for now and I will look closer with some coffee in the morning.

@nariai
Copy link
Contributor

nariai commented Feb 19, 2016

Thanks, please let me know if you can figure it out in more detail. I just re-submitted jobs in long queue.

@tatarsky
Copy link
Collaborator Author

So basically its got something to do with the -pe smp 1 but I don't know why. Tracing the make file shows clearly that your qsubs are as expected but that under some circumstance its like the -l long is dropped.

Removing the -pe smp 1 and it works fine. Trying to place the -l long into the submit script does NOT change the outcome. Changing the ordering does NOT change the outcome.

qsub -l long -pe smp 1 -p 0 -cwd -e logs -o stdouts -v ARG1=17 foo.sh

So I'm suspecting some kind of qsub argument bug at the moment but will double check my queue assumptions.

I am also going to try different pe smp options. Technically "pe smp 1" is not really a concept you need to specify but I would assume even if you did SGE wouldn't care.

@tatarsky
Copy link
Collaborator Author

Interestingly using -pe smp 2 results in everything going into long.q fine. At least for my tests.

So I'm guessing a form of bug involving asking for an SMP environment (multiple processors) but then really saying "not really" as that is what "1" would mean.

@tatarsky
Copy link
Collaborator Author

There are a few mutterings on the mailing lists of somewhat similar experiences. But nothing conclusive. I believe the queues are correct but that the outcome when the PE is 1 is probably a bug. I'll look a little more but my belief is also that stanza isn't ever needed (asking for one processor slot).

@tatarsky
Copy link
Collaborator Author

BTW, random choices of PE smp values above 1 all seem to result in the jobs correctly in the long.q.
So I'm guessing more its a bug.

@tatarsky
Copy link
Collaborator Author

SGE spool moved to fl-ims per #178 to reduce impact if fl-hn1 crashes during attempts to determine reason. fl-hn2 shadow qmaster now will not hang on NFS spool. No cross head node dependencies.

@tatarsky
Copy link
Collaborator Author

Moved seq_no for juplow and juphigh to non-matching seq_no's as other queues.
Based on possible assignment to said queues under some cases of jobs not referencing their required resource. #188

Current settings for reference:

short     0,[@notlonghosts=0],[@longhosts=5]
juplow    1,[@notlonghosts=1],[@longhosts=6]
juphigh   2,[@notlonghosts=2],[@longhosts=7]
opt       3
week      5,[@notlonghosts=5],[@longhosts=10]
all       10,[@notlonghosts=10],[@longhosts=15
long      10

@tatarsky
Copy link
Collaborator Author

Just for reference. c7.q was added. Single host in hostgroup for it cn12. queue resource c7. Defaults for everything else.

@tatarsky
Copy link
Collaborator Author

Make a "jupsmp" PE environment in prep for possible fix for #188
Remove make and mpi from juplow.q and juphigh.q as never needed for those queues and in theory related to proposed solution to #188 (although we don't see much mpi or make PE on this cluster)

@tatarsky
Copy link
Collaborator Author

For reference: Per Git #278 the s_rt and h_rt were removed from juphigh.q.

@tatarsky
Copy link
Collaborator Author

tatarsky commented Jan 7, 2020

opt_q and all cn* nodes will be removed shortly. Made a backup of SGE config and will start removing. Already removed them as an option in JupyterHubs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants