/
bluegene.html
282 lines (256 loc) · 13.9 KB
/
bluegene.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="keywords" content="Simple Linux Utility for Resource Management, SLURM, resource management,
Linux clusters, high-performance computing, Livermore Computing">
<meta name="LLNLRandR" content="UCRL-WEB-209488">
<meta name="LLNLRandRdate" content="24 February 2005">
<meta name="distribution" content="global">
<meta name="description" content="Simple Linux Utility for Resource Management">
<meta name="copyright"
content="This document is copyrighted U.S.
Department of Energy under Contract W-7405-Eng-48">
<meta name="Author" content="Morris Jette">
<meta name="email" content="jette1@llnl.gov">
<meta name="Classification"
content="DOE:DOE Web sites via organizational
structure:Laboratories and Other Field Facilities">
<title>Simple Linux Utility for Resource Management:Blue Gene User and Administrator Guide</title>
<link href="slurmstyles.css" rel="stylesheet" type="text/css">
</head>
<body bgcolor="#000000" text="#000000" leftmargin="0" topmargin="0">
<table width="770" border="0" cellspacing="0" cellpadding="0">
<tr>
<td><img src="slurm_banner.jpg" width="770" height="145" usemap="#Map" border="0" alt="Simple Linux Utility for Resource Management"></td>
</tr>
</table>
<table width="770" border="0" cellspacing="0" cellpadding="3" bgcolor="#FFFFFF">
<tr>
<td width="100%">
<table width="760" border="0" cellspacing="0" cellpadding="4" align="right">
<tr>
<td valign="top" bgcolor="#000000"><p><img src="spacer.gif" width="110" height="1" alt=""></p>
<p><a href="slurm.html" class="nav" align="center">Home</a></p>
<p><span class="whitetext">About</span><br>
<a href="overview.html" class="nav">Overview</a><br>
<a href="news.html" class="nav">What's New</a><br>
<a href="publications.html" class="nav">Publications</a><br>
<a href="team.html" class="nav">SLURM Team</a></p>
<p><span class="whitetext">Using</span><br>
<a href="documentation.html" class="nav">Documentation</a><br>
<a href="faq.html" class="nav">FAQ</a><br>
<a href="help.html" class="nav">Getting Help</a></p>
<p><span class="whitetext">Installing</span><br>
<a href="platforms.html" class="nav">Platforms</a><br>
<a href="download.html" class="nav">Download</a><br>
<a href="quickstart_admin.html" class="nav">Guide</a></p></td>
<td><img src="spacer.gif" width="10" height="1" alt=""></td>
<td valign="top"><h2>Blue Gene User and Administrator Guide</h2>
<h3>Overview</h3>
<p>This document describes the unique features of SLURM on the
<a href="http://www.research.ibm.com/bluegene">IBM Blue Gene</a> systems.
You should be familiar with the SLURM's mode of operation on Linux clusters
before studying the relatively few differences in Blue Gene operation
described in this document.</p>
<p>Blue Gene systems have several unique features making for a few
differences in how SLURM operates there.
The basic unit of resource allocation is a <i>base partition</i>.
The <i>base partitions</i> are connected in a three-dimensional torus.
Each <i>base partition</i> includes 512 <i>c-nodes</i> each containing two processors;
one designed primarily for computations and the other primarily for managing communications.
SLURM considers each <i>base partition</i> as one node with 1024 processors.
The <i>c-nodes</i> can execute only one process and thus are unable to execute both
the user's jobs and SLURM's <i>slurmd</i> daemon.
Thus the <i>slurmd</i> daemon executes on one of the Blue Gene <i>Front End Nodes</i>.
This <i>slurmd</i> daemon provides (almost) all of the normal SLURM services
for every <i>base partition</i> on the system. </p>
<h3>User Tools</h3>
<p>The normal set of SLURM user tools: srun, scancel, sinfo, squeue and scontrol
provide all of the expected services except support for job steps.
SLURM performs resource allocation for the job, but initiation of tasks is performed
using the <i>mpirun</i> command. SLURM has no concept of a job step on Blue Gene.
Four new srun options are available:
<i>--geometry</i> (specify job size in each dimension),
<i>--no-rotate</i> (disable rotation of geometry),
<i>--conn-type</i> (specify interconnect type between base partitions, mesh or torus), and
<i>--node-use</i> (specify how the second processor on each c-node is to be used,
coprocessor or virual).
You can also continue to use the <i>--nodes</i> option with a minimum and (optionally)
maximum node count. The <i>--ntasks</i> option continues to be supported.
See the srun man pages for details. </p>
<p>To reiterate: srun is used to submit a job script, but mpirun is used to launch the parallel tasks.
<b>It is highly recommended that the srun <i>--batch</i> option be used to submit a script.</b>
While the srun <i>--allocate</i> option may be used to create an interactive SLURM job,
it will be the responsibility of the user to insure that the <i>bglblock</i>
is ready for use before initiating any mpirun commands.
SLURM will assume this responsibility for batch jobs.
The script that you submit to SLURM can contain multiple invocations of mpirun as
well as any desired commands for pre- and post-processing.
The mpirun command will get its <i>bglblock</i> or BGL partition information from the
<i>MPIRUN_PARTITION</i> as set by SLURM. A sample script is shown below.
<pre>
#!/bin/bash
# pre-processing
date
# processing
mpirun -exec /home/user/prog -cwd /home/user -args 123
mpirun -exec /home/user/prog -cwd /home/user -args 124
# post-processing
date
</pre></p>
<a name="naming">
<p>The naming of nodes includes a three-digit suffix representing the base partition's
location in the X, Y and Z dimensions with a zero origin.
For example, "bgl012" represents the base partition whose location is at X=0, Y=1 and Z=2.
Since jobs must be allocated consecutive nodes in all three dimensions, we have developed
an abbreviated format for describing the nodes in one of these three-dimensional blocks.
The node's prefix is followed by the end-points of the block enclosed in square-brackets.
For example, " bgl[620x731]" is used to represent the eight nodes enclosed in a block
with endpoints bgl620 and bgl731 (bgl620, bgl621, bgl630, bgl631, bgl720, bgl721,
bgl730 and bgl731).</p></a>
<p>One new tools provided is <i>smap</i>.
Smap is aware of system topography and provides a map of what nodes are allocated
to jobs, partitions, etc.
See the smap man page for details.
A sample of smap output is provided below showing the location of five jobs.
Note the format of the list of nodes allocated to each job.
Also note that idle (unassigned) base partitions are indicated by a period.
Down and drained base partitions (those not available for use) are
indicated by a number sign (bgl703 in the display below).
The legend is for illustrative purposes only.
The origin (zero in every dimension) is shown at the rear left corner of the bottom plane.
Each set of four consecutive lines represents a plane in the Y dimension.
Values in the X dimension increase to the right.
Values in the Z dimension increase down and toward the left.</p>
<pre>
a a a a b b d d ID JOBID PARTITION USER NAME ST TIME NODES NODELIST
a a a a b b d d a 12345 batch joseph tst1 R 43:12 64 bgl[000x333]
a a a a b b c c b 12346 debug chris sim3 R 12:34 16 bgl[420x533]
a a a a b b c c c 12350 debug danny job3 R 0:12 8 bgl[622x733]
d 12356 debug dan colu R 18:05 16 bgl[600x731]
a a a a b b d d e 12378 debug joseph asx4 R 0:34 4 bgl[612x713]
a a a a b b d d
a a a a b b c c
a a a a b b c c
a a a a . . d d
a a a a . . d d
a a a a . . e e Y
a a a a . . e e |
|
a a a a . . d d 0----X
a a a a . . d d /
a a a a . . . . /
a a a a . . . # Z
</pre>
<p class="footer"><a href="#top">top</a></p>
<h3>System Administration</h3>
<p>Building a Blue Gene compatible system is dependent upon the <i>configure</i>
program locating some expected files. You should see "#define HAVE_BGL 1" and
"#define HAVE_FRONT_END 1" in the "config.h" file before making SLURM.</p>
<p>The slurmctld daemon should execute on the system's service node.
If an optional backup daemon is used, it must be in some location where
it is capable of writing to MMCS.
One slurmd daemon should be configured to execute on one of the front end nodes.
That one slurmd daemon represents communications channel for every base partition.
A future release of SLURM will support multiple slurmd daemons on multiple
front end nodes.
You can use the scontrol command to drain individual nodes as desired and
return them to service. </p>
<p>The slurm.conf (configuration) file needs to have the value of <i>InactiveLimit</i>
set to zero or not specified (it defaults to a value of zero).
This is because there are no job steps and we don't want to purge jobs prematurely.
The value of <i>SelectType</i> must be set to "select/bluegene" in order to have
node selection performed using a system aware of the system's topography
and interfaces.
The value of <i>SchedulerType</i> should be set to "sched/builtin".
The value of <i>Prolog</i> should be set to a program that will delay
execution until the bglblock identified by the MPIRUN_PARTITION environment
variable is ready for use. It is recommended that you construct a script
that serves this function and calls the supplied program <i>slurm_prolog</i>.
The value of <i>Epilog</i> should be set to a program that will wait
until the bglblock identified by the MPIRUN_PARTITION environment
variable has been freed. It is recommended that you construct a script
that serves this function and calls the supplied program <i>slurm_epilog</i>.
The prolog and epilog programs are used to insure proper synchronization
between the slurmctld daemon, the user job, and MMCS.
Since jobs with different geometries or other characteristics do not interfere
with each other's scheduling, backfill scheduling is not presently meaningful.
SLURM's builtin scheduler on Blue Gene will sort pending jobs and then attempt
to schedule all of them in priority order. </p>
<p>SLURM node and partition descriptions should make use of the
<a href="#naming">naming</a> conventions described above. For example,
"NodeName=bgl[000x733] NodeAddr=frontend0 NodeHostname=frontend0 Procs=1024".
Note that the values of both NodeAddr and NodeHostname for all
128 base partitions is the name of the front end node executing
the slurmd daemon.
The NodeName values represent base partitions.
No computers are actually expected to return a value of "bgl000"
in response to the <i>hostname</i> command nor will any attempt
be made to route message traffic to this address. </p>
<p>While users are unable to initiate SLURM job steps on Blue Gene systems,
this restriction does not apply to user root or SlurmUser.
Be advised that the one slurmd supporting all nodes is unable to manage a
large number of job steps, so this ability should be used only to verify normal
SLURM operation.
If large numbers of job steps are initiated by slurmd, expect the daemon to
fail due to lack of memory. </p>
<p>Presently the system administrator must explicitly define each of the
Blue Gene partitions (or bglblocks) available to execute jobs.
(<b>NOTE:</b> Blue Gene partitions are unrelated to SLURM partitions.)
Jobs must then execute in one of these pre-defined bglblocks.
This is known as <i>static partitioning</i>.
Each of these bglblocks are explicitly configured with either a mesh or
torus interconnect.
In addition to the normal <i>slurm.conf</i> file, a new
<i>bluegene.conf</i> configuration file is required with this information.
Put <i>bluegene.conf</i> into the SLURM configuration directory with
<i>slurm.conf</i>.
System administrators should use the smap tool to build appropriate
configuration file for static partitioning.
See the smap man page for more information.
Note that in addition to the bglblocks defined in blugene.conf, an
additional block containing all resources is created.
Make use of the SLURM partition mechanism to control access to these
bglblocks.</p>
<p>Two other changes are required to support SLURM interactions with
the DB2 database.
The <i>db2profile</i> script must be executed prior to the execution
of the slurmctld daemon.
This may be accomplished by executing the script from
<i>/etc/sysconfig/slurm</i>, which is executed by
<i>/etc/init.d/slurm</i>.
The second required file is <i>db.properties</i>, which should
be copied into the SLURM configuration directory with <i>slurm.conf</i>.
Again, this can be accomplished using /etc/sysconfig/slurm.</p>
<p>At some time in the future, we expect SLURM to support <i>dynamic
partitioning</i> in which Blue Gene job partitions are created and destroyed
as needed to accomodate the workload.
At that time the <i>bluegene.conf</i> configuration file will become obsolete.
Dynamic partition does involve substantial overhead including the
rebooting of c-nodes and I/O nodes.</p>
<p>Assuming that you build RPMs for SLURM, note that the smap and bluegene
RPMs must be built on the service node (where the BGL Bridge API libraries
exist) and installed on both the service node and front-end nodes (which
lack the API libraries).</p>
<p class="footer"><a href="#top">top</a></p></td>
</tr>
<tr>
<td colspan="3"><hr> <p>For information about this page, contact <a href="mailto:slurm-dev@lists.llnl.gov">slurm-dev@lists.llnl.gov</a>.</p>
<p><a href="http://www.llnl.gov/"><img align=middle src="lll.gif" width="32" height="32" border="0"></a></p>
<p class="footer">UCRL-WEB-209488<br>
Last modified 24 February 2005</p></td>
</tr>
</table>
</td>
</tr>
</table>
<map name="Map">
<area shape="rect" coords="616,4,762,97" href="../">
<area shape="rect" coords="330,1,468,11" href="http://www.llnl.gov/disclaimer.html">
<area shape="rect" coords="11,23,213,115" href="slurm.html">
</map>
</body>
</html>