Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nek5000 Bakeoff problems #1

Merged
merged 33 commits into from
Jun 8, 2017
Merged

Nek5000 Bakeoff problems #1

merged 33 commits into from
Jun 8, 2017

Conversation

thilinarmtb
Copy link
Contributor

@thilinarmtb thilinarmtb commented May 31, 2017

This PR is an effort to add Nek5000 bake-off problems to the
CEED benchmarks repo.

Current status

As of now, users can use,

./go.sh --config vulcan --compiler gcc --build "nek5000"
./go.sh --config vulcan --compiler gcc --run tests/nek5000_bps/bp1/bp1.sh

to clone Nek5000, build the required tools (genmap and genbox)
and then build the executable for different lx1 (order) values.
This will create a separate directory for separate lx1 values and
copy the box geometries into each of these directories. Next thing
is to run the executable inside each of the box geometries and collect
the data. That is really easy. We can change the range of lx1 and
the elements in generated box geometries easily as well. For the time
being user has to edit the scripts to change these two parameters.

Currently, this PR uses values from the makenek file to build the
executable. makenek file is the build script for Nek5000. So, I need
to figure out how to change the parameters in makenek file depending
on the machine configuration we use.

There are two approaches. We can keep a single makenek file and
edit it with sed depending on the machine configuration we want
or create different makenek files for different machines and use them
depending on the --config parameter. I think latter is better. Let me
know if there is a better way.

TODO

  • Use the correct build parameters in makenek file depending on
    the machine for building the executables
  • Run the generated executables.
    - [ ] Get the output data from the generated log files after the run
    and plot the graphs.

    - [ ] Polish the script files (Current script files in this repo are pretty
    nice :) )

Build script only builds the tools needed for the bakeoff problems:
genmap and genbox
Need to move these to the build directory
TODO:
  * use the correct build parameters in makenek file
  * Run the tests
  * get the output
  * polish the scripts
@@ -0,0 +1,10 @@
cp $1.box ttt.box
../../../product-sources/Nek5000/bin/genbox << EOF
Copy link
Contributor Author

@thilinarmtb thilinarmtb May 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ../../../ thing is not pretty. genbb is executed inside boxes.sh
which is called from tests/nek5000_bps/bp1/bp1.sh.

I can't think of a way to pass variables in go.sh to genbb script
without making them environment variables. Other option is to copy
all of these snippets to one .sh file and create them as functions, say
inside bp1.sh.

box
.1
EOF
mvn box $1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will give an error if run on a machine which does not
have Nek5000/bin directory in its path. I will fix this soon.

thilinarmtb and others added 8 commits May 31, 2017 11:30
non-mpi compilers separately: CC, CXX, FC.

Update the nek5000.sh to build in a separate directory based on the
config + compiler and to use the Fortran and C compilers, FC and CC.

Still need to update the compilers in machine-configs/{vulcan,ray}.sh.
Move it as a function inside the boxes.sh
Moved everything to bp1.sh script. There is a temporary
workaround in build_and_run_tests in bp1.sh script. For
some reason I can't access NEK5K_DIR variable defined
in go.sh.
Also, managed to get rid of the temporary work-around by adding
test_required_packages at the end.
I am going to soon expand this to run on other environments
as well. This is just a test to see if everything works in the
Nek5000 side.
Copy link
Contributor Author

@thilinarmtb thilinarmtb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I exported some of the variables in this commit so that they are accessible from makenek script.

do
cd b$j
cp ../nek5000 .
$NEK5K_DIR/bin/nekbmpi b$j 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a test. I will update this with more general commands.

@thilinarmtb
Copy link
Contributor Author

I think now the executable can be run in both mac and linux.
I need to figure out how to run it on say Cetus, which uses
qsub.

Does Vulcan use qsub?

@jdahm
Copy link
Contributor

jdahm commented Jun 1, 2017

It looks like it uses slurm - but I've never used it. If you have access to LLNL confluence pages there is more information here https://lc.llnl.gov/confluence/display/BGQ/Running+jobs.

@v-dobrev
Copy link
Member

v-dobrev commented Jun 1, 2017

You can just use the machine-configs/vulcan.sh. The variablesMPIEXEC, MPIEXEC_OPTS, and MPIEXEC_NP are set so that when used in the scripts, they will submit a job and wait for it to finish. The only thing is that you may need to set the bank -- I set it to "ceed" - you may need to change that if you do not have access to that bank.

@thilinarmtb
Copy link
Contributor Author

Thanks for the replies @jdahm and @v-dobrev !! I will try and let you know.

Also, I removed $CFLAGS and $FLAGS from being transferred into
maketools file. We use maketools to build genmap and genbox
in the front-end nodes in case of machines like vulcan. So,
they don't need the flags which we use to compile a Nek5000
case.
@thilinarmtb
Copy link
Contributor Author

Is there a way to change the number of nodes allocated in the sbatch command?

Currently, when I check the status of my jobs in Vulcan using squeue -u thilina, I see
each job has requested 1.5K nodes, which is too much.

I tried sbatch -N 1 ./submit.sh and sbatch --nodes=1 ./submit.sh but both of them
fail with the following error message:

sbatch: error: Batch job submission failed: Node count specification invalid

I tried different number of nodes as well. Any idea why this happens? If I don't use
-N or --nodes the jobs get submitted but they are still in the queue.

Added post-processing script plots the basic graph of DOFS/s vs DOFS.
This can be used to generate the graphs till the python plotting
is implemented. Below is the command to draw the graphs:

./go.sh --config <cfg> --compiler <cmplr> -pp tests/nek5000/bp1/pp_bp1.sh

-pp (--post-process) sets the environment for post-processing
script to run.
@v-dobrev
Copy link
Member

v-dobrev commented Jun 5, 2017

@thilinarmtb,
You should specify the same options to sbatch as to srun: e.g. in the nekmpi function you can execute:

   sbatch $MPIEXEC_OPTS $MPIEXEC_NP $num_proc_node ./submit.sh $1 $2

I tried that and I get the same error I get when I run the job interactively with srun, i.e. just running submit.sh as a script:

   ./submit.sh $1 $2

There seems to be some issue with the executable. Here is what I get:

2017-06-05 15:17:37.487 (FATAL) [0x400033ad230] 28534:ibm.runjob.client.Job: could not start job: job failed to start
2017-06-05 15:17:37.488 (FATAL) [0x400033ad230] 28534:ibm.runjob.client.Job: Load failed on R00-ID-J07: Generating static TLB map for application failed, errno 0

@ikarlin, any idea what this means? Or who we can ask about this error?

Veselin

Edit: The issue with the executable was that the combined size of all 16 tasks was more than 16 GB - the bp1.sh script now checks for that on vulcan.

@thilinarmtb
Copy link
Contributor Author

Thanks for the reply @v-dobrev. I will try to fix the issue. If I remember correctly, this commit ran in Vulcan. I will look into what is going on.

Otherwise nekbuilds with higher lx1 values do not
get built.
@thilinarmtb
Copy link
Contributor Author

Added post-processing script plots the basic graph of DOFS/s vs DOFS.
This can be used to generate the graphs till the python plotting
is implemented. Below is the command to draw the graphs:

./go.sh --config <cfg> --compiler <cmplr> -pp tests/nek5000/bp1/pp_bp1.sh

-pp (--post-process) sets the environment for post-processing
script to run.

@thilinarmtb
Copy link
Contributor Author

thilinarmtb commented Jun 7, 2017

I was able to run in Vulcan using xlc compilers. But I had to comment the CFLAGS.
They are not parsed correctly in makenek script when sed is called. My guess is
sed does not like the ":" in the CFLAGS . I will look into it.

@v-dobrev
Copy link
Member

v-dobrev commented Jun 7, 2017

@thilinarmtb,

I know about that issue with sed - you can simply not export CFLAGS and FFLAGS and let the nek script pick the options it uses by default. (The problem with the exported CFLAGS is that it contains columns, :, which get interpreted by sed the wrong way - one solution is to export CFLAGS with : escaped with a backslash, \:.)

Even with this issue resolved (which is just a build issue), the problem is that the produced executable does not work, regardless of how you run it with SLURM - either interactively, with srun, or in batch jobs with sbatch.

When you say "I was able to run in Vulcan using xlc compilers.", do you mean "run the scripts" or actually "run the executable produced by the Nek5000 build system"?

-Veselin

@thilinarmtb
Copy link
Contributor Author

Thanks for the reply @v-dobrev.

I was actually able to run the executables in Vulcan and they produced the
log files. However, I will double check.

@v-dobrev
Copy link
Member

v-dobrev commented Jun 7, 2017

Did you push that version to the repository? I tried the current version (20b2d3a) and I get a sed error. I'm just running:

./go.sh -c vulcan -m xlc -r tests/nek5000_bps/bp1/bp1.sh

…or configs

defineing the variable 'node_virt_mem_lim'.
@thilinarmtb thilinarmtb changed the title [WIP] Nek5000 Bakeoff problems Nek5000 Bakeoff problems Jun 8, 2017
@thilinarmtb
Copy link
Contributor Author

Thanks for approving changes @tzanio ! Thanks everyone for all the help.
I will merge this branch and send other updates in a separate PR.

@thilinarmtb thilinarmtb merged commit 70d9c72 into master Jun 8, 2017
@thilinarmtb thilinarmtb deleted the nek5k branch June 8, 2017 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants