piXedfit not working in HPC #15

Nikhil0504 · 2023-12-02T05:41:18Z

Hello,

I am trying to run the code for a JWST project on HPC. But, when I try to make the model rest frame spectra my code breaks due to some MPI error that is being cause by piXedfit.

I did test the MPI installation in my cluster and everything seems to be fine there with no conflicts and only breaks when I run piXedfit.

The error seems to be some networking error but upon checking with my HPC support team they didn't find any problems with the way I set up my cluster and seems to be leaning in the way that it's the problem of the way MPI implementation might be written here.

Please let me know how I can fix it because this code takes a very long time to run on my laptop (Macbook Air M1). I would like to also know if there are any other new features implemented for JWST images and spectroscopy.

Thanks,
Nikhil

FILE DUMPS:

Here is HPC job shell file:

#!/bin/bash
# a name for this job; letters and digits only please. (optional)
#SBATCH --job-name=run_pixed_fit

# Researchers can buy in for priority queueing. However, the amount of time 
# they can use for this is limited.  Everyone has access to unlimited
# windfall, but any priority jobs will go first.  The partitions (queues) on
# Puma are windfall, standard, and high_pri
# non-windfall partitions require a "#SBATCH --account=your_PI_group line"
#SBATCH --account=bfrye
#SBATCH --partition=standard

# Standard Puma Nodes have 94 available cores and 480GB of available RAM each.
# Since no memory allocation is set, the ram available will be 5GB per core
# (Task in Sluem).
# Note: the fewer resources you request, the less time your job will spend in
# waiting and the more resources will be available for others to use.
#SBATCH --nodes=4
#SBATCH --ntasks=32
#SBATCH --mem=32gb

#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikhilgaruda@arizona.edu

# This is the amount of time you think your job will take to run.
# 240 houra (10 days) is the maximum.
# This request shows 5 minutes
#SBATCH --time=10:00:00

# Reset modules, so we have a known starting point...
module purge
module load autotools prun/1.3 gnu8/8.3.0 openmpi3/3.1.4 ohpc
module load anaconda
module list

source /home/u4/nikhilgaruda/.bashrc

export SPS_HOME=/home/u4/nikhilgaruda/Software/fsps
export PIXEDFIT_HOME=/home/u4/nikhilgaruda/Software/piXedfit
export OMPI_MCA_opal_cuda_support=true

conda activate /home/u4/nikhilgaruda/.conda/envs/pixedfit
echo $CONDA_DEFAULT_ENV

conda list

conda install astropy -y


cd /groups/bfrye/G191/pixedfit
strace mpirun model_rest_frame_spectra.py

# *** If you need assistance using Puma, please email
# *** hpc-consult@list.arizona.edu

exit 0
#
#-end of puma_cpu sample script.

Here is the output from the run:
slurm-1653886.out.txt

aabdurrouf · 2023-12-03T06:24:52Z

Hi @Nikhil0504, Sorry to know that you got this issue. first of all, can I see your model_rest_frame_spectra.py script? and the way you run this Python script seems incorrect -- no need to mpirun again. Please try this instead
python model_rest_frame_spectra.py

And regarding the slow run on Macbook Air M1, how many cores does your laptop have and how many of them are used (nproc) for running your piXedfit script?. And did you put mpirun on terminal command when executing the script (which shouldn't be)? On my MBP using 5 cores, generating 100,000 models usually takes around 10-15min.

Nikhil0504 · 2023-12-07T01:57:01Z

Hey @aabdurrouf, sorry for the late reply! It's end of the semester for me and is filled with lots of finals.

Here is the model_rest_frame_spectra.py file. I set nproc=-1 to use all the cores and just ran it like a normal python script instead of using mpirun.

I am not sure why it's super slow and buggy for me.

# %%
import numpy as np
from astropy.cosmology import FlatLambdaCDM
from piXedfit.piXedfit_model import save_models_rest_spec

# %%
imf_type  = 1
sfh_form = 1
dust_law = 1
duste_switch = 1
add_neb_emission = 1
add_agn = 0

nmodels=100000
nproc=16

min_z = 2.47                     # minimuz redshift which determines the maximum age of the models
cosmo = FlatLambdaCDM(H0=70.0, Om0=0.3)
age_univ = cosmo.age(min_z)
max_log_age = np.log10(age_univ.value)

# # we fix the ionization parameter to log(U)=-2.0
# params_range = {'dust1':[0.0,4.0], 
#                 'dust2':[0.0,4.0], 
#                 'log_age':[-1.0,max_log_age], 
#                 'log_tau':[-1.0,1.5], 
#                 'gas_logu':[-2.0,-2.0]}

name_out = 'model_specs.hdf5'
save_models_rest_spec(imf_type=imf_type, sfh_form=sfh_form, dust_law=dust_law,
                        duste_switch=duste_switch, add_neb_emission=add_neb_emission, add_agn=add_agn,
                        nmodels=nmodels, nproc=nproc, name_out=name_out)

# %%

Nikhil0504 · 2023-12-29T20:35:41Z

Update:
I got it working in HPC using an interactive node. The first problem was the way to conda environment was defined and also the way I installed my mpi4py package.

But, upon installing none of the codes were working because of a conflict in the packages defined in the installation (mainly astropy). So, I had to fix the versions manually. I can do a new push fixing the errors of the packages with a new requirements.txt that freezes the versions that works the best for this version.

There was also a bug where the code breaks when nproc=-1, I am still trying to figure out where that is occuring.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

piXedfit not working in HPC #15

piXedfit not working in HPC #15

Nikhil0504 commented Dec 2, 2023

aabdurrouf commented Dec 3, 2023 •

edited

Loading

Nikhil0504 commented Dec 7, 2023

Nikhil0504 commented Dec 29, 2023

piXedfit not working in HPC #15

piXedfit not working in HPC #15

Comments

Nikhil0504 commented Dec 2, 2023

aabdurrouf commented Dec 3, 2023 • edited Loading

Nikhil0504 commented Dec 7, 2023

Nikhil0504 commented Dec 29, 2023

aabdurrouf commented Dec 3, 2023 •

edited

Loading