Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

piXedfit not working in HPC #15

Open
Nikhil0504 opened this issue Dec 2, 2023 · 3 comments
Open

piXedfit not working in HPC #15

Nikhil0504 opened this issue Dec 2, 2023 · 3 comments

Comments

@Nikhil0504
Copy link
Contributor

Hello,

I am trying to run the code for a JWST project on HPC. But, when I try to make the model rest frame spectra my code breaks due to some MPI error that is being cause by piXedfit.

I did test the MPI installation in my cluster and everything seems to be fine there with no conflicts and only breaks when I run piXedfit.

The error seems to be some networking error but upon checking with my HPC support team they didn't find any problems with the way I set up my cluster and seems to be leaning in the way that it's the problem of the way MPI implementation might be written here.

Please let me know how I can fix it because this code takes a very long time to run on my laptop (Macbook Air M1). I would like to also know if there are any other new features implemented for JWST images and spectroscopy.

Thanks,
Nikhil


FILE DUMPS:

Here is HPC job shell file:

#!/bin/bash
# a name for this job; letters and digits only please. (optional)
#SBATCH --job-name=run_pixed_fit

# Researchers can buy in for priority queueing. However, the amount of time 
# they can use for this is limited.  Everyone has access to unlimited
# windfall, but any priority jobs will go first.  The partitions (queues) on
# Puma are windfall, standard, and high_pri
# non-windfall partitions require a "#SBATCH --account=your_PI_group line"
#SBATCH --account=bfrye
#SBATCH --partition=standard

# Standard Puma Nodes have 94 available cores and 480GB of available RAM each.
# Since no memory allocation is set, the ram available will be 5GB per core
# (Task in Sluem).
# Note: the fewer resources you request, the less time your job will spend in
# waiting and the more resources will be available for others to use.
#SBATCH --nodes=4
#SBATCH --ntasks=32
#SBATCH --mem=32gb

#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikhilgaruda@arizona.edu

# This is the amount of time you think your job will take to run.
# 240 houra (10 days) is the maximum.
# This request shows 5 minutes
#SBATCH --time=10:00:00

# Reset modules, so we have a known starting point...
module purge
module load autotools prun/1.3 gnu8/8.3.0 openmpi3/3.1.4 ohpc
module load anaconda
module list

source /home/u4/nikhilgaruda/.bashrc

export SPS_HOME=/home/u4/nikhilgaruda/Software/fsps
export PIXEDFIT_HOME=/home/u4/nikhilgaruda/Software/piXedfit
export OMPI_MCA_opal_cuda_support=true

conda activate /home/u4/nikhilgaruda/.conda/envs/pixedfit
echo $CONDA_DEFAULT_ENV

conda list

conda install astropy -y


cd /groups/bfrye/G191/pixedfit
strace mpirun model_rest_frame_spectra.py

# *** If you need assistance using Puma, please email
# *** hpc-consult@list.arizona.edu

exit 0
#
#-end of puma_cpu sample script.

Here is the output from the run:
slurm-1653886.out.txt

@aabdurrouf
Copy link
Owner

aabdurrouf commented Dec 3, 2023

Hi @Nikhil0504, Sorry to know that you got this issue. first of all, can I see your model_rest_frame_spectra.py script? and the way you run this Python script seems incorrect -- no need to mpirun again. Please try this instead
python model_rest_frame_spectra.py

And regarding the slow run on Macbook Air M1, how many cores does your laptop have and how many of them are used (nproc) for running your piXedfit script?. And did you put mpirun on terminal command when executing the script (which shouldn't be)? On my MBP using 5 cores, generating 100,000 models usually takes around 10-15min.

@Nikhil0504
Copy link
Contributor Author

Hey @aabdurrouf, sorry for the late reply! It's end of the semester for me and is filled with lots of finals.

Here is the model_rest_frame_spectra.py file. I set nproc=-1 to use all the cores and just ran it like a normal python script instead of using mpirun.

I am not sure why it's super slow and buggy for me.

# %%
import numpy as np
from astropy.cosmology import FlatLambdaCDM
from piXedfit.piXedfit_model import save_models_rest_spec

# %%
imf_type  = 1
sfh_form = 1
dust_law = 1
duste_switch = 1
add_neb_emission = 1
add_agn = 0

nmodels=100000
nproc=16

min_z = 2.47                     # minimuz redshift which determines the maximum age of the models
cosmo = FlatLambdaCDM(H0=70.0, Om0=0.3)
age_univ = cosmo.age(min_z)
max_log_age = np.log10(age_univ.value)

# # we fix the ionization parameter to log(U)=-2.0
# params_range = {'dust1':[0.0,4.0], 
#                 'dust2':[0.0,4.0], 
#                 'log_age':[-1.0,max_log_age], 
#                 'log_tau':[-1.0,1.5], 
#                 'gas_logu':[-2.0,-2.0]}

name_out = 'model_specs.hdf5'
save_models_rest_spec(imf_type=imf_type, sfh_form=sfh_form, dust_law=dust_law,
                        duste_switch=duste_switch, add_neb_emission=add_neb_emission, add_agn=add_agn,
                        nmodels=nmodels, nproc=nproc, name_out=name_out)

# %%

@Nikhil0504
Copy link
Contributor Author

Update:
I got it working in HPC using an interactive node. The first problem was the way to conda environment was defined and also the way I installed my mpi4py package.

But, upon installing none of the codes were working because of a conflict in the packages defined in the installation (mainly astropy). So, I had to fix the versions manually. I can do a new push fixing the errors of the packages with a new requirements.txt that freezes the versions that works the best for this version.

There was also a bug where the code breaks when nproc=-1, I am still trying to figure out where that is occuring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants