-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The script doesn't create a correct Standard_HC44rs cluster with Mellanox EDR InfiniBand #374
Comments
AN (Accelerated networking) is not available for HC. Please set |
@jithinjosepkl in the config file i shared? Will that fix the interconnect issue? |
Please follow this article to make IMPI pick the mlx provider. Next update of CentOS HPC images will include IMPI 2019-U8, where you don't have to specify this environment parameter (picks up mlx provider by default). |
@jithinjosepkl I would appreciate a working scenario as the cost is getting very high without us getting a job ran correctly. surprisingly the OpenLogic:CentOS:7_7-gen2:latest image didn't come with any IMPI installed (It sounds like it get overridden maybe via the azurehpc scripts?). So i had to install IMPI manually . When i setup the Mlx, i get an error because the interconnect is not installed/setup correctly:
|
The CentOS images do not contain any pre-installed OFED drivers and mpi libraries (including Intel MPI), try using the CentOS-HPC images instead (e.g OpenLogic:CentOS-HPC:7_7-gen2:latest). The CentOS-HPC images should be ready to go if you want to use Infiniband on HC44 skus. |
@garvct I'm already using OpenLogic:CentOS-HPC:7_7-gen2:latest image. Please see my original post |
@Smahane , based on your config file, you are using OpenLogic:CentOS:7_7-gen2:latest. You need OpenLogic:CentOS-HPC:7_7-gen2:latest image instead for the MPIs to be pre-installed. |
Once you sort out your image, you can execute the IMB-MPI1 benchmark using the scripts in azurehpc/apps/imb-mpi. Examples of running IMB-MPI1 with different MPI libraries are provided. |
@garvct and @jithinjosepkl thank you for pointing me out of this HPC image.
|
|
This worked. Thank you everyone |
Hello @xpillons @garvct and @jithinjosepkl . I ran LAMMPS on up to 8 nodes Standard_HC44rs but I'm having performance issues at 8 nodes: I think one or more nodes are bad. Do you know of any smoke test or script that can help me check the nodes and detect which one is not working well? the azhpc_install_config/install/11_node_healthchecks.log doesn't show any errors. Thank you, |
@Smahane you can use the MPI PingPong test, we have an example here https://github.com/Azure/azurehpc/tree/master/apps/imb-mpi |
I need to run HPC applications (like LAMMPS) on a high bandwidth but the cluster doesn't seems to be configured correctly.
Describe the bug
To Reproduce
Steps to reproduce the behavior:
create a cluster using this config.json
mlx-fail.config.txt
setup the environment and run
mpiexec.hydra -f ${hostfile} -n 88 -ppn 44 ./lmp_intel_cpu_intelmpi -in in.intel.lc -v x 4 -v y 2 -v z 2 -pk intel 0 omp 1 -sf intel
Expected behavior
[0] MPI startup(): libfabric provider: mlx
** Other details
created a cluster with OpenLogic:CentOS:7_7-gen2:latest
Used Standard_HC44rs for compute nodes
Used Standard_D8s_v3 for headnode
** Additional details
To try to grantee that Mellanox is installed and setup correctly, i tried to set the headnode to be Standard_HC44rs but it fails with the error bellow:
The text was updated successfully, but these errors were encountered: