Skip to content

Adding robustness to the hostname indexing#466

Merged
TApplencourt merged 3 commits intodevelfrom
hostname_add_robustness
Feb 16, 2026
Merged

Adding robustness to the hostname indexing#466
TApplencourt merged 3 commits intodevelfrom
hostname_add_robustness

Conversation

@colleeneb
Copy link
Contributor

When testing the timeline on Sunspot, we ran into a fail like:

> mpirun -n 2 -ppn 1 -- iprof -l -- gpu_tile_compact.sh ./flops
Single Precision Peak Flops: 42422 GFlop/s
Double Precision Peak Flops: 21316.3 GFlop/s
/opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/thapi-git.426c8097e536fe40a8148bb83787ea695c91c57a_0.0.12-git.113-bstf5lt/bin/iprof:262:in `*': nil can't be coerced into Integer (TypeError)

      (MAX_UINT64_VALUE / number_hostname) * hostname_index
                                             ^^^^^^^^^^^^^^
        from /opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/thapi-git.426c8097e536fe40a8148bb83787ea695c91c57a_0.0.12-git.113-bstf5lt/bin/iprof:262:in `rank_offset'
        from /opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/thapi-git.426c8097e536fe40a8148bb83787ea695c91c57a_0.0.12-git.113-bstf5lt/bin/iprof:714:in `bt_analysis'
        from /opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/thapi-git.426c8097e536fe40a8148bb83787ea695c91c57a_0.0.12-git.113-bstf5lt/bin/iprof:980:in `block in all_trace_and_processing'
        from /opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/thapi-git.426c8097e536fe40a8148bb83787ea695c91c57a_0.0.12-git.113-bstf5lt/bin/iprof:469:in `open'
        from /opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/thapi-git.426c8097e536fe40a8148bb83787ea695c91c57a_0.0.12-git.113-bstf5lt/bin/iprof:966:in `all_trace_and_processing'
        from /opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/thapi-git.426c8097e536fe40a8148bb83787ea695c91c57a_0.0.12-git.113-bstf5lt/bin/iprof:1121:in `<main>'
x1921c1s5b0n0-hsn0.hsn.cm.sunspot.alcf.anl.gov: rank 1 exited with code 1
x1921c1s2b0n0-hsn0.hsn.cm.sunspot.alcf.anl.gov: rank 0 died from signal 15

This happened since hostname_id was nil since in hostname_id = hostnames.find_index(Socket.gethostname) Socket.gethostname was not found in the hostnames list. This was because the hostnames list on sunspot contains hostnames with -hsn0 on them:

x1921c1s2b0n0-hsn0.hsn.cm.sunspot.alcf.anl.gov
x1921c1s5b0n0-hsn0.hsn.cm.sunspot.alcf.anl.gov

while Socket.gethostname returns just x1921c1s2b0n0.

To work around this, this PR will get the index if the element in hostnames starts with the result of Socket.gethostname . This also raises an error and prints a message if the hostname_id is nil in the future for easier debugging.

Thanks to @TApplencourt for advice on solution!

@@ -255,7 +255,13 @@ module MPITopo
# (e.g. x4117c4s4b0n0.hsn.cm.aurora.alcf.anl.gov) and the Socket.gethostname is just x4117c4s4b0n0
hostnames = File.readlines(hostfile).map { |el| el.split('.', 2).first }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can remove the split and stuff now :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! pushed!

hostnames = File.readlines(hostfile)
# find index of hostname_string in list_hostnames
hostname_id = hostnames.find_index(Socket.gethostname)
hostname_id = hostnames.find_index{ |host| host.start_with?(Socket.gethostname) }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can do it in online now :D
File.readlines(hostfile).find_index{ |hostname| hostname.start_with?(Socket.gethostname) }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, we need the hostnames to get the number_hostname. We could make this change but then we'd need to read the hostfile again to get the number of hosts.

Copy link
Collaborator

@TApplencourt TApplencourt Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh true, my bad then! Didn't read enough of the diff :D

@TApplencourt TApplencourt merged commit 0940d81 into devel Feb 16, 2026
23 checks passed
@TApplencourt
Copy link
Collaborator

Thanks a lot \o/

@TApplencourt TApplencourt deleted the hostname_add_robustness branch February 16, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants