Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default MPI_BASE_PORT and turn into env. variable #150

Merged
merged 1 commit into from
Oct 6, 2021

Conversation

csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Oct 5, 2021

All tests running on Knative with, specifically, 10 ranks would fail, claiming that the port 9090 was already in use. Sure enough, ports 9090 and 9091 are used by knative-serving on the faasm-worker service.

The fact that this only happens with 10 ranks (i.e. MPI_WORLD_SIZE = 10) is because the ports we use are a combination of the outbound-inbound ranks and the world size.

To solve this problem, I change the default MPI base port to a port area, apparently, less cluttered: 10800. However, for us to actually benefit from this patch we will now have to:

  1. Merge this PR into faabric.
  2. Bump the faabric dependency in faasm.
  3. Bump the faasm version and re-build containers.

To prevent this from occuring in the future, I also make the MPI_BASE_PORT an environment variable in the faabric config.

One may argue that the over-arching problem is that, in a multi-tenant environment with big MPI jobs, we end up using too many ports, and errors like this will keep appearing. I could agree with that. And given that we now will start working on multi-tenant MPI jobs, it's about time to think about this in detail. I say we move discussion for this offline.

@csegarragonz csegarragonz added the bug Something isn't working label Oct 5, 2021
@csegarragonz csegarragonz self-assigned this Oct 5, 2021
Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this seems like the right approach for now. Let's take this offline, I have some ideas for how we can do this without taking up hundreds of ports.

@csegarragonz csegarragonz merged commit 1e08647 into master Oct 6, 2021
@csegarragonz csegarragonz deleted the change-mpi-base-port branch October 6, 2021 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants