Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canopie22 artifacts for reproducibility of our work #31

Merged
merged 5 commits into from
Oct 13, 2022

Conversation

milroy
Copy link
Member

@milroy milroy commented Oct 5, 2022

YAMLs, dockerfiles, Python scripts, JSONs and output data necessary to reproduce our CANOPIE 2022 paper: One Step Closer to Converged Computing: Achieving Scalability with Cloud-Native HPC.

@milroy milroy requested a review from cmisale October 5, 2022 08:30
Copy link
Collaborator

@cmisale cmisale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all looks good to me

@milroy
Copy link
Member Author

milroy commented Oct 5, 2022

Thanks @cmisale! I'm going to update a few things and then we can merge it soon.

@vsoch
Copy link
Member

vsoch commented Oct 5, 2022

Only 45K lines of yaml! 😆 😭

Are there instructions for how to use the containers outside of the MPI operator (maybe a TODO for a separate repository with very basic examples of running flux?) (understanding that this repository is scoped to reproduce your paper work!)

@milroy
Copy link
Member Author

milroy commented Oct 5, 2022

Are there instructions for how to use the containers outside of the MPI operator (maybe a TODO for a separate repository with very basic examples of running flux?) (understanding that this repository is scoped to reproduce your paper work!)

After a bit more thought those build and run instructions (which I still need to write) are better for the MPI Operator repo. I'll get a PR ready for that.

@milroy milroy force-pushed the canopie22-artifacts branch 2 times, most recently from 3a92dcb to 72c10df Compare October 9, 2022 08:27
@milroy milroy mentioned this pull request Oct 11, 2022
@milroy
Copy link
Member Author

milroy commented Oct 11, 2022

@cmisale: how should we go about replacing kubeflux with fluence in the manifests, etc.? I assume a recursive replace will break KubeFlux, right?

@vsoch
Copy link
Member

vsoch commented Oct 11, 2022

Just be careful with the .git directory for doing it on the command line - I have broke many a git directory with this kind of substitution!

If you have VSCode they have a nice find and replace that will show you the places before doing it.

@milroy
Copy link
Member Author

milroy commented Oct 11, 2022

Just be careful with the .git directory for doing it on the command line - I have broke many a git directory with this kind of substitution!

Excellent point; noted.

@cmisale
Copy link
Collaborator

cmisale commented Oct 11, 2022

@cmisale: how should we go about replacing kubeflux with fluence in the manifests, etc.? I assume a recursive replace will break KubeFlux, right?

@milroy yeah it would break a few things. I would need to 1. change the container names and 2. change the hard coded name.

Should I do that now?

@milroy
Copy link
Member Author

milroy commented Oct 11, 2022

Should I do that now?

I was about to say yes, but just considered something. All the data that I put in this PR references "KubeFlux". Should I leave the references in the data? If so, plotting (and job management, eventually) will need to change, too. For example, in the Python plotting scripts, we have:

        sns.boxplot(
            x="ranks",
            y="real",
            hue="scheduler",
            data=df,
            whis=[5, 95],
            palette={"default-scheduler": "#4878d0", "kubeflux": "#ee854a"},
        )

default-scheduler and kubeflux are keys read from the raw output data.

@milroy
Copy link
Member Author

milroy commented Oct 11, 2022

I think a simple check in the plotting scripts to check for kubeflux or fluence scheduler names will solve this. I'll try this later today and update you, @cmisale.

@milroy
Copy link
Member Author

milroy commented Oct 12, 2022

Ok, I've cleaned up the plotting scripts and ensured they work. @cmisale go ahead and make the two changes you mentioned above.

@cmisale
Copy link
Collaborator

cmisale commented Oct 12, 2022

@milroy sounds good! I should be able to complete it by today

@cmisale
Copy link
Collaborator

cmisale commented Oct 12, 2022

  • Sidecar: git@github.com:flux-framework/flux-k8s.git updated, along with the tag
  • Fluence: git@github.com:openshift-psap/scheduler-plugins.git updated
  • New container image release for Fluence sidecar: quay.io/cmisale1/fluence-sidecar:latest
  • New container image release for Fluence plugin scheduler: quay.io/cmisale1/fluence:upstream
  • Tested on a Kubernetes cluster with dummy workloads, installed through helm

Pretty name showing up

claudias-air:charts cmisale$ k get po -n scheduler-plugins 
NAME                                           READY   STATUS    RESTARTS   AGE
fluence-bc9758657-4m57b                        2/2     Running   0          65m
scheduler-plugins-controller-f5cdf9674-8ptsr   1/1     Running   0          65m

@milroy
Copy link
Member Author

milroy commented Oct 12, 2022

Thank you @cmisale!

I added a fixup commit with the changes to the repo. I'll test the combined changes on EKS ASAP and let you know what happens.

@milroy
Copy link
Member Author

milroy commented Oct 13, 2022

It works:

Events:
  Type    Reason     Age   From     Message
  ----    ------     ----  ----     -------
  Normal  Scheduled  14s   fluence  Successfully assigned default/lammps-4661ea379f0b-worker-0 to ip-192-168-41-26.us-east-2.compute.internal
  Normal  Pulling    14s   kubelet  Pulling image "milroy1/kf-testing:lammps-focal-openmpi-4.1.2-amd-efa"
  Normal  Pulled     13s   kubelet  Successfully pulled image "milroy1/kf-testing:lammps-focal-openmpi-4.1.2-amd-efa" in 275.853816ms
  Normal  Created    13s   kubelet  Created container worker
  Normal  Started    13s   kubelet  Started container worker

run_experiments.py works as expected, too.

@milroy milroy force-pushed the canopie22-artifacts branch 2 times, most recently from 287e7fb to 4820b56 Compare October 13, 2022 06:20
@milroy
Copy link
Member Author

milroy commented Oct 13, 2022

I'm merging this PR. Thank you very much @cmisale!

@milroy milroy merged commit 62e9277 into flux-framework:canopie22-artifacts Oct 13, 2022
@cmisale
Copy link
Collaborator

cmisale commented Oct 13, 2022

Great!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants