-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resources: Update GPUFS disk to ROCm 6.1, HCL format #29
Conversation
Not sure. Do I need to update resources.json for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I believe everything below are just suggestions, so feel free to let me know what you don't want to do anything.
@@ -0,0 +1,14 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this is totally different from the base image's script? It's OK if it needs to be different, but if there's no need, then I would prefer the execution path of each disk to be as similar as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really follow that script. We want to read a script, run, and exit. If it fails (e.g., QEMU) this will simply drop to shell. I'm not sure why all the other stuff is needed. To get interactive for example, we just run bash --norc
as the application (which avoids an infinite loop reading the bashrc, which calls m5 readfile, etc.)
One other major difference from the base image I found after a lot of testing is that we must login a root user for GPU. This is because we need to modprobe the driver. We have that blacklisted when Linux boots. The reason for that is we need to workaround KVM not having memory holes and we copy the GPU bios to the x86 VGA ROM region in gem5 before modprobe. The GPU driver supports multiple ways to read the GPU bios, but some of those are not compatible with some upcoming products we'd like to support in gem5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I would like to understand more of what you're saying here. If there's a better way to accomplish our goals, I would love to hear it.
We're trying to avoid logging in as root because it screws up other things (e.g., MPI). Would it work to make sudo passwordless and use sudo modprobe
? I think that could be something that we change in the base image :)
Edit: No need to change this. But if sudo modprobe
would work for you, let's make that change later. @Harshil2107 note this down.
- password: 12345 | ||
|
||
## Example gem5 commands | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a description of the m5 exit
calls and how users should expect the disk to behave?
I would prefer, though it's not required, to have all of our disks match as closely as possible. E.g., on the new ubuntu-22.04
disk we have an exit after kernel boot, after systemd, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m5 exit
ends the simulation. Anything else would require changes to the config scripts. I wasn't aware that this was changing. Any reason not to create a new m5op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we're working on creating new m5ops for each of these different points in the process. @BobbyRBruce is assigned this. For now, let's just document what we're doing and then Bobby or Harshil can fixup this disk image once we've finalized the new ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any major issues with this, but since @powerjg has some questions, I'll wait on approving.
I'll also need to test this locally, but that is unlikely to happen this week.
If we need to make a brand new diskimage, then I can bump the version of the existing one in resources (MongoDB) or make one if it doesn't exist. |
4ac69ba
to
34ab891
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
We could reuse the "x86-gpu-fs-img" resource... Although maybe it makes sense to rename to the same name as the directory |
Thanks for the approvals, but ROCm 6.1 was released yesterday so I am going to iterate on this one more time |
Update the disk image creation scripts to use the HCL packer formatted, modified from the x86-ubuntu gem5 resource. The major feature of this is a simplified one-step command to build the disk image. In addition the GPU package versions are bumped to ROCm 6.1, PyTorch 2.2.2, and TensorFlow 2.14.0 The previous ROCm 5.4.2 resource is removed and the new disk image uses an unversioned x86-ubuntu-gpu-ml directory.
34ab891
to
d49cb7e
Compare
Bumped to ROcm 6.1 and added cmake package which helps with testing |
Bump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Update the disk image creation scripts to use the HCL packer formatted, modified from the x86-ubuntu gem5 resource. The major feature of this is a simplified one-step command to build the disk image.
In addition the GPU package versions are bumped to ROCm 6.1, PyTorch 2.2.2, and TensorFlow 2.14.0
The previous ROCm 5.4.2 resource is removed and the new disk image uses an unversioned x86-ubuntu-gpu-ml directory.