Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resources: Update GPUFS disk to ROCm 6.1, HCL format #29

Merged
merged 1 commit into from
Apr 25, 2024

Conversation

abmerop
Copy link
Member

@abmerop abmerop commented Apr 13, 2024

Update the disk image creation scripts to use the HCL packer formatted, modified from the x86-ubuntu gem5 resource. The major feature of this is a simplified one-step command to build the disk image.

In addition the GPU package versions are bumped to ROCm 6.1, PyTorch 2.2.2, and TensorFlow 2.14.0

The previous ROCm 5.4.2 resource is removed and the new disk image uses an unversioned x86-ubuntu-gpu-ml directory.

@abmerop abmerop requested a review from mattsinc April 13, 2024 21:46
@abmerop
Copy link
Member Author

abmerop commented Apr 14, 2024

Not sure. Do I need to update resources.json for this?

Copy link
Contributor

@powerjg powerjg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I believe everything below are just suggestions, so feel free to let me know what you don't want to do anything.

src/x86-ubuntu-gpu-ml/scripts/rocm-install.sh Outdated Show resolved Hide resolved
@@ -0,0 +1,14 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is totally different from the base image's script? It's OK if it needs to be different, but if there's no need, then I would prefer the execution path of each disk to be as similar as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really follow that script. We want to read a script, run, and exit. If it fails (e.g., QEMU) this will simply drop to shell. I'm not sure why all the other stuff is needed. To get interactive for example, we just run bash --norc as the application (which avoids an infinite loop reading the bashrc, which calls m5 readfile, etc.)

One other major difference from the base image I found after a lot of testing is that we must login a root user for GPU. This is because we need to modprobe the driver. We have that blacklisted when Linux boots. The reason for that is we need to workaround KVM not having memory holes and we copy the GPU bios to the x86 VGA ROM region in gem5 before modprobe. The GPU driver supports multiple ways to read the GPU bios, but some of those are not compatible with some upcoming products we'd like to support in gem5

Copy link
Contributor

@powerjg powerjg Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would like to understand more of what you're saying here. If there's a better way to accomplish our goals, I would love to hear it.

We're trying to avoid logging in as root because it screws up other things (e.g., MPI). Would it work to make sudo passwordless and use sudo modprobe? I think that could be something that we change in the base image :)

Edit: No need to change this. But if sudo modprobe would work for you, let's make that change later. @Harshil2107 note this down.

- password: 12345

## Example gem5 commands

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a description of the m5 exit calls and how users should expect the disk to behave?

I would prefer, though it's not required, to have all of our disks match as closely as possible. E.g., on the new ubuntu-22.04 disk we have an exit after kernel boot, after systemd, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m5 exit ends the simulation. Anything else would require changes to the config scripts. I wasn't aware that this was changing. Any reason not to create a new m5op?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we're working on creating new m5ops for each of these different points in the process. @BobbyRBruce is assigned this. For now, let's just document what we're doing and then Bobby or Harshil can fixup this disk image once we've finalized the new ops.

src/x86-ubuntu-gpu-ml/BUILDING.md Outdated Show resolved Hide resolved
Copy link
Contributor

@mattsinc mattsinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any major issues with this, but since @powerjg has some questions, I'll wait on approving.

I'll also need to test this locally, but that is unlikely to happen this week.

@Harshil2107
Copy link
Contributor

Harshil2107 commented Apr 15, 2024

Not sure. Do I need to update resources.json for this?

If we need to make a brand new diskimage, then I can bump the version of the existing one in resources (MongoDB) or make one if it doesn't exist.

Harshil2107
Harshil2107 previously approved these changes Apr 16, 2024
Copy link
Contributor

@Harshil2107 Harshil2107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

powerjg
powerjg previously approved these changes Apr 17, 2024
@abmerop
Copy link
Member Author

abmerop commented Apr 18, 2024

Not sure. Do I need to update resources.json for this?

If we need to make a brand new diskimage, then I can bump the version of the existing one in resources (MongoDB) or make one if it doesn't exist.

We could reuse the "x86-gpu-fs-img" resource... Although maybe it makes sense to rename to the same name as the directory

@abmerop
Copy link
Member Author

abmerop commented Apr 18, 2024

Thanks for the approvals, but ROCm 6.1 was released yesterday so I am going to iterate on this one more time

Update the disk image creation scripts to use the HCL packer formatted,
modified from the x86-ubuntu gem5 resource. The major feature of this is
a simplified one-step command to build the disk image.

In addition the GPU package versions are bumped to ROCm 6.1, PyTorch
2.2.2, and TensorFlow 2.14.0

The previous ROCm 5.4.2 resource is removed and the new disk image uses
an unversioned x86-ubuntu-gpu-ml directory.
@abmerop abmerop dismissed stale reviews from powerjg and Harshil2107 via d49cb7e April 18, 2024 18:05
@abmerop abmerop changed the title resources: Update GPUFS disk to ROCm 6.0.2, HCL format resources: Update GPUFS disk to ROCm 6.1, HCL format Apr 18, 2024
@abmerop
Copy link
Member Author

abmerop commented Apr 18, 2024

Bumped to ROcm 6.1 and added cmake package which helps with testing

@abmerop
Copy link
Member Author

abmerop commented Apr 25, 2024

Bump

Copy link
Contributor

@Harshil2107 Harshil2107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@abmerop abmerop merged commit b9492fd into gem5:stable Apr 25, 2024
@abmerop abmerop deleted the ubuntu2204-rocm6 branch April 25, 2024 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants