resources: Update GPUFS disk to ROCm 6.1, HCL format #29

abmerop · 2024-04-13T21:46:28Z

Update the disk image creation scripts to use the HCL packer formatted, modified from the x86-ubuntu gem5 resource. The major feature of this is a simplified one-step command to build the disk image.

In addition the GPU package versions are bumped to ROCm 6.1, PyTorch 2.2.2, and TensorFlow 2.14.0

The previous ROCm 5.4.2 resource is removed and the new disk image uses an unversioned x86-ubuntu-gpu-ml directory.

abmerop · 2024-04-14T23:28:42Z

Not sure. Do I need to update resources.json for this?

powerjg

Looks great. I believe everything below are just suggestions, so feel free to let me know what you don't want to do anything.

src/x86-ubuntu-gpu-ml/scripts/rocm-install.sh

src/x86-ubuntu-gpu-ml/x86-ubuntu-gpu-ml.pkr.hcl

powerjg · 2024-04-15T15:47:15Z

src/x86-ubuntu-gpu-ml/files/run_gem5_app.sh

@@ -0,0 +1,14 @@
+#!/bin/bash


Any reason this is totally different from the base image's script? It's OK if it needs to be different, but if there's no need, then I would prefer the execution path of each disk to be as similar as possible.

I didn't really follow that script. We want to read a script, run, and exit. If it fails (e.g., QEMU) this will simply drop to shell. I'm not sure why all the other stuff is needed. To get interactive for example, we just run bash --norc as the application (which avoids an infinite loop reading the bashrc, which calls m5 readfile, etc.)

One other major difference from the base image I found after a lot of testing is that we must login a root user for GPU. This is because we need to modprobe the driver. We have that blacklisted when Linux boots. The reason for that is we need to workaround KVM not having memory holes and we copy the GPU bios to the x86 VGA ROM region in gem5 before modprobe. The GPU driver supports multiple ways to read the GPU bios, but some of those are not compatible with some upcoming products we'd like to support in gem5

Hmm, I would like to understand more of what you're saying here. If there's a better way to accomplish our goals, I would love to hear it.

We're trying to avoid logging in as root because it screws up other things (e.g., MPI). Would it work to make sudo passwordless and use sudo modprobe? I think that could be something that we change in the base image :)

Edit: No need to change this. But if sudo modprobe would work for you, let's make that change later. @Harshil2107 note this down.

powerjg · 2024-04-15T15:49:09Z

src/x86-ubuntu-gpu-ml/README.md

+- password: 12345
+
+## Example gem5 commands
+


Can you add a description of the m5 exit calls and how users should expect the disk to behave?

I would prefer, though it's not required, to have all of our disks match as closely as possible. E.g., on the new ubuntu-22.04 disk we have an exit after kernel boot, after systemd, etc.

m5 exit ends the simulation. Anything else would require changes to the config scripts. I wasn't aware that this was changing. Any reason not to create a new m5op?

Yeah, we're working on creating new m5ops for each of these different points in the process. @BobbyRBruce is assigned this. For now, let's just document what we're doing and then Bobby or Harshil can fixup this disk image once we've finalized the new ops.

src/x86-ubuntu-gpu-ml/BUILDING.md

mattsinc

I don't have any major issues with this, but since @powerjg has some questions, I'll wait on approving.

I'll also need to test this locally, but that is unlikely to happen this week.

Harshil2107 · 2024-04-15T20:51:58Z

Not sure. Do I need to update resources.json for this?

If we need to make a brand new diskimage, then I can bump the version of the existing one in resources (MongoDB) or make one if it doesn't exist.

Harshil2107

lgtm

abmerop · 2024-04-18T14:49:33Z

Not sure. Do I need to update resources.json for this?

If we need to make a brand new diskimage, then I can bump the version of the existing one in resources (MongoDB) or make one if it doesn't exist.

We could reuse the "x86-gpu-fs-img" resource... Although maybe it makes sense to rename to the same name as the directory

abmerop · 2024-04-18T14:50:01Z

Thanks for the approvals, but ROCm 6.1 was released yesterday so I am going to iterate on this one more time

Update the disk image creation scripts to use the HCL packer formatted, modified from the x86-ubuntu gem5 resource. The major feature of this is a simplified one-step command to build the disk image. In addition the GPU package versions are bumped to ROCm 6.1, PyTorch 2.2.2, and TensorFlow 2.14.0 The previous ROCm 5.4.2 resource is removed and the new disk image uses an unversioned x86-ubuntu-gpu-ml directory.

abmerop · 2024-04-18T18:07:01Z

Bumped to ROcm 6.1 and added cmake package which helps with testing

abmerop · 2024-04-25T14:23:37Z

Bump

Harshil2107

lgtm

abmerop requested a review from mattsinc April 13, 2024 21:46

powerjg reviewed Apr 15, 2024

View reviewed changes

powerjg requested a review from Harshil2107 April 15, 2024 15:52

mattsinc reviewed Apr 15, 2024

View reviewed changes

abmerop force-pushed the ubuntu2204-rocm6 branch from 4ac69ba to 34ab891 Compare April 16, 2024 00:57

powerjg approved these changes Apr 16, 2024

View reviewed changes

Harshil2107 previously approved these changes Apr 16, 2024

View reviewed changes

powerjg previously approved these changes Apr 17, 2024

View reviewed changes

abmerop dismissed stale reviews from powerjg and Harshil2107 via d49cb7e April 18, 2024 18:05

abmerop force-pushed the ubuntu2204-rocm6 branch from 34ab891 to d49cb7e Compare April 18, 2024 18:05

abmerop changed the title ~~resources: Update GPUFS disk to ROCm 6.0.2, HCL format~~ resources: Update GPUFS disk to ROCm 6.1, HCL format Apr 18, 2024

Harshil2107 approved these changes Apr 25, 2024

View reviewed changes

abmerop merged commit b9492fd into gem5:stable Apr 25, 2024

abmerop deleted the ubuntu2204-rocm6 branch April 25, 2024 17:53

ivanaamit mentioned this pull request May 9, 2024

Build ROCm disk image gem5/gem5#1119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resources: Update GPUFS disk to ROCm 6.1, HCL format #29

resources: Update GPUFS disk to ROCm 6.1, HCL format #29

abmerop commented Apr 13, 2024 •

edited

abmerop commented Apr 14, 2024

powerjg left a comment

powerjg Apr 15, 2024

abmerop Apr 16, 2024

powerjg Apr 16, 2024 •

edited

powerjg Apr 15, 2024

abmerop Apr 16, 2024

powerjg Apr 16, 2024

mattsinc left a comment

Harshil2107 commented Apr 15, 2024 •

edited

Harshil2107 left a comment

abmerop commented Apr 18, 2024

abmerop commented Apr 18, 2024 •

edited

abmerop commented Apr 18, 2024

abmerop commented Apr 25, 2024

Harshil2107 left a comment

resources: Update GPUFS disk to ROCm 6.1, HCL format #29

resources: Update GPUFS disk to ROCm 6.1, HCL format #29

Conversation

abmerop commented Apr 13, 2024 • edited

abmerop commented Apr 14, 2024

powerjg left a comment

Choose a reason for hiding this comment

powerjg Apr 15, 2024

Choose a reason for hiding this comment

abmerop Apr 16, 2024

Choose a reason for hiding this comment

powerjg Apr 16, 2024 • edited

Choose a reason for hiding this comment

powerjg Apr 15, 2024

Choose a reason for hiding this comment

abmerop Apr 16, 2024

Choose a reason for hiding this comment

powerjg Apr 16, 2024

Choose a reason for hiding this comment

mattsinc left a comment

Choose a reason for hiding this comment

Harshil2107 commented Apr 15, 2024 • edited

Harshil2107 left a comment

Choose a reason for hiding this comment

abmerop commented Apr 18, 2024

abmerop commented Apr 18, 2024 • edited

abmerop commented Apr 18, 2024

abmerop commented Apr 25, 2024

Harshil2107 left a comment

Choose a reason for hiding this comment

abmerop commented Apr 13, 2024 •

edited

powerjg Apr 16, 2024 •

edited

Harshil2107 commented Apr 15, 2024 •

edited

abmerop commented Apr 18, 2024 •

edited