Skip to content

Conversation

MitchellAugustin
Copy link
Member

@MitchellAugustin MitchellAugustin commented Jul 7, 2025

Description

Some modern GPUs have BAR sizes that exceed the default QEMU pci64-hole-size allowance, or require additional host memory / CPU allocations to work properly in guest VMs when passed through.

This series enables users of the gpgpu provider to apply custom environment options to override these defaults, as-needed.

Users who do not specify these optional environment variables will see no change in behavior.

Resolved issues

Options required for passthrough to work correctly with large-BAR GPUs can now be applied by the user

Documentation

I have added a new unit test for the LXD.set_config() function, and for the additional arguments I added to LXDVM

Tests

This PR was verified on a DGX B200 (buzzsaw) with the following environment variables:

[environment]
QEMU_OPTS=" -global q35-pcihost.pci-hole64-size=8192G"
VM_RAM_MB=8192
VM_CPUS=8

This section was appended to the default ./test-gpgpu launcher. I then verified that the VM and container GPU passthrough tests both passed as expected when run with these parameters on that platform.

To verify that there were no regressions for other platforms, I loaded my branch on a platform with a consumer-grade GPU (luma), then ran the unmodified ./test-gpgpu and confirmed that all tests passed (as they had when this platform was certified originally) and no custom options were applied.

@codecov
Copy link

codecov bot commented Jul 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.63%. Comparing base (2dab338) to head (1e06546).
⚠️ Report is 115 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2007      +/-   ##
==========================================
+ Coverage   50.60%   50.63%   +0.02%     
==========================================
  Files         384      384              
  Lines       41180    41203      +23     
  Branches     7636     7639       +3     
==========================================
+ Hits        20841    20864      +23     
  Misses      19594    19594              
  Partials      745      745              
Flag Coverage Δ
checkbox-support 63.95% <100.00%> (+0.01%) ⬆️
provider-gpgpu 93.06% <100.00%> (+0.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MitchellAugustin
Copy link
Member Author

MitchellAugustin commented Jul 7, 2025

I'm not sure why codecov claims my tests are missing coverage for the arguments it highlighted in yellow. I tested to confirm that both the test I wrote in de7fa54, as well as the autogenerated test_parse_args() test are hitting those conditionals. (I did this by adding a write to a file on my local branch that saves the passed value within that conditional, and confirmed that they were as expected when I ran python3 manage.py test.)

Since it seems the coverage was actually already the same before de7fa54, just with mock values, let me know if you want me to remove de7fa54 or keep it as-is.

With that said, this should be ready for review.

@pedro-avalos pedro-avalos self-requested a review July 7, 2025 21:48
@pedro-avalos pedro-avalos self-assigned this Jul 7, 2025
@MitchellAugustin
Copy link
Member Author

Here are the passing tests on Buzzsaw:

Independent of my changes, the VM tests will hang if they run after the container tests due to the driver still being in use. (@pedro-avalos and I have discussed addressing this in a separate PR.)

Copy link
Collaborator

@pedro-avalos pedro-avalos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for all the research and effort you put into this. I've left some suggestions mainly for the arguments and argument-parsing.

@MitchellAugustin
Copy link
Member Author

Thanks for the review @pedro-avalos . I applied all suggestions and will retest shortly and update you here once the B200 is available.

pedro-avalos
pedro-avalos previously approved these changes Jul 8, 2025
Copy link
Collaborator

@pedro-avalos pedro-avalos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for adding those changes!

… gpgpu VM passthrough test

These settings are required for certain platforms
(such as those with large-BAR GPUs)
@MitchellAugustin
Copy link
Member Author

@pedro-avalos I cleaned up and signed all of the commits in the tree, and verified that everything is still working as described previously. Let me know if you need any more adjustments.

@pedro-avalos pedro-avalos merged commit e59adb9 into canonical:main Jul 8, 2025
35 checks passed
@MitchellAugustin
Copy link
Member Author

Thanks @pedro-avalos !

mreed8855 pushed a commit that referenced this pull request Jul 30, 2025
…fix) (#2007)

* Add LXD support for `lxc config`

* Add options to override QEMU settings and guest RAM+CPU allocation in gpgpu VM passthrough test

These settings are required for certain platforms
(such as those with large-BAR GPUs)
mreed8855 pushed a commit that referenced this pull request Jul 31, 2025
…fix) (#2007)

* Add LXD support for `lxc config`

* Add options to override QEMU settings and guest RAM+CPU allocation in gpgpu VM passthrough test

These settings are required for certain platforms
(such as those with large-BAR GPUs)
bladernr pushed a commit that referenced this pull request Aug 28, 2025
…fix) (#2007)

* Add LXD support for `lxc config`

* Add options to override QEMU settings and guest RAM+CPU allocation in gpgpu VM passthrough test

These settings are required for certain platforms
(such as those with large-BAR GPUs)
stanley31huang pushed a commit that referenced this pull request Oct 3, 2025
…fix) (#2007)

* Add LXD support for `lxc config`

* Add options to override QEMU settings and guest RAM+CPU allocation in gpgpu VM passthrough test

These settings are required for certain platforms
(such as those with large-BAR GPUs)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants