Try GPU CI with `cupy` (DNM) #466

jaimergp · 2023-01-23T15:43:20Z

Checklist

Used a personal fork of the feedstock to propose changes
Bumped the build number (if the version is unchanged)
Reset the build number to 0 (if the version changed)
Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
Ensured the license file is being packaged.

Same as #446

Issues:

Failures with big tests?
libcuda.so.1 not found on aarch64 🤔
Some casting issues

MNT: Re-rendered with conda-build 3.21.7+119.g1b221ef0, conda-smithy 3.22.1.post.dev3, and conda-forge-pinning 2022.12.19.14.36.50

Co-authored-by: Amit Kumar <dtu.amit@gmail.com>

conda-forge-webservices · 2023-01-23T15:43:22Z

Hi! This is the friendly automated conda-forge-webservice.

It appears you are making a pull request from a branch in your feedstock and not a fork. This procedure will generate a separate build for each push to the branch and is thus not allowed. See our documentation for more details.

Please close this pull request and remake it from a fork of this feedstock.

Have a great day!

conda-forge-webservices · 2023-01-23T15:43:25Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

jaimergp · 2023-01-23T17:43:19Z

@leofang - seems to be working :D Can you take a look at the logs for 0d99872? I'll see what happens with qemu now.

leofang · 2023-01-27T03:24:05Z

Oh wow @jaimergp many thanks for the test drive! It seems to work fine! (cc: @kmaehashi, we're testing GPU CI for conda-forge!)
https://github.com/conda-forge/cf-autotick-bot-test-package-feedstock/actions/runs/3998064331/jobs/6860216209#step:3:4910

Jaime, can we turn on tests? Even running a subset of GPU tests is a great improvement.

kmaehashi · 2023-01-27T04:11:40Z

Great, thank you @jaimergp for testing!

recipe/run_test.py

conda-forge-webservices · 2023-01-31T09:10:16Z

Hi! This is the friendly automated conda-forge-webservice.

It appears you are making a pull request from a branch in your feedstock and not a fork. This procedure will generate a separate build for each push to the branch and is thus not allowed. See our documentation for more details.

Please close this pull request and remake it from a fork of this feedstock.

Have a great day!

leofang · 2023-10-19T02:54:10Z

btw test_callback.py failed because it needs libcufft-static and libcufft-dev

kmaehashi · 2023-10-19T04:15:35Z

CuPy caches generated kernels on disk. They are stored at $CUPY_CACHE_DIR, so if you zip the artifacts and keep it somewhere (cloud or local storage), and unzip them when a fresh CI process starts, it should help. @kmaehashi might be able to share how it's done in the CuPy CI (I only know this and this are relevant bits).

Yes, basically that is all that we do 😃 In CuPy's CI, after running tests:

Run an internal utility to compact the kernel cache size. In our setup we keep most-recently-used 3 GB (this is a generous estimate).
Generate a tar archive of kernel cache ($CUPY_CACHE_DIR, ~/.cupy by default) and ccache (~/.cache/ccache by default).
Upload it to Google Cloud Storage.

And before running tests, download the archive if it exists and expand it as it was.

jaimergp · 2023-10-19T07:53:34Z

Thanks for the tips! My goal is here is not so much making everything pass, but at least assure that the UX is nice, and having the machine die mid job is not one 😬 I'll try adding more deps and see if that passes, but in the meantime I wonder if this is a resource starvation issue. It passes on CUDA 11 but maybe CUDA 12 is heavier? How much disk/RAM are you using in your test boxes? Thanks!

jaimergp · 2023-10-19T08:03:08Z

Interesting do we have any metrics on what the machine was doing before it stopped? Was there high memory usage or something else amiss?

Sadly it's an ephemeral VM and OpenStack doesn't offer any immediate way of keeping a history around as far as I can see, but it would be interesting to have some info, so I'll see what we can do.

kmaehashi · 2023-10-19T09:30:54Z

How much disk/RAM are you using in your test boxes?

We are running under 8 cores CPU & 20 GB disk. As for RAM, we allocate 52 GB but this includes RAM disk space so I'm not sure how much is actually required for tests itself, unfortunately. With that said I guess the problem is elsewhere, as 99% of the test has passed. How about adding -v to pytest, or try with the subset (e.g., cupy_tests only) and see what happens?

jaimergp · 2023-10-19T10:05:06Z

Hm, these machines are:

CPU	RAM	Disk
4	12GB	60GB

jaimergp · 2023-10-19T13:04:38Z

Ok if I don't run cupyx_tests it doesn't error out badly. I have a VM with grafana running the whole thing now so we'll see how it goes :)

We still see cufft errors though, despite the packages being in meta.yaml:

In file included from /tmp/tmpmfodjtfc/tmpnl7smisg/cupy_callback_c05c60196d1dae6877fd7811c4f34c5b.cpp:771:
/tmp/tmpmfodjtfc/tmpnl7smisg/cupy_cufft.h:11:10: fatal error: cufft.h: No such file or directory
   11 | #include <cufft.h>
      |          ^~~~~~~~~
compilation terminated.
_ Test1dCallbacks.test_fft_load[_param_2_{n=None, norm=None, shape=(10, 10)}] __

leofang · 2023-10-19T16:35:05Z

We still see cufft errors though, despite the packages being in meta.yaml:

I see. I think this is a package layout problem specific to conda. CuPy expects headers can be found in $CUDA_PATH/include, and so in the past we've set $CUDA_PATH to $CONDA_PREFIX in the activation script. But, with CUDA 12.0+ the CTK headers are only available in $CONDA_PREFIX/targets/x86_64-linux/include/, hence the missing headers.

We either have to patch CuPy or rework on the conda package layout, neither is trivial task. I suggest we note this issue and move on.

leofang · 2023-10-19T16:38:18Z

Also, the cuFFT callback support in CuPy was never expected to work with conda packages, due to static libcufft & headers not shipped in the past. NVIDIA is working on a new solution that would get Python libraries like CuPy lifted, so let's not bothered by this 🙂)

jaimergp · 2023-10-19T17:28:46Z

76% of the test suite, this is how it's looking:

tests/cupyx_tests/scipy_tests/signal_tests/test_signaltools.py ......... [ 75%]
........................................................................ [ 75%]
........................................................................ [ 75%]
........................................................................ [ 75%]
........................................................................ [ 75%]
........................................................................ [ 76%]
........................................................................ [ 76%]
........................................................................ [ 76%]
........................................................................ [ 76%]
.................................................s...................... [ 76%]

jaimergp · 2023-10-20T08:13:44Z

The VM test finished correctly, but it's true that we are dangerously close to the RAM limits:

Since this is not using the Github Runner (I ran the build manually), maybe it does OOM and hence the issues we are seeing? What do you think @aktech?

jaimergp · 2023-10-20T08:18:06Z

CUDA 12 logs

aktech · 2023-10-24T09:50:46Z

Since this is not using the Github Runner (I ran the build manually), maybe it does OOM and hence the issues we are seeing? What do you think @aktech?

I think that does explain all the jobs that failed without apparent reason. GitHub's message around lack of resources Memory/CPU was right. We also have a larger flavor available with 16GB RAM, if we know (or can find out) that it won't exceed that then we can give that a try as well.

jakirkham · 2023-10-25T20:12:45Z

I don't think we know how much memory is used. So we probably need to collect more data first

Perhaps it is worth running something like pytest-monitor to collect and analyze data about memory usage in tests. This will store results in a SQLite DB, which can be analyzed later, but we would need to persist that database as an artifact for retrieval once the job has ended

Is there some way to do store artifacts from completed CI runs in our setup here? Can we store results even if the job fails?

jaimergp · 2023-10-26T08:21:46Z

I haven't tried adding CI artifacts yet, actually. That's a good point. After increasing the VM RAM to 16GB it doesn't OOM. We are also investigating how we can protect the GHA runner from the OOM killer a bit so other processes are stopped instead of that one. That should alleviate the disconnection problems we've seen. If the runner process dies there's nothing we can do about sending CI artifacts or other "post-mortem" diagnosis steps via GHA.

jaimergp · 2023-10-26T10:40:36Z

store_build_artifacts require 7z and/or zip, but these are not present in the VM image right now. We'll add this in the next update.

leofang · 2023-10-27T01:39:38Z

Q: I may have lost track the chronological order of the commits & above comments, was OOM hit only when cupyx_tests was enabled?

jaimergp · 2023-10-27T06:31:39Z

Correct! There's a grafana plot a few messages above.

leofang · 2023-10-27T11:38:09Z

Not sure if there's any memory leak, what if we run the cupyx tests in a separate process?

…ForgeAutomergeUpdate

jaimergp and others added 16 commits December 19, 2022 12:14

start doc

fbbdfd4

rerender with self-hosted cirun support

5dc139e

MNT: Re-rendered with conda-build 3.21.7+119.g1b221ef0, conda-smithy 3.22.1.post.dev3, and conda-forge-pinning 2022.12.19.14.36.50

rename cirun temporarily

c86bbed

use extended yaml syntax

e7604c1

try to match labels

55807e9

Some more docs [ci skip]

45acbf2

retrigger

41a4154

use latest vm

4b8e905

use gpu_medium

7901e95

rerender with smithy-dev 5f96e5ef

a7dff98

Update .cirun.yml

7929ebc

Co-authored-by: Amit Kumar <dtu.amit@gmail.com>

retrigger

0486882

patch mtu?

86e6498

sudo it

9903ce8

use gpu_large?

6a0df57

try cupy + skip most + rerender

e3e27ea

jaimergp added 3 commits January 23, 2023 16:46

only PRs

3e502fe

use gpu_large

6cd3a95

concurrency

0d99872

jaimergp added 3 commits January 23, 2023 18:52

enable aarch64

3f37548

retrigger

c6f9608

retrigger

b74076f

leofang reviewed Jan 27, 2023

View reviewed changes

recipe/run_test.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

add libcufft-static

0bc06ef

only cupy_tests for now

ab999a5

jaimergp mentioned this pull request Oct 23, 2023

Add support for Cirun on self-hosted GHA runners conda-forge/conda-smithy#1703

Merged

1 task

rerender

27c4b84

jaimergp added 2 commits October 25, 2023 10:26

upgrade to xlarge, run on pull_request, rerender

954895d

add cupyx tests again

035dcb4

jaimergp added 3 commits October 26, 2023 10:37

store build artifacts

68d7ee5

test only cupy_tests for quicker job

6188b68

use gpu-large

33d6471

leofang mentioned this pull request Nov 20, 2023

Expand document on CuPy's cache directory cupy/cupy#7982

Closed

[ci skip] [skip ci] [cf admin skip] ***NO_CI*** admin migration Conda…

6abc3c2

…ForgeAutomergeUpdate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try GPU CI with `cupy` (DNM) #466

Try GPU CI with `cupy` (DNM) #466

jaimergp commented Jan 23, 2023 •

edited

Loading

conda-forge-webservices bot commented Jan 23, 2023

conda-forge-webservices bot commented Jan 23, 2023

jaimergp commented Jan 23, 2023

leofang commented Jan 27, 2023

kmaehashi commented Jan 27, 2023

This comment was marked as outdated.

conda-forge-webservices bot commented Jan 31, 2023

leofang commented Oct 19, 2023 •

edited

Loading

kmaehashi commented Oct 19, 2023

jaimergp commented Oct 19, 2023

jaimergp commented Oct 19, 2023

kmaehashi commented Oct 19, 2023

jaimergp commented Oct 19, 2023

jaimergp commented Oct 19, 2023

leofang commented Oct 19, 2023

leofang commented Oct 19, 2023 •

edited

Loading

jaimergp commented Oct 19, 2023

jaimergp commented Oct 20, 2023

jaimergp commented Oct 20, 2023

aktech commented Oct 24, 2023

jakirkham commented Oct 25, 2023

jaimergp commented Oct 26, 2023

jaimergp commented Oct 26, 2023

leofang commented Oct 27, 2023

jaimergp commented Oct 27, 2023

leofang commented Oct 27, 2023

Try GPU CI with cupy (DNM) #466

Are you sure you want to change the base?

Try GPU CI with cupy (DNM) #466

Conversation

jaimergp commented Jan 23, 2023 • edited Loading

conda-forge-webservices bot commented Jan 23, 2023

conda-forge-webservices bot commented Jan 23, 2023

jaimergp commented Jan 23, 2023

leofang commented Jan 27, 2023

kmaehashi commented Jan 27, 2023

This comment was marked as outdated.

conda-forge-webservices bot commented Jan 31, 2023

leofang commented Oct 19, 2023 • edited Loading

kmaehashi commented Oct 19, 2023

jaimergp commented Oct 19, 2023

jaimergp commented Oct 19, 2023

kmaehashi commented Oct 19, 2023

jaimergp commented Oct 19, 2023

jaimergp commented Oct 19, 2023

leofang commented Oct 19, 2023

leofang commented Oct 19, 2023 • edited Loading

jaimergp commented Oct 19, 2023

jaimergp commented Oct 20, 2023

jaimergp commented Oct 20, 2023

aktech commented Oct 24, 2023

jakirkham commented Oct 25, 2023

jaimergp commented Oct 26, 2023

jaimergp commented Oct 26, 2023

leofang commented Oct 27, 2023

jaimergp commented Oct 27, 2023

leofang commented Oct 27, 2023

Try GPU CI with `cupy` (DNM) #466

Try GPU CI with `cupy` (DNM) #466

jaimergp commented Jan 23, 2023 •

edited

Loading

leofang commented Oct 19, 2023 •

edited

Loading

leofang commented Oct 19, 2023 •

edited

Loading