Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support batched QR solver #5583

Merged
merged 13 commits into from
Aug 6, 2021
Merged

Support batched QR solver #5583

merged 13 commits into from
Aug 6, 2021

Conversation

leofang
Copy link
Member

@leofang leofang commented Jul 30, 2021

Close #4986.

To prepare for the upcoming NumPy 1.22 (numpy/numpy#19151) and the Array API standard.

UPDATE: I made some educated guesses about zero-size input. But before NumPy 1.22 is out so that we can actually test all corner cases the validity remains to be seen. As a result, I require those tests to be run on NumPy 1.22+, and added an experimental warning if batching is in use.

TODO:

  • HIP
  • tests
  • docstring
  • benchmark

@leofang
Copy link
Member Author

leofang commented Aug 1, 2021

Jenkins, test this please

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 777bd74, target branch master) succeeded!

@leofang
Copy link
Member Author

leofang commented Aug 2, 2021

I looked at the perf numbers and compared the execution times of both np.linalg.qr() and cp.linalg.qr() for same problem size. The baseline is a single matrix (i.e., no batching), and the expected speedup using GPUs should remain constant as the batch size increases, which is indeed observed:

Script

import numpy as np
import cupy as cp
from cupy._core.internal import prod
from cupyx.time import repeat
        

s = [(256, 256), (4, 256, 256), (16, 256, 256), (64, 256, 256), (256, 256, 256)]
t = [np.float32, np.complex64]

def np_timing(a):
    out = []
    batch_size = prod(a.shape[:-2])
    a = a.reshape(batch_size, *a.shape[-2:])
    for i in range(batch_size):
        out.append(np.linalg.qr(a[i]))
    return out

for dtype in t:
    for shape in s:
        a = np.random.random(shape).astype(dtype)
        perfs = []
        for xp in (np, cp):
            a = xp.asarray(a)
            if xp is np:
                func = np_timing
            else:
                func = cp.linalg.qr
            perf = repeat(func, (a,), n_repeat=20, name=f"{'numpy' if xp is np else 'cupy'} QR")
            perfs.append(perf)
        print(f'shape={shape}, dtype={dtype}, speedup (np/cp)={perfs[0].gpu_times.mean()/perfs[1].gpu_times.mean()}')

Output (CUDA 11.2 + 2080 Ti):

shape=(256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=3.3447683143206732
shape=(4, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=3.774302833246912
shape=(16, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=4.879499521524501
shape=(64, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=4.90168403346067
shape=(256, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=5.117126872097228
shape=(256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=7.524473737838519
shape=(4, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=7.926423133355091
shape=(16, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=7.952283613467871
shape=(64, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=8.367485490705524
shape=(256, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=8.241908668974554

In fact, the speedup increases slightly as the batch size increases, but presumably it's because this PR loops in C while I looped np.linalg.qr() in Python. And, of course, in the batched cases the GPU is kept busy enough.

@leofang leofang marked this pull request as ready for review August 2, 2021 03:40
@leofang
Copy link
Member Author

leofang commented Aug 2, 2021

Apparently rocSOLVER performs QR decomposition very badly even with a single matrix...Way slower than NumPy does lol

Output (ROCm 4.2.0 + Radeon VII):
(Note: this machine is the same one as in the CUDA test above.)

shape=(256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.3170826395755196
shape=(4, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.4534171187704479
shape=(16, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.5732453205566855
shape=(64, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.6151140636461172
shape=(256, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.6325455493813151
shape=(256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=0.4376101582719772
shape=(4, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=0.7760912733517519
shape=(16, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.0252764644948935
shape=(64, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.122433502580071
shape=(256, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.1445024874706888

Output (ROCm 4.2.0 + MI50):

shape=(256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.26306059511239
shape=(4, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.4122473466044876
shape=(16, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.5134260967111005
shape=(64, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.5482251047447738
shape=(256, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.5646276030758934
shape=(256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=0.4540793638368552
shape=(4, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=0.7977612377117496
shape=(16, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.0661199672883184
shape=(64, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.149365631729448
shape=(256, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.1547136400778477

cc: @amathews-amd

@leofang
Copy link
Member Author

leofang commented Aug 2, 2021

Apparently rocSOLVER performs QR decomposition very badly even with a single matrix...Way slower than NumPy does lol

The CPU side took quite some time, yes, but the GPU spent much longer time:

shape=(256, 256), dtype=<class 'numpy.float32'>, time: numpy QR            :    CPU: 4791.347 us   +/-1556.343 (min: 4091.627 / max:10118.520) us     GPU-0: 4940.162 us   +/-1564.095 (min: 4236.147 / max:10296.607) us
shape=(256, 256), dtype=<class 'numpy.float32'>, time: cupy QR             :    CPU: 5329.389 us   +/-151.928 (min: 5164.621 / max: 5667.224) us     GPU-0:15758.160 us   +/-144.810 (min:15603.188 / max:16371.347) us
shape=(4, 256, 256), dtype=<class 'numpy.float32'>, time: numpy QR            :    CPU:16846.287 us   +/-53.650 (min:16790.809 / max:17044.884) us     GPU-0:17002.876 us   +/-54.536 (min:16939.829 / max:17199.028) us
shape=(4, 256, 256), dtype=<class 'numpy.float32'>, time: cupy QR             :    CPU: 9575.828 us   +/-219.489 (min: 9388.597 / max:10263.316) us     GPU-0:37768.517 us   +/-154.657 (min:37598.297 / max:38397.892) us
shape=(16, 256, 256), dtype=<class 'numpy.float32'>, time: numpy QR            :    CPU:67346.692 us   +/-232.117 (min:67154.150 / max:68185.524) us     GPU-0:67514.505 us   +/-232.858 (min:67325.073 / max:68358.833) us
shape=(16, 256, 256), dtype=<class 'numpy.float32'>, time: cupy QR             :    CPU:26223.060 us   +/-207.032 (min:25854.373 / max:26561.738) us     GPU-0:119712.383 us   +/-87.472 (min:119468.697 / max:119874.306) us
shape=(64, 256, 256), dtype=<class 'numpy.float32'>, time: numpy QR            :    CPU:272685.892 us   +/-1427.016 (min:271298.504 / max:277643.158) us     GPU-0:272877.339 us   +/-1427.184 (min:271503.357 / max:277834.869) us
shape=(64, 256, 256), dtype=<class 'numpy.float32'>, time: cupy QR             :    CPU:94575.008 us   +/-260.714 (min:94224.461 / max:95177.299) us     GPU-0:445704.062 us   +/-220.978 (min:445277.405 / max:446132.355) us

@leofang
Copy link
Member Author

leofang commented Aug 3, 2021

Jenkins, test this please

@chainer-ci
Copy link
Member

Jenkins CI test (for commit bb4bdbb, target branch master) succeeded!

@leofang
Copy link
Member Author

leofang commented Aug 3, 2021

The ROCm 4.2 CI will be fixed once the gpg key is updated.
https://gitter.im/cupy/devel-rocm?at=610953689484630efa1b6f20

@toslunar toslunar self-requested a review August 5, 2021 06:15
Copy link
Member

@toslunar toslunar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@toslunar toslunar added this to the v10.0.0b2 milestone Aug 6, 2021
@toslunar toslunar merged commit d9a668f into cupy:master Aug 6, 2021
@leofang leofang deleted the batched_qr branch August 6, 2021 12:06
toslunar added a commit to chainer-ci/cupy that referenced this pull request Aug 17, 2021
`_util._assert_rank2` in `qr` was moved by cupy#5583.
@kmaehashi kmaehashi mentioned this pull request Dec 2, 2021
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batched QR solver
4 participants