Support batched QR solver #5583

leofang · 2021-07-30T19:48:31Z

To prepare for the upcoming NumPy 1.22 (numpy/numpy#19151) and the Array API standard.

UPDATE: I made some educated guesses about zero-size input. But before NumPy 1.22 is out so that we can actually test all corner cases the validity remains to be seen. As a result, I require those tests to be run on NumPy 1.22+, and added an experimental warning if batching is in use.

TODO:

HIP
tests
docstring
benchmark

leofang · 2021-08-01T16:44:32Z

Jenkins, test this please

chainer-ci · 2021-08-01T18:31:58Z

Jenkins CI test (for commit 777bd74, target branch master) succeeded!

leofang · 2021-08-02T03:40:10Z

I looked at the perf numbers and compared the execution times of both np.linalg.qr() and cp.linalg.qr() for same problem size. The baseline is a single matrix (i.e., no batching), and the expected speedup using GPUs should remain constant as the batch size increases, which is indeed observed:

Script

import numpy as np
import cupy as cp
from cupy._core.internal import prod
from cupyx.time import repeat
        

s = [(256, 256), (4, 256, 256), (16, 256, 256), (64, 256, 256), (256, 256, 256)]
t = [np.float32, np.complex64]

def np_timing(a):
    out = []
    batch_size = prod(a.shape[:-2])
    a = a.reshape(batch_size, *a.shape[-2:])
    for i in range(batch_size):
        out.append(np.linalg.qr(a[i]))
    return out

for dtype in t:
    for shape in s:
        a = np.random.random(shape).astype(dtype)
        perfs = []
        for xp in (np, cp):
            a = xp.asarray(a)
            if xp is np:
                func = np_timing
            else:
                func = cp.linalg.qr
            perf = repeat(func, (a,), n_repeat=20, name=f"{'numpy' if xp is np else 'cupy'} QR")
            perfs.append(perf)
        print(f'shape={shape}, dtype={dtype}, speedup (np/cp)={perfs[0].gpu_times.mean()/perfs[1].gpu_times.mean()}')

Output (CUDA 11.2 + 2080 Ti):

shape=(256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=3.3447683143206732
shape=(4, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=3.774302833246912
shape=(16, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=4.879499521524501
shape=(64, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=4.90168403346067
shape=(256, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=5.117126872097228
shape=(256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=7.524473737838519
shape=(4, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=7.926423133355091
shape=(16, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=7.952283613467871
shape=(64, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=8.367485490705524
shape=(256, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=8.241908668974554

In fact, the speedup increases slightly as the batch size increases, but presumably it's because this PR loops in C while I looped np.linalg.qr() in Python. And, of course, in the batched cases the GPU is kept busy enough.

leofang · 2021-08-02T04:24:46Z

Apparently rocSOLVER performs QR decomposition very badly even with a single matrix...Way slower than NumPy does lol

Output (ROCm 4.2.0 + Radeon VII):
(Note: this machine is the same one as in the CUDA test above.)

shape=(256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.3170826395755196
shape=(4, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.4534171187704479
shape=(16, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.5732453205566855
shape=(64, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.6151140636461172
shape=(256, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.6325455493813151
shape=(256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=0.4376101582719772
shape=(4, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=0.7760912733517519
shape=(16, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.0252764644948935
shape=(64, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.122433502580071
shape=(256, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.1445024874706888

Output (ROCm 4.2.0 + MI50):

shape=(256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.26306059511239
shape=(4, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.4122473466044876
shape=(16, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.5134260967111005
shape=(64, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.5482251047447738
shape=(256, 256, 256), dtype=<class 'numpy.float32'>, speedup (np/cp)=0.5646276030758934
shape=(256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=0.4540793638368552
shape=(4, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=0.7977612377117496
shape=(16, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.0661199672883184
shape=(64, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.149365631729448
shape=(256, 256, 256), dtype=<class 'numpy.complex64'>, speedup (np/cp)=1.1547136400778477

cc: @amathews-amd

leofang · 2021-08-02T14:00:00Z

Apparently rocSOLVER performs QR decomposition very badly even with a single matrix...Way slower than NumPy does lol

The CPU side took quite some time, yes, but the GPU spent much longer time:

shape=(256, 256), dtype=<class 'numpy.float32'>, time: numpy QR            :    CPU: 4791.347 us   +/-1556.343 (min: 4091.627 / max:10118.520) us     GPU-0: 4940.162 us   +/-1564.095 (min: 4236.147 / max:10296.607) us
shape=(256, 256), dtype=<class 'numpy.float32'>, time: cupy QR             :    CPU: 5329.389 us   +/-151.928 (min: 5164.621 / max: 5667.224) us     GPU-0:15758.160 us   +/-144.810 (min:15603.188 / max:16371.347) us
shape=(4, 256, 256), dtype=<class 'numpy.float32'>, time: numpy QR            :    CPU:16846.287 us   +/-53.650 (min:16790.809 / max:17044.884) us     GPU-0:17002.876 us   +/-54.536 (min:16939.829 / max:17199.028) us
shape=(4, 256, 256), dtype=<class 'numpy.float32'>, time: cupy QR             :    CPU: 9575.828 us   +/-219.489 (min: 9388.597 / max:10263.316) us     GPU-0:37768.517 us   +/-154.657 (min:37598.297 / max:38397.892) us
shape=(16, 256, 256), dtype=<class 'numpy.float32'>, time: numpy QR            :    CPU:67346.692 us   +/-232.117 (min:67154.150 / max:68185.524) us     GPU-0:67514.505 us   +/-232.858 (min:67325.073 / max:68358.833) us
shape=(16, 256, 256), dtype=<class 'numpy.float32'>, time: cupy QR             :    CPU:26223.060 us   +/-207.032 (min:25854.373 / max:26561.738) us     GPU-0:119712.383 us   +/-87.472 (min:119468.697 / max:119874.306) us
shape=(64, 256, 256), dtype=<class 'numpy.float32'>, time: numpy QR            :    CPU:272685.892 us   +/-1427.016 (min:271298.504 / max:277643.158) us     GPU-0:272877.339 us   +/-1427.184 (min:271503.357 / max:277834.869) us
shape=(64, 256, 256), dtype=<class 'numpy.float32'>, time: cupy QR             :    CPU:94575.008 us   +/-260.714 (min:94224.461 / max:95177.299) us     GPU-0:445704.062 us   +/-220.978 (min:445277.405 / max:446132.355) us

leofang · 2021-08-03T16:13:37Z

Jenkins, test this please

chainer-ci · 2021-08-03T19:54:28Z

Jenkins CI test (for commit bb4bdbb, target branch master) succeeded!

leofang · 2021-08-03T20:38:08Z

The ROCm 4.2 CI will be fixed once the gpg key is updated.
https://gitter.im/cupy/devel-rocm?at=610953689484630efa1b6f20

toslunar

LGTM

`_util._assert_rank2` in `qr` was moved by cupy#5583.

leofang added 12 commits July 30, 2021 02:57

add looped geqrf

2e6787a

add batched QR

0a76f3f

fix

e223cff

Merge branch 'master' into batched_qr

3defeea

fix for RTD

16ffcbf

fix and add tests

6da576b

refactor a bit

9b950d7

small improvement

af12049

expand tests

b530975

update docstring and add warning

ac6dcee

support HIP

7ee1c8b

fix typo

777bd74

leofang marked this pull request as ready for review August 2, 2021 03:40

kmaehashi assigned toslunar Aug 2, 2021

kmaehashi added cat:feature New features/APIs prio:medium labels Aug 2, 2021

add test skip

bb4bdbb

toslunar self-requested a review August 5, 2021 06:15

toslunar approved these changes Aug 6, 2021

View reviewed changes

toslunar added this to the v10.0.0b2 milestone Aug 6, 2021

toslunar merged commit d9a668f into cupy:master Aug 6, 2021

leofang deleted the batched_qr branch August 6, 2021 12:06

toslunar added a commit to chainer-ci/cupy that referenced this pull request Aug 17, 2021

Fix cherry-pick

7178c45

`_util._assert_rank2` in `qr` was moved by cupy#5583.

kmaehashi mentioned this pull request Dec 2, 2021

Support NumPy 1.22 #6198

Closed

10 tasks

takagi mentioned this pull request Jan 17, 2022

Remove batched QR solver's experimental mark #6327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support batched QR solver #5583

Support batched QR solver #5583

leofang commented Jul 30, 2021 •

edited

leofang commented Aug 1, 2021

chainer-ci commented Aug 1, 2021

leofang commented Aug 2, 2021 •

edited

leofang commented Aug 2, 2021 •

edited

leofang commented Aug 2, 2021

leofang commented Aug 3, 2021

chainer-ci commented Aug 3, 2021

leofang commented Aug 3, 2021 •

edited

toslunar left a comment

Support batched QR solver #5583

Support batched QR solver #5583

Conversation

leofang commented Jul 30, 2021 • edited

leofang commented Aug 1, 2021

chainer-ci commented Aug 1, 2021

leofang commented Aug 2, 2021 • edited

leofang commented Aug 2, 2021 • edited

leofang commented Aug 2, 2021

leofang commented Aug 3, 2021

chainer-ci commented Aug 3, 2021

leofang commented Aug 3, 2021 • edited

toslunar left a comment

Choose a reason for hiding this comment

leofang commented Jul 30, 2021 •

edited

leofang commented Aug 2, 2021 •

edited

leofang commented Aug 2, 2021 •

edited

leofang commented Aug 3, 2021 •

edited