BLAS: address banded matrix storage (gbmv, tbmv, etc.) #24872

ninivert · 2024-04-17T10:17:00Z

This PR attempts to address complexities with the BLAS banded matrix routine wrappers (general banded matrix-vector gbmv, triangular banded tbmv).

Context

gbmv (resp. tbmv) computes the matrix-vector product Ax = y, where A is a banded matrix (resp. triangular) of shape m,n, x is a vector of length n, y is vector of length m.

General banded storage

A banded matrix A of shape m,n, with ku upper bands and kl lower bands can be represented in general as (adapted from OneAPI spec docs):

$$\begin{split}A = \left[\begin{smallmatrix} A_{11} & A_{12} & A_{13} & \ldots & A_{1,ku+1} & * & \ldots & \ldots & \ldots & \ldots & \ldots & * \\\ A_{21} & A_{22} & A_{23} & A_{24} & \ldots & A_{2,ku+2} & * & \ldots & \ldots & \ldots & \ldots & * \\\ A_{31} & A_{32} & A_{33} & A_{34} & A_{35} & \ldots & A_{3,ku+3} & * & \ldots & \ldots & \ldots & * \\\ \vdots & A_{42} & A_{43} & \ddots & \ddots & \ddots & \ddots & \ddots & * & \ldots & \ldots & \vdots \\\ A_{kl+1,1} & \vdots & A_{53} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & * & \ldots & \vdots \\\ * & A_{kl+2,2} & \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \vdots \\\ \vdots & * & A_{kl+3,3} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & * \\\ \vdots & \vdots & * & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & A_{n-ku,n}\\\ \vdots & \vdots & \vdots & * & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \vdots \\\ \vdots & \vdots & \vdots & \vdots & * & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & A_{m-2,n} \\\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & A_{m-1,n} \\\ * & * & * & \ldots & \ldots & \ldots & * & A_{m,m-kl} & \ldots & A_{m,n-2} & A_{m,n-1} & A_{m,n} \end{smallmatrix}\right]\end{split}$$

In banded storage, the matrix A is stored in a contiguous array as (see the documentation for ?gbmv routines):

In the case order == CblasRowMajor (adapted from OneAPI spec docs):

$$\scriptscriptstyle a = [\underbrace{ \underbrace{\underbrace{*,...,*}_\text{kl},A_{11}, A_{12},...,A_{1,min(ku+1,n)},*,...,*}_\text{lda}, \underbrace{\underbrace{*,...,*}_\text{kl-1},A_{2,max(1,2-kl)},...,A_{2,min(ku+2,n)},*,...*}_\text{lda}, ..., \underbrace{\underbrace{*,...,*}_\text{max(0,kl-m+1)},A_{m,max(1,m-kl)},...,A_{m,min(ku+m,n)},*,...*}_\text{lda} }_\text{lda $\times$ m}]$$

In the case order == CblasColMajor (adapted from OneAPI spec docs):

$$\scriptscriptstyle a = [\underbrace{ \underbrace{\underbrace{*,...,*}_\text{ku},A_{11}, A_{12},...,A_{min(kl+1,m),1},*,...,*}_\text{lda}, \underbrace{\underbrace{*,...,*}_\text{ku-1},A_{max(1,2-ku),2},...,A_{min(kl+2,m),2},*,...*}_\text{lda}, ..., \underbrace{\underbrace{*,...,*}_\text{max(0,ku-n+1)},A_{max(1,n-ku),n},...,A_{min(kl+n,m),n},*,...*}_\text{lda} }_\text{lda $\times$ n}]$$

Elements * are not read by the ?gbmv routines, and may be whatever (e.g. set to zero).

Example

For example, consider the matrix A with m=9, n=8, ku=3 and kl=2 (adapted from: IBM docs):

        | 11  12  13  14   0   0   0   0 |
        | 21  22  23  24  25   0   0   0 |
        | 31  32  33  34  35  36   0   0 |
        |  0  42  43  44  45  46  47   0 |
    A = |  0   0  53  54  55  56  57  58 |
        |  0   0   0  64  65  66  67  68 |
        |  0   0   0   0  75  76  77  78 |
        |  0   0   0   0   0  86  87  88 |
        |  0   0   0   0   0   0  97  98 |

Column-major layout

In the column-major layout, the matrix A is stored as:

    *   *   *   11  21  31  *   *  12  22  32  42  *   13   ..  97  58  68  78  88  98  *
    |      AB(.., 1)        |      AB(.., 2)       |                |      AB(.., n)    |
     <-    LDA elements   ->
     <------------------------------- n x LDA elements -------------------------------->

The column-major layout can be interpreted as a matrix AB_col of shape LDA,n (with LDA=kl+ku+1) in a column-major language (e.g. Fortran), where the bands are arranged row-wise, from the upper bands to the lower bands.

         |  *   *   *  14  25  36  47  58 |
         |  *   *  13  24  35  46  57  68 |
AB_col = |  *  12  23  34  45  56  67  78 |
         | 11  22  33  44  55  66  77  88 |  <- main diagonal
         | 21  32  43  54  65  76  87  98 |
         | 31  42  53  64  75  86  97   * |

The column-major layout interpreted in a row-major language (e.g. C or Chapel) is interpreted as the transpose, where the matrix AB_col^t has shape n,LDA:

           |  *   *   *  11  21  31 |
           |  *   *  12  22  32  42 |
           |  *  13  23  33  43  53 |
AB_col^t = | 14  24  34  44  54  64 |
           | 25  35  45  55  65  75 |
           | 36  46  56  66  76  86 |
           | 47  57  67  77  87  97 |
           | 58  68  78  88  98  *  |

Row-major layout

In row-major layout, the matrix A is stored as:

      *   *  11  12  13  14   *  21  22  ..   *   *   97  98  *   *   *   *
      |      AB(1, ..)        |                       |    AB(m, ..)      |
       <-    LDA elements   ->
       <--------------------------- m x LDA elements -------------------->

The row-major layout can be interpreted as a matrix AB_row of shape m,LDA (with LDA=kl+ku+1) in a row-major language, where the bands are arranged column-wise, from the lower bands to the upper bands:

                     main diagonal
                     v
           | *   *   11  12  13  14 |
           | *   21  22  23  24  25 |
           | 31  32  33  34  35  36 |
           | 42  43  44  45  46  47 |
  AB_row = | 53  54  55  56  57  58 |
           | 64  65  66  67  68   * |
           | 75  76  77  78   *   * |
           | 86  87  88   *   *   * |
           | 97  98   *   *   *   * |

The row-major layout is interpreted in a column-major language as the tranpose AB_row^t of shape LDA,m.

BLAS routine wrappers

In modules/packages/BLAS.chpl, the procedure gbmv wraps the usage of cblas_?gbmv by providing automatic detection of the array shapes.

Detection of LDA is done with:

var _ldA = getLeadingDim(A, order);

// ...

inline proc getLeadingDim(A: [?Adom], order : Order) : c_int {
  require header;
  if order==Order.Row then
    return Adom.dim(1).size : c_int;
  else
    return Adom.dim(0).size : c_int;
}

Due to the signature of the gbmv wrapper:

proc gbmv(A : [?Adom] ?eltType,
          X : [?Xdom] eltType, Y : [?Ydom] eltType,
          alpha: eltType, beta: eltType,
          kl : int = 0, ku : int = 0,
          trans : Op =  Op.N,
          order : Order = Order.Row, incx : c_int = 1, incy : c_int = 1)
          where (Adom.rank == 2) && (Xdom.rank==1) && (Ydom.rank == 1)

a user is forced to pass a matrix (Adom.rank == 2) as A, which means they pass:

AB_col^t of shape n,LDA if order == Order.Col
AB_row of shape m,LDA if order == Order.Row

Proposed solution

This means that LDA is always on the last dimension:

var _ldA = Adom.dim(1).size : c_int;

The counterpoint to this solution is that semantically LDA is no longer the "leading dimension" (number of rows), but here the number of columns. Although one can argue that the "leading dimension" in a column-major language is the number of rows, and in a row-major language the number of columns.

Alternative solutions

In order to be consistent with the "math" point of view (col-layout interpreted by col-language, row-layout interpreted by row-language), the user passes AB_col with order == Order.Col and AB_row with order == Order.Row. In this case _ldA = getLeadingDim(A, order) is correct, but the matrix AB_col needs to be transposed before the cblas_?gbmv call so that it has the correct layout in memory, which might be expensive.
Force the user to pass a one-dimensional array to gbmv. The user is responsible for doing the index computations. In this case, LDA should also be provided. Though this defies the purpose of having a wrapper in the first place.

Code demo

With the proposed changes:

// file: bandedformat.chpl
use Random only fillRandom;
import LinearAlgebra, BLAS;

config const n = 6, m = 7, kl = 2, ku = 3;
config const seed = 42;
config const epsilon = 1e-13;
config const verbose = true;

var A: [1..m, 1..n] real;
fillRandom(A, seed);
A = floor(9*A + 1);
makeBanded(A, kl, ku);
if verbose then writeln("A=\n", A);

var x: [1..n] real;
fillRandom(x, seed+1);
x = floor(9*x + 1);
if verbose then writeln("x=\n", x);

var y: [1..m] real = LinearAlgebra.dot(A, x);
if verbose then writeln("y=\n", y);


for param order in (BLAS.Order.Row..BLAS.Order.Col) {
  writeln("-- order = ", order);

  var AB = bandedStorage(A, kl, ku, order);
  if verbose then writeln("AB_", order, if order == BLAS.Order.Col then "^t" else "", "=\n", AB);

  var y_: [1..m] real;
  BLAS.gbmv(AB, x, y_, 1.0, 0.0, kl, ku, BLAS.Op.N, order);

  if verbose then writeln("y_=\n", y_);
  writeln("correct? ", && reduce [d in abs(y - y_)] d < epsilon);
}


proc makeBanded(ref A: [?D] ?T, kl: int, ku: int) {
  const m = D.dim(0).size;
  for (i,j) in D do A(i,j) = if (max(1,j-ku) <= i && i <= min(m, j+kl)) then A(i,j) else 0.0: T;
}

proc bandedStorage(A: [?D] ?T, kl: int, ku: int, param order: BLAS.Order)
  where order == BLAS.Order.Row {
  const m = D.dim(0).size, n = D.dim(1).size, lda = kl+ku+1;
  var AB: [1..m, 1..lda] T;
  // adapted from: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-1/cblas-gbmv.html
  for i in 1..m {
    var k = kl + 1 - i;
    for j in max(1, i-kl)..min(n, i+ku) do AB(i, k+j) = A(i, j);
  }
  return AB;
}

proc bandedStorage(A: [?D] ?T, kl: int, ku: int, param order: BLAS.Order)
  where order == BLAS.Order.Col {
  const m = D.dim(0).size, n = D.dim(1).size, lda = kl+ku+1;
  var AB: [1..n, 1..lda] T;
  // adapted from: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-1/cblas-gbmv.html
  for j in 1..n {
    var k = ku + 1 - j;
    for i in max(1, j-ku)..min(m, j+kl) do AB(j, k+i) = A(i, j);
  }
  return AB;
}

Output (compiled with -lopenblas):

./bandedformat --n 6 --m 7 --kl 2 --ku 3
A=
3.0 9.0 4.0 9.0 0.0 0.0
5.0 5.0 1.0 3.0 7.0 0.0
1.0 3.0 2.0 7.0 7.0 7.0
0.0 9.0 9.0 1.0 3.0 2.0
0.0 0.0 6.0 7.0 6.0 9.0
0.0 0.0 0.0 2.0 2.0 9.0
0.0 0.0 0.0 0.0 2.0 5.0
x=
9.0 2.0 6.0 4.0 2.0 5.0
y=
105.0 87.0 104.0 92.0 121.0 57.0 29.0
-- order = Row
AB_Row=
0.0 0.0 3.0 9.0 4.0 9.0
0.0 5.0 5.0 1.0 3.0 7.0
1.0 3.0 2.0 7.0 7.0 7.0
9.0 9.0 1.0 3.0 2.0 0.0
6.0 7.0 6.0 9.0 0.0 0.0
2.0 2.0 9.0 0.0 0.0 0.0
2.0 5.0 0.0 0.0 0.0 0.0
m=7 n=6 kl=2 ku=3 ldA=6 incx=1 incy=1
y_=
105.0 87.0 104.0 92.0 121.0 57.0 29.0
correct? true
-- order = Col
AB_Col^t=
0.0 0.0 0.0 3.0 5.0 1.0
0.0 0.0 9.0 5.0 3.0 9.0
0.0 4.0 1.0 2.0 9.0 6.0
9.0 3.0 7.0 1.0 7.0 2.0
7.0 7.0 3.0 6.0 2.0 2.0
7.0 2.0 9.0 9.0 5.0 0.0
m=7 n=6 kl=2 ku=3 ldA=6 incx=1 incy=1
y_=
105.0 87.0 104.0 92.0 121.0 57.0 29.0
correct? true

Next steps

discuss which solution is more appropriate
implement the same changes to triangular banded storage
implements tests for banded routines in test/library/BLAS/test_blas2.chpl
update documentation

jeremiah-corrado · 2024-04-17T20:36:15Z

Hi @ninivert, thanks for the contribution and the detailed explanation!

I agree with your proposed direction. WRT the alternatives:

In this case ldA = getLeadingDim(A, order) is correct, but the matrix AB_col needs to be transposed before the cblas?gbmv call so that it has the correct layout in memory, which might be expensive.

I think you're right that this would be overly computationally expensive, so I'd rather stray a bit from the purely mathematical approach. That said, I think it will be very important to clearly document that gbmv accepts either the banded storage of the matrix for row-major order, or the transpose of the banded storage for col-major order.

Force the user to pass a one-dimensional array to gbmv

It may be useful to have other overloads of gbmv and tbmv that do this in the future, but yeah, I think the more convenient wrappers should be kept essentially as-is.

I think you're good to move on to the next steps you've outlined above. Once you have tests for these functions, I'll double check that they pass on my system.

Signed-off-by: Nicole Vadot <nicole.vadot@epfl.ch>

lydia-duncan added area: Libraries / Modules stat: Needs Design Review labels Apr 17, 2024

ninivert force-pushed the blas-gbmv-pr branch from 455eed0 to 9bbed76 Compare May 14, 2024 09:51

ninivert added 2 commits May 14, 2024 12:03

always read LDA from last dimension

9aeee81

Signed-off-by: Nicole Vadot <nicole.vadot@epfl.ch>

added band array helpers

e043c06

Signed-off-by: Nicole Vadot <nicole.vadot@epfl.ch>

ninivert force-pushed the blas-gbmv-pr branch from 9bbed76 to e043c06 Compare May 14, 2024 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BLAS: address banded matrix storage (gbmv, tbmv, etc.) #24872

BLAS: address banded matrix storage (gbmv, tbmv, etc.) #24872

ninivert commented Apr 17, 2024 •

edited

Loading

jeremiah-corrado commented Apr 17, 2024

BLAS: address banded matrix storage (gbmv, tbmv, etc.) #24872

Are you sure you want to change the base?

BLAS: address banded matrix storage (gbmv, tbmv, etc.) #24872

Conversation

ninivert commented Apr 17, 2024 • edited Loading

Context

General banded storage

Example

Column-major layout

Row-major layout

BLAS routine wrappers

Proposed solution

Alternative solutions

Code demo

Next steps

jeremiah-corrado commented Apr 17, 2024

ninivert commented Apr 17, 2024 •

edited

Loading