Optimize spmatrix._set_many #7888

loganbvh · 2023-09-28T21:56:12Z

This PR implements the change proposed in #7876.

Performance tests

Run with CuPy 11.0.0 on Google Colab because I don't have a local GPU.

Update only, no insert

def test_update_only(size, nnz):

  import numpy as np
  import cupy
  import cupyx

  rows = cupy.random.randint(0, size, nnz)
  cols = cupy.random.randint(0, size, nnz)
  old_vals = cupy.random.random(nnz)
  new_vals = cupy.random.random(nnz)
  
  mat_cupy_old = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))
  mat_cupy_new = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))

  def run_old():
    vals = cupy.roll(new_vals, -1)
    mat_cupy_old._set_many(rows, cols, vals)
    return mat_cupy_old.get()

  def run_new():
    vals = cupy.roll(new_vals, -1)
    _set_many(mat_cupy_new, rows, cols, vals)
    return mat_cupy_new.get()

  print(f"{size = }, {nnz = }")

  print("Old _set_many:")
  %timeit mat_scipy_old = run_old()

  print("New _set_many:")
  %timeit mat_scipy_new = run_new()

  mat_scipy_old = run_old()
  mat_scipy_new = run_new()

  assert np.array_equal(mat_scipy_old.indices, mat_scipy_new.indices)
  assert np.array_equal(mat_scipy_old.indptr, mat_scipy_new.indptr)
  assert np.allclose(mat_scipy_old.data, mat_scipy_new.data)
  return mat_scipy_old, mat_scipy_new

Results:

size = 1000, nnz = 100
Old _set_many:
8.06 ms ± 707 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New _set_many:
1.84 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

size = 10000, nnz = 1000
Old _set_many:
74.9 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
New _set_many:
2.05 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

size = 100000, nnz = 1000
Old _set_many:
73.4 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
New _set_many:
2.1 ms ± 45.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

size = 100000, nnz = 10000
Old _set_many:
647 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
New _set_many:
2.17 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Insert only, no update

I expect very little performance difference here because this line

cupy/cupyx/scipy/sparse/_compressed.py

Line 540 in 5c32e40

if -1 not in offsets:

will always immediately find a -1 in the offsets when performing all inserts.

def test_insert_only(size, nnz):

  import numpy as np
  import cupy
  import cupyx

  rows = cupy.random.randint(0, size, nnz)
  cols = cupy.random.randint(0, size, nnz)
  old_vals = cupy.random.random(nnz)
  new_vals = cupy.random.random(nnz)

  mat_cupy_old = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))
  mat_cupy_new = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))

  def run_old():
    mat = cupyx.scipy.sparse.csr_matrix((size, size), dtype=old_vals.dtype)
    mat._set_many(rows, cols, new_vals)
    return mat.get()

  def run_new():
    mat = cupyx.scipy.sparse.csr_matrix((size, size), dtype=old_vals.dtype)
    _set_many(mat, rows, cols, new_vals)
    return mat.get()

  print(f"{size = }, {nnz = }")

  print("Old _set_many:")
  %timeit mat_scipy_old = run_old()

  print("New _set_many:")
  %timeit mat_scipy_new = run_new()

  mat_scipy_old = run_old()
  mat_scipy_new = run_new()

  assert np.array_equal(mat_scipy_old.indices, mat_scipy_new.indices)
  assert np.array_equal(mat_scipy_old.indptr, mat_scipy_new.indptr)
  assert np.allclose(mat_scipy_old.data, mat_scipy_new.data)
  return mat_scipy_old, mat_scipy_new

Results. Old and new are the same within the noise/reproducibility of the test.

size = 1000, nnz = 100
Old _set_many:
5.47 ms ± 696 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New _set_many:
4.36 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

output
size = 10000, nnz = 1000
Old _set_many:
5.08 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New _set_many:
5.87 ms ± 875 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


size = 100000, nnz = 1000
Old _set_many:
5.71 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New _set_many:
5.07 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

output
size = 10000, nnz = 10000
Old _set_many:
5.48 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New _set_many:
5.43 ms ± 98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

loganbvh · 2023-09-30T16:17:47Z

@emcastillo how does this CI work? Are the tests actually running, or do they not start until triggered by something?

emcastillo · 2023-10-02T00:34:36Z

CIs are triggered after review by one of the maintainers

emcastillo · 2023-10-02T00:37:02Z

cupyx/scipy/sparse/_compressed.py

+        mask = offsets > -1
+        self.data[offsets[mask]] = x[mask]
+
+        if mask.all():


This is still doing device synchronization since you need to bring back the value to the host for the if comparisson, so I think that execution time won't benefit from these changes

emcastillo · 2023-10-02T00:37:53Z

Thanks a lot for the PR!

Sorry, can you try to benchmark the code using this?

https://docs.cupy.dev/en/stable/user_guide/performance.html

I am interested in seeing the GPU performance. Thanks

loganbvh · 2023-10-02T01:32:12Z

@emcastillo thanks for the comments. I understand that there will still be a synchronization because of the if. I think that python_scalar not in cupy.ndarray must be doing something slow and linear in the array size (maybe copying values from device to host one by one for comparison?) to cause the performance difference I am seeing . See below

Test function (run on Google Colab with T4 GPU)

import numpy as np
import cupy
import cupyx

def benchmark_update_only(size, nnz):

  rows = cupy.random.randint(0, size, nnz)
  cols = cupy.random.randint(0, size, nnz)
  old_vals = cupy.random.random(nnz)
  new_vals = cupy.random.random(nnz)

  mat_cupy_old = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))
  mat_cupy_new = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))

  def run_old():
    vals = cupy.roll(new_vals, -1)
    mat_cupy_old._set_many(rows, cols, vals)
    return mat_cupy_old.get()

  def run_new():
    # _set_many() uses mask.all()
    vals = cupy.roll(new_vals, -1)
    _set_many(mat_cupy_new, rows, cols, vals)
    return mat_cupy_new.get()

  print(f"{size = }, {nnz = }")

  print("Old _set_many:")
  print(cupyx.profiler.benchmark(run_old, (), n_repeat=100))

  print("New _set_many:")
  print(cupyx.profiler.benchmark(run_new, (), n_repeat=100))

  mat_scipy_old = run_old()
  mat_scipy_new = run_new()

  assert np.array_equal(mat_scipy_old.indices, mat_scipy_new.indices)
  assert np.array_equal(mat_scipy_old.indptr, mat_scipy_new.indptr)
  assert np.allclose(mat_scipy_old.data, mat_scipy_new.data)
  return mat_scipy_old, mat_scipy_new

Results:

size = 1000, nnz = 100
Old _set_many:
run_old             :    CPU: 8825.070 us   +/-1310.078 (min: 7724.606 / max:16055.254) us     GPU-0: 8835.807 us   +/-1311.485 (min: 7733.472 / max:16068.256) us
New _set_many:
run_new             :    CPU: 1966.075 us   +/-380.427 (min: 1650.213 / max: 3323.051) us     GPU-0: 1975.862 us   +/-381.888 (min: 1658.272 / max: 3338.944) us

size = 10000, nnz = 1000
Old _set_many:
run_old             :    CPU:77947.207 us   +/-14357.633 (min:65536.332 / max:116974.177) us     GPU-0:77964.677 us   +/-14360.006 (min:65550.400 / max:116999.359) us
New _set_many:
run_new             :    CPU: 2164.446 us   +/-354.291 (min: 1912.754 / max: 4756.252) us     GPU-0: 2174.831 us   +/-356.572 (min: 1921.440 / max: 4793.504) us

size = 100000, nnz = 1000
Old _set_many:
run_old             :    CPU:77371.187 us   +/-14551.222 (min:64965.192 / max:121929.457) us     GPU-0:77391.704 us   +/-14553.185 (min:64981.827 / max:121949.310) us
New _set_many:
run_new             :    CPU: 2201.750 us   +/-283.191 (min: 1999.019 / max: 3904.617) us     GPU-0: 2213.568 us   +/-283.910 (min: 2010.080 / max: 3918.848) us

size = 100000, nnz = 10000
Old _set_many:
run_old             :    CPU:737002.206 us   +/-101138.287 (min:659308.547 / max:1129850.700) us     GPU-0:737029.525 us   +/-101140.965 (min:659333.740 / max:1129882.568) us
New _set_many:
run_new             :    CPU: 2391.390 us   +/-299.001 (min: 2131.012 / max: 4668.803) us     GPU-0: 2403.519 us   +/-299.872 (min: 2142.144 / max: 4689.088) us

emcastillo · 2023-10-02T01:50:59Z

Seems a significant improvement, let me kick the CIs

emcastillo · 2023-10-02T01:51:06Z

/test mini

loganbvh · 2023-10-02T03:29:37Z

Actually, this looks to be just as fast as using mask.all(), maybe faster:

Note the only change from main is if -1 not in offsets --> if -1 not in offsets.get(). Maybe it would be worth looking into cupy.ndarray handling of in and not in in Python.

def _set_many(self, i, j, x):
    """Sets value at each (i, j) to x
    Here (i,j) index major and minor respectively, and must not contain
    duplicate entries.
    """
    i, j, M, N = self._prepare_indices(i, j)
    x = cupy.array(x, dtype=self.dtype, copy=True, ndmin=1).ravel()

    new_sp = cupyx.scipy.sparse.csr_matrix(
        (cupy.arange(self.nnz, dtype=cupy.float32),
          self.indices, self.indptr), shape=(M, N))

    offsets = new_sp._get_arrayXarray(
        i, j, not_found_val=-1).astype(cupy.int32).ravel()

    if -1 not in offsets.get():  # This is the only line that is different from main
        # only affects existing non-zero cells
        self.data[offsets] = x
        return

    mask = offsets > -1
    self.data[offsets[mask]] = x[mask]
    # only insertions remain
    warnings.warn('Changing the sparsity structure of a '
                  '{}_matrix is expensive.'.format(self.format),
                  _base.SparseEfficiencyWarning)
    mask = ~mask
    i = i[mask]
    i[i < 0] += M
    j = j[mask]
    j[j < 0] += N
    self._insert_many(i, j, x[mask])

Benchmark:

size = 1000, nnz = 100
Old _set_many:
run_old             :    CPU: 8188.918 us   +/-1384.732 (min: 7172.135 / max:13345.911) us     GPU-0: 8199.408 us   +/-1386.683 (min: 7181.216 / max:13361.408) us
New _set_many:
run_new             :    CPU: 1409.316 us   +/-53.622 (min: 1353.613 / max: 1714.726) us     GPU-0: 1418.421 us   +/-53.911 (min: 1362.432 / max: 1723.136) us

size = 10000, nnz = 1000
Old _set_many:
run_old             :    CPU:72163.210 us   +/-12805.408 (min:61641.605 / max:107873.061) us     GPU-0:72179.694 us   +/-12807.955 (min:61653.633 / max:107892.639) us
New _set_many:
run_new             :    CPU: 1851.025 us   +/-413.465 (min: 1626.933 / max: 4501.087) us     GPU-0: 1861.673 us   +/-414.309 (min: 1635.808 / max: 4515.264) us

size = 100000, nnz = 1000
Old _set_many:
run_old             :    CPU:81232.915 us   +/-24788.292 (min:61793.035 / max:181135.444) us     GPU-0:81253.433 us   +/-24791.356 (min:61808.414 / max:181161.377) us
New _set_many:
run_new             :    CPU: 1918.286 us   +/-503.328 (min: 1617.062 / max: 5732.157) us     GPU-0: 1930.703 us   +/-504.159 (min: 1637.344 / max: 5749.920) us

size = 100000, nnz = 10000
Old _set_many:
run_old             :    CPU:689913.079 us   +/-130151.154 (min:609431.234 / max:1441613.560) us     GPU-0:689948.436 us   +/-130155.355 (min:609462.891 / max:1441665.771) us
New _set_many:
run_new             :    CPU: 1951.146 us   +/-264.610 (min: 1703.771 / max: 3905.877) us     GPU-0: 1963.304 us   +/-265.588 (min: 1714.656 / max: 3926.432) us

emcastillo

LGTM! nice find

loganbvh added 3 commits September 28, 2023 14:08

Update _compressed.py

7b2dfb8

remove whitespace

b3f9787

Merge branch 'main' into spmatrix-setitem

516c016

emcastillo self-assigned this Sep 29, 2023

emcastillo added cat:performance Performance in terms of speed or memory consumption prio:medium labels Sep 29, 2023

emcastillo reviewed Oct 2, 2023

View reviewed changes

emcastillo approved these changes Oct 3, 2023

View reviewed changes

emcastillo merged commit 199c616 into cupy:main Oct 3, 2023
52 of 53 checks passed

loganbvh deleted the spmatrix-setitem branch October 3, 2023 14:53

loganbvh mentioned this pull request Oct 4, 2023

cupyx.sparse matrix __setitem__ optimization #7876

Closed

asi1024 added this to the v13.0.0rc1 milestone Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize spmatrix._set_many #7888

Optimize spmatrix._set_many #7888

loganbvh commented Sep 28, 2023

loganbvh commented Sep 30, 2023

emcastillo commented Oct 2, 2023

emcastillo Oct 2, 2023

emcastillo commented Oct 2, 2023 •

edited

loganbvh commented Oct 2, 2023

emcastillo commented Oct 2, 2023

emcastillo commented Oct 2, 2023

loganbvh commented Oct 2, 2023

emcastillo left a comment

Optimize spmatrix._set_many #7888

Optimize spmatrix._set_many #7888

Conversation

loganbvh commented Sep 28, 2023

Performance tests

Update only, no insert

Insert only, no update

loganbvh commented Sep 30, 2023

emcastillo commented Oct 2, 2023

emcastillo Oct 2, 2023

Choose a reason for hiding this comment

emcastillo commented Oct 2, 2023 • edited

loganbvh commented Oct 2, 2023

emcastillo commented Oct 2, 2023

emcastillo commented Oct 2, 2023

loganbvh commented Oct 2, 2023

emcastillo left a comment

Choose a reason for hiding this comment

emcastillo commented Oct 2, 2023 •

edited