Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize spmatrix._set_many #7888

Merged
merged 3 commits into from
Oct 3, 2023
Merged

Conversation

loganbvh
Copy link
Contributor

This PR implements the change proposed in #7876.

Performance tests

Run with CuPy 11.0.0 on Google Colab because I don't have a local GPU.

Update only, no insert

def test_update_only(size, nnz):

  import numpy as np
  import cupy
  import cupyx

  rows = cupy.random.randint(0, size, nnz)
  cols = cupy.random.randint(0, size, nnz)
  old_vals = cupy.random.random(nnz)
  new_vals = cupy.random.random(nnz)
  
  mat_cupy_old = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))
  mat_cupy_new = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))

  def run_old():
    vals = cupy.roll(new_vals, -1)
    mat_cupy_old._set_many(rows, cols, vals)
    return mat_cupy_old.get()

  def run_new():
    vals = cupy.roll(new_vals, -1)
    _set_many(mat_cupy_new, rows, cols, vals)
    return mat_cupy_new.get()

  print(f"{size = }, {nnz = }")

  print("Old _set_many:")
  %timeit mat_scipy_old = run_old()

  print("New _set_many:")
  %timeit mat_scipy_new = run_new()

  mat_scipy_old = run_old()
  mat_scipy_new = run_new()

  assert np.array_equal(mat_scipy_old.indices, mat_scipy_new.indices)
  assert np.array_equal(mat_scipy_old.indptr, mat_scipy_new.indptr)
  assert np.allclose(mat_scipy_old.data, mat_scipy_new.data)
  return mat_scipy_old, mat_scipy_new

Results:

size = 1000, nnz = 100
Old _set_many:
8.06 ms ± 707 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New _set_many:
1.84 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

size = 10000, nnz = 1000
Old _set_many:
74.9 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
New _set_many:
2.05 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

size = 100000, nnz = 1000
Old _set_many:
73.4 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
New _set_many:
2.1 ms ± 45.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

size = 100000, nnz = 10000
Old _set_many:
647 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
New _set_many:
2.17 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Insert only, no update

I expect very little performance difference here because this line

if -1 not in offsets:
will always immediately find a -1 in the offsets when performing all inserts.

def test_insert_only(size, nnz):

  import numpy as np
  import cupy
  import cupyx

  rows = cupy.random.randint(0, size, nnz)
  cols = cupy.random.randint(0, size, nnz)
  old_vals = cupy.random.random(nnz)
  new_vals = cupy.random.random(nnz)

  mat_cupy_old = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))
  mat_cupy_new = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))

  def run_old():
    mat = cupyx.scipy.sparse.csr_matrix((size, size), dtype=old_vals.dtype)
    mat._set_many(rows, cols, new_vals)
    return mat.get()

  def run_new():
    mat = cupyx.scipy.sparse.csr_matrix((size, size), dtype=old_vals.dtype)
    _set_many(mat, rows, cols, new_vals)
    return mat.get()

  print(f"{size = }, {nnz = }")

  print("Old _set_many:")
  %timeit mat_scipy_old = run_old()

  print("New _set_many:")
  %timeit mat_scipy_new = run_new()

  mat_scipy_old = run_old()
  mat_scipy_new = run_new()

  assert np.array_equal(mat_scipy_old.indices, mat_scipy_new.indices)
  assert np.array_equal(mat_scipy_old.indptr, mat_scipy_new.indptr)
  assert np.allclose(mat_scipy_old.data, mat_scipy_new.data)
  return mat_scipy_old, mat_scipy_new

Results. Old and new are the same within the noise/reproducibility of the test.

size = 1000, nnz = 100
Old _set_many:
5.47 ms ± 696 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New _set_many:
4.36 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

output
size = 10000, nnz = 1000
Old _set_many:
5.08 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New _set_many:
5.87 ms ± 875 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


size = 100000, nnz = 1000
Old _set_many:
5.71 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New _set_many:
5.07 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

output
size = 10000, nnz = 10000
Old _set_many:
5.48 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New _set_many:
5.43 ms ± 98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@emcastillo emcastillo self-assigned this Sep 29, 2023
@emcastillo emcastillo added cat:performance Performance in terms of speed or memory consumption prio:medium labels Sep 29, 2023
@loganbvh
Copy link
Contributor Author

@emcastillo how does this CI work? Are the tests actually running, or do they not start until triggered by something?

@emcastillo
Copy link
Member

CIs are triggered after review by one of the maintainers

mask = offsets > -1
self.data[offsets[mask]] = x[mask]

if mask.all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still doing device synchronization since you need to bring back the value to the host for the if comparisson, so I think that execution time won't benefit from these changes

@emcastillo
Copy link
Member

emcastillo commented Oct 2, 2023

Thanks a lot for the PR!

Sorry, can you try to benchmark the code using this?

https://docs.cupy.dev/en/stable/user_guide/performance.html

I am interested in seeing the GPU performance. Thanks

@loganbvh
Copy link
Contributor Author

loganbvh commented Oct 2, 2023

@emcastillo thanks for the comments. I understand that there will still be a synchronization because of the if. I think that python_scalar not in cupy.ndarray must be doing something slow and linear in the array size (maybe copying values from device to host one by one for comparison?) to cause the performance difference I am seeing . See below

Test function (run on Google Colab with T4 GPU)
import numpy as np
import cupy
import cupyx

def benchmark_update_only(size, nnz):

  rows = cupy.random.randint(0, size, nnz)
  cols = cupy.random.randint(0, size, nnz)
  old_vals = cupy.random.random(nnz)
  new_vals = cupy.random.random(nnz)

  mat_cupy_old = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))
  mat_cupy_new = cupyx.scipy.sparse.csr_matrix((old_vals, (rows, cols)))

  def run_old():
    vals = cupy.roll(new_vals, -1)
    mat_cupy_old._set_many(rows, cols, vals)
    return mat_cupy_old.get()

  def run_new():
    # _set_many() uses mask.all()
    vals = cupy.roll(new_vals, -1)
    _set_many(mat_cupy_new, rows, cols, vals)
    return mat_cupy_new.get()

  print(f"{size = }, {nnz = }")

  print("Old _set_many:")
  print(cupyx.profiler.benchmark(run_old, (), n_repeat=100))

  print("New _set_many:")
  print(cupyx.profiler.benchmark(run_new, (), n_repeat=100))

  mat_scipy_old = run_old()
  mat_scipy_new = run_new()

  assert np.array_equal(mat_scipy_old.indices, mat_scipy_new.indices)
  assert np.array_equal(mat_scipy_old.indptr, mat_scipy_new.indptr)
  assert np.allclose(mat_scipy_old.data, mat_scipy_new.data)
  return mat_scipy_old, mat_scipy_new

Results:

size = 1000, nnz = 100
Old _set_many:
run_old             :    CPU: 8825.070 us   +/-1310.078 (min: 7724.606 / max:16055.254) us     GPU-0: 8835.807 us   +/-1311.485 (min: 7733.472 / max:16068.256) us
New _set_many:
run_new             :    CPU: 1966.075 us   +/-380.427 (min: 1650.213 / max: 3323.051) us     GPU-0: 1975.862 us   +/-381.888 (min: 1658.272 / max: 3338.944) us

size = 10000, nnz = 1000
Old _set_many:
run_old             :    CPU:77947.207 us   +/-14357.633 (min:65536.332 / max:116974.177) us     GPU-0:77964.677 us   +/-14360.006 (min:65550.400 / max:116999.359) us
New _set_many:
run_new             :    CPU: 2164.446 us   +/-354.291 (min: 1912.754 / max: 4756.252) us     GPU-0: 2174.831 us   +/-356.572 (min: 1921.440 / max: 4793.504) us

size = 100000, nnz = 1000
Old _set_many:
run_old             :    CPU:77371.187 us   +/-14551.222 (min:64965.192 / max:121929.457) us     GPU-0:77391.704 us   +/-14553.185 (min:64981.827 / max:121949.310) us
New _set_many:
run_new             :    CPU: 2201.750 us   +/-283.191 (min: 1999.019 / max: 3904.617) us     GPU-0: 2213.568 us   +/-283.910 (min: 2010.080 / max: 3918.848) us

size = 100000, nnz = 10000
Old _set_many:
run_old             :    CPU:737002.206 us   +/-101138.287 (min:659308.547 / max:1129850.700) us     GPU-0:737029.525 us   +/-101140.965 (min:659333.740 / max:1129882.568) us
New _set_many:
run_new             :    CPU: 2391.390 us   +/-299.001 (min: 2131.012 / max: 4668.803) us     GPU-0: 2403.519 us   +/-299.872 (min: 2142.144 / max: 4689.088) us

@emcastillo
Copy link
Member

Seems a significant improvement, let me kick the CIs

@emcastillo
Copy link
Member

/test mini

@loganbvh
Copy link
Contributor Author

loganbvh commented Oct 2, 2023

Actually, this looks to be just as fast as using mask.all(), maybe faster:

Note the only change from main is if -1 not in offsets --> if -1 not in offsets.get(). Maybe it would be worth looking into cupy.ndarray handling of in and not in in Python.

def _set_many(self, i, j, x):
    """Sets value at each (i, j) to x
    Here (i,j) index major and minor respectively, and must not contain
    duplicate entries.
    """
    i, j, M, N = self._prepare_indices(i, j)
    x = cupy.array(x, dtype=self.dtype, copy=True, ndmin=1).ravel()

    new_sp = cupyx.scipy.sparse.csr_matrix(
        (cupy.arange(self.nnz, dtype=cupy.float32),
          self.indices, self.indptr), shape=(M, N))

    offsets = new_sp._get_arrayXarray(
        i, j, not_found_val=-1).astype(cupy.int32).ravel()

    if -1 not in offsets.get():  # This is the only line that is different from main
        # only affects existing non-zero cells
        self.data[offsets] = x
        return

    mask = offsets > -1
    self.data[offsets[mask]] = x[mask]
    # only insertions remain
    warnings.warn('Changing the sparsity structure of a '
                  '{}_matrix is expensive.'.format(self.format),
                  _base.SparseEfficiencyWarning)
    mask = ~mask
    i = i[mask]
    i[i < 0] += M
    j = j[mask]
    j[j < 0] += N
    self._insert_many(i, j, x[mask])

Benchmark:

size = 1000, nnz = 100
Old _set_many:
run_old             :    CPU: 8188.918 us   +/-1384.732 (min: 7172.135 / max:13345.911) us     GPU-0: 8199.408 us   +/-1386.683 (min: 7181.216 / max:13361.408) us
New _set_many:
run_new             :    CPU: 1409.316 us   +/-53.622 (min: 1353.613 / max: 1714.726) us     GPU-0: 1418.421 us   +/-53.911 (min: 1362.432 / max: 1723.136) us

size = 10000, nnz = 1000
Old _set_many:
run_old             :    CPU:72163.210 us   +/-12805.408 (min:61641.605 / max:107873.061) us     GPU-0:72179.694 us   +/-12807.955 (min:61653.633 / max:107892.639) us
New _set_many:
run_new             :    CPU: 1851.025 us   +/-413.465 (min: 1626.933 / max: 4501.087) us     GPU-0: 1861.673 us   +/-414.309 (min: 1635.808 / max: 4515.264) us

size = 100000, nnz = 1000
Old _set_many:
run_old             :    CPU:81232.915 us   +/-24788.292 (min:61793.035 / max:181135.444) us     GPU-0:81253.433 us   +/-24791.356 (min:61808.414 / max:181161.377) us
New _set_many:
run_new             :    CPU: 1918.286 us   +/-503.328 (min: 1617.062 / max: 5732.157) us     GPU-0: 1930.703 us   +/-504.159 (min: 1637.344 / max: 5749.920) us

size = 100000, nnz = 10000
Old _set_many:
run_old             :    CPU:689913.079 us   +/-130151.154 (min:609431.234 / max:1441613.560) us     GPU-0:689948.436 us   +/-130155.355 (min:609462.891 / max:1441665.771) us
New _set_many:
run_new             :    CPU: 1951.146 us   +/-264.610 (min: 1703.771 / max: 3905.877) us     GPU-0: 1963.304 us   +/-265.588 (min: 1714.656 / max: 3926.432) us

Copy link
Member

@emcastillo emcastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! nice find

@emcastillo emcastillo merged commit 199c616 into cupy:main Oct 3, 2023
52 of 53 checks passed
@loganbvh loganbvh deleted the spmatrix-setitem branch October 3, 2023 14:53
@asi1024 asi1024 added this to the v13.0.0rc1 milestone Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:performance Performance in terms of speed or memory consumption prio:medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants