Implemented `side` parameter (resolves #1) #7

baldassarreFe · 2019-11-19T20:15:15Z

This PR includes:

C++ code for searchsorted with correct support of side=[left|right]
CUDA code for searchsorted with correct support of side=[left|right]
helper functions for broadcasting numpy arrays and pytorch tensors to be used in searchsorted
more unit tests to make sure all of the above work
minor changes in the README (how to check CUDA availability during installation, how to run tests, how to run the benchmark)
new benchmark script that uses the builtin timeit

Please check out the new implementation (the CUDA and C++ part should be much more readable), run the unit tests and let me know if you get similar times in the benchmark ;)

aliutkus · 2019-11-20T20:33:28Z

thank you very much I'll review this shortly

aliutkus

Dear @baldassarreFe ,

It looks like it took you a lot of time to prepare this PR and I sincerely thank you very much for this.

I also took some time to review this carefully, and although I believe it should be merged soon, I cannot accept the PR as it is now for several reasons:

1. The new version is slower

Proposed version:

Benchmark searchsorted:

a [5000 x 300]

v [5000 x 100]

reporting fastest time of 20 runs

each run executes searchsorted 100 times

Numpy: 4.360683292150497
CPU: 12.054927673190832
CUDA: 0.00202458119019866

Previous version:

Benchmark searchsorted:

a [5000 x 300]

v [5000 x 100]

reporting fastest time of 20 runs

each run executes searchsorted 100 times

Numpy: 4.632954204920679
CPU: 5.094183708075434
CUDA: 0.0007853466086089611

(exactly same machine, same setup)
As you can see, the PR makes both the CPU and the CUDA versions approximately 2x slower. Since this searchsorted operation is typically a very low-level one that is called in an intensive way (at least in my own work), I can't really accept the slow-down

2. Details in the code

I will mention them in the code directly

thanks for everything

aliutkus · 2019-11-25T07:32:16Z

README.md

+```bash
+pip install -v .
+```


Your modification removes the very important information that nvcc sometimes needs a version of gcc and g++ that are below the default ones on the system, so that your installation procedure will not work

All right, sorry for removing that disclaimer. I tested on my machine and on a few servers we have here and did not notice any issue, that's why it felt redundant to me. Also, g++ is usually just a link to gcc, I haven't seen cases where the two are different.

Anyway, let's just restore that warning so that no one runs into problems ;)

aliutkus · 2019-11-25T07:33:31Z

README.md

+The verbose flag `-v` is not mandatory, but it will print whether the installer was able to find `nvcc` and install the CUDA version of `torchsearchsorted`.
+If you're having problems with the installation, make sure `nvcc` and `gcc` are installed and available in your path, e.g.:
+```bash
+export PATH="/usr/local/cuda/bin:${PATH}"
+export CPATH="/usr/local/cuda/include:${CPATH}"


This is not enough to have gcc and g++ in the path.
Actually, what is required is to have nvcc get to gcc and g++ that are the version it wants.
For me, I had to do the following:

sudo apt-get install g++-8 gcc-8
sudo ln -s /usr/bin/gcc-8 /usr/local/cuda-10.1/bin/gcc
sudo ln -s /usr/bin/g++-8 /usr/local/cuda-10.1/bin/g++

Oh that's strange, for me nvcc V10.1.105 and gcc 7.4.0 work together without issues.

I think the trick with PATH and CPATH (which is what torch_scatter suggests) is safer than creating system-wide links for the simple reason that if some other program relies on specific versions it won't be affected by changing the PATH for the current shell.

We can document both approaches in the installation instructions, I think.

yes, gcc 7 should work. But I had 9 and it doesn't

aliutkus · 2019-11-25T07:35:18Z

README.md

+Numpy: 	3.4524286500000017
+CPU: 	10.617608329001087


As you can see, your CPU version gets significantly slower than numpy's.
Although this is not a very big issue for training deep models, that is typically done in GPU, it is a problem for inference, that is often done on CPU

aliutkus · 2019-11-25T07:36:00Z

examples/benchmark.py

+import timeit
+
+import torch
+import numpy as np
+from torchsearchsorted import searchsorted, numpy_searchsorted
+
+B = 5_000
+A = 300
+V = 100
+
+repeats = 20
+number = 100
+
+print(
+    f'Benchmark searchsorted:',
+    f'- a [{B} x {A}]',
+    f'- v [{B} x {V}]',
+    f'- reporting fastest time of {repeats} runs',
+    f'- each run executes searchsorted {number} times',
+    sep='\n',
+    end='\n\n'
+)
+
+
+def get_arrays():
+    a = np.sort(np.random.randn(B, A), axis=1)
+    v = np.random.randn(B, V)
+    out = np.empty_like(v, dtype=np.long)
+    return a, v, out
+
+
+def get_tensors(device):
+    a = torch.sort(torch.randn(B, A, device=device), dim=1)[0]
+    v = torch.randn(B, V, device=device)
+    out = torch.empty(B, V, device=device, dtype=torch.long)
+    return a, v, out
+
+
+numpy = timeit.repeat(
+    stmt="numpy_searchsorted(a, v, out, side='left')",
+    setup="a, v, out = get_arrays()",
+    globals=globals(),
+    repeat=repeats,
+    number=number
+)
+print('Numpy: ', min(numpy), sep='\t')
+
+cpu = timeit.repeat(
+    stmt="searchsorted(a, v, out, side='left')",
+    setup="a, v, out = get_tensors(device='cpu')",
+    globals=globals(),
+    repeat=repeats,
+    number=number
+)
+print('CPU: ', min(cpu), sep='\t')
+
+if torch.cuda.is_available():
+    gpu = timeit.repeat(
+        stmt="searchsorted(a, v, out, side='left')",
+        setup="a, v, out = get_tensors(device='cuda')",
+        globals=globals(),
+        repeat=repeats,
+        number=number
+    )
+    print('CUDA: ', min(gpu), sep='\t')
+


aliutkus · 2019-11-25T07:36:40Z

setup.py

 if CUDA_HOME:
+    print('torchsearchsorted will be installed with CUDA support')


aliutkus · 2019-11-25T07:36:47Z

setup.py

    modules.append(
        CUDAExtension('torchsearchsorted.cuda',
                      ['src/cuda/searchsorted_cuda_wrapper.cpp',
                       'src/cuda/searchsorted_cuda_kernel.cu'])
    )
+else:
+    print('torchsearchsorted will be installed for CPU only')


aliutkus · 2019-11-25T07:43:48Z

src/torchsearchsorted/searchsorted.py

+            raise ValueError(f"Output `out` must have the same shape as `v`, "
+                             f"got {out.shape} and {a.shape}")


This is not true, the output must have the same shape as: (max(a.shape[0], v.shape[0]), v.shape[1])

Let's see if I get the broadcasting semantic right.

Let's use these variables:

B = batch dimension, say 300 A = number of columns of a, say 15 V = number of columns of v, say 7

If a contains many rows and v contains many rows, all values in v are searched in the corresponding row of a, in this case a.shape[0] and v.shape[0] must match

If a contains only one row and v contains many rows, all values in v are searched in the first row of a

If a contains many rows and v contains only one row, all values of the first row of v are searched in the all rows of a

This is a summary of the cases:

a v out (B, A), (B, V) -> (B, V) (B, A), (1, V) -> (B, V) (1, A), (B, V) -> (B, V) (1, A), (1, V) -> (1, V) (X, A), (Y, V) -> error

When you say:

the output must have the same shape as (max(a.shape[0], v.shape[0]), v.shape[1])
how do you handle the case where a.shape[0] and v.shape[0] are different and are both > 1? Let's say max(a.shape[0], v.shape[0]) = max(20, 45) = 45. In this case how can you match every row of values to be searched to a row of a?

A couple of lines before this check, the shapes of a and v are broadcasted together in the first dimension (raising error if not possible), so I think it's fine to check that out.shape == v.shape. The only issue I see is that I put the wrong letter in the exception message:

Instead of {a.shape}:

if out is not None: if out.shape != v.shape: raise ValueError(f"Output `out` must have the same shape as `v`, " f"got {out.shape} and {a.shape}")

Should be {v.shape}:

if out is not None: if out.shape != v.shape: raise ValueError(f"Output `out` must have the same shape as `v`, " f"got {out.shape} and {v.shape}")

Hey again. actually, let me be more precise:
your solution works just nice. The only part that I guess doesn't work (am I wrong ?) is in the error message:
after you broadcasted: I don't agree that out should be of the same shape as (the origina, i.e. not yet broadcastedl) v. Let me give you an example: we may have v.shape[0]==1 and a.shape[0]>1. In this case, we want out to be (a.shape[0], v.shape[1]).
So I totally agree on the code: at this point (after broadcasting) we may try whether out is same shape as v. But the user possibly provided another v, so the message is not clear.
See what I mean ?

aliutkus · 2019-11-25T07:45:27Z

src/torchsearchsorted/searchsorted.py


-    left_side = 1 if side=='left' else 0
+    left_side = side == 'left'


lol, good catch

aliutkus · 2019-11-25T07:47:14Z

src/torchsearchsorted/searchsorted.py

-        raise Exception('torchsearchsorted on CUDA device is asked, but it seems '
-                        'that it is not available. Please install it')


I miss the case where you call searchsorted with a cuda tensor when torchsearchsorted.cuda is not built

In that case you get a warning the first time torchsearchsorted is imported. See lines 8-15 in searchsorted.py:

if torch.cuda.is_available(): try: from torchsearchsorted.cuda import searchsorted_cuda_wrapper except ImportError as e: warnings.warn("PyTorch is installed with CUDA support, but " "torchsearchsorted for CUDA was not installed, " "please repeat the installation or avoid passing " "CUDA tensors to the `searchsorted`.")

And if you still try to call the searchsorted function with a CUDA tensor you get the usual error message from PyTorch that says something like "RuntimeError: expected backend CPU and dtype Float but got backend CUDA and dtype Float".

I think letting PyTorch handle this exception for us and outputting the default message is better in this case.

all right yes

aliutkus · 2019-11-25T07:48:26Z

src/torchsearchsorted/utils.py

+def numpy_searchsorted(a: np.ndarray, v: np.ndarray,
+                       out: np.ndarray=None, side='left') -> np.ndarray:
+    """Batch-wise version of numpy's searchsorted"""
+    a = np.asarray(a)
+    v = np.asarray(v)
+    a, v = broadcast_arrays(a, v, axis=0)
+    if out is None:
+        out = np.empty(v.shape, dtype=np.long)
+    for i in range(v.shape[0]):
+        out[i] = np.searchsorted(a[i], v[i], side=side)
    return out
+
+
+def broadcast_arrays(*arrays, axis=0):
+    """Broadcast arrays along one axis, leaving other axes unchanged"""
+    if axis < 0:
+        raise ValueError(f"Negative axis not supported, got {axis}")
+    axis_size = max(a.shape[axis] for a in arrays)
+    return [
+        np.broadcast_to(a, (*a.shape[:axis], axis_size, *a.shape[axis + 1:]))
+        for a in arrays
+    ]


I don't think it's so much elegant in this case. IMO you should revert to the original version that's more concise

The thing is, in the original version, the sel function is called every time to select a row in the a and v matrices. This means that you get two function calls for every iteration of the for loop. On the other hand, we could make use of the native broadcasting functions of numpy/pytorch, that expand a dimension by setting the stride to 0 and need to be called only once.

Also, don't you like a separate, reusable and documented function better than a nested function?

well, I don't think this would be reused a lot, since it's quite hidden in our module.
if you like this, you could actually replace the sel function by a one-liner of your np.broadcast call, that would maybe be the best option ? I agree that this is probably more elegant
did you try: is this the same speed as the sel, slower, or faster ?

baldassarreFe · 2019-11-25T15:44:30Z

I've noticed the speed issue, I think it could be related to the left/right side, which at least in this version gives correct results. Anyway, I'll look into speeding up the implementation ;)

aliutkus · 2019-11-27T07:20:53Z

src/torchsearchsorted/searchsorted.py

-    assert (a.shape[0] == v.shape[0]
-            or a.shape[0] == 1
-            or v.shape[0] == 1), ("`a` and `v` must have the same number of "
-                                  "rows or one of them must have only one ")


answering your question below, this is where was handled the case of a and b .shape[0]>1 and different

baldassarreFe added 9 commits November 15, 2019 15:14

Added "CPU only" and "CUDA support" messages during installation

46eb8f5

Cleanup of the batch-wise version of numpy's searchsorted

4c7ec29

CPU version: correct impl of left/right and broadcasted tensors support

b258fc8

test['expected'] did not need to go on a different device

7935ef9

CUDA version: correct impl of left/right and broadcasted tensors support

98916a7

Added out parameter to numpy implementation (for a fair benchmark)

a83c677

Minor edit in CPU and CUDA code

6c72939

Benchmark with timeit and updated README

6e38c34

Modified examples/text.py to work with out param of numpy_searchsorted

dc9aa0d

aliutkus self-requested a review November 25, 2019 07:07

aliutkus requested changes Nov 25, 2019

View reviewed changes

aliutkus reviewed Nov 27, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented `side` parameter (resolves #1) #7

Implemented `side` parameter (resolves #1) #7

baldassarreFe commented Nov 19, 2019

aliutkus commented Nov 20, 2019

aliutkus left a comment •

edited

aliutkus Nov 25, 2019

baldassarreFe Nov 25, 2019

aliutkus Nov 25, 2019

baldassarreFe Nov 25, 2019

aliutkus Nov 25, 2019

aliutkus Nov 25, 2019

aliutkus Nov 25, 2019

aliutkus Nov 25, 2019

aliutkus Nov 25, 2019

aliutkus Nov 25, 2019

baldassarreFe Nov 25, 2019

aliutkus Nov 27, 2019

aliutkus Nov 25, 2019

aliutkus Nov 25, 2019

baldassarreFe Nov 25, 2019

aliutkus Nov 27, 2019

aliutkus Nov 25, 2019

baldassarreFe Nov 25, 2019

aliutkus Nov 27, 2019

baldassarreFe commented Nov 25, 2019

aliutkus Nov 27, 2019

		if CUDA_HOME:
		print('torchsearchsorted will be installed with CUDA support')

		raise ValueError(f"Output `out` must have the same shape as `v`, "
		f"got {out.shape} and {a.shape}")


		left_side = 1 if side=='left' else 0
		left_side = side == 'left'

		raise Exception('torchsearchsorted on CUDA device is asked, but it seems '
		'that it is not available. Please install it')

Implemented side parameter (resolves #1) #7

Are you sure you want to change the base?

Implemented side parameter (resolves #1) #7

Conversation

baldassarreFe commented Nov 19, 2019

aliutkus commented Nov 20, 2019

aliutkus left a comment • edited

Choose a reason for hiding this comment

1. The new version is slower

2. Details in the code

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baldassarreFe commented Nov 25, 2019

Choose a reason for hiding this comment

Implemented `side` parameter (resolves #1) #7

Implemented `side` parameter (resolves #1) #7

aliutkus left a comment •

edited