Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented side parameter (resolves #1) #7

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

baldassarreFe
Copy link
Contributor

This PR includes:

  • C++ code for searchsorted with correct support of side=[left|right]
  • CUDA code for searchsorted with correct support of side=[left|right]
  • helper functions for broadcasting numpy arrays and pytorch tensors to be used in searchsorted
  • more unit tests to make sure all of the above work
  • minor changes in the README (how to check CUDA availability during installation, how to run tests, how to run the benchmark)
  • new benchmark script that uses the builtin timeit

Please check out the new implementation (the CUDA and C++ part should be much more readable), run the unit tests and let me know if you get similar times in the benchmark ;)

@aliutkus
Copy link
Owner

thank you very much I'll review this shortly

@aliutkus aliutkus self-requested a review November 25, 2019 07:07
Copy link
Owner

@aliutkus aliutkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @baldassarreFe ,

It looks like it took you a lot of time to prepare this PR and I sincerely thank you very much for this.

I also took some time to review this carefully, and although I believe it should be merged soon, I cannot accept the PR as it is now for several reasons:

1. The new version is slower

Proposed version:

Benchmark searchsorted:

  • a [5000 x 300]
  • v [5000 x 100]
  • reporting fastest time of 20 runs
  • each run executes searchsorted 100 times

Numpy: 4.360683292150497
CPU: 12.054927673190832
CUDA: 0.00202458119019866

Previous version:

Benchmark searchsorted:

  • a [5000 x 300]
  • v [5000 x 100]
  • reporting fastest time of 20 runs
  • each run executes searchsorted 100 times

Numpy: 4.632954204920679
CPU: 5.094183708075434
CUDA: 0.0007853466086089611

(exactly same machine, same setup)
As you can see, the PR makes both the CPU and the CUDA versions approximately 2x slower. Since this searchsorted operation is typically a very low-level one that is called in an intensive way (at least in my own work), I can't really accept the slow-down

2. Details in the code

I will mention them in the code directly

thanks for everything

Comment on lines +28 to +30
```bash
pip install -v .
```
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your modification removes the very important information that nvcc sometimes needs a version of gcc and g++ that are below the default ones on the system, so that your installation procedure will not work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, sorry for removing that disclaimer. I tested on my machine and on a few servers we have here and did not notice any issue, that's why it felt redundant to me. Also, g++ is usually just a link to gcc, I haven't seen cases where the two are different.

Anyway, let's just restore that warning so that no one runs into problems ;)

Comment on lines +36 to +40
The verbose flag `-v` is not mandatory, but it will print whether the installer was able to find `nvcc` and install the CUDA version of `torchsearchsorted`.
If you're having problems with the installation, make sure `nvcc` and `gcc` are installed and available in your path, e.g.:
```bash
export PATH="/usr/local/cuda/bin:${PATH}"
export CPATH="/usr/local/cuda/include:${CPATH}"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not enough to have gcc and g++ in the path.
Actually, what is required is to have nvcc get to gcc and g++ that are the version it wants.
For me, I had to do the following:

sudo apt-get install g++-8 gcc-8
sudo ln -s /usr/bin/gcc-8 /usr/local/cuda-10.1/bin/gcc
sudo ln -s /usr/bin/g++-8 /usr/local/cuda-10.1/bin/g++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's strange, for me nvcc V10.1.105 and gcc 7.4.0 work together without issues.

I think the trick with PATH and CPATH (which is what torch_scatter suggests) is safer than creating system-wide links for the simple reason that if some other program relies on specific versions it won't be affected by changing the PATH for the current shell.

We can document both approaches in the installation instructions, I think.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, gcc 7 should work. But I had 9 and it doesn't

Comment on lines +79 to +80
Numpy: 3.4524286500000017
CPU: 10.617608329001087
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see, your CPU version gets significantly slower than numpy's.
Although this is not a very big issue for training deep models, that is typically done in GPU, it is a problem for inference, that is often done on CPU

Comment on lines +1 to +66
import timeit

import torch
import numpy as np
from torchsearchsorted import searchsorted, numpy_searchsorted

B = 5_000
A = 300
V = 100

repeats = 20
number = 100

print(
f'Benchmark searchsorted:',
f'- a [{B} x {A}]',
f'- v [{B} x {V}]',
f'- reporting fastest time of {repeats} runs',
f'- each run executes searchsorted {number} times',
sep='\n',
end='\n\n'
)


def get_arrays():
a = np.sort(np.random.randn(B, A), axis=1)
v = np.random.randn(B, V)
out = np.empty_like(v, dtype=np.long)
return a, v, out


def get_tensors(device):
a = torch.sort(torch.randn(B, A, device=device), dim=1)[0]
v = torch.randn(B, V, device=device)
out = torch.empty(B, V, device=device, dtype=torch.long)
return a, v, out


numpy = timeit.repeat(
stmt="numpy_searchsorted(a, v, out, side='left')",
setup="a, v, out = get_arrays()",
globals=globals(),
repeat=repeats,
number=number
)
print('Numpy: ', min(numpy), sep='\t')

cpu = timeit.repeat(
stmt="searchsorted(a, v, out, side='left')",
setup="a, v, out = get_tensors(device='cpu')",
globals=globals(),
repeat=repeats,
number=number
)
print('CPU: ', min(cpu), sep='\t')

if torch.cuda.is_available():
gpu = timeit.repeat(
stmt="searchsorted(a, v, out, side='left')",
setup="a, v, out = get_tensors(device='cuda')",
globals=globals(),
repeat=repeats,
number=number
)
print('CUDA: ', min(gpu), sep='\t')

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice

if CUDA_HOME:
print('torchsearchsorted will be installed with CUDA support')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

modules.append(
CUDAExtension('torchsearchsorted.cuda',
['src/cuda/searchsorted_cuda_wrapper.cpp',
'src/cuda/searchsorted_cuda_kernel.cu'])
)
else:
print('torchsearchsorted will be installed for CPU only')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Comment on lines +34 to +35
raise ValueError(f"Output `out` must have the same shape as `v`, "
f"got {out.shape} and {a.shape}")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true, the output must have the same shape as: (max(a.shape[0], v.shape[0]), v.shape[1])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if I get the broadcasting semantic right.

Let's use these variables:

B = batch dimension, say 300
A = number of columns of a, say 15
V = number of columns of v, say 7
  • If a contains many rows and v contains many rows, all values in v are searched in the corresponding row of a, in this case a.shape[0] and v.shape[0] must match
  • If a contains only one row and v contains many rows, all values in v are searched in the first row of a
  • If a contains many rows and v contains only one row, all values of the first row of v are searched in the all rows of a

This is a summary of the cases:

   a       v          out
(B, A), (B, V) -> (B, V)
(B, A), (1, V) -> (B, V)
(1, A), (B, V) -> (B, V)
(1, A), (1, V) -> (1, V)
(X, A), (Y, V) -> error

When you say:

the output must have the same shape as (max(a.shape[0], v.shape[0]), v.shape[1])
how do you handle the case where a.shape[0] and v.shape[0] are different and are both > 1? Let's say max(a.shape[0], v.shape[0]) = max(20, 45) = 45. In this case how can you match every row of values to be searched to a row of a?

A couple of lines before this check, the shapes of a and v are broadcasted together in the first dimension (raising error if not possible), so I think it's fine to check that out.shape == v.shape. The only issue I see is that I put the wrong letter in the exception message:

Instead of {a.shape}:

    if out is not None:
        if out.shape != v.shape:
            raise ValueError(f"Output `out` must have the same shape as `v`, "
                             f"got {out.shape} and {a.shape}")

Should be {v.shape}:

    if out is not None:
        if out.shape != v.shape:
            raise ValueError(f"Output `out` must have the same shape as `v`, "
                             f"got {out.shape} and {v.shape}")

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey again. actually, let me be more precise:
your solution works just nice. The only part that I guess doesn't work (am I wrong ?) is in the error message:
after you broadcasted: I don't agree that out should be of the same shape as (the origina, i.e. not yet broadcastedl) v. Let me give you an example: we may have v.shape[0]==1 and a.shape[0]>1. In this case, we want out to be (a.shape[0], v.shape[1]).
So I totally agree on the code: at this point (after broadcasting) we may try whether out is same shape as v. But the user possibly provided another v, so the message is not clear.
See what I mean ?


left_side = 1 if side=='left' else 0
left_side = side == 'left'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, good catch

Comment on lines -41 to -42
raise Exception('torchsearchsorted on CUDA device is asked, but it seems '
'that it is not available. Please install it')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss the case where you call searchsorted with a cuda tensor when torchsearchsorted.cuda is not built

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case you get a warning the first time torchsearchsorted is imported. See lines 8-15 in searchsorted.py:

if torch.cuda.is_available():
    try:
        from torchsearchsorted.cuda import searchsorted_cuda_wrapper
    except ImportError as e:
        warnings.warn("PyTorch is installed with CUDA support, but "
                      "torchsearchsorted for CUDA was not installed, "
                      "please repeat the installation or avoid passing "
                      "CUDA tensors to the `searchsorted`.")

And if you still try to call the searchsorted function with a CUDA tensor you get the usual error message from PyTorch that says something like "RuntimeError: expected backend CPU and dtype Float but got backend CUDA and dtype Float".

I think letting PyTorch handle this exception for us and outputting the default message is better in this case.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all right yes

Comment on lines +4 to +25
def numpy_searchsorted(a: np.ndarray, v: np.ndarray,
out: np.ndarray=None, side='left') -> np.ndarray:
"""Batch-wise version of numpy's searchsorted"""
a = np.asarray(a)
v = np.asarray(v)
a, v = broadcast_arrays(a, v, axis=0)
if out is None:
out = np.empty(v.shape, dtype=np.long)
for i in range(v.shape[0]):
out[i] = np.searchsorted(a[i], v[i], side=side)
return out


def broadcast_arrays(*arrays, axis=0):
"""Broadcast arrays along one axis, leaving other axes unchanged"""
if axis < 0:
raise ValueError(f"Negative axis not supported, got {axis}")
axis_size = max(a.shape[axis] for a in arrays)
return [
np.broadcast_to(a, (*a.shape[:axis], axis_size, *a.shape[axis + 1:]))
for a in arrays
]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's so much elegant in this case. IMO you should revert to the original version that's more concise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, in the original version, the sel function is called every time to select a row in the a and v matrices. This means that you get two function calls for every iteration of the for loop. On the other hand, we could make use of the native broadcasting functions of numpy/pytorch, that expand a dimension by setting the stride to 0 and need to be called only once.

Also, don't you like a separate, reusable and documented function better than a nested function?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I don't think this would be reused a lot, since it's quite hidden in our module.
if you like this, you could actually replace the sel function by a one-liner of your np.broadcast call, that would maybe be the best option ? I agree that this is probably more elegant
did you try: is this the same speed as the sel, slower, or faster ?

@baldassarreFe
Copy link
Contributor Author

I've noticed the speed issue, I think it could be related to the left/right side, which at least in this version gives correct results. Anyway, I'll look into speeding up the implementation ;)

Comment on lines -25 to -28
assert (a.shape[0] == v.shape[0]
or a.shape[0] == 1
or v.shape[0] == 1), ("`a` and `v` must have the same number of "
"rows or one of them must have only one ")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answering your question below, this is where was handled the case of a and b .shape[0]>1 and different

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants