-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge sorted sequences of sorted numpy arrays #52
Comments
mrocklin
changed the title
Merge sorted for sequence of numpy arrays
Merge sorted for sequences of numpy arrays
Feb 26, 2015
mrocklin
changed the title
Merge sorted for sequences of numpy arrays
Merge sorted sequences of sorted numpy arrays
Feb 26, 2015
@hhuuggoo might have thoughts on this too. |
cytoolz solution import numpy as np
from cytoolz import merge_sorted, concat, map, partition_all
def shard(n, x):
"""
>>> list(shard(3, list(range(10))))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
"""
for i in range(0, len(x), n):
yield x[i: i + n]
def array_merge_sorted(seqs, out=None, out_chunksize=2**14):
""" Merge step of external sort
Merged sorted sequences of numpy arrays in to out result
"""
assert out is not None
seqs2 = [concat(x.tolist() for x in seq) for seq in seqs]
seq = merge_sorted(*seqs2)
chunks = map(np.array, partition_all(out_chunksize, seq))
for i, chunk in enumerate(chunks):
out[i*out_chunksize: min(len(out), (i+1)*out_chunksize)] = chunk
return out test file from dask.frame.esort import array_merge_sorted, shard
import numpy as np
def test_shard():
result = list(shard(3, np.arange(10)))
assert result[0].tolist() == [0, 1, 2]
assert result[1].tolist() == [3, 4, 5]
assert result[2].tolist() == [6, 7, 8]
assert result[3].tolist() == [9]
def test_esort():
seqs = [np.random.random(size=(np.random.randint(100))) for i in range(5)]
sorteds = [np.sort(x) for x in seqs]
chunks = [shard(5, x) for x in sorteds]
out = np.empty(shape=(sum(len(x) for x in seqs),))
array_merge_sorted(chunks, out=out)
assert (out == np.sort(np.concatenate(seqs))).all() |
Closing this in favor of approximate percentiles |
mrocklin
added a commit
to mrocklin/dask
that referenced
this issue
Mar 28, 2019
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In the near future we may want to implement an external sort. One missing piece in our ecosystem is an out-of-core merge sorted operation for sequences of numpy arrays. E.g. given
We want to walk through each iterator, pull off a numpy array from each, and then perform a merge sorted operation (see toolz api for example) on these blocks, building up an intermediate block which we periodically push off on to the external store. In principle this should be almost
O(N)
and probably I/O bound.It is safe to assume that having one block from each of the iterators still fits in memory.
We will also need the original index of the values.
One approach might be to just use
cytoolz.merge_sorted
. I assume that this is slower than a straight C/Cython solution operating on binary arrays but it's worth checking out.This seems like the kind of thing that might interest @eriknw @cpcloud @nevermindewe
The text was updated successfully, but these errors were encountered: