Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge sorted sequences of sorted numpy arrays #52

Closed
mrocklin opened this issue Feb 26, 2015 · 3 comments
Closed

Merge sorted sequences of sorted numpy arrays #52

mrocklin opened this issue Feb 26, 2015 · 3 comments

Comments

@mrocklin
Copy link
Member

In the near future we may want to implement an external sort. One missing piece in our ecosystem is an out-of-core merge sorted operation for sequences of numpy arrays. E.g. given

  1. A few iterators of sorted numpy arrays in sorted order (e.g. a list of iterables of numpy ndarrays)
  2. An on-disk store supporting numpy setitem syntax (e.g. hdf5, bcolz)

We want to walk through each iterator, pull off a numpy array from each, and then perform a merge sorted operation (see toolz api for example) on these blocks, building up an intermediate block which we periodically push off on to the external store. In principle this should be almost O(N) and probably I/O bound.

It is safe to assume that having one block from each of the iterators still fits in memory.

We will also need the original index of the values.

One approach might be to just use cytoolz.merge_sorted. I assume that this is slower than a straight C/Cython solution operating on binary arrays but it's worth checking out.

This seems like the kind of thing that might interest @eriknw @cpcloud @nevermindewe

@mrocklin mrocklin changed the title Merge sorted for sequence of numpy arrays Merge sorted for sequences of numpy arrays Feb 26, 2015
@mrocklin mrocklin changed the title Merge sorted for sequences of numpy arrays Merge sorted sequences of sorted numpy arrays Feb 26, 2015
@quasiben
Copy link
Member

@hhuuggoo might have thoughts on this too.

@mrocklin
Copy link
Member Author

mrocklin commented Mar 2, 2015

cytoolz solution

import numpy as np
from cytoolz import merge_sorted, concat, map, partition_all

def shard(n, x):
    """

    >>> list(shard(3, list(range(10)))) 
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    """
    for i in range(0, len(x), n):
        yield x[i: i + n]


def array_merge_sorted(seqs, out=None, out_chunksize=2**14):
    """ Merge step of external sort

    Merged sorted sequences of numpy arrays in to out result
    """
    assert out is not None

    seqs2 = [concat(x.tolist() for x in seq) for seq in seqs]

    seq = merge_sorted(*seqs2)
    chunks = map(np.array, partition_all(out_chunksize, seq))

    for i, chunk in enumerate(chunks):
        out[i*out_chunksize: min(len(out), (i+1)*out_chunksize)] = chunk

    return out

test file

from dask.frame.esort import array_merge_sorted, shard
import numpy as np

def test_shard():
    result = list(shard(3, np.arange(10)))
    assert result[0].tolist() == [0, 1, 2]
    assert result[1].tolist() == [3, 4, 5]
    assert result[2].tolist() == [6, 7, 8]
    assert result[3].tolist() == [9]


def test_esort():
    seqs = [np.random.random(size=(np.random.randint(100))) for i in range(5)]
    sorteds = [np.sort(x) for x in seqs]
    chunks = [shard(5, x) for x in sorteds]

    out = np.empty(shape=(sum(len(x) for x in seqs),))

    array_merge_sorted(chunks, out=out)

    assert (out == np.sort(np.concatenate(seqs))).all() 

@mrocklin
Copy link
Member Author

Closing this in favor of approximate percentiles

mrocklin added a commit to mrocklin/dask that referenced this issue Mar 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants