secondary index vs dictionary implementation using pyarrow #6284

healiseu · 2020-01-26T12:29:15Z

Hi, I am searching for an efficient solution to build a secondary index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.

Let's take a simple example, we can scale this later on to produce some benchmarks:

import numpy as np

pk = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='uint32')
val = np.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], dtype='float32')

Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.

val.dictionary_encode()
Out[55]: 
<pyarrow.lib.DictionaryArray object at 0x7ff430d8b4d0>
-- dictionary:
  [
    15.5,
    3.75,
    142.88,
    nan,
    7.2,
    2.1
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    3,
    3,
    3,
    4,
    5
  ]

Pause a bit here to highlight an inconsistent behaviour, if we dictionary-encode a pyarrow array of floats we take:

pa.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], type='float32').dictionary_encode()

Out[62]: 
<pyarrow.lib.DictionaryArray object at 0x7ff430d8bad0>
-- dictionary:
  [
    15.5,
    3.75,
    142.88,
    7.2,
    2.1
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    null,
    null,
    null,
    3,
    4
  ]

This time, encoding doesn't have null values in the dictionary but it has null values in the indices. In my opinion that is confusing, because the two array representations should produce consistent results in processing.

Solution

I have searched both in the past and in the present for a solution but I have not found one that satisfies my appetite, so this time I decided to write one that covers the null case. Do notice also that secondary index is also very close to adjacency list representation, something which I am using a lot in my TRIADB project and that is the reason behind searching for the solution. It's basically one line code using numpy

idx = np.sort(np.array(list(zip(pk, val)), dtype=struct_type), order='val')

idx['val']
Out[68]: 
array([  2.1 ,   3.75,   7.2 ,  15.5 , 142.88, 142.88,    nan,    nan,   nan], dtype=float32)

idx['pk']
Out[69]: 
array([8, 1, 7, 0, 2, 3, 4, 5, 6], dtype=uint32)

Another solution (faster)

this is the case where pk has values in range(n)

idx_pk = np.argsort(val)
idx_pk
Out[91]: array([8, 1, 7, 0, 2, 3, 4, 5, 6])

idx_val = val[idx_pk]
idx_val
Out[93]: array([  2.1 ,   3.75,   7.2 ,  15.5 , 142.88, 142.88,    nan,    nan,   nan], dtype=float32)

You may compare the secondary index idx_val with the dictionary and idx_pk with the indices of your dictionary_encode() method. In fact this is the case where the dictionary values are sorted in ascending order.

I guess that can be relatively easy to implement in your pyarrow package and I think it will be extremely helpful in the future to use for filtering based on secondary index and for building graph representations !

To conclude you can have an option in your dictionary_encode method to produce a secondary index, i.e. a sorted dictionary.

PS: In case I am missing a better solution, which is likely the case, please add this here or to the relevant post in stackoverflow

The text was updated successfully, but these errors were encountered:

wesm · 2020-01-27T05:06:28Z

Would you like to discuss this on user@arrow.apache.org?

wesm · 2020-01-27T15:29:32Z

Closing this so we can continue the discussion on the mailing list.

wesm closed this as completed Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

secondary index vs dictionary implementation using pyarrow #6284

secondary index vs dictionary implementation using pyarrow #6284

healiseu commented Jan 26, 2020 •

edited

wesm commented Jan 27, 2020

wesm commented Jan 27, 2020

secondary index vs dictionary implementation using pyarrow #6284

secondary index vs dictionary implementation using pyarrow #6284

Comments

healiseu commented Jan 26, 2020 • edited

Solution

Another solution (faster)

wesm commented Jan 27, 2020

wesm commented Jan 27, 2020

healiseu commented Jan 26, 2020 •

edited