Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

secondary index vs dictionary implementation using pyarrow #6284

Closed
healiseu opened this issue Jan 26, 2020 · 2 comments
Closed

secondary index vs dictionary implementation using pyarrow #6284

healiseu opened this issue Jan 26, 2020 · 2 comments

Comments

@healiseu
Copy link

healiseu commented Jan 26, 2020

Hi, I am searching for an efficient solution to build a secondary index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.

Let's take a simple example, we can scale this later on to produce some benchmarks:

import numpy as np

pk = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='uint32')
val = np.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], dtype='float32')

Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.

val.dictionary_encode()
Out[55]: 
<pyarrow.lib.DictionaryArray object at 0x7ff430d8b4d0>
-- dictionary:
  [
    15.5,
    3.75,
    142.88,
    nan,
    7.2,
    2.1
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    3,
    3,
    3,
    4,
    5
  ]

Pause a bit here to highlight an inconsistent behaviour, if we dictionary-encode a pyarrow array of floats we take:

pa.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], type='float32').dictionary_encode()

Out[62]: 
<pyarrow.lib.DictionaryArray object at 0x7ff430d8bad0>
-- dictionary:
  [
    15.5,
    3.75,
    142.88,
    7.2,
    2.1
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    null,
    null,
    null,
    3,
    4
  ]

This time, encoding doesn't have null values in the dictionary but it has null values in the indices. In my opinion that is confusing, because the two array representations should produce consistent results in processing.

Solution

I have searched both in the past and in the present for a solution but I have not found one that satisfies my appetite, so this time I decided to write one that covers the null case. Do notice also that secondary index is also very close to adjacency list representation, something which I am using a lot in my TRIADB project and that is the reason behind searching for the solution. It's basically one line code using numpy

idx = np.sort(np.array(list(zip(pk, val)), dtype=struct_type), order='val')

idx['val']
Out[68]: 
array([  2.1 ,   3.75,   7.2 ,  15.5 , 142.88, 142.88,    nan,    nan,   nan], dtype=float32)

idx['pk']
Out[69]: 
array([8, 1, 7, 0, 2, 3, 4, 5, 6], dtype=uint32)

Another solution (faster)

this is the case where pk has values in range(n)

idx_pk = np.argsort(val)
idx_pk
Out[91]: array([8, 1, 7, 0, 2, 3, 4, 5, 6])

idx_val = val[idx_pk]
idx_val
Out[93]: array([  2.1 ,   3.75,   7.2 ,  15.5 , 142.88, 142.88,    nan,    nan,   nan], dtype=float32)

You may compare the secondary index idx_val with the dictionary and idx_pk with the indices of your dictionary_encode() method. In fact this is the case where the dictionary values are sorted in ascending order.

I guess that can be relatively easy to implement in your pyarrow package and I think it will be extremely helpful in the future to use for filtering based on secondary index and for building graph representations !

To conclude you can have an option in your dictionary_encode method to produce a secondary index, i.e. a sorted dictionary.

PS: In case I am missing a better solution, which is likely the case, please add this here or to the relevant post in stackoverflow

@wesm
Copy link
Member

wesm commented Jan 27, 2020

Would you like to discuss this on user@arrow.apache.org?

@wesm
Copy link
Member

wesm commented Jan 27, 2020

Closing this so we can continue the discussion on the mailing list.

@wesm wesm closed this as completed Jan 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants