You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am searching for an efficient solution to build a secondary index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.
Let's take a simple example, we can scale this later on to produce some benchmarks:
Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.
This time, encoding doesn't have null values in the dictionary but it has null values in the indices. In my opinion that is confusing, because the two array representations should produce consistent results in processing.
Solution
I have searched both in the past and in the present for a solution but I have not found one that satisfies my appetite, so this time I decided to write one that covers the null case. Do notice also that secondary index is also very close to adjacency list representation, something which I am using a lot in my TRIADB project and that is the reason behind searching for the solution. It's basically one line code using numpy
You may compare the secondary index idx_val with the dictionary and idx_pk with the indices of your dictionary_encode() method. In fact this is the case where the dictionary values are sorted in ascending order.
I guess that can be relatively easy to implement in your pyarrow package and I think it will be extremely helpful in the future to use for filtering based on secondary index and for building graph representations !
To conclude you can have an option in your dictionary_encode method to produce a secondary index, i.e. a sorted dictionary.
PS: In case I am missing a better solution, which is likely the case, please add this here or to the relevant post in stackoverflow
The text was updated successfully, but these errors were encountered:
Hi, I am searching for an efficient solution to build a secondary index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.
Let's take a simple example, we can scale this later on to produce some benchmarks:
Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.
Pause a bit here to highlight an inconsistent behaviour, if we dictionary-encode a pyarrow array of floats we take:
This time, encoding doesn't have
null
values in the dictionary but it hasnull
values in the indices. In my opinion that is confusing, because the two array representations should produce consistent results in processing.Solution
I have searched both in the past and in the present for a solution but I have not found one that satisfies my appetite, so this time I decided to write one that covers the
null
case. Do notice also that secondary index is also very close to adjacency list representation, something which I am using a lot in my TRIADB project and that is the reason behind searching for the solution. It's basically one line code using numpyAnother solution (faster)
this is the case where pk has values in range(n)
You may compare the secondary index
idx_val
with the dictionary andidx_pk
with the indices of your dictionary_encode() method. In fact this is the case where the dictionary values are sorted in ascending order.I guess that can be relatively easy to implement in your
pyarrow
package and I think it will be extremely helpful in the future to use for filtering based on secondary index and for building graph representations !To conclude you can have an option in your dictionary_encode method to produce a secondary index, i.e. a sorted dictionary.
PS: In case I am missing a better solution, which is likely the case, please add this here or to the relevant post in stackoverflow
The text was updated successfully, but these errors were encountered: