Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] "take" kernel crashes on ExtensionArrays w/ underlying dictionary type #34619

Closed
FancyAlex opened this issue Mar 17, 2023 · 0 comments · Fixed by #34684
Closed

[C++] "take" kernel crashes on ExtensionArrays w/ underlying dictionary type #34619

FancyAlex opened this issue Mar 17, 2023 · 0 comments · Fixed by #34684

Comments

@FancyAlex
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Version pyarrow=11.0.0 on Debian 5.10.162.

import pyarrow as pa                                                                                                                                                                                                                                                                                    
import pyarrow.compute as pc                                                                                                                                                                                                                                                                            

                                                                                                                                                                                                                                                                                                        
class MyEnum(pa.ExtensionType):                                                                                                                                                                                                                                                                         
    def __init__(self):                                                                                                                                                                                                                                                                                 
        super().__init__(pa.dictionary(pa.int32(), pa.string()), "MyEnum")                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                        
    def __arrow_ext_serialize__(self):                                                                                                                                                                                                                                                                  
        return b''                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                        
    @classmethod                                                                                                                                                                                                                                                                                        
    def __arrow_ext_deserialize__(cls, storage_type, serialized):                                                                                                                                                                                                                                       
        return cls()                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                        
arr = pa.array(["a", "b", "a", "c"]).dictionary_encode()                                                                                                                                                                                                                                                
print(pc.take(arr, [0, 1]))
# <pyarrow.lib.DictionaryArray object at 0x7ffa8fee3bc0>                                                                                                                                                                                                                                                
#                                                                                                                                                                                                                                                                                                       
# -- dictionary:                                                                                                                                                                                                                                                                                        
#   [                                                                                                                                                                                                                                                                                                   
#     "a",                                                                                                                                                                                                                                                                                              
#     "b",                                                                                                                                                                                                                                                                                              
#     "c"                                                                                                                                                                                                                                                                                               
#   ]                                                                                                                                                                                                                                                                                                   
# -- indices:                                                                                                                                                                                                                                                                                           
#   [                                                                                                                                                                                                                                                                                                   
#     0,                                                                                                                                                                                                                                                                                                
#     1                                                                                                                                                                                                                                                                                                 
#   ]                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                        
ext = pa.ExtensionArray.from_storage(MyEnum(), arr)                                                                                                                                                                                                                                                     
pc.take(ext, [0, 1])                                                                                                                                                                                                                                                                                
# WARNING: Logging before InitGoogleLogging() is written to STDERR                                                                                                                                                                                                                                      
# F20230317 10:59:30.370121 2048457 array_dict.cc:83]  Check failed: (data->dictionary) != (nullptr)                                                                                                                                                                                                    
# *** Check failure stack trace: ***                                                                                                                                                                                                                                                                    
# Aborted  

Component(s)

C++

@jorisvandenbossche jorisvandenbossche added this to the 12.0.0 milestone Mar 18, 2023
@jorisvandenbossche jorisvandenbossche changed the title pyarrow.compute.take crashes on ExtensionArrays w/ underlying dictionary type. [C++] "take" kernel crashes on ExtensionArrays w/ underlying dictionary type Mar 18, 2023
westonpace added a commit that referenced this issue Mar 24, 2023
…34684)

### Rationale for this change

The take kernel would generate a segmentation fault when used on an extension array with a dictionary storage type.

### What changes are included in this PR?

Conversion to/from ArraySpan now uses the storage type to determine if it needs to assign a dictionary

### Are these changes tested?

I added unit tests that verify ArraySpan round trip for most types, including an extension dictionary type (and verified this test fails without the fix).

### Are there any user-facing changes?

No
* Closes: #34619

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue Mar 27, 2023
…rsion (apache#34684)

### Rationale for this change

The take kernel would generate a segmentation fault when used on an extension array with a dictionary storage type.

### What changes are included in this PR?

Conversion to/from ArraySpan now uses the storage type to determine if it needs to assign a dictionary

### Are these changes tested?

I added unit tests that verify ArraySpan round trip for most types, including an extension dictionary type (and verified this test fails without the fix).

### Are there any user-facing changes?

No
* Closes: apache#34619

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…rsion (apache#34684)

### Rationale for this change

The take kernel would generate a segmentation fault when used on an extension array with a dictionary storage type.

### What changes are included in this PR?

Conversion to/from ArraySpan now uses the storage type to determine if it needs to assign a dictionary

### Are these changes tested?

I added unit tests that verify ArraySpan round trip for most types, including an extension dictionary type (and verified this test fails without the fix).

### Are there any user-facing changes?

No
* Closes: apache#34619

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants