Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.dataset.FileSystemDataset.take method causes Segmentation Fault #30777

Closed
asfimport opened this issue Jan 8, 2022 · 7 comments

Comments

@asfimport
Copy link

Whenever I try calling the pyarrow.dataset.FileSystemDataset.take method, I get a segmentation fault. 

I first encountered this using a proprietary dataset and recreated it with the UC Davis data below. I can successfully run the pyarrow.dataset.FileSystemDataset.to_batches method but not the take() method.

Steps to recreate:

!wget https://anson.ucdavis.edu/~clarkf/pems_parquet.zip
!unzip -q pems_parquet.zip

import pyarrow.dataset as ds
file_path= "./pems_sorted/station=402264/part-r-00151-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet"
dataset = ds.dataset(file_path)
dataset.take(1)
>>> 80874 segmentation fault

Creating a dataset from a directory as below also results in a segfault.

dir_path = "./pems_sorted/station=402264"
dataset = ds.dataset(dir_path)
dataset.take(1)

Environments tried:

  • Ubuntu 18.04, Python 3.8.6
  • macOS 11.6.2, Python 3.7.5

Environment: Ubuntu 18.04, Python 3.8.6; macOS 11.6.2, Python 3.7.5
Reporter: Dustin Zubke
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-15286. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
[~dzubke] thanks for the report!

I can reproduce this, and running it under gdb gives me the following traceback:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fb5d0f2b8dc in std::__shared_ptr<arrow::ArrayData, (__gnu_cxx::_Lock_policy)2>::get (this=0x8)
    at /home/joris/miniconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1310
1310	      { return _M_ptr; }
(gdb) bt
#0  0x00007fb5d0f2b8dc in std::__shared_ptr<arrow::ArrayData, (__gnu_cxx::_Lock_policy)2>::get (this=0x8)
    at /home/joris/miniconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1310
#1  0x00007fb5d0f2864b in std::__shared_ptr_access<arrow::ArrayData, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get (this=0x8)
    at /home/joris/miniconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1021
#2  0x00007fb5d0f25495 in std::__shared_ptr_access<arrow::ArrayData, (__gnu_cxx::_Lock_policy)2, false, false>::operator-> (this=0x8)
    at /home/joris/miniconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1015
#3  0x00007fb5d0f22387 in arrow::Array::null_count (this=0x0) at ../src/arrow/array/array_base.cc:52
#4  0x00007fb5cd0cd9ea in arrow::dataset::Scanner::TakeRows (this=0x5584436ebbd0, indices=...) at ../src/arrow/dataset/scanner.cc:1036
#5  0x00007fb5cd32cfcc in __pyx_pw_7pyarrow_8_dataset_7Scanner_15take(_object*, _object*) () from /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
A simpler observation: passing it as a pyarrow arrow works fine (dataset.take(pa.array([1])), but passing it as a list or scalar not (dataset.take([1])

This happens because we simply pass the user-passed indices to pyarrow_unwrap_array, which returns a shared_ptr[CArray]() empty array when the object is not a pyarrow.Array.

@asfimport
Copy link
Author

Dustin Zubke:
@jorisvandenbossche thank you for your prompt response!

And thank you for the observation to use a pyarrow array. I have confirmed that this approach works for me without raising a Segmentation Fault.

It may be helpful for future users to specify that the indices argument of the dataset.take method as needing to be a pyarrow array. The current documentation does not clearly specify the object type of the indices argument.

@asfimport
Copy link
Author

Dustin Zubke:
I'm not sure how you handle Jira issues. I consider this issue closed but am not sure what field to to select in the Resolution field. Feel free to "close" this issue in whatever way you see fit, or provide me with instruction on how to do so.  

Thanks!

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
I opened a PR to automatically convert any input to a pyarrow array, to avoid such segfaults.
When that PR is merged, this issue will automatically get resolved.

@asfimport
Copy link
Author

Dustin Zubke:
Awesome, thanks for taking care of this!

Happy New Year!

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 12115
#12115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants