New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10910: [Python] Provide better error message when trying to read from None source #10054
Conversation
python/pyarrow/io.pxi
Outdated
@@ -1449,6 +1449,10 @@ cdef shared_ptr[CBuffer] as_c_buffer(object o) except *: | |||
|
|||
|
|||
cdef NativeFile get_native_file(object source, c_bool use_memory_map): | |||
if source is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is useful. Why would you pass None
in the first place?
Not crashing is nice, but I don't think we want to maintain a specific error message for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested to add object source not None
in the signature (that would automatically give a better error message), but so apparently that only works for def
functions and not for cdef
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can confirm that not None
is not allowed in cdef
functions:
pyarrow/io.pxi:1451:32: 'not None' only allowed in Python functions
I think that the main value probably comes from having a test to cover for a past segfault
, the better error message is a nice to have but not strictly something that values the extra code. I wonder if we should not just keep the test without any error message customization.
If we want to go deeper we might want to investigate where the segfault was originally coming from (as my feeling is that we have just hidden it behind other python changes) but it doesn't seem to value the extra time at the moment if it's already solved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amol- another solution could be to add not None
in places where get_native_file
eventually gets called, to get an error higher up. Eg ParquetReader.open
or ORCReader.open
could have object source not None
, since those are def
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same issue seems to involved btw also callers of get_input_stream
which is a bit more widespread.
See for example
>>> json.read_json(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/_json.pyx", line 238, in pyarrow._json.read_json
_get_reader(input_file, &stream)
File "pyarrow/_json.pyx", line 193, in pyarrow._json._get_reader
get_input_stream(input_file, use_memory_map, out)
File "pyarrow/io.pxi", line 1494, in pyarrow.lib.get_input_stream
input_stream = nf.get_input_stream()
AttributeError: 'NoneType' object has no attribute 'get_input_stream'
I wonder if a possible strategy might also be to provide an error signaling return value to get_reader
and get_input_stream
(as they don't return anything) and use it for the purpose of adding an except -1
or something like that.
That seems would cover more paths that could all end up in this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'm not sure what the concern is. Is it common to pass None
instead of a file object? It's not an optional parameter, so I don't know where the None
could come from.
(also, getting this kind of AttributeError
when passing the wrong type is quite common in Python)
@jorisvandenbossche @pitrou should I delete this one? If we don't think it values doing it there is little benefit in keeping the PR around |
Personally, I still find it worth it to provide a better error message. Yes, it's not typical to pass For example adding a simple --- a/python/pyarrow/_parquet.pyx
+++ b/python/pyarrow/_parquet.pyx
@@ -927,7 +927,7 @@ cdef class ParquetReader(_Weakrefable):
self.pool = maybe_unbox_memory_pool(memory_pool)
self._metadata = None
- def open(self, object source, bint use_memory_map=True,
+ def open(self, object source not None, bint use_memory_map=True,
read_dictionary=None, FileMetaData metadata=None,
int buffer_size=0): could also help if we don't want to add an explicit |
I simplified further the solution as suggested by @jorisvandenbossche |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @amol- :-)
…d from None source Closes apache#10054 from amol-/ARROW-10910 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…d from None source Closes apache#10054 from amol-/ARROW-10910 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…d from None source Closes apache#10054 from amol-/ARROW-10910 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
No description provided.