Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10910: [Python] Provide better error message when trying to read from None source #10054

Closed
wants to merge 3 commits into from

Conversation

amol-
Copy link
Contributor

@amol- amol- commented Apr 15, 2021

No description provided.

@github-actions
Copy link

@@ -1449,6 +1449,10 @@ cdef shared_ptr[CBuffer] as_c_buffer(object o) except *:


cdef NativeFile get_native_file(object source, c_bool use_memory_map):
if source is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is useful. Why would you pass None in the first place?
Not crashing is nice, but I don't think we want to maintain a specific error message for this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested to add object source not None in the signature (that would automatically give a better error message), but so apparently that only works for def functions and not for cdef functions.

Copy link
Contributor Author

@amol- amol- Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can confirm that not None is not allowed in cdef  functions:

pyarrow/io.pxi:1451:32: 'not None' only allowed in Python functions

I think that the main value probably comes from having a test to cover for a past segfault, the better error message is a nice to have but not strictly something that values the extra code. I wonder if we should not just keep the test without any error message customization.

If we want to go deeper we might want to investigate where the segfault was originally coming from (as my feeling is that we have just hidden it behind other python changes) but it doesn't seem to value the extra time at the moment if it's already solved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amol- another solution could be to add not None in places where get_native_file eventually gets called, to get an error higher up. Eg ParquetReader.open or ORCReader.open could have object source not None, since those are def methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue seems to involved btw also callers of get_input_stream which is a bit more widespread.
See for example

>>> json.read_json(None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_json.pyx", line 238, in pyarrow._json.read_json
    _get_reader(input_file, &stream)
  File "pyarrow/_json.pyx", line 193, in pyarrow._json._get_reader
    get_input_stream(input_file, use_memory_map, out)
  File "pyarrow/io.pxi", line 1494, in pyarrow.lib.get_input_stream
    input_stream = nf.get_input_stream()
AttributeError: 'NoneType' object has no attribute 'get_input_stream'

I wonder if a possible strategy might also be to provide an error signaling return value to get_reader and get_input_stream (as they don't return anything) and use it for the purpose of adding an except -1 or something like that.

That seems would cover more paths that could all end up in this situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm not sure what the concern is. Is it common to pass None instead of a file object? It's not an optional parameter, so I don't know where the None could come from.

(also, getting this kind of AttributeError when passing the wrong type is quite common in Python)

@amol-
Copy link
Contributor Author

amol- commented Apr 26, 2021

@jorisvandenbossche @pitrou should I delete this one? If we don't think it values doing it there is little benefit in keeping the PR around

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 26, 2021

Personally, I still find it worth it to provide a better error message. Yes, it's not typical to pass None as the file source, but it can accidentally happen, and then the current message is not that helpful (it doesn't give any indication which keyword is incorrectly being passed as None, but just showing a later error about None not having some attribute)

For example adding a simple not None to ParquetReader:

--- a/python/pyarrow/_parquet.pyx
+++ b/python/pyarrow/_parquet.pyx
@@ -927,7 +927,7 @@ cdef class ParquetReader(_Weakrefable):
         self.pool = maybe_unbox_memory_pool(memory_pool)
         self._metadata = None
 
-    def open(self, object source, bint use_memory_map=True,
+    def open(self, object source not None, bint use_memory_map=True,
              read_dictionary=None, FileMetaData metadata=None,
              int buffer_size=0):

could also help if we don't want to add an explicit if .. is None: .. check.

@amol-
Copy link
Contributor Author

amol- commented Apr 28, 2021

I simplified further the solution as suggested by @jorisvandenbossche
I think that at this point the change is so minor that even if we don't think it provides much value the maintenance cost is negligible and still verifies for something that was a segfault in the past.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @amol- :-)

@pitrou pitrou closed this in 08acfa5 Apr 28, 2021
DileepSrigiri pushed a commit to DileepSrigiri/arrow that referenced this pull request May 6, 2021
…d from None source

Closes apache#10054 from amol-/ARROW-10910

Authored-by: Alessandro Molina <amol@turbogears.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
…d from None source

Closes apache#10054 from amol-/ARROW-10910

Authored-by: Alessandro Molina <amol@turbogears.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…d from None source

Closes apache#10054 from amol-/ARROW-10910

Authored-by: Alessandro Molina <amol@turbogears.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants