ARROW-822: [Python] StreamWriter Wrapper for Socket and File-like Objects without tell()#569
Conversation
…d unittest for StreamWriter and StreamReader over local socket
|
@wesm, I just added a simple wrapper to implement This is the way Spark sets up sockets, so it would be great to get that fixed here. I'm not sure where exactly this is happening, just somewhere in the StreamWriter constructor, but I'll keep trying to debug it. |
|
|
||
| def stopClientServer(self): | ||
| import struct | ||
| self.sink.write(struct.pack('i', 0)) |
There was a problem hiding this comment.
@wesm, would it be possible to include the writing of a 0 when the StreamWriter is closed to signal the EOS? I believe the Java implementation does this here https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java#L54
There was a problem hiding this comment.
Sure. I am not sure if it will be more complicated than writing a 4-byte 0 in https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L623. If you'd rather do that in a separate patch I can take a look after this goes in
python/pyarrow/_io.pyx
Outdated
| if isinstance(sink, socket.socket): | ||
| # Use a buffer of 0 to flush after each write, otherwise it is | ||
| # possible to close socket without flushing | ||
| sink_file = sink.makefile("wb", 0) |
There was a problem hiding this comment.
I was kind of on the fence about automatically wrapping a socket object like this, let me know if you prefer to take this out and maybe just require the application to call makefile instead
python/pyarrow/_io.pyx
Outdated
| if not hasattr(sink, "tell"): | ||
| return _StreamSinkWrapper(sink) | ||
| else: | ||
| return sink |
There was a problem hiding this comment.
Instead of using this wrapper class, can you instead add int64_t position_ member to PyOutputStream here https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/io.cc#L192, initalized to 0, and return that position_ from Tell instead of calling the file's tell method? Then in PyOutputStream::Write, you would increment the position.
There was a problem hiding this comment.
Yeah, so PyOutputStream would always use this position_ member for Tell or is it still possible to check if the python handle object already definestell() and then make it conditional there? I'm just wondering if there is a sink that implements a valid tell() that we wouldn't want to override?
There was a problem hiding this comment.
I don't know of one off hand; I will probably take the YAGNI approach on this one and keep things simple (no need to invoke tell() on the Python object)
There was a problem hiding this comment.
Sure, np. That will probably fix the other issue with using os.fdopen(sock.fileno(),... too
python/pyarrow/tests/test_ipc.py
Outdated
| # NOTE: must start and stop server in test | ||
| pass | ||
|
|
||
| def startClientServer(self, do_read_all): |
There was a problem hiding this comment.
Can you use start_client_server here and same naming convention below?
|
|
||
| def stopClientServer(self): | ||
| import struct | ||
| self.sink.write(struct.pack('i', 0)) |
There was a problem hiding this comment.
Sure. I am not sure if it will be more complicated than writing a 4-byte 0 in https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L623. If you'd rather do that in a separate patch I can take a look after this goes in
Can you paste a standalone repro here (or with this PR) and I can help debug? |
|
Ok, that strange error is coming from Looks like the real error is after calling Is it taking the empty parenthesis to be a tuple argument? repro is here |
|
@wesm I removed the wrapper and made changes to |
wesm
left a comment
There was a problem hiding this comment.
+1. This looks good, thanks @BryanCutler! I will take a look at the other error also
|
@BryanCutler I can't get the code from ARROW-822 to work either with Python 2.7 or 3.5. Can you paste a reproduction? |
|
Sorry @wesm, I should have said you need to run "netcat -l 8080" in another
terminal before running the code to give it something to connect to. It
might be "nc -l 8080" depending on what os you have.
Also, it won't happen after this fix since python tell() isn't called anymore, I'll see if I can find another way to reproduce.
|
…ects without tell() Added a wrapper for StreamWriter to implement the required tell() method so that python sockets and file-like objects can be used as sinks. The tell() method will report the position by starting at 0 when the StreamWriter is created and incrementing by number of bytes after each write. Added unittests that use local socket as the source/sink for streaming. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#569 from BryanCutler/pyarrow-stream-writer-socket-ARROW-822 and squashes the following commits: 6cdec4f [Bryan Cutler] Removed StreamWriter wrapper and put position handling in PyStreamWriter instead 2bd669f [Bryan Cutler] Added StreamSinkWrapper to ensure stream sink has tell() method, added unittest for StreamWriter and StreamReader over local socket
Added a wrapper for StreamWriter to implement the required tell() method so that python sockets and file-like objects can be used as sinks. The tell() method will report the position by starting at 0 when the StreamWriter is created and incrementing by number of bytes after each write.
Added unittests that use local socket as the source/sink for streaming.