ARROW-3399: [Python] Implementing numpy matrix serialization#4096
ARROW-3399: [Python] Implementing numpy matrix serialization#4096rok wants to merge 4 commits intoapache:masterfrom
Conversation
|
Are you sure this works (I mean it looks it does, it is a rhetorical question)? I have no idea why then I was getting an error when I tried this. I must have made some mistake. This is great. |
robertnishihara
left a comment
There was a problem hiding this comment.
Thanks for doing this! Nice work.
python/pyarrow/tests/test_plasma.py
Outdated
There was a problem hiding this comment.
Can we test a few more dtypes as well?
There was a problem hiding this comment.
I've added other types. Do you think str, int and float are enough?
There was a problem hiding this comment.
That's pretty good, but are we actually hitting the if obj.dtype.str != '|O': code path? Maybe add a test that uses a numpy matrix that contains some custom object, like
class Foo(object):
pass
x = np.matrix([Foo()])There should be an analogous test for numpy arrays. If so, maybe copy that one.
There was a problem hiding this comment.
I've added CustomType to the list of types we check for. I've added some logging locally and we appear to be hitting if obj.dtype.str != '|O': now.
We do get a lot of np.matrix deprecation warnings, should we silence them?
There was a problem hiding this comment.
We do get a lot of np.matrix deprecation warnings, should we silence them?
In tests you mean?
There was a problem hiding this comment.
Yes tests trigger them. I am not sure how it works if this is merged into arrow - would every deserialization trigger warnings or not?
There was a problem hiding this comment.
I think we could silence them during tests. But during regular use I would just leave to the user to do so if they want.
There was a problem hiding this comment.
I don't have a strong opinion either way. What is the project default @robertnishihara?
There was a problem hiding this comment.
I don't see the warnings. Where are they? Did they appear in the arrow CI?
python/pyarrow/tests/test_plasma.py
Outdated
There was a problem hiding this comment.
Will this catch errors if the dtype changes?
There was a problem hiding this comment.
It wouldn't e.g. for int and float. I've added a dtype comparison.
python/pyarrow/serialization.py
Outdated
There was a problem hiding this comment.
I just did a little timing and the time taken by this method seems to scale linearly in the size of the array. Can we fix this?
I think np.matrix(...) is the slow line.
There was a problem hiding this comment.
Indeed, copying doesn't seem like the thing to do here. I think np.matrix(data, copy=False) would just create a new view.
…s datatypes at np.matrix deserialization.
|
I'm having issues building, any idea why after running: I get: I'm using conda and this worked up until some point recently and now stopped. Any ideas? |
|
I'm not sure about the compilation error, but it's unrelated to this PR. |
|
I figured out my build issue. It was kinda stupid in hindsight. I opened an issue on Jira. |
|
@rok nice work! I should have mentioned this earlier, but didn't notice until just now. This test should be in |
|
Thanks @robertnishihara :). |
|
Thanks @rok! Nice work, I merged this. |
|
Nice! Thanks @robertnishihara and @mitar! |
See ARROW-3399.