Skip to content

ARROW-3399: [Python] Implementing numpy matrix serialization#4096

Closed
rok wants to merge 4 commits intoapache:masterfrom
rok:ARROW-3399
Closed

ARROW-3399: [Python] Implementing numpy matrix serialization#4096
rok wants to merge 4 commits intoapache:masterfrom
rok:ARROW-3399

Conversation

@rok
Copy link
Member

@rok rok commented Apr 2, 2019

See ARROW-3399.

@mitar
Copy link
Contributor

mitar commented Apr 10, 2019

Are you sure this works (I mean it looks it does, it is a rhetorical question)? I have no idea why then I was getting an error when I tried this. I must have made some mistake.

This is great.

Copy link
Contributor

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! Nice work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test a few more dtypes as well?

Copy link
Member Author

@rok rok Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added other types. Do you think str, int and float are enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty good, but are we actually hitting the if obj.dtype.str != '|O': code path? Maybe add a test that uses a numpy matrix that contains some custom object, like

class Foo(object):
    pass

x = np.matrix([Foo()])

There should be an analogous test for numpy arrays. If so, maybe copy that one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added CustomType to the list of types we check for. I've added some logging locally and we appear to be hitting if obj.dtype.str != '|O': now.
We do get a lot of np.matrix deprecation warnings, should we silence them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do get a lot of np.matrix deprecation warnings, should we silence them?

In tests you mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes tests trigger them. I am not sure how it works if this is merged into arrow - would every deserialization trigger warnings or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could silence them during tests. But during regular use I would just leave to the user to do so if they want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion either way. What is the project default @robertnishihara?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the warnings. Where are they? Did they appear in the arrow CI?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd - locally I see a warning similar to this one about np.matrix being deprecated. I can't find it in the CI log. I like @mitar's point I'll leave this as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this catch errors if the dtype changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't e.g. for int and float. I've added a dtype comparison.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a little timing and the time taken by this method seems to scale linearly in the size of the array. Can we fix this?

I think np.matrix(...) is the slow line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, copying doesn't seem like the thing to do here. I think np.matrix(data, copy=False) would just create a new view.

@rok
Copy link
Member Author

rok commented Apr 11, 2019

I'm having issues building, any idea why after running:

pushd arrow/python
export PYARROW_WITH_FLIGHT=1
export PYARROW_WITH_GANDIVA=1
export PYARROW_WITH_ORC=1
export PYARROW_WITH_PARQUET=1
python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --inplace
popd

I get:

...
[100%] Linking CXX shared module release/_parquet.cpython-37m-x86_64-linux-gnu.so
[100%] Built target _parquet
-- Finished cmake --build for pyarrow
Bundling includes: include
error: [Errno 2] No such file or directory: 'include'

I'm using conda and this worked up until some point recently and now stopped. Any ideas?

@robertnishihara
Copy link
Contributor

I'm not sure about the compilation error, but it's unrelated to this PR.

@rok
Copy link
Member Author

rok commented Apr 12, 2019

I figured out my build issue. It was kinda stupid in hindsight. I opened an issue on Jira.

@robertnishihara
Copy link
Contributor

@rok nice work! I should have mentioned this earlier, but didn't notice until just now. This test should be in test_serialization.py instead of test_plasma.py. That will require the test to be changed a little bit (to not use plasma).

@rok
Copy link
Member Author

rok commented Apr 15, 2019

Thanks @robertnishihara :).
Yeah test_serialization does make more sense. I pushed a change now.

@robertnishihara
Copy link
Contributor

Thanks @rok! Nice work, I merged this.

@rok
Copy link
Member Author

rok commented Apr 17, 2019

Nice! Thanks @robertnishihara and @mitar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants