Skip to content

ARROW-6054: [Python] Fix the type erasion bug when serializing structured type ndarray.#4953

Closed
sighingnow wants to merge 3 commits intoapache:masterfrom
sighingnow:fix-structured-dtype-serialization
Closed

ARROW-6054: [Python] Fix the type erasion bug when serializing structured type ndarray.#4953
sighingnow wants to merge 3 commits intoapache:masterfrom
sighingnow:fix-structured-dtype-serialization

Conversation

@sighingnow
Copy link
Copy Markdown
Contributor

Fix the type erasion bug when serializing structured arrays of numpy.

Without this patch, we could see something like:

In [1]: import pyarrow as pa
In [2]: import numpy as np
In [3]: x = np.array([(1, "a"), (2, "bb")], dtype=np.dtype([('x', 'int32'), ('y', '<U4')]))
In [4]: y = pa.deserialize(pa.serialize(x).to_buffer())
In [5]: y
Out[5]:
array([b'\x01\x00\x00\x00\x61\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
       b'\x02\x00\x00\x00\x62\x00\x00\x00\x62\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'],
      dtype='|V20')
In [6]: x.dtype
Out[6]: dtype([('x', '<i4'), ('y', '<U4')])

Note that the dtype of deserialized result y is lost, which is really annoying. The reason is that x.type.str is '|V20', rather than the structured type it self. Thus we need dtype.descr.

After this PR, we get something like

In [1]: import pyarrow as pa
In [2]: import numpy as np
In [3]: x = np.array([(1, "a"), (2, "bb")], dtype=np.dtype([('x', 'int32'), ('y', '<U4')]))
In [4]: y  = pa.deserialize(pa.serialize(x).to_buffer())
In [5]: y
Out[5]: array([(1, 'a'), (2, 'bb')], dtype=[('x', '<i4'), ('y', '<U4')])
In [6]: y.dtype
Out[6]: dtype([('x', '<i4'), ('y', '<U4')])

I didn't see any existing test that checks the dtype when testing serialization of numpy thus I didn't add test case for this PR. If a test case is needed please let me know.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 26, 2019

Can you add a unit test and open a JIRA issue for this?

http://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
@sighingnow sighingnow changed the title Use dtype.descr rather than str to avoid type erasion. ARROW-6054: [Python] Fix the type erasion bug when serializing structured type ndarray. Jul 27, 2019
@sighingnow
Copy link
Copy Markdown
Contributor Author

Jira issue: ARROW-6054.

Test for structured dtype has been added as well.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #4953 into master will decrease coverage by 23.65%.
The diff coverage is 88.23%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4953       +/-   ##
===========================================
- Coverage   88.53%   64.88%   -23.66%     
===========================================
  Files         912      488      -424     
  Lines      116620    65118    -51502     
  Branches     1418        0     -1418     
===========================================
- Hits       103255    42253    -61002     
- Misses      13003    22865     +9862     
+ Partials      362        0      -362
Impacted Files Coverage Δ
python/pyarrow/serialization.py 84.76% <100%> (ø) ⬆️
python/pyarrow/tests/test_serialization.py 89.4% <100%> (+0.02%) ⬆️
python/pyarrow/compat.py 62.29% <84%> (+5.59%) ⬆️
cpp/src/arrow/util/memory.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/date_utils.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/memory.cc 0% <0%> (-100%) ⬇️
cpp/src/arrow/filesystem/util-internal.cc 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/sse-util.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/decimal_type_util.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/compute/logical_type.h 0% <0%> (-100%) ⬇️
... and 668 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 937bcb7...f1778f6. Read the comment docs.

@sighingnow
Copy link
Copy Markdown
Contributor Author

hi @wesm, the AMD64 Conda C++ failure is not caused by this PR. Do you have any further comment on this PR?

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question here.

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou pitrou closed this in 171c3f7 Jul 29, 2019
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jul 29, 2019

Thanks @sighingnow for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants