Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3890: [Python] Handle NumPy binary arrays with UTF-8 validation when converting to StringArray #3063

Closed
wants to merge 4 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Nov 30, 2018

I'm not sure if all compilers will be smart enough to do loop unswitching here. If it ends up being a bottleneck I suggest rewriting in a follow up patch.

The BinaryArray overflow issue (ChunkedArray is not being produced) is still present here. We will need to address that in ARROW-2970

This patch also includes symbol export macros particular to the arrow_python shared library. These are needed so that global data members in arrow.dll can be accessed in arrow_python.dll

@wesm
Copy link
Member Author

wesm commented Dec 1, 2018

On windows:

src\arrow\python\CMakeFiles\arrow_python_shared.dir\serialize.cc.obj /out:release\arrow_python.dll /implib:release\arrow_python.lib /pdb:release\arrow_python.pdb /dll /version:12.0 /machine:x64 /INCREMENTAL:NO release\arrow.lib C:\Miniconda36-x64\envs\arrow\libs\Python36.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:release\arrow_python.dll.manifest" failed (exit code 1120) with the following output:
   Creating library release\arrow_python.lib and object release\arrow_python.exp
numpy_to_arrow.cc.obj : error LNK2019: unresolved external symbol "unsigned short * arrow::util::internal::utf8_large_table" (?utf8_large_table@internal@util@arrow@@3PAGA) referenced in function "unsigned short __cdecl arrow::util::internal::ValidateOneUTF8Byte(unsigned char,unsigned short)" (?ValidateOneUTF8Byte@internal@util@arrow@@YAGEG@Z)
release\arrow_python.dll : fatal error LNK1120: 1 unresolved externals
[262/341] Building CXX object src\parquet\CMakeFiles\parquet_static.dir\column_writer.cc.obj
[263/341] Building CXX object src\arrow\python\CMakeFiles\python-test.dir\python-test.cc.obj

Static members across DLLs in Windows is a bit zany, taking a look...

@wesm
Copy link
Member Author

wesm commented Dec 1, 2018

Hm, so I'm surprised we haven't hit this issue yet, but it seems unfortunately that we need to use different visibility macros for libarrow_python. The issue is that we need __declspec(dllexport) to be set for arrow_python.dll APIs but __declspec(dllimport) set for arrow.dll APIs when we are compiling the arrow_python library. If __declspec(dllimport) is not set on global data members (like arrow::util::internal::utf8_large_table) then they will not be properly imported.

It seems this is a well known problem; I think other C++ projects address this by just not having a lot of DLL artifacts, see https://blog.kitware.com/create-dlls-on-windows-without-declspec-using-new-cmake-export-all-feature/. Using WINDOWS_EXPORT_ALL_SYMBOLS is not a panacea since

For global data symbols, __declspec(dllimport) must still be used when compiling against the code in the DLL.

Darn. Well, it is what it is.

…pyarrow.array with type=pyarrow.string()

Change-Id: Ie3e047eda3bf8c7619944bf9cac8d67084df420c
Change-Id: Ic60726716a34827126dc22914402238e1c7b860a
…mbers from arrow.dll can be accessed correctly in arrow_python.dll

Change-Id: Ib5748ce1cf9aebd60f8cec20ee4ee03617bec983
@codecov-io
Copy link

Codecov Report

Merging #3063 into master will increase coverage by 1.05%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #3063      +/-   ##
=========================================
+ Coverage   87.15%   88.2%   +1.05%     
=========================================
  Files         489     431      -58     
  Lines       69161   65470    -3691     
=========================================
- Hits        60275   57748    -2527     
+ Misses       8787    7722    -1065     
+ Partials       99       0      -99
Impacted Files Coverage Δ
cpp/src/arrow/python/arrow_to_pandas.h 100% <ø> (ø) ⬆️
cpp/src/arrow/python/helpers.h 90% <ø> (ø) ⬆️
cpp/src/arrow/python/common.h 100% <ø> (ø) ⬆️
cpp/src/arrow/python/decimal.h 100% <ø> (ø) ⬆️
cpp/src/arrow/python/python_to_arrow.h 100% <ø> (ø) ⬆️
cpp/src/arrow/python/inference.cc 54.95% <ø> (ø) ⬆️
python/pyarrow/__init__.py 69.23% <0%> (ø) ⬆️
cpp/src/arrow/python/serialize.h 0% <0%> (ø) ⬆️
cpp/src/arrow/python/numpy_to_arrow.cc 93.73% <100%> (+0.19%) ⬆️
python/pyarrow/tests/test_array.py 99.1% <100%> (+0.02%) ⬆️
... and 60 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a667fca...dac4995. Read the comment docs.

@wesm
Copy link
Member Author

wesm commented Dec 2, 2018

+1. I want to get to work on the overflow issues. @pitrou or @xhochy if you have comments on these changes please let me know and I'll incorporate into the upcoming patches

@wesm wesm closed this in 67b9215 Dec 2, 2018
@wesm wesm deleted the ARROW-3890 branch December 2, 2018 16:56
#ifdef ARROW_STATIC
#define ARROW_PYTHON_EXPORT
#elif defined(ARROW_PYTHON_EXPORTING)
#define ARROW_PYTHON_EXPORT __declspec(dllexport)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an explanation why we need those separate macros?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wrote in the comments. So the issue is that we need ARROW_EXPORT to mean dllimport when we are compiling arrow_python.dll, or global data members from arrow.dll cannot be imported. So we need a separate dllexport macro for this separate DLL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether or not there are global data members in arrow.dll, it is the more correct thing to have different export macros for each distinct DLL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants