Skip to content

Conversation

@alphalfalfa
Copy link
Contributor

@alphalfalfa alphalfalfa commented Feb 6, 2018

Currently the take kernel supports:

  • Primitive value types (NA, Boolean and numeric types)
  • Single numeric index
  • Multiple flat indices in an array

The kernel does not yet handle special indices including negative integers or out-of-bound indices. ARROW-2105 is created for such feature. Parent ticket is at Arrow-772.

wagavulin and others added 30 commits October 20, 2017 09:42
The current build script for Travis CI (`ci/travis_before_script_c_glib.sh`) does not use `--enable-gtk-doc` option on Mac. Now it can be used also on Mac, so it should be enabled to check build status.

Author: Wataru Shimizu <waruzilla@gmail.com>

Closes apache#1195 from wagavulin/ci-mac-enable-gtk-doc and squashes the following commits:

1b93aa7 [Wataru Shimizu] Add environment variable for xsltproc on macOS
675c910 [Wataru Shimizu] Enable gtk-doc on Travis CI Mac environment
…eated in custom serializer

This uses the NumPyBuffer built into Arrow's Tensor facility to protect the numpy arrays holding the Tensors to be serialized. See also the problem description in https://issues.apache.org/jira/browse/ARROW-1695.

Author: Philipp Moritz <pcmoritz@gmail.com>

Closes apache#1220 from pcmoritz/fix-serialize-tensors and squashes the following commits:

7e23bb5 [Philipp Moritz] fix linting
dce92ad [Philipp Moritz] fix handling of numpy arrays generated in the custom serializer methods
… directory

Omitting the trailing slash results in a `java/apidocs` subdirectory being created

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1221 from wesm/ARROW-1686 and squashes the following commits:

565681e [Wes McKinney] Move API docs generation readme to RELEASE_MANAGEMENT.md
2fcf5fc [Wes McKinney] rsync contents of apidocs directory into site java directory
…ch more than once

Author: Brian Hulette <brian.hulette@ccri.com>

Closes apache#1222 from TheNeuralBit/ARROW-1698 and squashes the following commits:

aec2f6d [Brian Hulette] readDictionary now returns a single vector or null
f7b251b [Brian Hulette] only load a dictionary once
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1230 from xhochy/update-manylinux1-jemalloc and squashes the following commits:

6a29fa8 [Uwe L. Korn] Update jemalloc in manylinux1 build
This is being used in PySpark. After 0.8.0 we should migrate Spark to use the `pyarrow.types` APIs and then remove this from the top level namespace again

cc @BryanCutler

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1229 from wesm/ARROW-1683 and squashes the following commits:

7378d9d [Wes McKinney] Restore TimestampType to pyarrow namespace
This will save us repeating ourselves when users report bugs or request features on GitHub

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1226 from wesm/ARROW-1697 and squashes the following commits:

8a2ed58 [Wes McKinney] Add note about mailing list
b11e497 [Wes McKinney] Add ISSUE_TEMPLATE.md
…ation

This also restructures the code such that it is easier to reset the default serialization context back to the initial state after more handlers have been registered.

Author: Philipp Moritz <pcmoritz@gmail.com>

Closes apache#1223 from pcmoritz/torch-tensor and squashes the following commits:

db09afc [Philipp Moritz] fix test
ba0856c [Philipp Moritz] don't run pytorch test on python 2.7
264d199 [Philipp Moritz] remove import that is not required
882d9a5 [Philipp Moritz] small restructuring and support all PyTorch tensor types
c6dac9e [Philipp Moritz] add -q flag
23de67b [Philipp Moritz] add -y to torch installation
9814897 [Philipp Moritz] test torch tensor conversion
375bbfa [Philipp Moritz] support serializing torch tensors
It's caused by GObject Introspection package in Ubuntu 14.04 is old.
Recent GObject Introspection doesn't have this problem.
This PR makes it possible to add serialization handlers that allow to deserialize python objects using zero copy. If the serialization handler returns an arrow buffer, it will be appended to the serialized objects and the object can be reconstructed from there without copying the buffer.

TODO before merge:

- [x] Add pandas zero copy buffer read
- [x] See if fixing the TODO about union tags makes the code cleaner

Author: Philipp Moritz <pcmoritz@gmail.com>

Closes apache#1231 from pcmoritz/buffer-serialization and squashes the following commits:

10a01c0 [Philipp Moritz] run clang-format
529366d [Philipp Moritz] fix linting
7d88a67 [Philipp Moritz] fixes and make pandas deserialization zero-copy
2b75a0d [Philipp Moritz] fix linting
e078cb0 [Philipp Moritz] use union array fields to distinguish between ints and tensor/buffer references
dbd2364 [Philipp Moritz] add read path
0cc6d45 [Philipp Moritz] add read path
9f0173d [Philipp Moritz] buffer serialization write path
Initially I was looking to see if we could gracefully skip the tests if we have the requisite libraries available but are unable to connect to a running HDFS cluster, but I think this would mask errors that we want to see. So this just makes it so that developers who don't want to see the test failure can skip building the unit test executable

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1232 from wesm/ARROW-641 and squashes the following commits:

eed9ea4 [Wes McKinney] Do not build io-hdfs-test if ARROW_HDFS is off
…s in all builds

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1219 from wesm/ARROW-1087 and squashes the following commits:

5444b82 [Wes McKinney] Use more stable include
6cc855d [Wes McKinney] Add stub rst file about C extensions with pyarrow
…actor existing code to new variant

We should be very strict to not return Status from functions that cannot fail

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1227 from wesm/ARROW-1671 and squashes the following commits:

8a15a06 [Wes McKinney] Deprecate arrow::MakeArray that returns Status, refactor existing code to use new MakeArray returning std::shared_ptr<Array>
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1236 from xhochy/ARROW-1708 and squashes the following commits:

0695745 [Uwe L. Korn] ARROW-1708: [JS] Fix linter error
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1235 from xhochy/ARROW-1707 and squashes the following commits:

21a6734 [Uwe L. Korn] Add link to GitBox for account setup
d484bb0 [Uwe L. Korn] ARROW-1707: Update dev README after movement to GitBox
…ng, improve docs

This was actually already documented. I improved the docstring for `ParquetWriter` and added a unit test to validate the user API. Added ParquetWriter to the API listing also

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1218 from wesm/ARROW-571 and squashes the following commits:

e4f44b4 [Wes McKinney] Use py27-compatible class __doc__ setting
bb16d9b [Wes McKinney] Add ParquetWriter to API listing
4cda8ab [Wes McKinney] Add unit test for incremental Parquet file building
…nit tests

In the event that the offsets array has nulls, this will backward-fill the offsets to compute the correct value sizes.

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1224 from wesm/ARROW-507 and squashes the following commits:

9027c14 [Wes McKinney] Clean valid bits to remove trailing set bit
8d2cb51 [Wes McKinney] Implement / add tests for ListArray.from_arrays in Python
1c6a870 [Wes McKinney] Complete C++ implementation, unit test for ListArray::FromArrays, handling of offsets with nulls
This patch includes ARROW-1172 which I put up separately. Putting this up for comments on the API before I go too far down the rabbit hole. The idea is to make it simpler for users to construct record batches given a known schema. For example, this could be used in turbodbc or another database driver

closes apache#810 incidentally

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#802 from wesm/ARROW-1114 and squashes the following commits:

5f104c4 [Wes McKinney] Rename Create to Make, remove const method versions
a8b6c5c [Wes McKinney] Update doxygen comments, change FlushAndReset to instead have Flush with a flag for resetting
2573ae7 [Wes McKinney] Test invalid field length
256419f [Wes McKinney] Basic test passing
be7e232 [Wes McKinney] Start table_builder-test.cc
9660502 [Wes McKinney] Draft RecordBatchBuilder, no tests or benchmarks yet
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1238 from wesm/ARROW-1654 and squashes the following commits:

2e6f9e3 [Wes McKinney] Add pickling test cases for timestamp, decimal
1827b23 [Wes McKinney] Fix pickling on py27, implement for Schema. Also pickle field/schema metadata
1395583 [Wes McKinney] Implement pickling for list, struct, add __richcmp__ for Field
366f428 [Wes McKinney] Start implementing pickling for DataType, Field
This closes [ARROW-1720](https://issues.apache.org/jira/browse/ARROW-1720).

Author: Licht-T <licht-t@outlook.jp>

Closes apache#1243 from Licht-T/fix-unbound-chunk and squashes the following commits:

cabdd43 [Licht-T] TST: Add bounds check tests for chunk getter
bda7f4c [Licht-T] BUG: Implement bounds check in chunk getter
This got messed up during one of the patches in which these files were refactored. Once the build fails, I will fix the lint errors

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1242 from wesm/ARROW-1711 and squashes the following commits:

cd4b655 [Wes McKinney] Fix more flake8 warnings
2eb8bf4 [Wes McKinney] Fix flake8 issues
cef7a7c [Wes McKinney] Fix flake8 calls to lint the right directories
… to avoid using nullptr in public headers

cc @TobyShaw. Can you test this?

Close apache#1098

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1228 from wesm/ARROW-1134 and squashes the following commits:

bf18158 [Wes McKinney] Only define NULLPTR if not already defined
a51dd88 [Wes McKinney] Add NULLPTR macro to avoid using nullptr in public headers for C++/CLI users
Author: Phillip Cloud <cpcloud@gmail.com>

Closes apache#1211 from cpcloud/ARROW-1588 and squashes the following commits:

ae0d562 [Phillip Cloud] ARROW-1588: [C++/Format] Harden Decimal Format
Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1247 from kou/c-glib-release-verify and squashes the following commits:

e9f2307 [Kouhei Sutou] [GLib] Add setup description to verify C GLib build
…p units

Several JIRAs here that made sense to tackle together:

* ARROW-1680
* ARROW-1482
* ARROW-1484
* ARROW-1524

This also fixes bugs relating to ignoring the offset in sliced arrays in some of the cast kernel implementations.

cc @BryanCutler @xhochy @cpcloud

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1245 from wesm/ARROW-1484 and squashes the following commits:

6f6107f [Wes McKinney] Test that timestamp truncation raises for safe casts
1d28ce2 [Wes McKinney] Fix typo
a820ed9 [Wes McKinney] Add test case for ARROW-1680
ac9845e [Wes McKinney] Implement date32<->date64 casts, unit tests
4911b09 [Wes McKinney] Add time casts
6441a95 [Wes McKinney] Write unit tests for timestamp to timestamp casts
1bde233 [Wes McKinney] Cast table for timestamps
569dfb8 [Wes McKinney] Fix existing cast tests with sliced arrays
884f6d2 [Wes McKinney] Baby steps
a040136 [Wes McKinney] Placeholders for time32/64/timestamp cast functions
fddf466 [Wes McKinney] Stub out unit test for timestamp unit change
…'t supported in numpy_to_arrow.cc

This closes [ARROW-1721](https://issues.apache.org/jira/projects/ARROW/issues/ARROW-1721).

Author: Licht-T <licht-t@outlook.jp>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1246 from Licht-T/feature-object-from_pandas-mask and squashes the following commits:

41a1229 [Wes McKinney] Fix flake8 issues
d754533 [Licht-T] Fix lint issues by clang-format-4.0
7ef7f78 [Licht-T] Revert "Fix lint issues"
5c6c182 [Licht-T] Fix lint issues
78d3c3f [Licht-T] TST: Add tests of null-mask check for object types
72030bf [Licht-T] ENH: Implement null-mask check for object types
This also makes Feather writes more robust to columns having a mix of unicode and bytes (these gets coerced to binary)

Also resolves ARROW-1672

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1250 from wesm/ARROW-1675 and squashes the following commits:

9d55886 [Wes McKinney] Use RecordBatch.from_pandas in Feather write path. Fix test cases
…test pandas roundtrips

I ran into this rough edge today, invariably serialization code paths will need to send across a DataFrame with no columns, this will need to work even if `preserve_index=False`

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1252 from wesm/ARROW-1732 and squashes the following commits:

a240c05 [Wes McKinney] Permit creating record batches with no columns, test pandas roundtrips
This makes performance debugging much easier, as it allows you to track down what (Arrow) data is causing unexpected delays in loading. It also makes testing features like ARROW-1689 easier as you can prove (via unit tests) that copies are not being made.

Author: Nick White <n.j.white@gmail.com>

Closes apache#1233 from njwhite/feature/zerocopycategories and squashes the following commits:

b06f50d [Nick White] ARROW-1689 Don't Deserialize the Dictionary Twice
a968b0b [Nick White] ARROW-1689 Allow User To Request No Data Copies
This was ported from parquet-mr/parquet-cpp and modified to work in Python 2 and 3

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1248 from wesm/ARROW-587 and squashes the following commits:

da49b1c [Wes McKinney] Another buglet
d65a1b5 [Wes McKinney] Fix a python2 only statement
938b6dc [Wes McKinney] Fix script to pass right JIRA object
bbbce86 [Wes McKinney] Add fix version to PR merge tool (ported from apache/parquet-mr)
robertnishihara and others added 21 commits January 20, 2018 13:41
…tRequest.

Author: Robert Nishihara <robertnishihara@gmail.com>

Closes apache#1479 from robertnishihara/deduplicatefiledescriptors and squashes the following commits:

9be9643 [Robert Nishihara] Fix bug.
8a827cf [Robert Nishihara] Remove mmap_size from PlasmaObject.
ab30d7d [Robert Nishihara] Fix tests.
2916e87 [Robert Nishihara] Remove mmap_size from PlasmaObjectSpec, and file_descriptor -> fd.
7f5c618 [Robert Nishihara] Deduplicate file descriptors when store replies to Get.
ab12d63 [Robert Nishihara] Make Create return a MutableBuffer.
Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1494 from kou/glib-support-distclean and squashes the following commits:

d660e0f [Kouhei Sutou] [GLib] Support "make distclean"
Author: Phillip Cloud <cpcloud@gmail.com>

Closes apache#1489 from cpcloud/ARROW-1580 and squashes the following commits:

ff81567 [Phillip Cloud] Move to sphinx
700d2c5 [Phillip Cloud] Remove link to nightlies
9fbe9ac [Phillip Cloud] Add build artifact location
cb9f2a5 [Phillip Cloud] [Python] Instructions for setting up nightly builds on Linux
…trings_to_categorical` is True

This closes [ARROW-1997](https://issues.apache.org/jira/browse/ARROW-1997).

The problem is
```python
>>> import pandas as pd
>>> import pyarrow as pa
>>>
>>> df = pd.DataFrame({
... 'Foo': ['A', 'A', 'B', 'B']
... })
>>> table = pa.Table.from_pandas(df)
>>> df = table.to_pandas(strings_to_categorical=True)
<class 'pandas.core.categorical.Categorical'>
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "table.pxi", line 1043, in pyarrow.lib.Table.to_pandas
  File "pyarrow/pandas_compat.py", line 535, in table_to_blockmanager
    blocks = _table_to_blocks(options, block_table, nthreads, memory_pool)
  File "pyarrow/pandas_compat.py", line 629, in _table_to_blocks
    return [_reconstruct_block(item) for item in result]
  File "pyarrow/pandas_compat.py", line 436, in _reconstruct_block
    ordered=item['ordered'])
  File "/home/rito/miniconda3/envs/pyarrow-dev-27/lib/python2.7/site-packages/pandas/core/categorical.py", line 624, in from_codes
    raise ValueError("codes need to be between -1 and "
ValueError: codes need to be between -1 and len(categories)-1
```

When `strings_to_categorical=True`, the categorical index is newly created in `to_pandas` procedure.
But, this passes data to Python by zero-copy, so the array is deallocated.
https://github.com/Licht-T/arrow/blob/be58af6dd0333652abbe2333ee5968df3f2e371f/cpp/src/arrow/python/arrow_to_pandas.cc#L1040

Author: Licht-T <licht-t@outlook.jp>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1480 from Licht-T/fix-to_pandas-with-strings_to_categorical and squashes the following commits:

61eac9c [Wes McKinney] Adjust error message
c1bc353 [Licht-T] TST: Add test for to_pandas no-NA strings with strings_to_categorical
cce3f50 [Licht-T] BUG: Avoid zero-copy-option in to_pandas when strings_to_categorical is True
Author: yosuke shiro <ys2014hs@gmail.com>

Closes apache#1496 from shiro615/build-instruction-on-macos-and-homebrew-is-incomplete and squashes the following commits:

6b32f68 [yosuke shiro] [C++] fix Build instruction on macOS and Homebrew
…lue data

Modified BinaryBuilder::Resize(int64_t) so that when building BinaryArrays with a known size, space is also reserved for value_data_builder_ to prevent internal reallocation.

Author: Panchen Xue <pan.panchen.xue@gmail.com>

Closes apache#1481 from xuepanchen/master and squashes the following commits:

707b67b [Panchen Xue] ARROW-1712: [C++] Fix lint errors
360e601 [Panchen Xue] Merge branch 'master' of https://github.com/xuepanchen/arrow
d4bbd15 [Panchen Xue] ARROW-1712: [C++] Modify test case for BinaryBuilder::ReserveData() and change arguments for offsets_builder_.Resize()
77f8f3c [Panchen Xue] Merge pull request apache#5 from apache/master
bc5db7d [Panchen Xue] ARROW-1712: [C++] Remove unneeded data member in BinaryBuilder and modify test case
5a5b70e [Panchen Xue] Merge pull request apache#4 from apache/master
8e4c892 [Panchen Xue] Merge pull request apache#3 from xuepanchen/xuepanchen-arrow-1712
d3c8202 [Panchen Xue] ARROW-1945: [C++] Fix a small typo
0b07895 [Panchen Xue] ARROW-1945: [C++] Add data_capacity_ to track capacity of value data
18f90fb [Panchen Xue] ARROW-1945: [C++] Add data_capacity_ to track capacity of value data
bbc6527 [Panchen Xue] ARROW-1945: [C++] Update test case for BinaryBuild data value space reservation
15e045c [Panchen Xue] Add test case for array-test.cc
5a5593e [Panchen Xue] Update again ReserveData(int64_t) method for BinaryBuilder
9b5e805 [Panchen Xue] Update ReserveData(int64_t) method signature for BinaryBuilder
8dd5eaa [Panchen Xue] Update builder.cc
b002e0b [Panchen Xue] Remove override keyword from ReserveData(int64_t) method for BinaryBuilder
de318f4 [Panchen Xue] Implement ReserveData(int64_t) method for BinaryBuilder
e0434e6 [Panchen Xue] Add ReserveData(int64_t) and value_data_capacity() for methods for BinaryBuilder
5ebfb32 [Panchen Xue] Add capacity() method for TypedBufferBuilder
5b73c1c [Panchen Xue] Update again BinaryBuilder::Resize(int64_t capacity) in builder.cc
d021c54 [Panchen Xue] Merge pull request apache#2 from xuepanchen/xuepanchen-arrow-1712
232024e [Panchen Xue] Update BinaryBuilder::Resize(int64_t capacity) in builder.cc
c2f8dc4 [Panchen Xue] Merge pull request apache#1 from apache/master
…rks fine

Previously creating two instances of `HadoopFileSystem` using the same init parameters would result in both pointing to the same `hdfsFS` object. If one `HadoopFileSystem` disconnected then the underlying `hdfsFS` would be closed for both instances.

To fix this, we force a new instance of `hdfsFS` on connect, removing this cacheing behavior.

Author: Jim Crist <jiminy.crist@gmail.com>

Closes apache#1499 from jcrist/no-cache-hdfs and squashes the following commits:

f8ff135 [Jim Crist] Add test
bf6627e [Jim Crist] Force libhdfs/libhdfs3 to return new FS on connect
…internals.make_block

Apparently this argument is not used at all in pandas, and the pandas developers wish to simply remove the argument rather than go through a deprecation cycle

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1507 from wesm/ARROW-2003 and squashes the following commits:

a838226 [Wes McKinney] Remove use of fastpath parameter to pandas.core.internals.make_block
* [C++] Update README for linting

There are hidden gotchas when trying to move the linting Makefile targets,
mention them.

* Mention the LLVM releases page
Previously checking if the file was closed was subclass specific, and wasn't caught in the hdfs backed file, leading to program crashes.

This adds a check in `NativeFile.tell` for the file being open, and a test on a few subclasses of `NativeFile` to assure the error is raised.

Note that since most python file-like objects raise a `ValueError` for operations after close, I changed the type of the existing error for these cases. This could be changed back, but an error should at least be thrown.

Author: Jim Crist <jiminy.crist@gmail.com>

Closes apache#1502 from jcrist/hdfs-tell-on-closed and squashes the following commits:

8a9dc94 [Jim Crist] NativeFile.tell errors after close
…IST (apache#1497)

* ARROW-2019: [JAVA] Control the memory allocated for inner vector in LIST

* address review comments
Adds support for pickling `HadoopFileSystem`

A few additional small fixes:
- Adds a check that `driver` is one of {'libhdfs3', 'libhdfs'}
- A few small tweaks to the hdfs tests to make them easier to run locally.

Author: Jim Crist <jiminy.crist@gmail.com>

Closes apache#1505 from jcrist/pickle-hdfs-filesystem and squashes the following commits:

b19ed3e [Jim Crist] Compat with older cython versions
1f74726 [Jim Crist] HadoopFileSystem is pickleable
… files when passing flavor='spark'

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1511 from wesm/ARROW-1961 and squashes the following commits:

e13b6b4 [Wes McKinney] Preserve pre-existing schema metadata when sanitizing fields when passing flavor='spark'
…nt64 range

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1504 from wesm/ARROW-2017 and squashes the following commits:

56e67dc [Wes McKinney] Use unsigned PyLong API for uint64 values over int64 range
…hen possible for inner arrays

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1509 from wesm/ARROW-2007 and squashes the following commits:

cd12626 [Wes McKinney] Pin thrift-cpp in Appveyor
326c82e [Wes McKinney] Pin Thrift 0.10.0 in toolchain
e334f4e [Wes McKinney] Add explicit type check
db04659 [Wes McKinney] Implement float32 conversions, use NumPy dtype when possible for inner arrays rather than dispatching to the generic sequence routine
Author: Wes McKinney <wes.mckinney@twosigma.com>
Author: Antoine Pitrou <antoine@python.org>

Closes apache#1516 from pitrou/ARROW-2035-update-cpplint and squashes the following commits:

8f2f892 [Wes McKinney] Fix IWYU errors surfaced by newer cpplint
f5175dd [Antoine Pitrou] ARROW-2035: [C++] Update vendored cpplint.py to a Py3-compatible one
Author: yosuke shiro <ys2014hs@gmail.com>

Closes apache#1521 from shiro615/change-description-from-osx-to-macos and squashes the following commits:

ab03f72 [yosuke shiro] [C++] change description from OS X to macOS
@alphalfalfa
Copy link
Contributor Author

@wesm @xuepanchen

@wesm
Copy link
Member

wesm commented Feb 12, 2018

@alphalfalfa thank you for starting on these kernels; I'm sorry it's taken me a little while to get down to review but I intend to do so this week

namespace compute {

struct ARROW_EXPORT TakeOptions {
enum OutOfBoundMode { RAISE, WRAP, CLIP };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments describing the behavior of each options.

}
};

TEST_F(TestTake, NumericValueTypeCases) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified with GTest templated tests feature (and type_traits.h).

const vector<bool>& is_valid_expect,
const TakeOptions& options) {
shared_ptr<Array> input_a, indices_a, expect_a;
if (is_valid_input.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the empty check is done? You can still pass a vector with zero elements.

ASSERT_RAISES(NotImplemented, Take(&ctx_, input, indices, options, &result));
}

template <typename ValueType, typename V, typename IndexType, typename I>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using default argument to templates to simplify the callsite, e.g.

template <typename ValueType, typename V = ValueType::c_type, typename IndexType, typename I = IndexType::c_type>

Status Take(FunctionContext* context, const Array& in, const Array& indices,
const TakeOptions& options, std::shared_ptr<Array>* out);

// Single integer index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it require a special case for a single element? I think it adds a lot of clutter, if it's for testing purpose, consider using ArrayFromVector (you already do) or ArrayFromJSON in tests.

const Type::type type_id = out->type->id();
if (!(is_primitive(type_id) || type_id == Type::FIXED_SIZE_BINARY ||
type_id == Type::DECIMAL)) {
std::stringstream ss;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need stringstream, see other comment on the subject.


// ----------------------------------------------------------------------

typedef std::function<void(FunctionContext*,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the typedef, can it just pass the Functor arround?

class TakeKernel : public UnaryKernel {};

template <>
class TakeKernel<Int32Type> : public UnaryKernel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Int32Type, this seems fishy as it doesn't seem used in the rest of the code.

FN(VALUE_TYPE, UInt64Type);

#define GET_TAKE_FUNCTION(CASE_GENERATOR, ValueType) \
static std::unique_ptr<UnaryKernel> Get##ValueType##TakeFunc( \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this function return Status and pass the unique_ptr by output parameter (as the parent callsite is also doing).

func = [](FunctionContext* ctx, const TakeOptions& options, const ArrayData& input, \
const ArrayData& indices, ArrayData* out) { \
TakeFunctor<ValueType, IndexType> func; \
func(ctx, options, input, indices, out); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's way too much func shadowing involved here.

@wesm
Copy link
Member

wesm commented Feb 12, 2019

As this PR is over a year old I think it probably needs a fresh start

@wesm
Copy link
Member

wesm commented Mar 13, 2019

Closing in favor of new PR, thank you for starting on this @alphalfalfa!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.