Skip to content

ARROW-1473: [JAVA] Initial prototype code hierarchy for review#1163

Closed
siddharthteotia wants to merge 14 commits intoapache:masterfrom
siddharthteotia:ARROW-1473
Closed

ARROW-1473: [JAVA] Initial prototype code hierarchy for review#1163
siddharthteotia wants to merge 14 commits intoapache:masterfrom
siddharthteotia:ARROW-1473

Conversation

@siddharthteotia
Copy link
Contributor

cc @jacques-n , @BryanCutler , @icexelloss , @elahrvivaz

  1. BaseFixedWidthVector abstract class - can be extended with a template to generate all the fixed width vector types.

  2. FixedValueVectorsPrototype.java - template similar to what we have currently in FixedValueVectors.java . However, all the common functionality has been moved from the template to BaseFixedWidthVector. So the template pretty much contains the custom mutator and accessor methods which are extremely complex as there is no partial approach to refactor them.

This is why I was suggesting that we should try to get rid of template for fixed value vectors.

Another approach could be to use the template for simple types like INT, FLOAT4, FLOAT8 since the "if" conditions in template code are probably not very complicated for these types. But for types like TIMESTAMP and others, there are giant "if" blocks.

  1. PrototypeIntVectorNonCodegen - Non code generated IntVector class that implements IntVector specific functionality by extending BaseFixedWidthVector. The get() functions in the Accessor bypass ArrowBuf and directly work with the memory address to fish out the value.

  2. BaseVariableWidthVector abstract class - looking at the code of VarCharVector and VarBinaryVector, it seems like the code is 95% same with only 1 or 2 accessor functions being different. So we can implement all the functionality (including mutator and accessor as well) in the base class and then have small non code generated subclasses that just have the specific functionality for VarChar and VarBinary. We will no longer need VariableLengthVectors.java template.

cpcloud and others added 14 commits September 27, 2017 20:58
Author: Phillip Cloud <cpcloud@gmail.com>

Closes apache#1128 from cpcloud/ARROW-1607 and squashes the following commits:

3137566 [Phillip Cloud] ARROW-1607: [C++] Implement DictionaryBuilder for Decimals
see apache#1139

Author: Philipp Moritz <pcmoritz@gmail.com>

Closes apache#1144 from pcmoritz/plasma-xcode-9-workaround and squashes the following commits:

7642499 [Philipp Moritz] fix on other platforms
c5cdddd [Philipp Moritz] xcode 9 compilation workaround
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1141 from xhochy/ARROW-1620 and squashes the following commits:

30da182 [Uwe L. Korn] ARROW-1620: Python: Download Boost in manylinux1 build from bintray
cc @jacques-n , @icexelloss , @BryanCutler

This is initial small phase of our attempt to reduce heap usage per vector.

As part of investigation, we realized its better to address few things as part of subtasks for ARROW-1463. Accordingly, I need to update the requirements document w.r.t heap usage for ARROW-1471.

This patch gets rid of Release Listener object in Allocation Manager as all the logic is implemented as part of AllocationManager itself.

https://docs.google.com/document/d/1MU-ah_bBHIxXNrd7SkwewGCOOexkXJ7cgKaCis5f-PI/edit

Author: siddharth <siddharth@dremio.com>

Closes apache#1142 from siddharthteotia/ARROW-1618 and squashes the following commits:

77151a2 [siddharth] ARROW-1618: Reduce Heap Usage (Phase 1)
When reading a vector in JsonFileReader, lastSet should be set in VariableWidthVectors after reading inner vectors or else subsequent operations could corrupt the offsets.  This also allows to simplify some of the related code.  Additionally, ListVector.lastSet should be explicitly initialized to 0, which is it's starting offset.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#1140 from BryanCutler/java-JsonReader-setLast-ARROW-1619 and squashes the following commits:

8f97a3d [Bryan Cutler] added test for VarBinaryVector that checks lastSet after reading
70df0cc [Bryan Cutler] set lastSet in JsonFileReader and initialize lastSet for ListVector
…CxxFlags.cmake

Author: Rene Sugar <rene.sugar@gmail.com>

Closes apache#1145 from renesugar/ARROW-1615 and squashes the following commits:

71a615e [Rene Sugar] ARROW-1615 Add -Wno-vla-extension and change non-checkin builds back to -Wall
1895843 [Rene Sugar] ARROW-1615 Add -Wno-cast-align
5fe4e8e [Rene Sugar] ARROW-1615 Move -Wno-shorten-64-to-32 after -Wconversion
9d3c7ec [Rene Sugar] ARROW-1615 Identify compiler version for clang-802 plus more warning entries
5ebaf86 [Rene Sugar] ARROW-1615 Moved version specific warning entry
971e61a [Rene Sugar] ARROW-1615 Fixed version specific warning entry
6cf2497 [Rene Sugar] ARROW-1615 Added more version specific Clang warning entries
50def43 [Rene Sugar] ARROW-1615 Updated build warning level terminology
ea906eb [Rene Sugar] ARROW-1615 Check compiler version before disabling some warnings
159e189 [Rene Sugar] ARROW-1615 Include CompilerInfo before SetupCxxFlags in arrow/python
8359c96 [Rene Sugar] ARROW-1615 Added BUILD_WARNING_LEVEL and BUILD_WARNING_FLAGS to SetupCxxFlags.cmake
Many other libraries interchange binary data with `std::string`. This makes it easy to wrap such data in an `arrow::Buffer`.

It may be worth adding a function that creates a buffer from a string, but owns its memory.

I also deprecated `arrow::GetBufferFromString`, which shouldn't have been public in the first place, since the new ctor is more general

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1147 from wesm/ARROW-1600 and squashes the following commits:

f60f502 [Wes McKinney] Remove TestBuffer fixture
644bf2b [Wes McKinney] Add Buffer ctor that wraps std::string
…riginating in pandas

This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off:

```
In [8]: arr = np.random.randn(10000000)

In [9]: arr[::3] = np.nan

In [10]: arr2 = pa.array(arr)

In [11]: arr2.null_count
Out[11]: 0

In [12]: %timeit arr2 = pa.array(arr)
The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 68.4 µs per loop

In [13]: arr2 = pa.array(arr, from_pandas=True)

In [14]: arr2.null_count
Out[14]: 3333334

In [15]: %timeit arr2 = pa.array(arr, from_pandas=True)
1 loop, best of 3: 228 ms per loop
```

When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated.

This also permits sequence reads into integers smaller than int64:

```
In [17]: pa.array([1, 2, 3, 4], type='i1')
Out[17]:
<pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8>
[
  1,
  2,
  3,
  4
]
```

Oh, I also added NumPy-like string type aliases:

```
In [18]: pa.int32() == 'i4'
Out[18]: True
```

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1146 from wesm/expand-py-array-method and squashes the following commits:

1570e52 [Wes McKinney] Code review comments
d3bbb3c [Wes McKinney] Handle type aliases in cast, too
797f015 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array
f2802fc [Wes McKinney] Cleaner codepath for numpy->arrow conversions
587c575 [Wes McKinney] Add direct types sequence converters for more data types
cf40b76 [Wes McKinney] Add type aliases, some unit tests
7b530e4 [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array
…is tool called infer

Author: Rene Sugar <rene.sugar@gmail.com>

Closes apache#1149 from renesugar/infer and squashes the following commits:

8591b5f [Rene Sugar] ARROW-1626 Add make targets to run the inter-procedural static analysis tool called infer
…ppressions

I'm going to quick make a pass through later today and see how many of these warning suppressions I can remove

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1148 from wesm/warning-fixes and squashes the following commits:

e930152 [Wes McKinney] Only build compute modules if -DARROW_COMPUTE=ON
d6ca7ac [Wes McKinney] Slight refactor of CMakeLists.txt to move Arrow library setup to src/arrow
e2c61a2 [Wes McKinney] Use -Wno-unknown-warning-option
3d2d726 [Wes McKinney] Fix travis CI script
6d0d411 [Wes McKinney] Use BUILD_WARNING_LEVEL in Travis CI
1bec4a7 [Wes McKinney] Fix some more compiler warnings
cae05fb [Wes McKinney] Fix documentation compiler warnings
f76b2b9 [Wes McKinney] Fix a bunch of documentation warnings
ac54e2d [Wes McKinney] Remove some clang warning suppressions, fix warnings
6935b8c [Wes McKinney] Fix compiler warnings with clang-4.0
…n infer tool output

This was an interesting journey through some esoterica. I went through all the warnings/errors that infer (fbinfer.com) outputs and made changes if it seemed warranted. Some of the checks might be overkill.

See https://gist.github.com/wesm/fc6809e4f4aaef3ecfeb21b8123627bc for a summary of actions on each warning

Most of the errors that Infer wasn't happy about were already addressed by DCHECKs. This was useful to go through all these cases -- in nearly all cases the null references are impossible or would be the result of an error on behalf of the application programmer. For example: we do not do array boundschecking in most cases in production builds, but these boundschecks are included in debug builds to assist with catching bugs caused by improper use by application developers.

As a matter of convention, we should not use error `Status` to do parameter validation or asserting pre-conditions that are the responsibility of the library user. If parameter validation is required in binding code (e.g. Python), then this validation should happen in the binding layer, not in the core C++ library.

There are some other cases where we have a `std::shared_ptr<T>` out variable with code like:

```
RETURN_NOT_OK(Foo(..., &out));
out->Method(...);
```

Here, infer complains that `out` could contain a null pointer, but our contract with developers is that if `Foo` returns successfully that `out` is non-null.

Interestingly, infer doesn't like some stack variables that are bound in C++11 lambda expressions. I noted these in the gist with `LAMBDA`.

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1151 from wesm/fix-infer-issues and squashes the following commits:

f285be9 [Wes McKinney] Restore code paths for empty chunked arrays for backwards compat
5aa86ce [Wes McKinney] More DCHECK esoterica / tweaks based on infer report
22c5d36 [Wes McKinney] Address a couple more infer warnings
75131a6 [Wes McKinney] Some more minor infer fixes
5ff3e3a [Wes McKinney] Compilation fix
05316ce [Wes McKinney] Fix miscellaneous things that infer does not like. Make some Python helper functions internal / non-exported
… platforms

Close apache#1136

Change-Id: Ib9261972fb720df351931902a1666f23b0a0132f
…lization

This PR adds support for OrderedDicts and default dicts using custom serialization handlers.

Author: Philipp Moritz <pcmoritz@gmail.com>

Closes apache#1152 from pcmoritz/pydict-exact2 and squashes the following commits:

431e027 [Philipp Moritz] make cloudpickle optional
052b1aa [Philipp Moritz] I'd prefer this not to be a runtime dependency
db19ab9 [Philipp Moritz] add tests
799d983 [Philipp Moritz] do not interpret OrderedDict as dict
@siddharthteotia
Copy link
Contributor Author

Please ignore this PR. Something is wrong with my git environment. Closing this one and will recreate.

@siddharthteotia siddharthteotia deleted the ARROW-1473 branch October 4, 2017 18:05
@wesm
Copy link
Member

wesm commented Oct 4, 2017

@siddharthteotia this could have been fixed by rebasing on apache/master. I rebased master after the 0.7.1 release vote passed

@siddharthteotia
Copy link
Contributor Author

Yes, that's exactly what I did. Later realized that the PR would have been fixed automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants