Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8 #8401

Closed
wants to merge 12 commits into from
Closed

Conversation

jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Oct 8, 2020

This PR is a proposal to add support to the C data interface by implementing the necessary functionality to both consume and produce structs with its ABI and lifetime rules.

This is for now limited to primitive types and strings (utf8), but it is easily generalized for all types whose data is encapsulated in ArrayData (things with buffers and child data).

Some design choices:

  • import and export does not care about the type of the data that is in memory (previously BufferData, now Bytes) - it only cares about how they should be converted from and to ArrayData to the C data interface.
  • import wraps incoming pointers on a struct behind an Arc, so that we thread-safely refcount them and can share them between buffers, arrays, etc.
  • export places Buffers in private_data for bookkeeping and release them when the consumer releases it via release.

I do not expect this PR to be easy to review, as it is touching sensitive (aka unsafe) code. However, based on the tests I did so far, I am sufficiently happy to PR it.

This PR has three main parts:

  1. Addition of an ffi module that contains the import and export functionality
  2. Add some helpers to import and export an Array from C Data Interface
  3. A crate to test this against Python/C++'s API

It also does a small refactor of BufferData, renaming it to Bytes (motivated by the popular bytes crate), and moving it to a separate file.

What is tested:

  • round-trip Python -> Rust -> Python (new separate crate, arrow-c-integration)
  • round-trip Rust -> Python -> Rust (new separate crate, arrow-c-integration)
  • round-trip Rust -> Rust -> Rust
  • memory allocation counts

Finally, this PR has a large contribution of @pitrou , that took a lot of his time to explain to me how the C++ was doing it and the main things that I had to worry about here.

@github-actions
Copy link

github-actions bot commented Oct 8, 2020

@kou kou changed the title ARROW-10109: Add support to the C data interface for primitive types ARROW-10109: [Rust] Add support to the C data interface for primitive types Oct 8, 2020
@jorgecarleitao jorgecarleitao changed the title ARROW-10109: [Rust] Add support to the C data interface for primitive types ARROW-10109: [Rust] Add support to the C data interface for primitive types and string Oct 9, 2020
@jorgecarleitao jorgecarleitao changed the title ARROW-10109: [Rust] Add support to the C data interface for primitive types and string ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8 Oct 9, 2020
@andygrove
Copy link
Member

Thanks @jorgecarleitao this is really interesting. I've had a first pass through to get familiar with this and I will try building locally sometime this weekend hopefully.

@pitrou
Copy link
Member

pitrou commented Oct 12, 2020

memory allocation counts (I am still trying to find a way of doing this in Rust)

Just FTR, in C++ we have a global function that returns the number of currently Arrow-allocated bytes. This helps us write crude resource allocation tests (here through the Python wrapper pyarrow.get_total_allocated_bytes):
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_cffi.py#L116-L131

Another possibility would be to use the callback facility on your Bytes object to check that the destructor was called at some point.

@jorgecarleitao
Copy link
Member Author

Some changes since last time:

  1. added a small tool to verify memory allocations and verified no leaks
  2. added the test that @pitrou suggested to verify that we call the c++'s release.

The tool counts all memory allocations and deallocations, like in C++. It is used as a test at the end of all tests, as a final validation that the test suite does not leak.

I've placed it behind a feature gate as it current only works in single-threaded programs. Suggestions are welcomed to improve it further.

I think that this is now ready to review.

rust/arrow/src/memory.rs Outdated Show resolved Hide resolved
@nevi-me nevi-me added the needs-rebase A PR that needs to be rebased by the author label Nov 7, 2020
nevi-me pushed a commit that referenced this pull request Nov 7, 2020
This is a major refactor of the `equal.rs` module.

The rational for this change is many fold:

* currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders.
* the code in array comparison is difficult to follow given the amount of calls that they perform around offsets.
* The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate.
* Some code is being repeated.

This PR:

1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change)
2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons.
3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison
4. Makes array equality be statically dispatched, via `match datatype`.
5. DRY the code around array equality
6. Fixes an error in equality of dictionary with equal values
7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries)
8. splits `equal.rs` in smaller, more manageable files.
9. Removes `ArrayListOps`, since it it no longer needed
10. Moves Json equality to its own module, for clarity.
11. removes the need to have two functions per type to compare arrays.
12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from #8401
13. adds a benchmark for array equality

Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on #8200.

IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long.

This also improves performance by 10-40%.

<details>
 <summary>Benchmark results</summary>

```
Previous HEAD position was 3dd3c69 Added bench for equality.
Switched to branch 'equal'
Your branch is up to date with 'origin/equal'.
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 51.28s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12
Gnuplot not found, using plotters backend
equal_512               time:   [36.861 ns 36.894 ns 36.934 ns]
                        change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

equal_nulls_512         time:   [2.3271 us 2.3299 us 2.3331 us]
                        change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

equal_string_512        time:   [49.219 ns 49.347 ns 49.517 ns]
                        change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

equal_string_nulls_512  time:   [3.7873 us 3.7939 us 3.8013 us]
                        change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
```

</details>

All tests are there, plus new tests for some of the edge cases and untested arrays.

This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust.

Closes #8541 from jorgecarleitao/equal

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@jorgecarleitao
Copy link
Member Author

@nevi-me , @andygrove @pitrou @alamb , I have rebased this PR.

I need your guidance here:

  • is this something that we still want to pursue, or should we close this? If open:
  • Currently the memory track is done with a feature gate. This is faster, but requires a new compilation to run the tests with that feature gate.
  • Currently it tests memory leaks via a test at the end of all tests (and under the feature gate). This covers all tests implicitely, but tests that panic are intrinsically leaky and thus there is a non-trivial interaction between tests and the memory check test.
  • The integration test with Python/C++ requires another compilation, with other compiler flags, which is an extra CI burden.

Some ideas:

  • Refactor the memory test check to be on a per-test basis, so that we do not have interactions between tests. The positive is that we won't get tests interactions. The downside is that we need to be explicit in performing the memory check on each test we want.
  • Make the allocation/deallocation outside the feature gate. This may make the code slower (I need to profile) as we need to handle an thread-atomic lock of a counter, but it significantly reduces the complexity around CI (no need to re-compile with a different feature gate).
  • Make the Python/C++ tests use C++ libs and headers instead of running against pyarrow. This is something that I do not know how to do, so it will take me some time.

@jorgecarleitao jorgecarleitao removed the needs-rebase A PR that needs to be rebased by the author label Nov 11, 2020
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took so long to review this PR -- there was a lot here.

Something I had missed earlier was the rationale for this work which is well spelled out on https://issues.apache.org/jira/browse/ARROW-10109 but I hadn't read that before. I recommend reviewing that description to gain more context for this PR for anyone else who is interested

I am not an expert in any of these technologies, but I did read the PR carefully and it makes sense to me

is this something that we still want to pursue, or should we close this?

I think it is definitely worth pursuing. Thank you for doing so!

Currently the memory track is done with a feature gate. This is faster, but requires a new compilation to run the tests with that feature gate.

I think always tracking memory , as @pitrou says the overhead of updating some counter (especially as the semantics of doing so can be very relaxed) is likely to be noise compared to the actual work to do the allocation itself

Currently it tests memory leaks via a test at the end of all tests (and under the feature gate). This covers all tests implicitely, but tests that panic are intrinsically leaky and thus there is a non-trivial interaction between tests and the memory check test.

I don't understand the assertion that tests that panic! are leaky (the stack is still unwound in an orderly fashion and allocations are Drop'd as I understand) so I think testing at the end is fine unless we get evidence to the contrary.

The integration test with Python/C++ requires another compilation, with other compiler flags, which is an extra CI burden.

Maybe we could look into extending some of the existing python tests, or only running the integration test if the ffi or python binding code changes to reduce the burden.

rust/arrow-c-integration/README.md Outdated Show resolved Hide resolved
rust/arrow-c-integration/src/lib.rs Outdated Show resolved Hide resolved
rust/arrow-c-integration/tests/test_sql.py Outdated Show resolved Hide resolved
// specific language governing permissions and limitations
// under the License.

//! Contains functionality to load an ArrayData from the C Data Interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! Contains functionality to load an ArrayData from the C Data Interface
//! Contains functionality to load data to/from `ArrayData` from the C Data Interface

rust/arrow/src/array/ffi.rs Outdated Show resolved Hide resolved
rust/arrow/src/array/ffi.rs Outdated Show resolved Hide resolved
rust/arrow/src/buffer.rs Show resolved Hide resolved
rust/arrow/src/bytes.rs Show resolved Hide resolved
rust/arrow/src/ffi.rs Show resolved Hide resolved
@jorgecarleitao
Copy link
Member Author

I made the following changes:

  1. Made the allocation counter a AtomicIsize and removed it from the feature gate
  2. Added CI
  3. Renamed the crate interacting with Python to arrow-pyarrow-integration-testing

The CI will likely fail, as I will need to fiddle with it. There will be some spam. Sorry guys!

@codecov-io
Copy link

codecov-io commented Nov 21, 2020

Codecov Report

Merging #8401 (faabc3d) into master (322cd01) will decrease coverage by 0.07%.
The diff coverage is 76.84%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8401      +/-   ##
==========================================
- Coverage   84.54%   84.47%   -0.08%     
==========================================
  Files         185      190       +5     
  Lines       46176    46836     +660     
==========================================
+ Hits        39041    39565     +524     
- Misses       7135     7271     +136     
Impacted Files Coverage Δ
rust/arrow-pyarrow-integration-testing/src/lib.rs 0.00% <0.00%> (ø)
rust/arrow/src/array/array.rs 83.49% <0.00%> (-7.04%) ⬇️
rust/arrow/src/array/transform/boolean.rs 76.92% <0.00%> (-23.08%) ⬇️
rust/arrow/src/array/transform/list.rs 36.36% <0.00%> (-8.09%) ⬇️
rust/arrow/src/compute/kernels/comparison.rs 96.27% <ø> (ø)
rust/arrow/src/compute/kernels/filter.rs 61.27% <ø> (ø)
rust/arrow/src/compute/kernels/limit.rs 100.00% <ø> (ø)
rust/arrow/src/error.rs 5.00% <0.00%> (-0.27%) ⬇️
rust/arrow/src/record_batch.rs 88.29% <ø> (-2.06%) ⬇️
rust/datafusion/src/logical_plan/plan.rs 83.68% <ø> (ø)
... and 63 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17805f3...faabc3d. Read the comment docs.

@jorgecarleitao
Copy link
Member Author

@kszucs , I am trying to fix the Dev / Source Release and Merge Sctript, but I have been unable: I do not understand what I need to do besides what I already did.

From what I understood, this script is used to replace versions throughout the project. I added a new section for the new crate, but I can't find understand why is the test failing.

@jorgecarleitao
Copy link
Member Author

@alamb @pitrou , @paddyhoran @andygrove , the integration with CI is in place, the tests pass, and all comments from @alamb and @pitrou were addressed. If anyone wants to take a final pass, please let me know, otherwise, for me this is ready to merge.

/// Mode of deallocating memory regions
pub enum Deallocation {
/// Native deallocation, using Rust deallocator with Arrow-specific memory aligment
Native(usize),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it's worth optimizing the size of the Bytes struct. If we use NonZeroUsize here, the enum itself might become only be the size of one usize, but I haven't tried whether that really works.

Deallocation::Native(capacity) => capacity,
// we cannot determine this in general,
// and thus we state that this is externally-owned memory
Deallocation::Foreign(_) => 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning len here might be better as it would keep the invariant that capacity() >= len()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Have you seem a good usecase for having capacity public: IMO it is an implementation detail of the buffer, related to how much it expands on every reserve / resize. Since we perform the check internally for safety reasons, using it outside of buffer seems unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a good point. Seems it is only used for reporting memory usage of arrays and you could argue that does not need to include foreign memory. For that usecase the method could have a better name like memory_usage, with the current name I think returning len sounds more logical.

*(self.array.buffers as *mut *const u8).add(1) as *const i32
};
// get last offset
(unsafe { *offset_buffer.add(len / size_of::<i32>() - 1) }) as usize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent several minutes thinking about whether the - 1 is correct, seems it is, because len is the length of the offset buffer, which is one larger than the length of the corresponding array. Not sure how to phrase that in a small comment though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, an alternative is to safely cast all of this to i32 and perform the op on &[i32]. IMO this is pretty unsafe if someone sends us data with wrong alignments. :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhorstmann , I agree with you. I am working on this on as a separate PR, but it is not forgotten.

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 28, 2020
@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 1, 2020
@jorgecarleitao jorgecarleitao deleted the arrow-c-inte branch December 4, 2020 07:41
@jorgecarleitao jorgecarleitao restored the arrow-c-inte branch December 4, 2020 07:41
@jorgecarleitao jorgecarleitao reopened this Dec 4, 2020
@jorgecarleitao jorgecarleitao deleted the arrow-c-inte branch December 5, 2020 08:10
@jorgecarleitao
Copy link
Member Author

I merged this and marked all associated issues as resolved.

We can now FFI to and from c++ and pyarrow 🎉

@pitrou
Copy link
Member

pitrou commented Dec 5, 2020

Wahoo! This is a great milestone! You could announce it on the ML.

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This is a major refactor of the `equal.rs` module.

The rational for this change is many fold:

* currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders.
* the code in array comparison is difficult to follow given the amount of calls that they perform around offsets.
* The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate.
* Some code is being repeated.

This PR:

1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change)
2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons.
3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison
4. Makes array equality be statically dispatched, via `match datatype`.
5. DRY the code around array equality
6. Fixes an error in equality of dictionary with equal values
7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries)
8. splits `equal.rs` in smaller, more manageable files.
9. Removes `ArrayListOps`, since it it no longer needed
10. Moves Json equality to its own module, for clarity.
11. removes the need to have two functions per type to compare arrays.
12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from apache#8401
13. adds a benchmark for array equality

Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on apache#8200.

IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long.

This also improves performance by 10-40%.

<details>
 <summary>Benchmark results</summary>

```
Previous HEAD position was 3dd3c69 Added bench for equality.
Switched to branch 'equal'
Your branch is up to date with 'origin/equal'.
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 51.28s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12
Gnuplot not found, using plotters backend
equal_512               time:   [36.861 ns 36.894 ns 36.934 ns]
                        change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

equal_nulls_512         time:   [2.3271 us 2.3299 us 2.3331 us]
                        change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

equal_string_512        time:   [49.219 ns 49.347 ns 49.517 ns]
                        change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

equal_string_nulls_512  time:   [3.7873 us 3.7939 us 3.8013 us]
                        change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
```

</details>

All tests are there, plus new tests for some of the edge cases and untested arrays.

This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust.

Closes apache#8541 from jorgecarleitao/equal

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
… types and utf8

This PR is a proposal to add support to the [C data interface](https://arrow.apache.org/docs/format/CDataInterface.html) by implementing the necessary functionality to both consume and produce structs with its ABI and lifetime rules.

This is for now limited to primitive types and strings (utf8), but it is easily generalized for all types whose data is encapsulated in `ArrayData` (things with buffers and child data).

Some design choices:

* import and export does not care about the type of the data that is in memory (previously `BufferData`, now `Bytes`) - it only cares about how they should be converted from and to `ArrayData` to the C data interface.
* import wraps incoming pointers on a struct behind an `Arc`, so that we thread-safely refcount them and can share them between buffers, arrays, etc.
* `export` places `Buffer`s in `private_data` for bookkeeping and release them when the consumer releases it via `release`.

I do not expect this PR to be easy to review, as it is touching sensitive (aka `unsafe`) code. However, based on the tests I did so far, I am sufficiently happy to PR it.

This PR has three main parts:

1. Addition of an `ffi` module that contains the import and export functionality
2. Add some helpers to import and export an Array from C Data Interface
3. A crate to test this against Python/C++'s API

It also does a small refactor of `BufferData`, renaming it to `Bytes` (motivated by the popular `bytes` crate), and moving it to a separate file.

What is tested:

* round-trip `Python -> Rust -> Python` (new separate crate, `arrow-c-integration`)
* round-trip `Rust -> Python -> Rust`  (new separate crate, `arrow-c-integration`)
* round-trip `Rust -> Rust -> Rust`
* memory allocation counts

Finally, this PR has a large contribution of @pitrou , that took _a lot_ of his time to explain to me how the C++ was doing it and the main things that I had to worry about here.

Closes apache#8401 from jorgecarleitao/arrow-c-inte

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants