Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8883: [Rust] [Integration] Enable more tests #8200

Closed

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Sep 16, 2020

This enables more integration tests, and makes the necessary changes to make them pass.


Status

Primitive Arrays

These are passing, but I have questions on whether the way I'm comparing them is fine.
Specifically, when reading an array that has 3 values, the Buffer is returned aligned to 8 bytes, such that an i8's buffer would have { data: [v1, v2, v3, 0, 0, 0, 0, 0] }.

This causes problems, because when we reconstruct the above from JSON, we would have { data: [v1, v2, v3] }.

I have implemented PartialEq manually for ArrayData at 261c021. I'd appreciate some feedback on whether the approach is fine or not. I deliberately used assert_eq!(a, b, "helpful message") to make it a bit easier to track down the differences between objects being compared, as it's otherwise very difficult and tedious without some log!() implementation.

@jhorstmann 's work on bit slicing has been instrumental in allowing us to compare ArrayData, so I'm leveraging that for buffer comparisons. We however can end up with a situation where slicing buffers with offsets could lead to 2 buffers having different capacities. I don't think this is a bug.

Due to the above, I've also had to compare Buffer by length, and not capacity at 69f4ef3. @sunchao @jorgecarleitao PTAL and advise. I just made the change so tests could pass, but perhaps we could agree on what constitutes equal buffers (whether we should compare capacity).

I'd prefer to make the above change as a separate PR, so I can address offsets and also add tests for coverage.

Complex Types

Lists and structs currently fail because we can't carry custom names and nullability. For example, if a list's child is a list, such child could be named anything (I've seen item or list). We currently have no way of encoding this information; and thus tests fail on schema equality.

I opened https://issues.apache.org/jira/browse/ARROW-10261 to address this. I'll open a PR on the branch that's soon to be renamed from master 🤭, as this will affect Parquet and DataFusion.

Dictionaries

We seem to have covered a bit of dictonary support, but because we can't write dictionaries yet, we can't complete the integration testing.
I've opened https://issues.apache.org/jira/browse/ARROW-10268 for any interesting takers. I think it should be fairly trivial to implement, but I'm already trying to do too much.
@carols10cents this builds on the work that you did for dictionary support from arrow::ipc::conversion, so thanks for that 😃

Metadata

We fail on some metadata tests because arrow::datatypes::Field doesn't support metadata.
This one's a bit difficult to implement, as HashMap<String, String> doesn't support #[derive(Hash)] and other requirements.
I was thinking that maybe we could use Vec<(String, String)> and deal with deduplication of fields, and compatible JSON serde internally.

I opened https://issues.apache.org/jira/browse/ARROW-10259 for the above.

Quicker Testing

One of the reasons why I've taken long to work on this, was because I struggled a bit with running the integration tests (at least I had to go into a few dirs and build stuff). docker-compose --file docker-compose.yml run --rm conda-integration took long, and I was getting sporadic failures in Rust.

I've temporarily created a script called integration-testing.sh, so that whoever pitches in to help out, can use that to quickly build the C++, Java and Rust binaries for testing.

@pitrou @nealrichardson I'll remove the script and the .dockerignore entry before merging this.

Lastly, Antoine, seeing that we can't complete this in time for 2.0.0, I'm fine with leaving the implementation status fields on Rust as blank for a few more month.

Apologies for the "CC the world" update, but I think this is important enough for us to get more eyes on it.

@github-actions
Copy link

@pitrou
Copy link
Member

pitrou commented Sep 25, 2020

Remember to update the implementation status when done ;-)

@nevi-me nevi-me force-pushed the rust-integration-more-tests-sep2020 branch 2 times, most recently from 63c6e4c to 24d274a Compare October 10, 2020 02:59
@jorgecarleitao
Copy link
Member

Thanks a lot for driving this and CCing. This is definitely important.

I have myself hit that capacity problem multiple times! One was when I was trying to simplify the equal.rs, for what you wrote on which buffers are different when the capacity is different, the second one was on #8401 , since when we have a buffer that receives data via the c data interface, we do not even know (or care) about its capacity.

AFAI can tell, we have two use-cases of capacity atm:

  • deallocating the region when it is no longer used
  • computing the total size in bytes of arrays

Because arrays share buffers, the total size of an array is currently misleading. For example, when the array is computed from is_not_null of another array, both the null buffer (buffer 0) and the value (buffer 1) share the same memory region, and thus IMO the total size computation based on capacity is incorrect. This is also true for complex structs on which buffers are shared within the same array.

Thus, IMO capacity main use-case is for bookkeeping, on how to de-allocate the region. #8401 systematizes that idea, on which BufferData (renamed Bytes there) no longer has capacity, but an enum about how to de-allocate itself.

Regardless, I would say that the vast majority of the use-cases on which we want to compare buffers is when we want to compare its contents, irrespectively of how they should be deallocated / capacity. Therefore, I would be happy to have buffer comparison be based on their actual content (in bytes). We just need to be careful about bitmaps, on which the comparison should be made in bits.

@nevi-me
Copy link
Contributor Author

nevi-me commented Oct 10, 2020

Thus, IMO capacity main use-case is for bookkeeping, on how to de-allocate the region. #8401 systematizes that idea, on which BufferData (renamed Bytes there) no longer has capacity, but an enum about how to de-allocate itself.

I agree with your thinking here, especially on equality and arrays with offsets. One of the things we don't test (extensively or at all) is whether we're able to write arrays with unaligned offsets to the IPC format. I'll put in time to work on this whenever I get to v5 of the Arrow format.

Regardless, I would say that the vast majority of the use-cases on which we want to compare buffers is when we want to compare its contents, irrespectively of how they should be deallocated / capacity. Therefore, I would be happy to have buffer comparison be based on their actual content (in bytes). We just need to be careful about bitmaps, on which the comparison should be made in bits.

I'd also prefer not to compare using buffer capacity, as I'm mainly interested in the buffer contents, especially when dealing with a frozen buffer where we won't be modifying contents.

On bitmaps, I'll have a look, but I think that the approach of using the bit iterators should cover this.

Apologies again for not having gotten to #8401 yet, I hope you're not in a rush to get it implemented (esp as we're now looking at 3.0 for all new work).

@nevi-me nevi-me force-pushed the rust-integration-more-tests-sep2020 branch from 832c187 to d103c3f Compare October 27, 2020 05:10
@nevi-me nevi-me marked this pull request as ready for review October 27, 2020 05:11
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick run trough and left some comments. I am still missing the large bin file.

So far, all changes make sense to me (apart from the comments).

rust/arrow/src/array/data.rs Outdated Show resolved Hide resolved
rust/arrow/src/ipc/convert.rs Outdated Show resolved Hide resolved
rust/arrow/src/util/integration_util.rs Outdated Show resolved Hide resolved
@nevi-me nevi-me force-pushed the rust-integration-more-tests-sep2020 branch from d103c3f to e3b6e42 Compare October 29, 2020 06:20
@nevi-me nevi-me force-pushed the rust-integration-more-tests-sep2020 branch 2 times, most recently from 5ec1b21 to 4b5dc0d Compare November 7, 2020 12:37
nevi-me pushed a commit that referenced this pull request Nov 7, 2020
This is a major refactor of the `equal.rs` module.

The rational for this change is many fold:

* currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders.
* the code in array comparison is difficult to follow given the amount of calls that they perform around offsets.
* The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate.
* Some code is being repeated.

This PR:

1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change)
2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons.
3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison
4. Makes array equality be statically dispatched, via `match datatype`.
5. DRY the code around array equality
6. Fixes an error in equality of dictionary with equal values
7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries)
8. splits `equal.rs` in smaller, more manageable files.
9. Removes `ArrayListOps`, since it it no longer needed
10. Moves Json equality to its own module, for clarity.
11. removes the need to have two functions per type to compare arrays.
12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from #8401
13. adds a benchmark for array equality

Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on #8200.

IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long.

This also improves performance by 10-40%.

<details>
 <summary>Benchmark results</summary>

```
Previous HEAD position was 3dd3c69 Added bench for equality.
Switched to branch 'equal'
Your branch is up to date with 'origin/equal'.
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 51.28s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12
Gnuplot not found, using plotters backend
equal_512               time:   [36.861 ns 36.894 ns 36.934 ns]
                        change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

equal_nulls_512         time:   [2.3271 us 2.3299 us 2.3331 us]
                        change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

equal_string_512        time:   [49.219 ns 49.347 ns 49.517 ns]
                        change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

equal_string_nulls_512  time:   [3.7873 us 3.7939 us 3.8013 us]
                        change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
```

</details>

All tests are there, plus new tests for some of the edge cases and untested arrays.

This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust.

Closes #8541 from jorgecarleitao/equal

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
nevi-me and others added 10 commits November 10, 2020 08:09
Tested with Rust and Java, but there are Java failures

make ARROW_MAGIC and CONT_MARKER consts

remove unused import
We weren't able to read dictionaries because of legacy issues
expose Field's dictionary fields via methods

some progress on dictionary integration
The intention is not to commit this, but to share with other users who might be interested in helping out with the integration testing.
I found the docker-compose process too slow, and it was returning errors on Rust :(
This is to be more in line with integration testing behaviour
After slicing buffers, we can end up with 2 buffers having the same len and data, but not capacity.
Such buffers fail the equality test.

CC @sunchao  jorgecarleitao as you've worked on this.
Perhaps we can find a suitable definition of equality, as we're otherwise failing integration tests because of this.
We were only doing basic comparisons of array properties,
such as whether data types and lengths are the same.
This extends this to test the array data.

fix lint
remove my integration testing helper

rebase, fix failing test
(cherry picked from commit 03b8aa4)
@nevi-me nevi-me force-pushed the rust-integration-more-tests-sep2020 branch from 2e3763f to dde96c6 Compare November 10, 2020 06:26
@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 10, 2020

@jorgecarleitao @carols10cents I'm contemplating closing this, and breaking it up into smaller PRs (if this round of changes don't pass integration). I'll work on it this week in my evenings (GMT+2) so I can get it out of the way, and go back to working on Parquet.

@jorgecarleitao
Copy link
Member

From what I have read in this PR, it looks good so far. Is there anything blocking it?

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 10, 2020

From what I have read in this PR, it looks good so far. Is there anything blocking it?

@jorgecarleitao the actual integration tests not passing, are the blocker 😅

There were 10 failures in the last run, logical equality related. It's actually odd, because on my local testing I get 120+ failures between Rust and the other langs (except C++, that passes).

I'll monitor https://github.com/apache/arrow/pull/8200/checks?check_run_id=1378263093 in the background, if it passes; we can merge this (if you don't have any outstanding queries/concerns).

I used your suggested implementation of logical equality (dde96c6)

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 10, 2020

@jorgecarleitao we're down to 3 failures.

thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', arrow/src/array/equal/variable_size.rs:31:21

Looks like we're not comparing zero length arrays correctly

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 10, 2020

@jorgecarleitao finally passed! There's still some tests that don't pass, but I've updated the status documenation with them (cc @andygrove as you had opened a JIRA for this)

@nevi-me nevi-me closed this in 18b9281 Nov 10, 2020
yordan-pavlov pushed a commit to yordan-pavlov/arrow that referenced this pull request Nov 14, 2020
This enables more integration tests, and makes the necessary changes to make them pass.

___

# Status

## Primitive Arrays

These are passing, but I have questions on whether the way I'm comparing them is fine.
Specifically, when reading an array that has 3 values, the `Buffer` is returned aligned to 8 bytes, such that an `i8`'s buffer would have `{ data: [v1, v2, v3, 0, 0, 0, 0, 0] }`.

This causes problems, because when we reconstruct the above from JSON, we would have `{ data: [v1, v2, v3] }`.

I have implemented `PartialEq` manually for `ArrayData` at 261c021. I'd appreciate some feedback on whether the approach is fine or not. I deliberately used `assert_eq!(a, b, "helpful message")` to make it a bit easier to track down the differences between objects being compared, as it's otherwise very difficult and tedious without some `log!()` implementation.

@jhorstmann 's work on bit slicing has been instrumental in allowing us to compare `ArrayData`, so I'm leveraging that for buffer comparisons. We however can end up with a situation where slicing buffers with offsets could lead to 2 buffers having different capacities. _I don't think this is a bug_.

Due to the above, I've also had to compare `Buffer` by length, and not capacity at 69f4ef3. @sunchao @jorgecarleitao PTAL and advise. I just made the change so tests could pass, but perhaps we could agree on what constitutes equal buffers (whether we should compare capacity).

I'd prefer to make the above change as a separate PR, so I can address offsets and also add tests for coverage.

## Complex Types

Lists and structs currently fail because we can't carry custom names and nullability. For example, if a list's child is a list, such child could be named anything (I've seen `item` or `list`). We currently have no way of encoding this information; and thus tests fail on schema equality.

I opened https://issues.apache.org/jira/browse/ARROW-10261 to address this. I'll open a PR on the branch that's soon to be renamed from `master` 🤭, as this will affect Parquet and DataFusion.

## Dictionaries

We seem to have covered a bit of dictonary support, but because we can't write dictionaries yet, we can't complete the integration testing.
I've opened https://issues.apache.org/jira/browse/ARROW-10268 for any interesting takers. I think it should be fairly trivial to implement, but I'm already trying to do too much.
@carols10cents this builds on the work that you did for dictionary support from `arrow::ipc::conversion`, so thanks for that 😃

## Metadata

We fail on some metadata tests because `arrow::datatypes::Field` doesn't support metadata.
This one's a bit difficult to implement, as `HashMap<String, String>` doesn't support `#[derive(Hash)]` and other requirements.
I was thinking that maybe we could use `Vec<(String, String)>` and deal with deduplication of fields, and compatible JSON serde internally.

I opened https://issues.apache.org/jira/browse/ARROW-10259 for the above.

## Quicker Testing

One of the reasons why I've taken long to work on this, was because I struggled a bit with running the integration tests (at least I had to go into a few dirs and build stuff). `docker-compose --file docker-compose.yml run --rm conda-integration` took long, and I was getting sporadic failures in Rust.

I've temporarily created a script called `integration-testing.sh`, so that whoever pitches in to help out, can use that to quickly build the C++, Java and Rust binaries for testing.

@pitrou @nealrichardson  I'll remove the script and the .dockerignore entry before merging this.

Lastly, Antoine, seeing that we can't complete this in time for 2.0.0, I'm fine with leaving the implementation status fields on Rust as blank for a few more month.

Apologies for the "CC the world" update, but I think this is important enough for us to get more eyes on it.

Closes apache#8200 from nevi-me/rust-integration-more-tests-sep2020

Lead-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
nevi-me pushed a commit that referenced this pull request Dec 3, 2020
…info

The purpose of this PR is refactoring `read_dictionary` to only need one kind of `Schema`, which lets us then remove the `find_dictionary_field` function and the `ipc_schema` field on `StreamReader` by adding a way to look up schema fields that use a particular dictionary by ID.

I'm also resubmitting a change to the `dict_id`/`dict_is_ordered` methods on `Field`; I had submitted this to @nevi-me to become part of #8200 but it looks like it got lost in a rebase or something? I think it's more correct to only return values if the fields have a dictionary as their datatype.

Closes #8820 from carols10cents/simplify-dictionary-field-lookup

Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@nevi-me nevi-me deleted the rust-integration-more-tests-sep2020 branch January 5, 2021 19:11
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This is a major refactor of the `equal.rs` module.

The rational for this change is many fold:

* currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders.
* the code in array comparison is difficult to follow given the amount of calls that they perform around offsets.
* The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate.
* Some code is being repeated.

This PR:

1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change)
2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons.
3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison
4. Makes array equality be statically dispatched, via `match datatype`.
5. DRY the code around array equality
6. Fixes an error in equality of dictionary with equal values
7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries)
8. splits `equal.rs` in smaller, more manageable files.
9. Removes `ArrayListOps`, since it it no longer needed
10. Moves Json equality to its own module, for clarity.
11. removes the need to have two functions per type to compare arrays.
12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from apache#8401
13. adds a benchmark for array equality

Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on apache#8200.

IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long.

This also improves performance by 10-40%.

<details>
 <summary>Benchmark results</summary>

```
Previous HEAD position was 3dd3c69 Added bench for equality.
Switched to branch 'equal'
Your branch is up to date with 'origin/equal'.
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 51.28s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12
Gnuplot not found, using plotters backend
equal_512               time:   [36.861 ns 36.894 ns 36.934 ns]
                        change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

equal_nulls_512         time:   [2.3271 us 2.3299 us 2.3331 us]
                        change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

equal_string_512        time:   [49.219 ns 49.347 ns 49.517 ns]
                        change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

equal_string_nulls_512  time:   [3.7873 us 3.7939 us 3.8013 us]
                        change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
```

</details>

All tests are there, plus new tests for some of the edge cases and untested arrays.

This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust.

Closes apache#8541 from jorgecarleitao/equal

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This enables more integration tests, and makes the necessary changes to make them pass.

___

# Status

## Primitive Arrays

These are passing, but I have questions on whether the way I'm comparing them is fine.
Specifically, when reading an array that has 3 values, the `Buffer` is returned aligned to 8 bytes, such that an `i8`'s buffer would have `{ data: [v1, v2, v3, 0, 0, 0, 0, 0] }`.

This causes problems, because when we reconstruct the above from JSON, we would have `{ data: [v1, v2, v3] }`.

I have implemented `PartialEq` manually for `ArrayData` at 261c021. I'd appreciate some feedback on whether the approach is fine or not. I deliberately used `assert_eq!(a, b, "helpful message")` to make it a bit easier to track down the differences between objects being compared, as it's otherwise very difficult and tedious without some `log!()` implementation.

@jhorstmann 's work on bit slicing has been instrumental in allowing us to compare `ArrayData`, so I'm leveraging that for buffer comparisons. We however can end up with a situation where slicing buffers with offsets could lead to 2 buffers having different capacities. _I don't think this is a bug_.

Due to the above, I've also had to compare `Buffer` by length, and not capacity at 69f4ef3. @sunchao @jorgecarleitao PTAL and advise. I just made the change so tests could pass, but perhaps we could agree on what constitutes equal buffers (whether we should compare capacity).

I'd prefer to make the above change as a separate PR, so I can address offsets and also add tests for coverage.

## Complex Types

Lists and structs currently fail because we can't carry custom names and nullability. For example, if a list's child is a list, such child could be named anything (I've seen `item` or `list`). We currently have no way of encoding this information; and thus tests fail on schema equality.

I opened https://issues.apache.org/jira/browse/ARROW-10261 to address this. I'll open a PR on the branch that's soon to be renamed from `master` 🤭, as this will affect Parquet and DataFusion.

## Dictionaries

We seem to have covered a bit of dictonary support, but because we can't write dictionaries yet, we can't complete the integration testing.
I've opened https://issues.apache.org/jira/browse/ARROW-10268 for any interesting takers. I think it should be fairly trivial to implement, but I'm already trying to do too much.
@carols10cents this builds on the work that you did for dictionary support from `arrow::ipc::conversion`, so thanks for that 😃

## Metadata

We fail on some metadata tests because `arrow::datatypes::Field` doesn't support metadata.
This one's a bit difficult to implement, as `HashMap<String, String>` doesn't support `#[derive(Hash)]` and other requirements.
I was thinking that maybe we could use `Vec<(String, String)>` and deal with deduplication of fields, and compatible JSON serde internally.

I opened https://issues.apache.org/jira/browse/ARROW-10259 for the above.

## Quicker Testing

One of the reasons why I've taken long to work on this, was because I struggled a bit with running the integration tests (at least I had to go into a few dirs and build stuff). `docker-compose --file docker-compose.yml run --rm conda-integration` took long, and I was getting sporadic failures in Rust.

I've temporarily created a script called `integration-testing.sh`, so that whoever pitches in to help out, can use that to quickly build the C++, Java and Rust binaries for testing.

@pitrou @nealrichardson  I'll remove the script and the .dockerignore entry before merging this.

Lastly, Antoine, seeing that we can't complete this in time for 2.0.0, I'm fine with leaving the implementation status fields on Rust as blank for a few more month.

Apologies for the "CC the world" update, but I think this is important enough for us to get more eyes on it.

Closes apache#8200 from nevi-me/rust-integration-more-tests-sep2020

Lead-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…info

The purpose of this PR is refactoring `read_dictionary` to only need one kind of `Schema`, which lets us then remove the `find_dictionary_field` function and the `ipc_schema` field on `StreamReader` by adding a way to look up schema fields that use a particular dictionary by ID.

I'm also resubmitting a change to the `dict_id`/`dict_is_ordered` methods on `Field`; I had submitted this to @nevi-me to become part of apache#8200 but it looks like it got lost in a rebase or something? I think it's more correct to only return values if the fields have a dictionary as their datatype.

Closes apache#8820 from carols10cents/simplify-dictionary-field-lookup

Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants