Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38361: [MATLAB] Add validation logic for offsets and values to arrow.array.ListArray.fromArrays #38531

Merged
merged 10 commits into from
Oct 31, 2023

Conversation

kevingurney
Copy link
Member

@kevingurney kevingurney commented Oct 31, 2023

Rationale for this change

This pull request adds a new ValidationMode name-value pair to the arrow.array.ListArray.fromArrays function. This allows client code to validate whether provided offsets and values are valid.

What changes are included in this PR?

  1. Added a new name-value pair ValidationMode = "None" | "Minimal" (default) | "Full". to the arrow.array.ListArrays.fromArrays function. If ValidationMode is set to "Minimal" or "Full" and the provided offsets and values arrays are invalid, then an error will be thrown when calling the arrow.array.ListArrays.fromArrays function.
  2. Set the default ValidationMode for arrow.array.ListArray.fromArrays to "Minimal" to balance usability and performance when creating ListArrays. Hopefully, this should help more MATLAB users navigate the complexities of creating ListArrays "from scratch" using offsets and values arrays.
  3. Added a new arrow.array.ValidationMode enumeration class. This is used as the type of the ValidationMode name-value pair on the arrow.array.ListArray.fromArrays function.

Supported values for arrow.array.ValidationMode include:

  • arrow.array.ValidationMode.None - Do no validation checks on the given Array.
  • arrow.array.ValidationMode.Minimal - Do relatively inexpensive validation checks on the given Array. Delegates to the C++ Array::Validate method under the hood.
  • arrow.array.ValidationMode.Full - Do expensive / robust validation checks on the given Array. Delegates to the C++ Array::ValidateFull method under the hood.

Example

>> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5]))

offsets = 

  Int32Array with 6 elements and 0 null values:

    0 | 1 | 2 | 3 | 4 | 5

>> values = arrow.array([1, 2, 3])

values = 

  Float64Array with 3 elements and 0 null values:

    1 | 2 | 3

>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full")
Error using  . 
Offset invariant failure: offset for slot 4 out of bounds: 4 > 3

Error in arrow.array.ListArray.fromArrays (line 108)
            proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
 
>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal")
Error using  . 
Length spanned by list offsets (5) larger than values array (length 3)

Error in arrow.array.ListArray.fromArrays (line 108)
            proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
 
>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none")

array = 

  ListArray with 5 elements and 0 null values:

<Invalid array: Length spanned by list offsets (5) larger than values array (length 3)>

Are these changes tested?

Yes.

  1. Added new test cases for verifying ValidationMode behavior to tListArray.m.

Are there any user-facing changes?

Yes.

  1. Client code can now control validation behavior when calling arrow.array.ListArray.fromArrays by using the new ValidationMode name-value pair.
  2. By default, an error will now be thrown by arrow.array.ListArray.fromArrays for certain invalid combinations of offsets and values. In other words, arrow.array.ListArray.fromArrays will call the C++ method Array::Validate by default, which corresponds to arrow.array.ValidationMode.Minimal.
  3. Client code can now create arrow.array.ValidationMode enumeration values.

This PR includes breaking changes to public APIs.

Previously, all offsets and values would be accepted by the arrow.array.ListArray.fromArrays function. However, this pull request changes the default behavior to call the C++ Array::Validate method under the hood, which means that some previously accepted offsets and values will now result in a validation error. This can be worked around by setting ValidationMode to "None" when calling arrow.array.ListArray.fromArrays.

Future Directions

  1. Currently ValidationMode has only been added to the arrow.array.ListArray.fromArrays method. However, in the future, it may make sense to generalize validation behavior and provide ValidationMode on other fromMATLAB and fromArrays methods for other Array types. We may also want to add a stand-alone validate method on all arrow.array.Array classes ([MATLAB] Add a validate method to all arrow.array.Array classes #38532). We decided to start with ListArray as an incremental first step since we suspect creating valid ListArrays from offsets and values will generally be more error prone than creating simpler Array types like Float64Array or StringArray.

Notes

  1. We chose to set the default ValidationMode value to arrow.array.ValidationMode.Minimal to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to arrow.array.ValidationMode.None in the future.
  2. Thank you @sgilmore10 for your help with this pull request!

2. Hard-code enumeration values for `ValidationMode` to match in both
MATLAB and C++.
3. Set default `ValidationMode` to `Minimal` for
`arrow.array.ListArrays.fromArrays`.
4. Move `arrow.array.internal.validation.ValidationMode` to
`arrow.array.ValidationMode`.
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Oct 31, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 31, 2023
are valid, no error is thrown, even if `ValidationMode` is set to
`"Minimal"` or `"Full"`.
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 31, 2023
@sgilmore10
Copy link
Member

Apologies for prematurely leaving comments on this PR. I did not realize it was still a draft.

@kevingurney
Copy link
Member Author

Apologies for prematurely leaving comments on this PR. I did not realize it was still a draft.

No worries at all! Thanks for the helpful feedback!

@kevingurney kevingurney marked this pull request as ready for review October 31, 2023 17:48
@kevingurney kevingurney requested a review from kou as a code owner October 31, 2023 17:48
@kevingurney
Copy link
Member Author

+1

@kevingurney kevingurney merged commit 66844e9 into apache:main Oct 31, 2023
24 checks passed
@kevingurney kevingurney deleted the GH-38361 branch October 31, 2023 18:16
@kevingurney kevingurney removed the awaiting changes Awaiting changes label Oct 31, 2023
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 66844e9.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.

@kou kou changed the title GH-38361: Add validation logic for offsets and values to arrow.array.ListArray.fromArrays GH-38361: [MATLAB] Add validation logic for offsets and values to arrow.array.ListArray.fromArrays Nov 1, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…rrow.array.ListArray.fromArrays` (apache#38531)

### Rationale for this change

This pull request adds a new `ValidationMode` name-value pair to the `arrow.array.ListArray.fromArrays` function. This allows client code to validate whether provided `offsets` and `values` are valid.

### What changes are included in this PR?

1. Added a new name-value pair `ValidationMode = "None" | "Minimal" (default) | "Full".` to the `arrow.array.ListArrays.fromArrays` function. If `ValidationMode` is set to `"Minimal"` or `"Full"` and the provided `offsets` and `values` arrays are invalid, then an error will be thrown when calling the `arrow.array.ListArrays.fromArrays` function.
2. Set the default `ValidationMode` for `arrow.array.ListArray.fromArrays` to `"Minimal"` to balance usability and performance when creating `ListArray`s. Hopefully, this should help more MATLAB users navigate the complexities of creating `ListArray`s "from scratch" using `offsets` and `values` arrays.
3. Added a new `arrow.array.ValidationMode` enumeration class. This is used as the type of the `ValidationMode` name-value pair on the `arrow.array.ListArray.fromArrays` function.

Supported values for `arrow.array.ValidationMode` include:
* `arrow.array.ValidationMode.None` - Do no validation checks on the given `Array`.
* `arrow.array.ValidationMode.Minimal` - Do relatively inexpensive validation checks on the given `Array`. Delegates to the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood.
* `arrow.array.ValidationMode.Full` - Do expensive / robust validation checks on the given `Array`. Delegates to the C++ [`Array::ValidateFull`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L209) method under the hood.

**Example**
```matlab
>> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5]))

offsets = 

  Int32Array with 6 elements and 0 null values:

    0 | 1 | 2 | 3 | 4 | 5

>> values = arrow.array([1, 2, 3])

values = 

  Float64Array with 3 elements and 0 null values:

    1 | 2 | 3

>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full")
Error using  . 
Offset invariant failure: offset for slot 4 out of bounds: 4 > 3

Error in arrow.array.ListArray.fromArrays (line 108)
            proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
 
>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal")
Error using  . 
Length spanned by list offsets (5) larger than values array (length 3)

Error in arrow.array.ListArray.fromArrays (line 108)
            proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
 
>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none")

array = 

  ListArray with 5 elements and 0 null values:

<Invalid array: Length spanned by list offsets (5) larger than values array (length 3)>
```

### Are these changes tested?

Yes.

1. Added new test cases for verifying `ValidationMode` behavior to `tListArray.m`.

### Are there any user-facing changes?

Yes.

1. Client code can now control validation behavior when calling `arrow.array.ListArray.fromArrays` by using the new `ValidationMode` name-value pair.
2. By default, an error will now be thrown by `arrow.array.ListArray.fromArrays` for certain invalid combinations of `offsets` and `values`. In other words, `arrow.array.ListArray.fromArrays` will call the C++ method `Array::Validate` by default, which corresponds to `arrow.array.ValidationMode.Minimal`.
3. Client code can now create `arrow.array.ValidationMode` enumeration values.

**This PR includes breaking changes to public APIs.**

Previously, all `offsets` and `values` would be accepted by the `arrow.array.ListArray.fromArrays` function. However, this pull request changes the default behavior to call the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood, which means that some previously accepted `offsets` and `values` will now result in a validation error. This can be worked around by setting `ValidationMode` to `"None"` when calling `arrow.array.ListArray.fromArrays`.

### Future Directions

1. Currently `ValidationMode` has only been added to the `arrow.array.ListArray.fromArrays` method. However, in the future, it may make sense to generalize validation behavior and provide `ValidationMode` on other `fromMATLAB` and `fromArrays` methods for other `Array` types. We may also want to add a stand-alone `validate` method on all `arrow.array.Array` classes (apache#38532). We decided to start with `ListArray` as an incremental first step since we suspect creating valid `ListArray`s from `offsets` and `values` will generally be more error prone than creating simpler `Array` types like `Float64Array` or `StringArray`.

### Notes

1. We chose to set the default `ValidationMode` value to `arrow.array.ValidationMode.Minimal` to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to `arrow.array.ValidationMode.None` in the future. 
2. Thank you @ sgilmore10 for your help with this pull request!
* Closes: apache#38361 

Authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…rrow.array.ListArray.fromArrays` (apache#38531)

### Rationale for this change

This pull request adds a new `ValidationMode` name-value pair to the `arrow.array.ListArray.fromArrays` function. This allows client code to validate whether provided `offsets` and `values` are valid.

### What changes are included in this PR?

1. Added a new name-value pair `ValidationMode = "None" | "Minimal" (default) | "Full".` to the `arrow.array.ListArrays.fromArrays` function. If `ValidationMode` is set to `"Minimal"` or `"Full"` and the provided `offsets` and `values` arrays are invalid, then an error will be thrown when calling the `arrow.array.ListArrays.fromArrays` function.
2. Set the default `ValidationMode` for `arrow.array.ListArray.fromArrays` to `"Minimal"` to balance usability and performance when creating `ListArray`s. Hopefully, this should help more MATLAB users navigate the complexities of creating `ListArray`s "from scratch" using `offsets` and `values` arrays.
3. Added a new `arrow.array.ValidationMode` enumeration class. This is used as the type of the `ValidationMode` name-value pair on the `arrow.array.ListArray.fromArrays` function.

Supported values for `arrow.array.ValidationMode` include:
* `arrow.array.ValidationMode.None` - Do no validation checks on the given `Array`.
* `arrow.array.ValidationMode.Minimal` - Do relatively inexpensive validation checks on the given `Array`. Delegates to the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood.
* `arrow.array.ValidationMode.Full` - Do expensive / robust validation checks on the given `Array`. Delegates to the C++ [`Array::ValidateFull`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L209) method under the hood.

**Example**
```matlab
>> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5]))

offsets = 

  Int32Array with 6 elements and 0 null values:

    0 | 1 | 2 | 3 | 4 | 5

>> values = arrow.array([1, 2, 3])

values = 

  Float64Array with 3 elements and 0 null values:

    1 | 2 | 3

>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full")
Error using  . 
Offset invariant failure: offset for slot 4 out of bounds: 4 > 3

Error in arrow.array.ListArray.fromArrays (line 108)
            proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
 
>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal")
Error using  . 
Length spanned by list offsets (5) larger than values array (length 3)

Error in arrow.array.ListArray.fromArrays (line 108)
            proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
 
>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none")

array = 

  ListArray with 5 elements and 0 null values:

<Invalid array: Length spanned by list offsets (5) larger than values array (length 3)>
```

### Are these changes tested?

Yes.

1. Added new test cases for verifying `ValidationMode` behavior to `tListArray.m`.

### Are there any user-facing changes?

Yes.

1. Client code can now control validation behavior when calling `arrow.array.ListArray.fromArrays` by using the new `ValidationMode` name-value pair.
2. By default, an error will now be thrown by `arrow.array.ListArray.fromArrays` for certain invalid combinations of `offsets` and `values`. In other words, `arrow.array.ListArray.fromArrays` will call the C++ method `Array::Validate` by default, which corresponds to `arrow.array.ValidationMode.Minimal`.
3. Client code can now create `arrow.array.ValidationMode` enumeration values.

**This PR includes breaking changes to public APIs.**

Previously, all `offsets` and `values` would be accepted by the `arrow.array.ListArray.fromArrays` function. However, this pull request changes the default behavior to call the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood, which means that some previously accepted `offsets` and `values` will now result in a validation error. This can be worked around by setting `ValidationMode` to `"None"` when calling `arrow.array.ListArray.fromArrays`.

### Future Directions

1. Currently `ValidationMode` has only been added to the `arrow.array.ListArray.fromArrays` method. However, in the future, it may make sense to generalize validation behavior and provide `ValidationMode` on other `fromMATLAB` and `fromArrays` methods for other `Array` types. We may also want to add a stand-alone `validate` method on all `arrow.array.Array` classes (apache#38532). We decided to start with `ListArray` as an incremental first step since we suspect creating valid `ListArray`s from `offsets` and `values` will generally be more error prone than creating simpler `Array` types like `Float64Array` or `StringArray`.

### Notes

1. We chose to set the default `ValidationMode` value to `arrow.array.ValidationMode.Minimal` to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to `arrow.array.ValidationMode.None` in the future. 
2. Thank you @ sgilmore10 for your help with this pull request!
* Closes: apache#38361 

Authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MATLAB] Add validation logic for offsets and values to arrow.array.ListArray.fromArrays
2 participants