Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MATLAB] Add an InferNulls name-value pair for controlling null value inference during construction of arrow.array.Array #35676

Closed
kevingurney opened this issue May 18, 2023 · 2 comments · Fixed by #35827

Comments

@kevingurney
Copy link
Member

kevingurney commented May 18, 2023

Describe the enhancement requested

This is a follow up to the initial null value handling support that was added in #35598.

In order to give clients more flexibility in how null values in MATLAB arrays are detected when constructing an arrow.array.Array, it would be helpful to expose more name-value pairs on the arrow.array.Array class (and concrete subclasses).

One possible name-value pair for handling null value inference would be InferNulls, which is described below.

InferNulls

Supported values: true (default) | false

true - "automatically" detect null values in the input MATLAB array based on the presence of MATLAB type-specific missing values (e.g. NaN for double, <missing> for string, NaT for datetime, etc.).

false - Do not "automatically" detect null values.

Example:

>> matlabArray = string(["A", missing, "C", missing])'

matlabArray = 

  4x1 string array

    "A"
    <missing>
    "C"
    <missing>

% Infer null values from MATLAB <missing> string values
>> arrowArray = arrow.array.StringArray(matlabArray, InferNulls=true)
[
    "A",
    null,
    "C",
    null
]

Note: For some MATLAB types (e.g. int64) there is no concept of a missing value. In this case the value of InferNulls won't impact the resulting arrow.array.Array.

Component(s)

MATLAB

@kevingurney kevingurney changed the title [MATLAB] Add name-value pairs for controlling null value handling during construction of arrow.array.Array [MATLAB] Add an InferNulls name-value pair for controlling null value inference during construction of arrow.array.Array May 24, 2023
@kevingurney
Copy link
Member Author

kevingurney commented May 24, 2023

After further consideration, it may make sense to simplify the proposed name-value pairs to only include InferNulls = true | false rather than DetectNulls and NullDetectionFcn.

Rather than using a function_handle, clients can pre-compute null values using whatever approach they would like and then pass in a validity bitmap via the Valid name-value pair proposed in #35693.

I've updated the issue title and description accordingly.

@sgilmore10
Copy link
Member

take

@kou kou closed this as completed in #35827 Jun 3, 2023
kou pushed a commit that referenced this issue Jun 3, 2023
…g null value inference during construction of `arrow.array.Array` (#35827)

### Rationale for this change
This change lets users control toggle the automatic null-value detection behavior. By default, values MATLAB considers to be missing  (e.g. `NaN` for `double`, `<missing>` for `string`, and `NaT` for `datetime`) will be treated as `null` values. Users can toggle this behavior on and off using the `InferNulls` name-value pair. 

**Example**
```matlab
>> matlabArray = [1 NaN 3]'

matlabArray =

     1
     NaN
     3
      
% Treat NaN as a null value
 >> arrowArray1 = arrow.array.Float64Array(maltabArray, InferNulls=true)

arrowArray1 = 

[
  1,
  null,
  3
]

% Don't treat NaN as a null value 
 >> arrowArray2 = arrow.array.Float64Array(maltabArray, InferNulls=false)
   
arrowArray2 = 

[
  1,
  nan,
  3
]

```
We've only added this nv-pair to `arrow.array.Float64Array` for now. We'll add this nv-pair to the other types in a followup changelist.

### What changes are included in this PR?

1. Added `InferNulls` name-value pair to `arrow.array.Float64Array`.
2. Added common validation function `arrow.args.validateTypeAndShape` to remove duplicate validation code among the numeric classes.
3. Added a function called `arrow.args.parseValidElements` that the `arrow.array.<Type>Array` classes will be able to share for generating the logical mask of valid elements.

### Are these changes tested?

Yes, we added a test pointed called `InferNulls` to  the test class`tFloat64Array.m`.

### Are there any user-facing changes?

Yes, users can now control how `NaN` values are treated when creating an `arrow.array.Float64Array`.

### Future Directions

1. Add a name-value pair to allow users to specify the valid elements themselves.
2. Extend null support to other numeric types.
3. We've been working on adding error-handling support to `mathworks/libmexclass`. We have a prototype to do this using status-like and result-like objects already pushed to a [branch](https://github.com/mathworks/libmexclass/tree/33). Once this branch is merged with the `main` branch of `mathworks/libmexclass`, we'll port it over.

### Notes

Thank you @ kevingurney for all the help with this PR! 
* Closes: #35676

Lead-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou kou added this to the 13.0.0 milestone Jun 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment