Skip to content

Commit

Permalink
apacheGH-37448: [MATLAB] Add arrow.array.ChunkedArray class (apache…
Browse files Browse the repository at this point in the history
…#37525)

### Rationale for this change

In order to add an `arrow.tabular.Table` class to the MATLAB Interface, we first need to add a MATLAB class representing `arrow::ChunkedArray`s. This is required because an `arrow::Table` is backed by a vector of `arrow::ChunkedArray`s, and the output of its `column(int index)` method is an `arrow::ChunkedArray`.

### What changes are included in this PR?

1. Introduced a new class called `arrow.array.ChunkedArray`. 
2. `arrow.array.ChunkedArray` has the following properties:
    1.  `Type` - datatype of the `arrow.array.Array`s
    2. `Length` - Sum of the `arrow.array.Array` lengths 
    3. `NumChunks` - Number of `arrow.array.Array`s
3. `arrow.array.ChunkedArray` has the following methods:
   1. `chunk(index)` - Returns the `arrow.array.Array` stored at the specified index
   2. `fromArrays(array1, array1, ..., arrayN, Type=type)` - Creates a `ChunkedArray` from the arrays provided. If `Type` is provided, all arrays are expected to have the specified `Type`.

**Example Usage**

```matlab
>> a1 = arrow.array(1:100);
>> a2 = arrow.array(101:250);
>> a3 = arrow.array(251:300);

% Create a ChunkedArray from 3 Float64Arrays
>> c = arrow.array.ChunkedArray.fromArrays(a1, a2, a3)

c = 

  ChunkedArray with properties:

         Type: [1×1 arrow.type.Float64Type]
    NumChunks: 3
       Length: 300

% Extract the first chunk and compare it to a1
>> c1 = c.chunk(1);
>> tf = isequal(c1, a1)

tf =

  logical

   1

% Create an empty ChunkedArray by providing the Type nv-pair
>> c = arrow.array.ChunkedArray.fromArrays(Type=arrow.timestamp())

c = 

  ChunkedArray with properties:

         Type: [1×1 arrow.type.TimestampType]
    NumChunks: 0
       Length: 0

```

### Are these changes tested?

Yes. I added a new test class called `tChunkedArray.m` that contains unit tests for the new class.

### Are there any user-facing changes?

Yes. Users can now create a `ChunkedArray` in the MATLAB Interface. 

### Future Directions

1. In this PR, we deliberately didn't include a convenience constructor function because we're not sure if we want users to create `ChunkedArray`s themselves. We think users will mostly use `ChunkedArray` when extracting columns from `Table`s. 
2. We will implement more methods on `ChunkedArray`, such as `flatten()` and `combineChunks()`, etc.
* Closes: apache#37448

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
  • Loading branch information
sgilmore10 authored and dgreiss committed Feb 17, 2024
1 parent 71bf20b commit c8ec7ad
Show file tree
Hide file tree
Showing 25 changed files with 694 additions and 52 deletions.
22 changes: 12 additions & 10 deletions matlab/src/cpp/arrow/matlab/array/proxy/array.cc
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ namespace arrow::matlab::array::proxy {
// Register Proxy methods.
REGISTER_METHOD(Array, toString);
REGISTER_METHOD(Array, toMATLAB);
REGISTER_METHOD(Array, length);
REGISTER_METHOD(Array, valid);
REGISTER_METHOD(Array, type);
REGISTER_METHOD(Array, getLength);
REGISTER_METHOD(Array, getValid);
REGISTER_METHOD(Array, getType);
REGISTER_METHOD(Array, isEqual);

}
Expand All @@ -51,13 +51,13 @@ namespace arrow::matlab::array::proxy {
context.outputs[0] = str_mda;
}

void Array::length(libmexclass::proxy::method::Context& context) {
void Array::getLength(libmexclass::proxy::method::Context& context) {
::matlab::data::ArrayFactory factory;
auto length_mda = factory.createScalar(array->length());
context.outputs[0] = length_mda;
}

void Array::valid(libmexclass::proxy::method::Context& context) {
void Array::getValid(libmexclass::proxy::method::Context& context) {
auto array_length = static_cast<size_t>(array->length());

// If the Arrow array has no null values, then return a MATLAB
Expand All @@ -77,7 +77,7 @@ namespace arrow::matlab::array::proxy {
context.outputs[0] = valid_elements_mda;
}

void Array::type(libmexclass::proxy::method::Context& context) {
void Array::getType(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;

mda::ArrayFactory factory;
Expand All @@ -87,11 +87,13 @@ namespace arrow::matlab::array::proxy {
context,
error::ARRAY_FAILED_TO_CREATE_TYPE_PROXY);

auto type_id = type_proxy->unwrap()->id();
auto proxy_id = libmexclass::proxy::ProxyManager::manageProxy(type_proxy);
const auto type_id = static_cast<int32_t>(type_proxy->unwrap()->id());
const auto proxy_id = libmexclass::proxy::ProxyManager::manageProxy(type_proxy);

context.outputs[0] = factory.createScalar(proxy_id);
context.outputs[1] = factory.createScalar(static_cast<int64_t>(type_id));
mda::StructArray output = factory.createStructArray({1, 1}, {"ProxyID", "TypeID"});
output[0]["ProxyID"] = factory.createScalar(proxy_id);
output[0]["TypeID"] = factory.createScalar(type_id);
context.outputs[0] = output;
}

void Array::isEqual(libmexclass::proxy::method::Context& context) {
Expand Down
6 changes: 3 additions & 3 deletions matlab/src/cpp/arrow/matlab/array/proxy/array.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,11 @@ class Array : public libmexclass::proxy::Proxy {

void toString(libmexclass::proxy::method::Context& context);

void length(libmexclass::proxy::method::Context& context);
void getLength(libmexclass::proxy::method::Context& context);

void valid(libmexclass::proxy::method::Context& context);
void getValid(libmexclass::proxy::method::Context& context);

void type(libmexclass::proxy::method::Context& context);
void getType(libmexclass::proxy::method::Context& context);

virtual void toMATLAB(libmexclass::proxy::method::Context& context) = 0;

Expand Down
187 changes: 187 additions & 0 deletions matlab/src/cpp/arrow/matlab/array/proxy/chunked_array.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include "arrow/util/utf8.h"

#include "arrow/matlab/array/proxy/chunked_array.h"
#include "arrow/matlab/array/proxy/array.h"
#include "arrow/matlab/error/error.h"
#include "arrow/matlab/type/proxy/wrap.h"
#include "arrow/matlab/array/proxy/wrap.h"

#include "libmexclass/proxy/ProxyManager.h"

namespace arrow::matlab::array::proxy {

namespace {
libmexclass::error::Error makeEmptyChunkedArrayError() {
const std::string error_msg = "Numeric indexing using the chunk method is not supported for chunked arrays with zero chunks.";
return libmexclass::error::Error{error::CHUNKED_ARRAY_NUMERIC_INDEX_WITH_EMPTY_CHUNKED_ARRAY, error_msg};
}

libmexclass::error::Error makeInvalidNumericIndexError(const int32_t matlab_index, const int32_t num_chunks) {
std::stringstream error_message_stream;
error_message_stream << "Invalid chunk index: ";
error_message_stream << matlab_index;
error_message_stream << ". Chunk index must be between 1 and the number of chunks (";
error_message_stream << num_chunks;
error_message_stream << ").";
return libmexclass::error::Error{error::CHUNKED_ARRAY_INVALID_NUMERIC_CHUNK_INDEX, error_message_stream.str()};
}
}

ChunkedArray::ChunkedArray(std::shared_ptr<arrow::ChunkedArray> chunked_array) : chunked_array{std::move(chunked_array)} {

// Register Proxy methods.
REGISTER_METHOD(ChunkedArray, getLength);
REGISTER_METHOD(ChunkedArray, getNumChunks);
REGISTER_METHOD(ChunkedArray, getChunk);
REGISTER_METHOD(ChunkedArray, getType);
REGISTER_METHOD(ChunkedArray, isEqual);
}


libmexclass::proxy::MakeResult ChunkedArray::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) {
namespace mda = ::matlab::data;

mda::StructArray opts = constructor_arguments[0];
const mda::TypedArray<uint64_t> array_proxy_ids = opts[0]["ArrayProxyIDs"];
const mda::TypedArray<uint64_t> type_proxy_id = opts[0]["TypeProxyID"];

std::vector<std::shared_ptr<arrow::Array>> arrays;
// Retrieve all of the Array Proxy instances from the libmexclass ProxyManager.
for (const auto& array_proxy_id : array_proxy_ids) {
auto proxy = libmexclass::proxy::ProxyManager::getProxy(array_proxy_id);
auto array_proxy = std::static_pointer_cast<proxy::Array>(proxy);
auto array = array_proxy->unwrap();
arrays.push_back(array);
}

auto proxy = libmexclass::proxy::ProxyManager::getProxy(type_proxy_id[0]);
auto type_proxy = std::static_pointer_cast<type::proxy::Type>(proxy);
auto type = type_proxy->unwrap();

MATLAB_ASSIGN_OR_ERROR(auto chunked_array,
arrow::ChunkedArray::Make(arrays, type),
error::CHUNKED_ARRAY_MAKE_FAILED);

return std::make_unique<proxy::ChunkedArray>(std::move(chunked_array));
}

std::shared_ptr<arrow::ChunkedArray> ChunkedArray::unwrap() {
return chunked_array;
}

void ChunkedArray::getLength(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
mda::ArrayFactory factory;
auto length_mda = factory.createScalar(chunked_array->length());
context.outputs[0] = length_mda;
}

void ChunkedArray::getNumChunks(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
mda::ArrayFactory factory;
auto length_mda = factory.createScalar(chunked_array->num_chunks());
context.outputs[0] = length_mda;
}

void ChunkedArray::getChunk(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
mda::ArrayFactory factory;

mda::StructArray args = context.inputs[0];
const mda::TypedArray<int32_t> index_mda = args[0]["Index"];
const auto matlab_index = int32_t(index_mda[0]);

// Note: MATLAB uses 1-based indexing, so subtract 1.
// arrow::Schema::field does not do any bounds checking.
const int32_t index = matlab_index - 1;
const auto num_chunks = chunked_array->num_chunks();

if (num_chunks == 0) {
context.error = makeEmptyChunkedArrayError();
return;
}

if (matlab_index < 1 || matlab_index > num_chunks) {
context.error = makeInvalidNumericIndexError(matlab_index, num_chunks);
return;
}

const auto array = chunked_array->chunk(index);
MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto array_proxy,
arrow::matlab::array::proxy::wrap(array),
context,
error::UNKNOWN_PROXY_FOR_ARRAY_TYPE);


const auto array_proxy_id = libmexclass::proxy::ProxyManager::manageProxy(array_proxy);
const auto type_id = static_cast<int64_t>(array->type_id());

mda::StructArray output = factory.createStructArray({1, 1}, {"ProxyID", "TypeID"});
output[0]["ProxyID"] = factory.createScalar(array_proxy_id);
output[0]["TypeID"] = factory.createScalar(type_id);
context.outputs[0] = output;
}


void ChunkedArray::getType(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;

mda::ArrayFactory factory;

MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto type_proxy,
type::proxy::wrap(chunked_array->type()),
context,
error::ARRAY_FAILED_TO_CREATE_TYPE_PROXY);


const auto proxy_id = libmexclass::proxy::ProxyManager::manageProxy(type_proxy);
const auto type_id = static_cast<int32_t>(type_proxy->unwrap()->id());

mda::StructArray output = factory.createStructArray({1, 1}, {"ProxyID", "TypeID"});
output[0]["ProxyID"] = factory.createScalar(proxy_id);
output[0]["TypeID"] = factory.createScalar(type_id);
context.outputs[0] = output;
}

void ChunkedArray::isEqual(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;

const mda::TypedArray<uint64_t> chunked_array_proxy_ids = context.inputs[0];

bool is_equal = true;
for (const auto& chunked_array_proxy_id : chunked_array_proxy_ids) {
// Retrieve the ChunkedArray proxy from the ProxyManager
auto proxy = libmexclass::proxy::ProxyManager::getProxy(chunked_array_proxy_id);
auto chunked_array_proxy = std::static_pointer_cast<proxy::ChunkedArray>(proxy);
auto chunked_array_to_compare = chunked_array_proxy->unwrap();

// Use the ChunkedArray::Equals(const ChunkedArray& other) overload instead
// of ChunkedArray::Equals(const std::shared_ptr<ChunkedArray> other&) to
// ensure we don't assume chunked arrays with the same memory address are
// equal. This ensures we treat NaNs as not equal by default.
if (!chunked_array->Equals(*chunked_array_to_compare)) {
is_equal = false;
break;
}
}
mda::ArrayFactory factory;
context.outputs[0] = factory.createScalar(is_equal);
}
}
51 changes: 51 additions & 0 deletions matlab/src/cpp/arrow/matlab/array/proxy/chunked_array.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include "arrow/chunked_array.h"

#include "libmexclass/proxy/Proxy.h"

namespace arrow::matlab::array::proxy {

class ChunkedArray : public libmexclass::proxy::Proxy {
public:
ChunkedArray(std::shared_ptr<arrow::ChunkedArray> chunked_array);

~ChunkedArray() {}

std::shared_ptr<arrow::ChunkedArray> unwrap();

static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments);

protected:

void getLength(libmexclass::proxy::method::Context& context);

void getNumChunks(libmexclass::proxy::method::Context& context);

void getChunk(libmexclass::proxy::method::Context& context);

void getType(libmexclass::proxy::method::Context& context);

void isEqual(libmexclass::proxy::method::Context& context);

std::shared_ptr<arrow::ChunkedArray> chunked_array;
};

}
4 changes: 4 additions & 0 deletions matlab/src/cpp/arrow/matlab/error/error.h
Original file line number Diff line number Diff line change
Expand Up @@ -190,5 +190,9 @@ namespace arrow::matlab::error {
static const char* FEATHER_VERSION_UNKNOWN = "arrow:io:feather:FeatherVersionUnknown";
static const char* FEATHER_FAILED_TO_READ_TABLE = "arrow:io:feather:FailedToReadTable";
static const char* FEATHER_FAILED_TO_READ_RECORD_BATCH = "arrow:io:feather:FailedToReadRecordBatch";
static const char* CHUNKED_ARRAY_MAKE_FAILED = "arrow:chunkedarray:MakeFailed";
static const char* CHUNKED_ARRAY_NUMERIC_INDEX_WITH_EMPTY_CHUNKED_ARRAY = "arrow:chunkedarray:NumericIndexWithEmptyChunkedArray";
static const char* CHUNKED_ARRAY_INVALID_NUMERIC_CHUNK_INDEX = "arrow:chunkedarray:InvalidNumericChunkIndex";


}
2 changes: 2 additions & 0 deletions matlab/src/cpp/arrow/matlab/proxy/factory.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
#include "arrow/matlab/array/proxy/timestamp_array.h"
#include "arrow/matlab/array/proxy/time32_array.h"
#include "arrow/matlab/array/proxy/time64_array.h"
#include "arrow/matlab/array/proxy/chunked_array.h"
#include "arrow/matlab/tabular/proxy/record_batch.h"
#include "arrow/matlab/tabular/proxy/schema.h"
#include "arrow/matlab/error/error.h"
Expand Down Expand Up @@ -55,6 +56,7 @@ libmexclass::proxy::MakeResult Factory::make_proxy(const ClassName& class_name,
REGISTER_PROXY(arrow.array.proxy.Time32Array , arrow::matlab::array::proxy::NumericArray<arrow::Time32Type>);
REGISTER_PROXY(arrow.array.proxy.Time64Array , arrow::matlab::array::proxy::NumericArray<arrow::Time64Type>);
REGISTER_PROXY(arrow.array.proxy.Date32Array , arrow::matlab::array::proxy::NumericArray<arrow::Date32Type>);
REGISTER_PROXY(arrow.array.proxy.ChunkedArray , arrow::matlab::array::proxy::ChunkedArray);
REGISTER_PROXY(arrow.tabular.proxy.RecordBatch , arrow::matlab::tabular::proxy::RecordBatch);
REGISTER_PROXY(arrow.tabular.proxy.Schema , arrow::matlab::tabular::proxy::Schema);
REGISTER_PROXY(arrow.type.proxy.Field , arrow::matlab::type::proxy::Field);
Expand Down
8 changes: 4 additions & 4 deletions matlab/src/cpp/arrow/matlab/tabular/proxy/record_batch.cc
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ namespace arrow::matlab::tabular::proxy {

RecordBatch::RecordBatch(std::shared_ptr<arrow::RecordBatch> record_batch) : record_batch{record_batch} {
REGISTER_METHOD(RecordBatch, toString);
REGISTER_METHOD(RecordBatch, numColumns);
REGISTER_METHOD(RecordBatch, columnNames);
REGISTER_METHOD(RecordBatch, getNumColumns);
REGISTER_METHOD(RecordBatch, getColumnNames);
REGISTER_METHOD(RecordBatch, getColumnByIndex);
REGISTER_METHOD(RecordBatch, getColumnByName);
REGISTER_METHOD(RecordBatch, getSchema);
Expand Down Expand Up @@ -104,15 +104,15 @@ namespace arrow::matlab::tabular::proxy {
return record_batch_proxy;
}

void RecordBatch::numColumns(libmexclass::proxy::method::Context& context) {
void RecordBatch::getNumColumns(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
mda::ArrayFactory factory;
const auto num_columns = record_batch->num_columns();
auto num_columns_mda = factory.createScalar(num_columns);
context.outputs[0] = num_columns_mda;
}

void RecordBatch::columnNames(libmexclass::proxy::method::Context& context) {
void RecordBatch::getColumnNames(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
mda::ArrayFactory factory;
const int num_columns = record_batch->num_columns();
Expand Down
4 changes: 2 additions & 2 deletions matlab/src/cpp/arrow/matlab/tabular/proxy/record_batch.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ namespace arrow::matlab::tabular::proxy {

protected:
void toString(libmexclass::proxy::method::Context& context);
void numColumns(libmexclass::proxy::method::Context& context);
void columnNames(libmexclass::proxy::method::Context& context);
void getNumColumns(libmexclass::proxy::method::Context& context);
void getColumnNames(libmexclass::proxy::method::Context& context);
void getColumnByIndex(libmexclass::proxy::method::Context& context);
void getColumnByName(libmexclass::proxy::method::Context& context);
void getSchema(libmexclass::proxy::method::Context& context);
Expand Down
Loading

0 comments on commit c8ec7ad

Please sign in to comment.