Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-602: [C++] Provide iterator access to primitive elements inside an Array #13009

Conversation

AlvinJ15
Copy link
Contributor

Create Iterator method in stl for Array and ChunkedArray

@AlvinJ15 AlvinJ15 force-pushed the ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays branch from a2a4f06 to 1b98d31 Compare April 27, 2022 05:22
@github-actions
Copy link

@AlvinJ15 AlvinJ15 force-pushed the ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays branch 4 times, most recently from 8690f28 to 2941f70 Compare April 27, 2022 07:00
int64_t index_in_chunk =
chunk_index ? index_ + n - chunks_lengths_[chunk_index - 1] : index_ + n;
return iterators_list_[chunk_index]
[index_in_chunk - iterators_list_[chunk_index].index()];
Copy link
Contributor

@edponce edponce Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChunkedArray already has GetScalar(index) which is efficient for accessing sequential scalars at particular logical indices. I suggest to use it for the iteration. There is no need to perform complex index calculations/checks in the iterator. This would be somewhat analogous to the Array iterator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edponce The GetScalar suggestion doesn't seem right to me. The iterator should just yield C values, not boxed scalars which are much more expensive.

Copy link
Contributor

@edponce edponce Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, well what about adding a GetView method to ChunkedArray which invokes the GetView of the underlying Arrays and use the ChunkedResolver to resolve the logical index?

Copy link
Contributor

@edponce edponce Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, ok I see now that it is using the Array iterators.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for attempting this @AlvinJ15 . I think the first thing to do would be to simplify the implementation by relying on ChunkResolver instead of doing chunk resolution yourself. You may want to add a friend declaration inside ChunkedArray if that would help.

int64_t GetChunkIndex(int64_t index) const {
return static_cast<int64_t>(
std::upper_bound(chunks_lengths_.begin(), chunks_lengths_.end(), index) -
chunks_lengths_.begin());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have ChunkResolver now which can be used without needing to reinvent it (and which should generally be faster).

chunks_lengths_.begin());
}

void InitializeComponents(ChunkedArrayIterator<ArrayType>& chunked_array_iterator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this UpdateComponents as it doesn't only initialize but is also used for updates.

} else {
chunked_array_iterator.chunks_lengths_.push_back(chunk_length);
}
chunk_index++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Factor out the push_back operation from the if-else

auto chunk_length = ...;
if (chunk_index) {
   chunk_length += chunked_array_iterator.chunks_lengths_[chunk_index - 1]);
}
chunked_array_iterator.chunks_lengths_.push_back(chunk_length);

Comment on lines 313 to 270
auto& current_iterator =
chunked_array_iterator
.iterators_list_[chunked_array_iterator.current_chunk_index_];
current_iterator -= current_iterator.index();
if (chunked_array_iterator.current_chunk_index_)
current_iterator +=
chunked_array_iterator.index_ -
chunked_array_iterator
.chunks_lengths_[chunked_array_iterator.current_chunk_index_ - 1];
else
current_iterator += chunked_array_iterator.index_;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is enforced in Arrow but I suggest to always use curly-braces for if-else blocks.
Nit: This can be simplified to

auto& current_iterator = chunked_array_iterator.iterators_list_[chunked_array_iterator.current_chunk_index_] - current_iterator.index() + chunked_array_iterator.index_;
if (chunked_array_iterator.current_chunk_index_) {
  current_iterator += chunked_array_iterator.chunks_lengths_[chunked_array_iterator.current_chunk_index_ - 1];
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this suggestion for the record. We don't enforce it but we try to maintain it.

@AlvinJ15 AlvinJ15 force-pushed the ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays branch from 2941f70 to 158e7c5 Compare April 27, 2022 20:54
@AlvinJ15 AlvinJ15 requested a review from pitrou April 27, 2022 21:14
@AlvinJ15 AlvinJ15 force-pushed the ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays branch from 158e7c5 to b4074bb Compare April 27, 2022 21:38
@@ -141,6 +141,9 @@ class ARROW_EXPORT ChunkedArray {
/// \brief Return the type of the chunked array
const std::shared_ptr<DataType>& type() const { return type_; }

/// \brief Return chunk resolver of the chunked array
const internal::ChunkResolver& GetChunkResolver() const { return chunk_resolver_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exposing an internal facility as a public member API. I don't know what our API hygiene for this should be. @westonpace @xhochy @lidavidm Thoughts?

(we could also make this private and use a friend declaration for the consuming class)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least it needs to be clearly marked off as an internal API, right now it looks like any other public API. Ideally it would be private with a friend declaration.

Suggested change
const internal::ChunkResolver& GetChunkResolver() const { return chunk_resolver_; }
const internal::ChunkResolver& chunk_resolver() const { return chunk_resolver_; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of using a friend declaration.

Comment on lines 313 to 314
std::vector<int64_t> chunks_lengths_;
std::vector<ArrayIterator<ArrayType>> iterators_list_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised, are these members still needed? Ideally, everything can be done using the ChunkResolver...

Comment on lines 252 to 255
auto chunk0 = ArrayFromJSON(int32(), "[4, 5, null]");
auto chunk1 = ArrayFromJSON(int32(), "[6]");

ASSERT_OK_AND_ASSIGN(auto result, ChunkedArray::Make({chunk0, chunk1}, int32()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there's a ChunkedArrayFromJSON.

Can we also test a chunked array with no chunks?

cpp/src/arrow/stl_iterator_test.cc Show resolved Hide resolved
@@ -141,6 +141,9 @@ class ARROW_EXPORT ChunkedArray {
/// \brief Return the type of the chunked array
const std::shared_ptr<DataType>& type() const { return type_; }

/// \brief Return chunk resolver of the chunked array
const internal::ChunkResolver& GetChunkResolver() const { return chunk_resolver_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least it needs to be clearly marked off as an internal API, right now it looks like any other public API. Ideally it would be private with a friend declaration.

Suggested change
const internal::ChunkResolver& GetChunkResolver() const { return chunk_resolver_; }
const internal::ChunkResolver& chunk_resolver() const { return chunk_resolver_; }

@edponce
Copy link
Contributor

edponce commented Apr 28, 2022

Here are some observations on the Array/ChunkedArray iterator support in Arrow. I am not requesting to resolve all of these alongside this PR, but they came to mind while reviewing this PR.

  1. For ChunkedArray, how would we know if the iterator is in the begin/end?
    I suggest we add begin/end methods to ChunkedArray ( for example, see NumericArray) so that it can be traversed in for-loop idioms.
    NOTE: Currently, not all Arrow Arrays have begin/end methods.

    # Iterator
    for (auto it = chunked_array.begin(); it != chunked_array.end(); ++it) { ... }
    
    # STL for_each
    std::for_each(chunked_array.begin(), chunked_array.end(), [](T const& elem) { ... }
    
    # range-for
    for (const auto& elem: chunked_array) { ... }
  2. C++ has begin/end and their const variants cbegin/cend. In Arrow Array and ChunkedArray are immutable types, so I would expect for them to only support cbegin/cend, but range-for requires begin/end. Here is an example showing how we can constraint mutability through the iterators:

#include <iostream>
#include <vector>

struct A {
  using T = int;
  using vector_type = typename std::vector<T>;
  using iterator_type = typename std::vector<const T>::iterator;

  A(vector_type values) : values_(values) {}

  // NOTE: We are disguising cbegin() via begin()
  const iterator_type begin() const {
    return std::cbegin(values_);
  }

  // NOTE: We are disguising cend() via end()
  const iterator_type end() const {
    return std::cend(values_);
  }

 private:
  vector_type values_;
};


int main() {
  auto a = A{{1, 2, 3, 4, 5}};

  // Value is a copy so it can be mutated
  for (auto v : a) {
    v += 10;
    std::cout << v << " ";
  }

  // Value is a non-const reference taken from a const iterator,
  // so it cannot be mutated
  for (auto& v : a) {
    // v += 10;
    std::cout << v << " ";
  }

  // Value is a const reference taken from a const iterator,
  // so it cannot be mutated
  for (const auto& v : a) {
    // v += 10;
    std::cout << v << " ";
  }
}
  1. There is an experimental MultipleChunkIterator that seems to not be used in the codebase. Should it be removed?

@pitrou
Copy link
Member

pitrou commented Apr 28, 2022

  1. I suggest we add begin/end methods to ChunkedArray

That wouldn't work, because there are no specialized subclasses of ChunkedArray. We want the iterator to yield raw C values whose type is datatype-specific.

There is an experimental MultipleChunkIterator that seems to not be used in the codebase.

It's used by ApplyBinaryChunked.

@edponce
Copy link
Contributor

edponce commented Apr 28, 2022

That wouldn't work, because there are no specialized subclasses of ChunkedArray. We want the iterator to yield raw C values whose type is datatype-specific.

...I still think it can work for the Array types that currently support iteration (std::enable_if magic). The ChunkedArrayIterator in this PR can extract raw C values from a ChunkedArray so I am not suggesting something different.

It's used by ApplyBinaryChunked.

My mistake, did a grep and thought only declaration/definition were in chunked_array.h/cc

@lidavidm
Copy link
Member

That wouldn't work, because there are no specialized subclasses of ChunkedArray. We want the iterator to yield raw C values whose type is datatype-specific.

...I still think it can work for the Array types that currently support iteration (std::enable_if magic). The ChunkedArrayIterator in this PR can extract raw C values from a ChunkedArray so I am not suggesting something different.

Er, you simply can't implement begin right? It would have to be

template <typename Type>
ChunkedArrayIterator<Type> begin() const;

which is a little clumsy

@pitrou
Copy link
Member

pitrou commented Apr 28, 2022

...I still think it can work for the Array types that currently support iteration (std::enable_if magic).

begin() and end() must return a well-known iterator type and it cannot depend on the datatype.

@edponce
Copy link
Contributor

edponce commented Apr 28, 2022

I am not fully convinced but no need on continuing this rant. I will wait for this PR to be merged and try something out.
Thanks for the all the great feedback!

@edponce
Copy link
Contributor

edponce commented Apr 29, 2022

@pitrou @lidavidm I see now the issue with typing and ChunkedArray. Related to this, @AlvinJ15 attempted to add ChunkedArrayIterator as a friend class to ChunkedArray, but ChunkedArrayIterator requires the ArrayType template parameter.

ChunkedArray is agnostic of the Array type which ChunkedArrayIterator requires. Nevertheless, we can get the ArrayType via TypeTraits if we have the data type at compile-time. Specialized Arrays have the TypeClass attribute which would work, but this attribute is not part of the base Array class which is what ChunkedArray works with.

If we could add a TypeClass attribute to base Array then I think we are good, but what value would it have?
Other ideas?

@lidavidm
Copy link
Member

If we could add a TypeClass attribute to base Array then I think we are good, but what value would it have?

How would this work? The value would vary at runtime, but we need it to be compile time.

template <typename Type, typename ArrayType = typename TypeTraits<Type>::ArrayType>
ChunkedArrayIterator<ArrayType> Iterate(const ChunkedArray& chunked_array) {
return stl::ChunkedArrayIterator<ArrayType>(chunked_array);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these Iterate methods? One can simply create the corresponding iterator directly. Also, it is prone to error because it allows for the Type parameter to be different from the actual Arrays contained in the chunked_array.

On the other hand, if we are to keep factory functions, the convention is to prefix them with Make, so MakeIterator would be more clear, as Iterate is an action verb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add a condition for validate that and raise properly an error message, but I don't know if this is a good solution for that problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edponce Iterating is what this API is meant for. We have a separate Iterator concept in Arrow C++ that we don't want to mix up with this.

@edponce
Copy link
Contributor

edponce commented Apr 29, 2022

A solution is to have the factory MakeIterator not be templated and check the type of the chunked_array at runtime, and use a switch-case to dispatch the correct ChunkedArrayIterator. Then the MakeIterator would be the friend to ChunkedArray and then you can pass the chunk_resolver_ as a parameter to the iterator.

I am simply thinking out loud.

@edponce
Copy link
Contributor

edponce commented Apr 29, 2022

@pitrou @lidavidm I apologize for all the unnecessary comments/discussion I made, I thought this was a new JIRA issue (the 602 clearly gives it away :s) and had no idea of all the discussions that had happened in the JIRA comments.

@AlvinJ15 AlvinJ15 force-pushed the ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays branch from ade6af7 to df5b3e8 Compare May 3, 2022 23:30
@AlvinJ15 AlvinJ15 requested review from pitrou and lidavidm May 3, 2022 23:47
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. This is looking better, but here are some more comments.

@@ -59,11 +62,12 @@ class ArrayIterator {

// Value access
value_type operator*() const {
return array_->IsNull(index_) ? value_type{} : array_->GetView(index_);
return !array_ || array_->IsNull(index_) ? value_type{} : array_->GetView(index_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any circumstance where array_ is null and you're still calling this operator? This doesn't seem right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a ChunkedArray is empty, I initialize an ArrayIterator with the empty constructor which initialize the array_ with a NULLPTR, and this extra condition is more feasible/short than adding more conditions on ChunkedArrayIterator for validate multiple cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think dereferencing such an iterator should be a valid operation, though, so the check should not be needed.

Comment on lines 152 to 155
current_array_iterator_ = ArrayIterator<ArrayType>(
*arrow::internal::checked_pointer_cast<ArrayType>(
chunked_array_->chunk(static_cast<int>(chunk_location.chunk_index))),
index_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more logical:

Suggested change
current_array_iterator_ = ArrayIterator<ArrayType>(
*arrow::internal::checked_pointer_cast<ArrayType>(
chunked_array_->chunk(static_cast<int>(chunk_location.chunk_index))),
index_);
current_array_iterator_ = ArrayIterator<ArrayType>(
*arrow::internal::checked_pointer_cast<ArrayType>(
chunked_array_->chunk(static_cast<int>(chunk_location.chunk_index))),
chunk_location.index_in_chunk);

... and then you don't need the subtraction below.


value_type operator[](difference_type n) const {
auto chunk_location = GetChunkLocation(index_ + n);
if (current_chunk_index_ == chunk_location.chunk_index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this if branch is needed, constructing a ArrayIterator should be cheap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arrow::internal::checked_pointer_cast<ArrayType> when creating an ArrayIterator, is less expensive than int64_t comparissons and a arithmetic operation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, first you should use checked_cast as there's no reason to create a new shared_ptr. Second, in release mode checked_cast is a static cast that doesn't have any actual cost.

ChunkedArrayIterator& operator+=(difference_type n) {
index_ += n;
auto chunk_location = GetChunkLocation(index_);
if (current_chunk_index_ == chunk_location.chunk_index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think this if branch is needed as constructing an ArrayIterator should be cheap.

Comment on lines +183 to +197
ChunkedArrayIterator& operator++() {
(*this) += 1;
return *this;
}
ChunkedArrayIterator& operator--() {
(*this) -= 1;
return *this;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one place where current_array_iterator_ could be useful for performance but you're not using it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra conditions would have to be added to control overflows, so I better call the operator += which already has the necessary conditions, and also the ChunkResolver implementation managing a cache which avoids repeated searches when it is in the same chunk

ASSERT_EQ(it[1], 11);
ASSERT_EQ(it[2], 13);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. can you add tests with an empty chunked array?
  2. can you add tests with some standard algorithms as done above for ArrayIterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. At the beginning of the EmptyChunks it has and empty ChunkedArray
TEST(ChunkedArrayIterator, EmptyChunks) {
  auto result = ChunkedArrayFromJSON(int32(), {});
  auto it = Iterate<Int32Type>(*result);
  ASSERT_EQ(*it, nullopt);
  ASSERT_EQ(it[0], nullopt);
  1. Added

@AlvinJ15 AlvinJ15 force-pushed the ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays branch 2 times, most recently from 4b74c8e to 883b51f Compare May 5, 2022 01:26
@AlvinJ15
Copy link
Contributor Author

AlvinJ15 commented May 5, 2022

@lidavidm @pitrou @edponce How can I check why it is failing just in Windows?

@pitrou
Copy link
Member

pitrou commented May 5, 2022

@pitrou
Copy link
Member

pitrou commented May 5, 2022

Hmm, I think my Iterate suggestion was a bit too vague. To satisfy the requirements for iteration and for call STL algorithms, we need "begin" and "end" iterators. Basically, I would suggest something like:

/// Return an iterator to the beginning of the chunked array
template <typename Type, typename ArrayType = typename TypeTraits<Type>::ArrayType>
ChunkedArrayIterator<ArrayType> Begin(const ChunkedArray& chunked_array) {
  return stl::ChunkedArrayIterator<ArrayType>(chunked_array);
}

/// Return an iterator to the end of the chunked array
template <typename Type, typename ArrayType = typename TypeTraits<Type>::ArrayType>
ChunkedArrayIterator<ArrayType> End(const ChunkedArray& chunked_array) {
  return stl::ChunkedArrayIterator<ArrayType>(chunked_array, chunked_array.length());
}

template <typename ArrayType>
struct ChunkedArrayRange {
  const ChunkedArray* chunked_array;

  ChunkedArrayIterator<ArrayType> begin() {
    return stl::ChunkedArrayIterator<ArrayType>(*chunked_array);
  }
  ChunkedArrayIterator<ArrayType> end() {
    return stl::ChunkedArrayIterator<ArrayType>(*chunked_array, chunked_array.length());
  }
};

/// Return an iterable range over the chunked array
///
/// This makes it easy to use a chunked array in a for-range construct.
template <typename Type, typename ArrayType = typename TypeTraits<Type>::ArrayType>
ChunkedArrayRange<ArrayType> Iterate(const ChunkedArray& chunked_array) {
  return stl::ChunkedArrayRange<ArrayType>(&chunked_array);
}

This way you can write:

for (util::optional<int64_t> : Iterate<Int64Type>(chunked_array)) {
  ...

as well as:

// Find the first item that's non-null and at least 42
auto it = std::find_if(
    Begin<Int64Type>(chunked_array), End<Int64Type>(chunked_array),
    [util::optional<int64_t> item]() { return item.has_value() && *item >= 42});

@AlvinJ15 AlvinJ15 force-pushed the ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays branch 2 times, most recently from 3acf34f to 68e1ee8 Compare May 9, 2022 19:09
@AlvinJ15 AlvinJ15 force-pushed the ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays branch from 68e1ee8 to 50dac46 Compare May 9, 2022 23:29
@AlvinJ15 AlvinJ15 requested a review from pitrou May 9, 2022 23:46
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the update @AlvinJ15 ! This looked mostly good, I've pushed a couple very minor changes.

@pitrou
Copy link
Member

pitrou commented May 10, 2022

@westonpace @lidavidm Feel free to leave any comments on the API even after this is merged.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, some minor suggestions but looks good overall

namespace stl {
template <typename T, typename V>
class ChunkedArrayIterator;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, doesn't the linter usually require } // namespace stl here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should indeed.

// Value access
value_type operator*() const { return *current_array_iterator_; }

value_type operator[](difference_type n) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we DCHECK(chunked_array_) in most of these operations to guard against use of a default-initialized iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some assert for check it.

}
ChunkedArrayIterator& operator+=(difference_type n) {
index_ += n;
if (index_ != chunked_array_->length()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also DCHECK here and in the constructor that we aren't exceeding end()?

Copy link
Member

@pitrou pitrou May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DCHECKs are not allowed in public headers. We could perhaps use an assert though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I always forget that…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check should be if (index_ < chunked_array_->length()) so that it does not include index values that exceed or equal the length. Also, here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edponce Hmm, possibly. The C++ spec doesn't clearly say whether creating an out-of-bounds iterator is well-defined. However, the operational semantics involve incrementing/decrementing and it seems that incrementing/decrementing past extremities is undefined. @bkietz Do you have any thoughts on this?

Copy link
Contributor

@edponce edponce May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that any value is valid as long as you can return to the initial value with its inverse operation. So the check != follows the operational semantics of a RandomAccessIterator.
Actually, the < check would be insufficient because it allows negative indices (need absolute value of index).

The question now becomes, should Arrow iterators guard against out-of-bounds accesses?
From a performance perspective, they have basically the same cost (+ std::abs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, my preference is to maximize the similarity of any random access iterators to indices/non-owning pointers. In this case I'd try to avoid accessing elements at all when incrementing the iterator, deferring the resolution of current_array_iterator_ until something is actually dereferenced. This would also move out-of-bounds/invalid current array checking into the dereferencing member functions where correct behavior is far more obvious.

The specific question of how to validate out-of-bounds iterators becomes easier to answer too: We can construct an iterator with whatever index and compare/increment/etc without worry since that operates only on the index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edponce In https://en.cppreference.com/w/cpp/named_req/ForwardIterator, you can see that only dereferenceable iterators can be incremented. And all other operations are derived from incrementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional observations on current operator+=:

  • This dereference can be invalid because chunk_index is not validated and can be out-of-bounds.

  • An Array and an Iterator are constructed for every access via +=. There are opportunities to reuse current_array_iterator_.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz Ok, I did that. It also turns out it's not very useful to memoize current_array_iterator_ currently.

// Arithmetic
difference_type operator-(const ChunkedArrayIterator& other) const {
return index_ - other.index_;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method part of the public API of a standard iterator? or is it a helper method that can be private?
I couldn't find where it is being used.

Also, why is there not an equivalent method for addition?

Copy link
Contributor Author

@AlvinJ15 AlvinJ15 May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is used in the Test ChunkedArrayIterator.Arithmetic ASSERT_EQ(it - it2, -2);
There is a similar method in ArrayIterator, for calculate the difference between two iterators(like a distance), I think make a + operator would not be used

Copy link
Member

@pitrou pitrou May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edponce See https://en.cppreference.com/w/cpp/named_req/RandomAccessIterator for the requirements for a random access iterator.

template <typename Type, typename ArrayType = typename TypeTraits<Type>::ArrayType>
ChunkedArrayIterator<ArrayType> End(const ChunkedArray& chunked_array) {
return stl::ChunkedArrayIterator<ArrayType>(chunked_array, chunked_array.length());
}
Copy link
Contributor

@edponce edponce May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Sometimes ChunkedArrayIterator<> is used and other times stl::ChunkedArrayIterator<>.

@AlvinJ15 AlvinJ15 force-pushed the ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays branch from b94cf75 to 48c815c Compare May 11, 2022 08:27
@pitrou pitrou closed this in 9ae12a5 May 11, 2022
@ursabot
Copy link

ursabot commented May 13, 2022

Benchmark runs are scheduled for baseline = 0bae875 and contender = 9ae12a5. 9ae12a5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.31% ⬆️0.0%] test-mac-arm
[Finished ⬇️2.5% ⬆️0.36%] ursa-i9-9960x
[Finished ⬇️1.54% ⬆️0.43%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 9ae12a56 ec2-t3-xlarge-us-east-2
[Finished] 9ae12a56 test-mac-arm
[Finished] 9ae12a56 ursa-i9-9960x
[Finished] 9ae12a56 ursa-thinkcentre-m75q
[Finished] 0bae875b ec2-t3-xlarge-us-east-2
[Failed] 0bae875b test-mac-arm
[Finished] 0bae875b ursa-i9-9960x
[Finished] 0bae875b ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 13, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants