Skip to content

Conversation

@SuKi2cn
Copy link
Contributor

@SuKi2cn SuKi2cn commented Nov 21, 2025

This PR addresses issue #330 by introducing aggregate expressions & execution support.

  • Add aggregate expression family (count / count_null / count_star / max / min)
    with bound/unbound types, visitor and binder support.

  • Add AggregateEvaluator for count/max/min execution over StructLike rows.

  • Expose aggregate factories in Expressions and wire into CMake/Meson builds
    with new aggregate tests.

@SuKi2cn SuKi2cn mentioned this pull request Nov 21, 2025
6 tasks
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I just scanned through the design and it seems that the current design is not consistent with both C++ and Java impls. What is in my mind is like something below:

template <typename T>
concept TermType = std::derived_from<T, Term>;

template <TermType T>
class ICEBERG_EXPORT Aggregate : public virtual Expression {
 public:
  Expression::Operation op() const override { return operation_; }

  const std::shared_ptr<T>& term() const { return term_; }

 protected:
  Expression::Operation operation_;
  std::shared_ptr<T> term_;
};

class ICEBERG_EXPORT UnboundAggregate : public virtual Expression,
                                        public Unbound<Expression> {
 public:
  Result<std::shared_ptr<Expression>> Bind(const Schema& schema,
                                           bool case_sensitive) const override = 0;

  bool is_unbound_aggregate() const override { return true; }
};

template <typename B>
class ICEBERG_EXPORT UnboundAggregateImpl : public UnboundAggregate,
                                            public Aggregate<UnboundTerm<B>> {
  using BASE = Aggregate<UnboundTerm<B>>;

 public:
  std::shared_ptr<NamedReference> reference() override {
    return BASE::term() ? BASE::term()->reference() : nullptr;
  }

  Result<std::shared_ptr<Expression>> Bind(const Schema& schema,
                                           bool case_sensitive) const override;
};

class ICEBERG_EXPORT BoundAggregate : public Aggregate<BoundTerm>, public Bound {
 public:
  using Aggregate<BoundTerm>::op;
  using Aggregate<BoundTerm>::term;

  std::shared_ptr<BoundReference> reference() override {
    return term_ ? term_->reference() : nullptr;
  }

  Result<Literal> Evaluate(const StructLike& data) const override;

  bool is_bound_aggregate() const override { return true; }

  enum class Kind : int8_t {
    // Count aggregates (COUNT, COUNT_STAR, COUNT_NULL)
    kCount = 0,
    // Value aggregates (MIN, MAX)
    kValue,
  };

  virtual Kind kind() const = 0;
};

class ICEBERG_EXPORT CountAggregate : public BoundAggregate {
 public:
  Result<Literal> Evaluate(const StructLike& data) const override;

  Kind kind() const override { return Kind::kCount; }
};

class ICEBERG_EXPORT ValueAggregate : public BoundAggregate {
 public:
  Result<Literal> Evaluate(const StructLike& data) const override;
  
  Kind kind() const override { return Kind::kValue; }
};

};

/// \brief COUNT aggregate variants.
class ICEBERG_EXPORT CountAggregate : public UnboundAggregate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to define subclass for unbound aggregates. Please check the current design of Predicate in both C++ and Java implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! In progress.

@SuKi2cn
Copy link
Contributor Author

SuKi2cn commented Nov 21, 2025

I've updated the design to align with the current Predicate pattern and the Java implementation:

  • Removed unbound aggregate subclasses (e.g. CountAggregate on the unbound side).
  • Introduced UnboundAggregateImpl to handle all unbound aggregates, similar to Predicate.
  • Refactored BoundAggregate to be the primary execution unit and moved row-level logic into its subclasses.
  • Updated AggregateEvaluator to act as a driver over bound aggregates instead of owning aggregate semantics.

@SuKi2cn SuKi2cn requested a review from wgtmac November 21, 2025 12:06
Copy link

@Jinchul81 Jinchul81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If you refer to the implementation of Iceberg Java version, please add some context.
  • In the header files, please add a comment for each function and member variables.
  • It looks like unit test coverage for your change is insufficient. I am not sure you're working to add more unit tests. Please add more test cases.


namespace {

std::string OperationToPrefix(Expression::Operation op) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use std::string_view instead of std::string if the return values should be constant string literals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use std::string_view instead of std::string if the return values should be constant string literals?

Good point — these return values are all string literals with static lifetime, so std::string_view is more appropriate here. I'll update the signature accordingly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the return type constexpr std::string_view.


// Aggregates

std::shared_ptr<CountAggregate> Expressions::Count(std::string name) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change the parameter to std::string&& to avoid an unnecessary copy and clearly express that the function intends to take ownership of the string. Using an rvalue reference makes the API’s intent clearer and can reduce overhead in performance-critical paths.

return std::shared_ptr<CountAggregate>(std::move(agg));
}

std::shared_ptr<CountAggregate> Expressions::CountNull(std::string name) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above.

return std::shared_ptr<CountAggregate>(std::move(agg));
}

std::shared_ptr<CountAggregate> Expressions::CountNotNull(std::string name) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above.

return std::shared_ptr<CountAggregate>(std::move(agg));
}

std::shared_ptr<ValueAggregate> Expressions::Max(std::string name) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above.

default:
break;
}
return "aggregate";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right return value. Why do we have to map the others to aggregate?

Expression::Operation::kCount, Mode::kNull, std::move(term), std::move(ref)));
}

std::unique_ptr<CountAggregate> CountAggregate::CountStar() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it return std::unique_ptr<CountAggregate> instead of Result<std::unique_ptr<CountAggregate>>?

: BoundAggregate(op, std::move(term)), mode_(mode) {}

std::string BoundCountAggregate::ToString() const {
if (mode_ == CountAggregate::Mode::kStar) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you special handling for kStar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you special handling for kStar?

Good catch — mapping all remaining operations to "aggregate" is not really correct.

In the Java implementation, Aggregate.toString() handles each aggregate operation explicitly and the default branch throws an UnsupportedOperationException("Invalid aggregate: " + op()), so there is no generic "aggregate" fallback.

I’ll update the C++ helper to mirror that behavior: only the supported aggregate operations will be handled explicitly, and for anything else we’ll treat it as invalid/unreachable instead of returning "aggregate". This way we don’t silently hide unsupported operations and stay consistent with the Java implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set aside the idea of copying the Java implementation. Can you please clarify why we cannot make it as the generic approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set aside the idea of copying the Java implementation. Can you please clarify why we cannot make it as the generic approach?

Got it, thanks for clarifying.

I understand your point. The goal here is not to blindly mirror the Java implementation, but to design a more generic and idiomatic C++ solution.

I'll take some time to revisit the current design and see how kStar can be modeled in a more generic way (e.g. reducing special-casing in ToString() and improving consistency with the Predicate style), while still preserving the same external semantics.

I’ll follow up with an updated implementation once I’ve reworked this part.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file an issue for the follow up item and add a comment including the reason why you've applied special handling and TODO(#XYZ).

bool case_sensitive) const override;

private:
CountAggregate(Expression::Operation op, Mode mode,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment why public ctor is not allowed.

@SuKi2cn SuKi2cn marked this pull request as draft November 23, 2025 10:57
Result<std::shared_ptr<Expression>> CountAggregate::Bind(const Schema& schema,
bool case_sensitive) const {
std::shared_ptr<BoundTerm> bound_term;
if (term_ != nullptr) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write the code like if (term_ != nullptr) [[unlikely]] { to help branch prediction.

ICEBERG_DCHECK(aggregate != nullptr, "Aggregate cannot be null");

if (auto count = std::dynamic_pointer_cast<BoundCountAggregate>(aggregate)) {
if (count->mode() != CountAggregate::Mode::kStar && !count->term()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write the code like if (count->mode() != CountAggregate::Mode::kStar && !count->term()) [[unlikely]] { to help branch prediction.

explicit VectorStructLike(std::vector<Scalar> fields) : fields_(std::move(fields)) {}

Result<Scalar> GetField(size_t pos) const override {
if (pos >= fields_.size()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write the code like if (pos >= fields_.size()) [[unlikely]] { to help branch prediction.

@SuKi2cn
Copy link
Contributor Author

SuKi2cn commented Nov 24, 2025

@wgtmac @Jinchul81 Thanks for the review, I'll take some time to address these changes and follow up shortly.

SuKi2cn and others added 4 commits November 24, 2025 07:48
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
This PR refactors the C++ aggregate implementation to better align with the Java Iceberg design and existing Predicate patterns.

The changes introduce a dedicated Aggregator hierarchy, simplify COUNT handling by splitting it into distinct classes, and improve evaluator extensibility for future input types (e.g. DataFile).
@SuKi2cn SuKi2cn marked this pull request as ready for review November 24, 2025 18:07
@SuKi2cn SuKi2cn requested a review from wgtmac November 24, 2025 18:07
This change refactors the aggregate framework to better match the Java implementation and improves API clarity, performance, and test coverage.
@SuKi2cn SuKi2cn requested a review from wgtmac November 26, 2025 09:36
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @SuKi2cn! This looks good to me.

@SuKi2cn
Copy link
Contributor Author

SuKi2cn commented Nov 28, 2025

Thanks for working on this, @SuKi2cn! This looks good to me.

Thanks a lot for the review and guidance, @wgtmac ! I really appreciate it. Happy Thanksgiving!

@wgtmac wgtmac merged commit dbcbdf2 into apache:main Nov 28, 2025
10 checks passed
@SuKi2cn SuKi2cn deleted the fix-issue-330 branch November 28, 2025 05:17
HuaHuaY added a commit to HuaHuaY/iceberg-cpp that referenced this pull request Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants