Skip to content

Conversation

@AliRana30
Copy link

@AliRana30 AliRana30 commented Jan 31, 2026

Rationale for This Change

The SparseCSFIndex::Equals method can crash when comparing two sparse indices that have a different number of dimensions. The method iterates over the indices() and indptr() vectors of the current object and accesses the corresponding elements in the other object without first verifying that both objects have matching vector sizes. This can lead to out-of-bounds access and a segmentation fault when the dimension counts differ.

What Changes Are Included in This PR?

This change adds explicit size equality checks for the indices() and indptr() vectors at the beginning of the SparseCSFIndex::Equals method. If the dimensions do not match, the method now safely returns false instead of attempting invalid memory access.

Are These Changes Tested?

Yes. The fix has been validated through targeted reproduction of the crash scenario using mismatched dimension counts, ensuring the method behaves safely and deterministically.

Are There Any User-Facing Changes?

No. This change improves internal safety and robustness without altering public APIs or observable user behavior.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@AliRana30 AliRana30 changed the title [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions GH-49104: [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions Jan 31, 2026
@github-actions
Copy link

⚠️ GitHub issue #49104 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member

kou commented Feb 1, 2026

Could you add a test for this case?

@AliRana30 AliRana30 force-pushed the fix-sparsecsfindex-equals-segfault branch from d60ea08 to 3e1cbd6 Compare February 1, 2026 13:50
@AliRana30
Copy link
Author

@kou On your request, I have added a new test case, TestEqualityMismatchedDimensions, in
cpp/src/arrow/sparse_tensor_test.cc.

Test details:
This test compares SparseCSFIndex objects with mismatched dimensions (for example, 1D vs 2D) to verify that Equals now safely returns false instead of causing a segfault.

Fix summary:
The fix adds explicit checks for the sizes of indices, indptr, and axis_order before attempting to iterate over them, ensuring safe comparison when dimensions do not match.

@kou
Copy link
Member

kou commented Feb 2, 2026

Could you fix the lint failure?

@AliRana30
Copy link
Author

I have fixed the lint failures by reformatting SparseCSFIndex::Equals() to comply with the 90-character line limit and Arrow's style guide. All functionality remains unchanged.

You can have a check on it ):

The TEST(TestSparseCSFIndex, EqualsMismatchedDimensions) test created
SparseCSFIndex objects with empty tensors (nullptr buffers, 0-length shape),
causing segfaults during validation on ASAN/UBSAN and 'front() called on
empty vector' errors on MSVC. The typed test TestEqualityMismatchedDimensions
already properly validates the fix with valid CSF index structures.
@AliRana30
Copy link
Author

AliRana30 commented Feb 2, 2026

Note: Some packaging/JNI tests are failing due to Docker image naming with my fork. The core C++ tests should be passing.
Some builds are failing due to Google Benchmark deprecation warnings in benchmark_util.h (not modified by this PR). The core issue fix and tests are complete.
I think it's not an issue>

@AliRana30
Copy link
Author

@kou can you have a look at this ??

Comment on lines 418 to 426
for (int64_t i = 0; i < static_cast<int64_t>(indices().size()); ++i) {
if (!indices()[i]->Equals(*other.indices()[i])) return false;
if (!indices()[i]->Equals(*other.indices()[i])) {
return false;
}
}
for (int64_t i = 0; i < static_cast<int64_t>(indptr().size()); ++i) {
if (!indptr()[i]->Equals(*other.indptr()[i])) return false;
if (!indptr()[i]->Equals(*other.indptr()[i])) {
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pease revert this to make the PR more readable.

Comment on lines 414 to 416
if (axis_order().size() != other.axis_order().size()) {
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why is this required?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 3, 2026
@AliRana30
Copy link
Author

AliRana30 commented Feb 3, 2026

@raulcd This change fixes a segmentation fault that occurs when comparing SparseCSFIndex objects with mismatched dimensions.

Bug description:
The original implementation calls indices()[i]->Equals(...) without first verifying that both objects have the same number of dimensions. When comparing,
for example, a 2D index with a 3D index:

  • indices().size() returns different values (e.g., 2 vs 3)
  • The loop iterates using the size of the first object
  • Accessing other.indices()[i] with mismatched sizes results in out-of-bounds access, leading to a segmentation fault

Fix:
Lines 408–416 add early size checks before any iteration:

if (indices().size() != other.indices().size()) return false;
if (indptr().size() != other.indptr().size()) return false;
if (axis_order().size() != other.axis_order().size()) return false;

These checks ensure that Equals safely returns false when dimensions do not match, preventing access to invalid memory.

Test coverage:
Added TestEqualityMismatchedDimensions (lines 1644–1661), which reproduces the original issue. Without this fix, the test would crash due to the segmentation fault.

@raulcd
Copy link
Member

raulcd commented Feb 3, 2026

Hii @Alirana2829 thanks for the comment.
I wasn't asking about a summary of the overall change, I do understand what we are trying to solve with the PR. I am talking about those specific line changes I am pointing out on the review comments. Those two specific changes do not seem required.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 3, 2026
Keep only essential size checks. Maintainers requested reverting
formatting changes to reduce diff noise and improve readability.
@AliRana30
Copy link
Author

@raulcd You're absolutely right - the axis_order size check was redundant. Since the final return axis_order() == other.axis_order() already uses vector equality (which checks size internally), that check was unnecessary.

I've removed it. The PR now only contains the essential size checks for indices() and indptr() that prevent the segfault from out-of-bounds access.

@AliRana30
Copy link
Author

@rok I've reverted the formatting changes. The loop bodies are back to single-line format.

The axis_order().size() check was unnecessary because vector equality
operator already compares sizes. Keeping only the essential checks for
indices() and indptr() that prevent segfault from out-of-bounds access.
@AliRana30 AliRana30 requested review from raulcd and rok February 3, 2026 19:25
Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok, I would just add the one test.

Comment on lines +1662 to +1663
ASSERT_FALSE(si_3D->Equals(*si_2D));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_FALSE(si_3D->Equals(*si_2D));
}
ASSERT_FALSE(si_3D->Equals(*si_2D));
ASSERT_TRUE(si_2D->Equals(*si_2D));
}

Comment on lines 407 to 422
bool SparseCSFIndex::Equals(const SparseCSFIndex& other) const {
if (indices().size() != other.indices().size()) {
return false;
}
if (indptr().size() != other.indptr().size()) {
return false;
}

for (int64_t i = 0; i < static_cast<int64_t>(indices().size()); ++i) {
if (!indices()[i]->Equals(*other.indices()[i])) return false;
}
for (int64_t i = 0; i < static_cast<int64_t>(indptr().size()); ++i) {
if (!indptr()[i]->Equals(*other.indptr()[i])) return false;
}
return axis_order() == other.axis_order();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just replacing this function?

Suggested change
bool SparseCSFIndex::Equals(const SparseCSFIndex& other) const {
auto eq = [](const auto& a, const auto& b) { return a->Equals(*b); };
return axis_order() == other.axis_order()
&& std::ranges::equal(indices(), other.indices(), eq)
&& std::ranges::equal(indptr(), other.indptr(), eq);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants