[BWARE] Add removeEmpty support to compressed column groups#2504
Merged
Conversation
Implement removeEmptyRows and removeEmptyCols for compressed matrices without full decompression. Add CLALibRemoveEmpty driver and LibMatrixReorg helpers (rmemptyEarlyAbort/rmemptyUnsafe), per-column-group removeEmptyRows/removeEmptyColsSubset, dictionary sliceColumns, and offset/mapping support for index-only row removal.
…ide effect - Add Apache license header to CLALibRemoveEmpty (fixes RAT license check) - Remove dead partition/swap helpers and unused imports in Dictionary/QDictionary (fixes checkstyle unused-import errors) - Make MatrixBlockDictionary.sliceColumns read-through instead of calling sparseToDense() in place, so it no longer mutates the shared dictionary block - Validate selected-row count up front in AMapToData.removeEmpty instead of catching ArrayIndexOutOfBoundsException for control flow - Lower the decompress-fallback log in CLALibRemoveEmpty to guarded debug - Document the null-return contract on removeEmptyCols and rmemptyEarlyAbort - Remove leftover commented-out debug code in AOffset and ColGroupSDCSingleZeros - Use Assume.assumeTrue for size-gated removeEmpty tests instead of silent skips - Revert unrelated CompressForce and EncodeSample test changes out of scope
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2504 +/- ##
============================================
+ Coverage 71.45% 71.54% +0.09%
- Complexity 48864 49043 +179
============================================
Files 1573 1574 +1
Lines 189239 189565 +326
Branches 37128 37188 +60
============================================
+ Hits 135215 135632 +417
+ Misses 43576 43457 -119
- Partials 10448 10476 +28 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Route CompressedMatrixBlock.removeEmptyOperations through CLALibRemoveEmpty.rmempty instead of always decompressing and delegating. This makes the compressed removeEmpty implementation (per-column-group removeEmptyRows/removeEmptyColsSubset, dictionary sliceColumns, and offset/mapping handling) actually reachable; it was previously dead code with 0% test coverage. The existing removeEmptyOperations tests in CompressedMatrixTest now exercise the compressed path, while the null-select case still falls back to full decompression.
Extend CompressedMatrixTest with additional removeEmptyOperations cases that exercise branches the existing tests missed: emptyReturn=true with a selection vector, a denser row selection, select-all (nothing removed), and the all-rows/all-cols-removed paths (rOut==0 / cOut==0). Each case compares the compressed result against the uncompressed reference across the full compression-config matrix.
CompressedMatrixBlock.removeEmptyOperations was wired to CLALibRemoveEmpty, but the select-based fast path iterated column groups and called removeEmptyRows/removeEmptyColsSubset with no fallback. OLE, RLE, LinearFunctional, and UncompressedArray groups throw NotImplementedException for these operations, so a compressed matrix containing one of them would hard-fail where the previous always-decompress implementation succeeded. Wrap the per-group loops in rmEmptyRows/rmEmptyCols and route to the existing decompression fallback on NotImplementedException, restoring the prior "works for any encoding" contract. Add CompressedRemoveEmptyForcedTest, which forces OLE/RLE encodings and asserts removeEmptyOperations matches the uncompressed reference instead of throwing.
Cover the per-encoding removeEmptyColsSubset dictionary-slicing paths that were previously untested by selecting a strict subset of a multi-column column group: - SDC, SDC-single, and SDCFOR (via sparsifyFOR) reference/default-tuple slicing branches. - DDC subset removal. - RLE/OLE fallback to decompression when index-only column removal is unsupported (NotImplementedException path). - Selection vectors with unknown (-1) non-zero counts, exercising the recompute branches in CLALibRemoveEmpty for both rows and columns. Identical columns are used so co-coding merges them into one multi-column group, guaranteeing the subset path is reached instead of the all-selected copyAndSet shortcut.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement removeEmptyRows and removeEmptyCols for compressed matrices without full decompression. Add CLALibRemoveEmpty driver and LibMatrixReorg helpers (rmemptyEarlyAbort/rmemptyUnsafe), per-column-group removeEmptyRows/removeEmptyColsSubset, dictionary sliceColumns, and offset/mapping support for index-only row removal.