fix: resolve GPU embedding performance bottleneck in TransformerEmbedder#43
Conversation
- Pass device directly to SentenceTransformer constructor instead of using .to(), which left _target_device out of sync and caused the model to be moved back to CPU before each forward pass on CUDA - Replace manual per-batch encode loop with a single model.encode() call to eliminate repeated DataLoader/tokenization overhead and enable length-based sorting - Measured ~15% embedding throughput improvement on CPU; GPU improvement to be measured soon
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughTransformerEmbedder now constructs SentenceTransformer on the target device and encodes all windows in one call (delegating batching and normalization to sentence-transformers). Tests mock encode to return correctly shaped normalized embeddings. Package version metadata updated to 1.1.1. ChangesGPU embedding generation optimization
Sequence DiagramsequenceDiagram
participant TransformerEmbedder
participant SentenceTransformer
participant Config
TransformerEmbedder->>SentenceTransformer: __init__(device=str(self.device))
SentenceTransformer-->>TransformerEmbedder: model instance on device
TransformerEmbedder->>TransformerEmbedder: collect window texts
TransformerEmbedder->>SentenceTransformer: encode(texts, batch_size=Config.batch_size, normalize_embeddings=True, convert_to_numpy=True)
SentenceTransformer-->>TransformerEmbedder: numpy normalized embeddings
TransformerEmbedder->>TransformerEmbedder: yield (window, embedding) pairs (zip strict)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cordon/embedding/transformer.py`:
- Line 90: Replace the non-strict zip to fail fast: before yielding, compare
len(window_list) and len(all_embeddings) (the outputs of window_list and
encode()) and raise a ValueError with a clear message if they differ, then use
zip(window_list, all_embeddings, strict=True) (or simply zip(..., strict=True)
if you prefer relying on the built-in check); reference the variables
window_list, all_embeddings and the encode() call so the error message explains
that encode() returned a different row count than inputs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 15616cfb-981e-45a8-b5d7-a8c9dc9fa282
📒 Files selected for processing (2)
src/cordon/embedding/transformer.pytests/test_transformer.py
- Add explicit length check between window_list and encode() output before yielding results, raising ValueError with a descriptive message if they differ - Switch zip from strict=False to strict=True as a secondary safeguard
|
|
Actionable comments posted: 0 |



Summary
devicedirectly to theSentenceTransformerconstructor instead of using.to(), which left_target_deviceout of sync and caused the model to potentially move back to CPU before each forward pass on CUDAencode()loop with a singlemodel.encode()call to eliminate repeated DataLoader/tokenization overhead and enable length-based sorting for optimal paddingBenchmark Results
All benchmarks compare the old code (manual batching +
.to()device init) against the fixed code (singleencode()+ constructor device param), using--batch-size 64andBAAI/bge-base-en-v1.5.CPU (Apple Silicon, local)
GPU (NVIDIA RTX 3060, 12GB VRAM, CUDA cu121)
GPU improvement scales with workload size. Fixed code sustains peaks of ~21 batch/s vs ~16.8 batch/s during sustained processing on the large file.
Why the measured GPU improvement is conservative
The device initialization bug (
_target_devicestaying out of sync) may not fully manifest on all hardware/driver combinations. On the reporter's RTX A4000 setup, the real-world improvement may be larger..Closes #42
Summary by CodeRabbit
Bug Fixes
Refactor
Tests
Chores