Add protocols, dependency injection, character-limit line splitting, and test infrastructure#37
Conversation
- Add max_line_length config option to split long lines into virtual lines for windowing, preventing token truncation in embeddings - Refactor windower: replace deque with list, extract _flush_buffer helper to eliminate duplicated TextWindow construction - Add --max-line-length CLI argument - Replace bare tuples in IntervalMerger with NamedTuple for clarity - Rename confusing percentile variables in Thresholder - Add comprehensive tests for line splitting edge cases Co-authored-by: Cursor <cursoragent@cursor.com>
- Define Reader, Segmenter, Thresholder, Merger, and Formatter protocols - Add dependency injection to SemanticLogAnalyzer for all pipeline stages - Rename OutputFormatter to XmlFormatter - Replace backend if/elif chain with registry pattern - Add WindowConfig and ScoringConfig sub-config dataclasses - Add pipeline unit tests verifying DI works - Update tests for XmlFormatter rename Co-authored-by: Cursor <cursoragent@cursor.com>
- Create tests/conftest.py with shared fixtures for common test data - Add return type annotations to all test fixtures - Add @pytest.mark.integration to tests loading real models - Register integration marker in pyproject.toml - Refactor tests to use shared fixtures where appropriate Co-authored-by: Cursor <cursoragent@cursor.com>
📝 WalkthroughWalkthroughThis PR introduces a pluggable pipeline architecture enabling dependency injection of semantic log analyzer components, adds configuration-driven line splitting for long logs, renames the output formatter for clarity, and significantly expands test coverage with shared fixtures and specialized test classes. ChangesPipeline Architecture Refactoring
🎯 4 (Complex) | ⏱️ ~75 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_transformer.py (1)
237-244:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssert identity with
isin the identity test.Line 243 uses
==, which only checks value equality. This test should enforce object identity.Suggested fix
- assert result_window == window + assert result_window is window🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_transformer.py` around lines 237 - 244, In test_embed_preserves_window_identity replace the value-equality assertion with an identity check so the returned window object is the same instance: assert result_window is window; locate the test function test_embed_preserves_window_identity and the variables window and result_window produced by calling TransformerEmbedder.embed_windows (and TextWindow) and change the assertion accordingly to use "is" instead of "==".
🧹 Nitpick comments (3)
tests/test_segmentation.py (1)
106-130: ⚡ Quick winStrengthen split assertions to catch ordering/truncation regressions.
Line 109 and Line 122 currently use exact-multiple lengths, and the assertions mostly check substring presence. That can still pass if chunks are reordered or a tail chunk is lost. Prefer exact content assertions and at least one non-multiple-length case.
Proposed test tightening
def test_split_long_line_exceeds_limit(self) -> None: """Test that a line exceeding max_line_length is split into virtual lines.""" config = AnalysisConfig(window_size=3, max_line_length=5) - lines = iter([(1, "abcdefghijklmno")]) + lines = iter([(1, "abcdefghijklmn")]) # 5, 5, 4 segmenter = SlidingWindowSegmenter() windows = list(segmenter.segment(lines, config)) assert len(windows) == 1 assert windows[0].start_line == 1 assert windows[0].end_line == 1 - assert "abcde" in windows[0].content - assert "fghij" in windows[0].content - assert "klmno" in windows[0].content + assert windows[0].content == "abcde\nfghij\nklmn" @@ def test_split_creates_multiple_windows(self) -> None: """Test that a very long line creates multiple windows.""" config = AnalysisConfig(window_size=2, max_line_length=5) lines = iter([(1, "abcdefghijklmnopqrst")]) segmenter = SlidingWindowSegmenter() windows = list(segmenter.segment(lines, config)) assert len(windows) == 2 assert windows[0].start_line == 1 assert windows[0].end_line == 1 assert windows[1].start_line == 1 assert windows[1].end_line == 1 + assert windows[0].content == "abcde\nfghij" + assert windows[1].content == "klmno\npqrst"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_segmentation.py` around lines 106 - 130, The tests use only substring checks which can miss ordering/truncation regressions; update test_split_long_line_exceeds_limit and test_split_creates_multiple_windows (which exercise SlidingWindowSegmenter.segment with AnalysisConfig) to assert exact virtual-chunk contents and ordering instead of membership: for test_split_long_line_exceeds_limit assert windows[0].content equals the concatenation of the expected ordered chunks (e.g., "abcde|fghij|klmno" or however chunks are delimited by the segmenter) and for test_split_creates_multiple_windows add at least one case where the final chunk length is not an exact multiple and assert the exact content of windows[0].content and windows[1].content as well as their order to catch reordering/truncation regressions.tests/test_pipeline_unit.py (1)
26-26: ⚡ Quick winAssert exact call arguments for injected collaborators.
On Line 26 and Line 86,
assert_called_once()misses argument-level regressions. Useassert_called_once_with(...)to verify path/config wiring too.Proposed test hardening
@@ - mock_reader.read_lines.assert_called_once() + mock_reader.read_lines.assert_called_once_with(Path("dummy.log")) @@ - mock_scorer.score_windows.assert_called_once() + mock_scorer.score_windows.assert_called_once_with([], default_config)Also applies to: 86-86
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_pipeline_unit.py` at line 26, Replace the generic mock assertions that use mock_reader.read_lines.assert_called_once() with argument-aware assertions to catch wiring regressions: change mock_reader.read_lines.assert_called_once() and the similar call at Line 86 to mock_reader.read_lines.assert_called_once_with(<expected_path_or_config>) (use the actual expected path/config value used when constructing the Pipeline under test) so the test verifies both that read_lines was called once and that it received the correct path/config; locate these assertions by the symbol mock_reader.read_lines in tests/test_pipeline_unit.py.src/cordon/postprocess/__init__.py (1)
1-4: ⚡ Quick winPreserve
OutputFormatteras a compatibility alias during the renameLine 4 removes
OutputFormatterfrom the public export surface, which is a breaking change for existing consumers. Consider exporting both names for one deprecation cycle.♻️ Proposed compatibility patch
diff --git a/src/cordon/postprocess/formatter.py b/src/cordon/postprocess/formatter.py @@ class XmlFormatter: @@ output_parts.append("</anomalies>") return "\n".join(output_parts) + +# Backward-compatible alias; remove in a future major release. +OutputFormatter = XmlFormatterdiff --git a/src/cordon/postprocess/__init__.py b/src/cordon/postprocess/__init__.py @@ -from cordon.postprocess.formatter import XmlFormatter +from cordon.postprocess.formatter import OutputFormatter, XmlFormatter from cordon.postprocess.merger import IntervalMerger -__all__ = ["IntervalMerger", "XmlFormatter"] +__all__ = ["IntervalMerger", "XmlFormatter", "OutputFormatter"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cordon/postprocess/__init__.py` around lines 1 - 4, The public API removed the old name OutputFormatter; restore backward compatibility by adding a compatibility alias (e.g., set OutputFormatter = XmlFormatter) and include "OutputFormatter" in __all__ alongside "XmlFormatter" and "IntervalMerger"; optionally emit a DeprecationWarning when OutputFormatter is referenced to guide consumers to use XmlFormatter instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cordon/segmentation/windower.py`:
- Around line 40-43: The current code materializes chunks via chunks =
self._split_line(...) and then iterates, causing large temporary allocations for
very long lines; change to iterate the generator/iterator returned by
_split_line directly (e.g., for chunk_line_num, chunk_text in
self._split_line(line_num, line_text, max_line_length):) and append each tuple
to buffer, and apply the same lazy iteration replacement in the other occurrence
(the block covering lines 75-92) so no intermediate list is created.
In `@tests/test_transformer.py`:
- Around line 217-220: The test currently sets mock_model.encode.return_value to
an already-normalized vector (raw / np.linalg.norm(...)), which masks any bug in
the embedder's normalization; change the mock to return the raw unnormalized
vector (e.g., raw as created by rng.standard_normal) so that the embedder's
normalization logic (the code under test exercised via mock_model.encode and
mock_st) must produce a unit vector for the test to pass.
---
Outside diff comments:
In `@tests/test_transformer.py`:
- Around line 237-244: In test_embed_preserves_window_identity replace the
value-equality assertion with an identity check so the returned window object is
the same instance: assert result_window is window; locate the test function
test_embed_preserves_window_identity and the variables window and result_window
produced by calling TransformerEmbedder.embed_windows (and TextWindow) and
change the assertion accordingly to use "is" instead of "==".
---
Nitpick comments:
In `@src/cordon/postprocess/__init__.py`:
- Around line 1-4: The public API removed the old name OutputFormatter; restore
backward compatibility by adding a compatibility alias (e.g., set
OutputFormatter = XmlFormatter) and include "OutputFormatter" in __all__
alongside "XmlFormatter" and "IntervalMerger"; optionally emit a
DeprecationWarning when OutputFormatter is referenced to guide consumers to use
XmlFormatter instead.
In `@tests/test_pipeline_unit.py`:
- Line 26: Replace the generic mock assertions that use
mock_reader.read_lines.assert_called_once() with argument-aware assertions to
catch wiring regressions: change mock_reader.read_lines.assert_called_once() and
the similar call at Line 86 to
mock_reader.read_lines.assert_called_once_with(<expected_path_or_config>) (use
the actual expected path/config value used when constructing the Pipeline under
test) so the test verifies both that read_lines was called once and that it
received the correct path/config; locate these assertions by the symbol
mock_reader.read_lines in tests/test_pipeline_unit.py.
In `@tests/test_segmentation.py`:
- Around line 106-130: The tests use only substring checks which can miss
ordering/truncation regressions; update test_split_long_line_exceeds_limit and
test_split_creates_multiple_windows (which exercise
SlidingWindowSegmenter.segment with AnalysisConfig) to assert exact
virtual-chunk contents and ordering instead of membership: for
test_split_long_line_exceeds_limit assert windows[0].content equals the
concatenation of the expected ordered chunks (e.g., "abcde|fghij|klmno" or
however chunks are delimited by the segmenter) and for
test_split_creates_multiple_windows add at least one case where the final chunk
length is not an exact multiple and assert the exact content of
windows[0].content and windows[1].content as well as their order to catch
reordering/truncation regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7aa6195b-1a8b-473c-bfa4-8ca9c760203a
📒 Files selected for processing (20)
pyproject.tomlsrc/cordon/analysis/thresholder.pysrc/cordon/cli.pysrc/cordon/core/__init__.pysrc/cordon/core/config.pysrc/cordon/core/types.pysrc/cordon/embedding/__init__.pysrc/cordon/pipeline.pysrc/cordon/postprocess/__init__.pysrc/cordon/postprocess/formatter.pysrc/cordon/postprocess/merger.pysrc/cordon/segmentation/windower.pytests/conftest.pytests/test_core.pytests/test_llama_cpp.pytests/test_pipeline_unit.pytests/test_postprocess.pytests/test_remote.pytests/test_segmentation.pytests/test_transformer.py
| chunks = self._split_line(line_num, line_text, max_line_length) | ||
|
|
||
| for chunk_line_num, chunk_text in chunks: | ||
| buffer.append((chunk_line_num, chunk_text)) |
There was a problem hiding this comment.
Stream chunk generation to avoid memory spikes on very long lines.
Line 40 materializes all chunks as a list before window flushing. On very long lines this creates a large intermediate allocation, which weakens the line-splitting safety goal. Iterate lazily from _split_line instead.
Proposed fix
- chunks = self._split_line(line_num, line_text, max_line_length)
-
- for chunk_line_num, chunk_text in chunks:
+ for chunk_line_num, chunk_text in self._split_line(
+ line_num, line_text, max_line_length
+ ):
buffer.append((chunk_line_num, chunk_text))
@@
- def _split_line(line_num: int, line_text: str, max_length: int | None) -> list[tuple[int, str]]:
+ def _split_line(
+ line_num: int, line_text: str, max_length: int | None
+ ) -> Iterator[tuple[int, str]]:
@@
if max_length is None or len(line_text) <= max_length:
- return [(line_num, line_text)]
+ yield (line_num, line_text)
+ return
- return [
- (line_num, line_text[i : i + max_length]) for i in range(0, len(line_text), max_length)
- ]
+ for i in range(0, len(line_text), max_length):
+ yield (line_num, line_text[i : i + max_length])Also applies to: 75-92
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cordon/segmentation/windower.py` around lines 40 - 43, The current code
materializes chunks via chunks = self._split_line(...) and then iterates,
causing large temporary allocations for very long lines; change to iterate the
generator/iterator returned by _split_line directly (e.g., for chunk_line_num,
chunk_text in self._split_line(line_num, line_text, max_line_length):) and
append each tuple to buffer, and apply the same lazy iteration replacement in
the other occurrence (the block covering lines 75-92) so no intermediate list is
created.
| rng = np.random.default_rng(0) | ||
| raw = rng.standard_normal((1, 384)).astype(np.float32) | ||
| mock_model.encode.return_value = raw / np.linalg.norm(raw, axis=1, keepdims=True) | ||
| mock_st.return_value = mock_model |
There was a problem hiding this comment.
Use non-normalized mock output so the normalization test can fail on regressions.
Line 219 returns an already normalized vector, so test_embed_returns_normalized_vectors can pass even if normalization is broken in the embedder.
Suggested fix
mock_model = MagicMock()
rng = np.random.default_rng(0)
raw = rng.standard_normal((1, 384)).astype(np.float32)
- mock_model.encode.return_value = raw / np.linalg.norm(raw, axis=1, keepdims=True)
+ mock_model.encode.return_value = raw
mock_st.return_value = mock_model📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rng = np.random.default_rng(0) | |
| raw = rng.standard_normal((1, 384)).astype(np.float32) | |
| mock_model.encode.return_value = raw / np.linalg.norm(raw, axis=1, keepdims=True) | |
| mock_st.return_value = mock_model | |
| rng = np.random.default_rng(0) | |
| raw = rng.standard_normal((1, 384)).astype(np.float32) | |
| mock_model.encode.return_value = raw | |
| mock_st.return_value = mock_model |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_transformer.py` around lines 217 - 220, The test currently sets
mock_model.encode.return_value to an already-normalized vector (raw /
np.linalg.norm(...)), which masks any bug in the embedder's normalization;
change the mock to return the raw unnormalized vector (e.g., raw as created by
rng.standard_normal) so that the embedder's normalization logic (the code under
test exercised via mock_model.encode and mock_st) must produce a unit vector for
the test to pass.



Summary
max_line_lengthconfig option to split long lines into virtual lines for windowing, preventing token truncation in embeddingsdequewithlist, extract_flush_bufferhelper--max-line-lengthCLI argumentIntervalMergerwithNamedTuplefor clarityThresholderReader,Segmenter,Embedder,Scorer,Thresholder,Merger,FormatterSemanticLogAnalyzerfor all pipeline stagesOutputFormattertoXmlFormatterWindowConfigandScoringConfigsub-config dataclassestests/conftest.pywith shared fixtures@pytest.mark.integrationmarkers for tests loading real modelsSummary by CodeRabbit
Release Notes
New Features
--max-line-lengthCLI option to control line splitting during analysis.Refactor
Tests