Skip to content

Add protocols, dependency injection, character-limit line splitting, and test infrastructure#37

Merged
calebevans merged 3 commits into
mainfrom
refactor/architecture
May 23, 2026
Merged

Add protocols, dependency injection, character-limit line splitting, and test infrastructure#37
calebevans merged 3 commits into
mainfrom
refactor/architecture

Conversation

@calebevans
Copy link
Copy Markdown
Owner

@calebevans calebevans commented May 23, 2026

Summary

  • Add max_line_length config option to split long lines into virtual lines for windowing, preventing token truncation in embeddings
  • Refactor windower: replace deque with list, extract _flush_buffer helper
  • Add --max-line-length CLI argument
  • Replace bare tuples in IntervalMerger with NamedTuple for clarity
  • Rename confusing percentile variables in Thresholder
  • Define protocols for all 7 pipeline stages: Reader, Segmenter, Embedder, Scorer, Thresholder, Merger, Formatter
  • Add dependency injection to SemanticLogAnalyzer for all pipeline stages
  • Rename OutputFormatter to XmlFormatter
  • Replace backend if/elif chain with registry pattern using importlib
  • Add WindowConfig and ScoringConfig sub-config dataclasses
  • Create tests/conftest.py with shared fixtures
  • Add return type annotations to all test fixtures
  • Add @pytest.mark.integration markers for tests loading real models

Summary by CodeRabbit

Release Notes

  • New Features

    • Added --max-line-length CLI option to control line splitting during analysis.
    • Pipeline now supports dependency injection of custom analysis components.
  • Refactor

    • Formatter implementation renamed for clarity.
    • Internal improvements to threshold calculation and window merging logic.
    • Enhanced segmentation to support dynamic line splitting.
  • Tests

    • Added comprehensive test fixtures and coverage for new line-splitting functionality.
    • Added unit tests for dependency injection feature.

Review Change Stack

calebevans and others added 3 commits May 23, 2026 01:06
- Add max_line_length config option to split long lines into virtual
  lines for windowing, preventing token truncation in embeddings
- Refactor windower: replace deque with list, extract _flush_buffer
  helper to eliminate duplicated TextWindow construction
- Add --max-line-length CLI argument
- Replace bare tuples in IntervalMerger with NamedTuple for clarity
- Rename confusing percentile variables in Thresholder
- Add comprehensive tests for line splitting edge cases

Co-authored-by: Cursor <cursoragent@cursor.com>
- Define Reader, Segmenter, Thresholder, Merger, and Formatter protocols
- Add dependency injection to SemanticLogAnalyzer for all pipeline stages
- Rename OutputFormatter to XmlFormatter
- Replace backend if/elif chain with registry pattern
- Add WindowConfig and ScoringConfig sub-config dataclasses
- Add pipeline unit tests verifying DI works
- Update tests for XmlFormatter rename

Co-authored-by: Cursor <cursoragent@cursor.com>
- Create tests/conftest.py with shared fixtures for common test data
- Add return type annotations to all test fixtures
- Add @pytest.mark.integration to tests loading real models
- Register integration marker in pyproject.toml
- Refactor tests to use shared fixtures where appropriate

Co-authored-by: Cursor <cursoragent@cursor.com>
@calebevans calebevans self-assigned this May 23, 2026
@calebevans calebevans added the enhancement New feature or request label May 23, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

📝 Walkthrough

Walkthrough

This PR introduces a pluggable pipeline architecture enabling dependency injection of semantic log analyzer components, adds configuration-driven line splitting for long logs, renames the output formatter for clarity, and significantly expands test coverage with shared fixtures and specialized test classes.

Changes

Pipeline Architecture Refactoring

Layer / File(s) Summary
Pipeline component interface contracts
src/cordon/core/types.py, src/cordon/core/__init__.py
Five Protocol interfaces (Reader, Segmenter, Thresholder, Merger, Formatter) define the expected method signatures for each pipeline stage, enabling custom implementations and dependency injection.
Configuration model extensions and validation
src/cordon/core/config.py, src/cordon/cli.py
WindowConfig and ScoringConfig dataclasses organize configuration concerns; AnalysisConfig adds max_line_length field with validation rejecting values < 1; CLI exposes --max-line-length option.
Dependency injection in SemanticLogAnalyzer
src/cordon/pipeline.py
Constructor accepts optional injected components (reader, segmenter, scorer, thresholder, merger, formatter) as keyword-only parameters with defaults to concrete implementations; analyze_file_detailed uses injected components instead of creating local instances.
Formatter class renaming and export updates
src/cordon/postprocess/formatter.py, src/cordon/postprocess/__init__.py, tests/test_postprocess.py
OutputFormatter renamed to XmlFormatter throughout codebase; module exports and all test instantiations updated for consistency.
Configuration-driven line splitting in segmentation
src/cordon/segmentation/windower.py
SlidingWindowSegmenter optionally splits long lines into virtual chunks based on max_line_length; _flush_buffer helper converts buffered chunks to TextWindow objects; _split_line performs split logic while preserving real line numbers for window boundaries.
Type-safe window interval handling in merger
src/cordon/postprocess/merger.py
IntervalMerger introduces _WindowInterval NamedTuple for typed window segments (start, end, window_id, score); refactored merge_windows performs overlap/adjacency detection and gap handling using named fields instead of raw tuples.
Dynamic embedder backend registry pattern
src/cordon/embedding/__init__.py
Refactors embedder factory from explicit if/import branches to registry-based approach; _BACKEND_REGISTRY maps backend names to fully-qualified class paths; uses importlib for reflective instantiation with improved error messaging.
Code clarity improvements
src/cordon/analysis/thresholder.py
Thresholder.select_significant range-mode refactored: upper_percentile/lower_percentile renamed to exclusion_cutoff/inclusion_floor with clarifying comments explaining percentile-to-score-threshold mapping.
Shared test fixtures and configuration
tests/conftest.py, pyproject.toml
Added conftest.py with reusable pytest fixtures (default_config, sample_windows, sample_embeddings, sample_scored_windows, sample_merged_blocks); pytest.ini_options gains integration marker definition.
Configuration validation and segmentation splitting tests
tests/test_core.py, tests/test_segmentation.py
Added AnalysisConfig.max_line_length validation tests (rejects < 1, accepts None/positive); comprehensive SlidingWindowSegmenter tests covering boundary, over-limit, multiple-window, under-limit, and disabled splitting scenarios.
Dependency injection unit tests
tests/test_pipeline_unit.py
New test module with TestPipelineDI verifying custom reader/formatter/scorer invocation, default component instantiation when not provided, and formatter output pass-through.
Test type annotations and integration marker coverage
tests/test_llama_cpp.py, tests/test_remote.py, tests/test_transformer.py
TYPE_CHECKING-guarded imports and return type annotations in test fixtures enable static typing without runtime side effects; transformer, llama_cpp, and remote tests marked @pytest.mark.integration; new TestTransformerEmbedderUnit class with mocked embeddings verifies normalization and window identity preservation.

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly Related PRs

  • calebevans/cordon#34: Refactors the formatter API signature to pass pre-ingested lines instead of a Path, directly affecting how the pipeline calls the formatter stage introduced in this PR.
  • calebevans/cordon#33: Modifies Thresholder.select_significant validation in the same file as the variable naming clarity refactor, adding explicit range consistency checks.

Poem

🐰 Hop along the pipeline bright,
Components inject with all their might,
Line splitting flows where limits lie,
XmlFormatter tags code up high,
Tests now dance with fixtures neat,
Architecture refactored—complete!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main changes: adding protocols, dependency injection, character-limit line splitting, and test infrastructure are all central features introduced by this changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/architecture

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_transformer.py (1)

237-244: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert identity with is in the identity test.

Line 243 uses ==, which only checks value equality. This test should enforce object identity.

Suggested fix
-        assert result_window == window
+        assert result_window is window
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_transformer.py` around lines 237 - 244, In
test_embed_preserves_window_identity replace the value-equality assertion with
an identity check so the returned window object is the same instance: assert
result_window is window; locate the test function
test_embed_preserves_window_identity and the variables window and result_window
produced by calling TransformerEmbedder.embed_windows (and TextWindow) and
change the assertion accordingly to use "is" instead of "==".
🧹 Nitpick comments (3)
tests/test_segmentation.py (1)

106-130: ⚡ Quick win

Strengthen split assertions to catch ordering/truncation regressions.

Line 109 and Line 122 currently use exact-multiple lengths, and the assertions mostly check substring presence. That can still pass if chunks are reordered or a tail chunk is lost. Prefer exact content assertions and at least one non-multiple-length case.

Proposed test tightening
     def test_split_long_line_exceeds_limit(self) -> None:
         """Test that a line exceeding max_line_length is split into virtual lines."""
         config = AnalysisConfig(window_size=3, max_line_length=5)
-        lines = iter([(1, "abcdefghijklmno")])
+        lines = iter([(1, "abcdefghijklmn")])  # 5, 5, 4
         segmenter = SlidingWindowSegmenter()
         windows = list(segmenter.segment(lines, config))
         assert len(windows) == 1
         assert windows[0].start_line == 1
         assert windows[0].end_line == 1
-        assert "abcde" in windows[0].content
-        assert "fghij" in windows[0].content
-        assert "klmno" in windows[0].content
+        assert windows[0].content == "abcde\nfghij\nklmn"
@@
     def test_split_creates_multiple_windows(self) -> None:
         """Test that a very long line creates multiple windows."""
         config = AnalysisConfig(window_size=2, max_line_length=5)
         lines = iter([(1, "abcdefghijklmnopqrst")])
         segmenter = SlidingWindowSegmenter()
         windows = list(segmenter.segment(lines, config))
         assert len(windows) == 2
         assert windows[0].start_line == 1
         assert windows[0].end_line == 1
         assert windows[1].start_line == 1
         assert windows[1].end_line == 1
+        assert windows[0].content == "abcde\nfghij"
+        assert windows[1].content == "klmno\npqrst"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_segmentation.py` around lines 106 - 130, The tests use only
substring checks which can miss ordering/truncation regressions; update
test_split_long_line_exceeds_limit and test_split_creates_multiple_windows
(which exercise SlidingWindowSegmenter.segment with AnalysisConfig) to assert
exact virtual-chunk contents and ordering instead of membership: for
test_split_long_line_exceeds_limit assert windows[0].content equals the
concatenation of the expected ordered chunks (e.g., "abcde|fghij|klmno" or
however chunks are delimited by the segmenter) and for
test_split_creates_multiple_windows add at least one case where the final chunk
length is not an exact multiple and assert the exact content of
windows[0].content and windows[1].content as well as their order to catch
reordering/truncation regressions.
tests/test_pipeline_unit.py (1)

26-26: ⚡ Quick win

Assert exact call arguments for injected collaborators.

On Line 26 and Line 86, assert_called_once() misses argument-level regressions. Use assert_called_once_with(...) to verify path/config wiring too.

Proposed test hardening
@@
-        mock_reader.read_lines.assert_called_once()
+        mock_reader.read_lines.assert_called_once_with(Path("dummy.log"))
@@
-        mock_scorer.score_windows.assert_called_once()
+        mock_scorer.score_windows.assert_called_once_with([], default_config)

Also applies to: 86-86

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_pipeline_unit.py` at line 26, Replace the generic mock assertions
that use mock_reader.read_lines.assert_called_once() with argument-aware
assertions to catch wiring regressions: change
mock_reader.read_lines.assert_called_once() and the similar call at Line 86 to
mock_reader.read_lines.assert_called_once_with(<expected_path_or_config>) (use
the actual expected path/config value used when constructing the Pipeline under
test) so the test verifies both that read_lines was called once and that it
received the correct path/config; locate these assertions by the symbol
mock_reader.read_lines in tests/test_pipeline_unit.py.
src/cordon/postprocess/__init__.py (1)

1-4: ⚡ Quick win

Preserve OutputFormatter as a compatibility alias during the rename

Line 4 removes OutputFormatter from the public export surface, which is a breaking change for existing consumers. Consider exporting both names for one deprecation cycle.

♻️ Proposed compatibility patch
diff --git a/src/cordon/postprocess/formatter.py b/src/cordon/postprocess/formatter.py
@@
 class XmlFormatter:
@@
         output_parts.append("</anomalies>")
         return "\n".join(output_parts)
+
+# Backward-compatible alias; remove in a future major release.
+OutputFormatter = XmlFormatter
diff --git a/src/cordon/postprocess/__init__.py b/src/cordon/postprocess/__init__.py
@@
-from cordon.postprocess.formatter import XmlFormatter
+from cordon.postprocess.formatter import OutputFormatter, XmlFormatter
 from cordon.postprocess.merger import IntervalMerger
 
-__all__ = ["IntervalMerger", "XmlFormatter"]
+__all__ = ["IntervalMerger", "XmlFormatter", "OutputFormatter"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cordon/postprocess/__init__.py` around lines 1 - 4, The public API
removed the old name OutputFormatter; restore backward compatibility by adding a
compatibility alias (e.g., set OutputFormatter = XmlFormatter) and include
"OutputFormatter" in __all__ alongside "XmlFormatter" and "IntervalMerger";
optionally emit a DeprecationWarning when OutputFormatter is referenced to guide
consumers to use XmlFormatter instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cordon/segmentation/windower.py`:
- Around line 40-43: The current code materializes chunks via chunks =
self._split_line(...) and then iterates, causing large temporary allocations for
very long lines; change to iterate the generator/iterator returned by
_split_line directly (e.g., for chunk_line_num, chunk_text in
self._split_line(line_num, line_text, max_line_length):) and append each tuple
to buffer, and apply the same lazy iteration replacement in the other occurrence
(the block covering lines 75-92) so no intermediate list is created.

In `@tests/test_transformer.py`:
- Around line 217-220: The test currently sets mock_model.encode.return_value to
an already-normalized vector (raw / np.linalg.norm(...)), which masks any bug in
the embedder's normalization; change the mock to return the raw unnormalized
vector (e.g., raw as created by rng.standard_normal) so that the embedder's
normalization logic (the code under test exercised via mock_model.encode and
mock_st) must produce a unit vector for the test to pass.

---

Outside diff comments:
In `@tests/test_transformer.py`:
- Around line 237-244: In test_embed_preserves_window_identity replace the
value-equality assertion with an identity check so the returned window object is
the same instance: assert result_window is window; locate the test function
test_embed_preserves_window_identity and the variables window and result_window
produced by calling TransformerEmbedder.embed_windows (and TextWindow) and
change the assertion accordingly to use "is" instead of "==".

---

Nitpick comments:
In `@src/cordon/postprocess/__init__.py`:
- Around line 1-4: The public API removed the old name OutputFormatter; restore
backward compatibility by adding a compatibility alias (e.g., set
OutputFormatter = XmlFormatter) and include "OutputFormatter" in __all__
alongside "XmlFormatter" and "IntervalMerger"; optionally emit a
DeprecationWarning when OutputFormatter is referenced to guide consumers to use
XmlFormatter instead.

In `@tests/test_pipeline_unit.py`:
- Line 26: Replace the generic mock assertions that use
mock_reader.read_lines.assert_called_once() with argument-aware assertions to
catch wiring regressions: change mock_reader.read_lines.assert_called_once() and
the similar call at Line 86 to
mock_reader.read_lines.assert_called_once_with(<expected_path_or_config>) (use
the actual expected path/config value used when constructing the Pipeline under
test) so the test verifies both that read_lines was called once and that it
received the correct path/config; locate these assertions by the symbol
mock_reader.read_lines in tests/test_pipeline_unit.py.

In `@tests/test_segmentation.py`:
- Around line 106-130: The tests use only substring checks which can miss
ordering/truncation regressions; update test_split_long_line_exceeds_limit and
test_split_creates_multiple_windows (which exercise
SlidingWindowSegmenter.segment with AnalysisConfig) to assert exact
virtual-chunk contents and ordering instead of membership: for
test_split_long_line_exceeds_limit assert windows[0].content equals the
concatenation of the expected ordered chunks (e.g., "abcde|fghij|klmno" or
however chunks are delimited by the segmenter) and for
test_split_creates_multiple_windows add at least one case where the final chunk
length is not an exact multiple and assert the exact content of
windows[0].content and windows[1].content as well as their order to catch
reordering/truncation regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7aa6195b-1a8b-473c-bfa4-8ca9c760203a

📥 Commits

Reviewing files that changed from the base of the PR and between 736c9a9 and dbc8972.

📒 Files selected for processing (20)
  • pyproject.toml
  • src/cordon/analysis/thresholder.py
  • src/cordon/cli.py
  • src/cordon/core/__init__.py
  • src/cordon/core/config.py
  • src/cordon/core/types.py
  • src/cordon/embedding/__init__.py
  • src/cordon/pipeline.py
  • src/cordon/postprocess/__init__.py
  • src/cordon/postprocess/formatter.py
  • src/cordon/postprocess/merger.py
  • src/cordon/segmentation/windower.py
  • tests/conftest.py
  • tests/test_core.py
  • tests/test_llama_cpp.py
  • tests/test_pipeline_unit.py
  • tests/test_postprocess.py
  • tests/test_remote.py
  • tests/test_segmentation.py
  • tests/test_transformer.py

Comment on lines +40 to +43
chunks = self._split_line(line_num, line_text, max_line_length)

for chunk_line_num, chunk_text in chunks:
buffer.append((chunk_line_num, chunk_text))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stream chunk generation to avoid memory spikes on very long lines.

Line 40 materializes all chunks as a list before window flushing. On very long lines this creates a large intermediate allocation, which weakens the line-splitting safety goal. Iterate lazily from _split_line instead.

Proposed fix
-            chunks = self._split_line(line_num, line_text, max_line_length)
-
-            for chunk_line_num, chunk_text in chunks:
+            for chunk_line_num, chunk_text in self._split_line(
+                line_num, line_text, max_line_length
+            ):
                 buffer.append((chunk_line_num, chunk_text))
@@
-    def _split_line(line_num: int, line_text: str, max_length: int | None) -> list[tuple[int, str]]:
+    def _split_line(
+        line_num: int, line_text: str, max_length: int | None
+    ) -> Iterator[tuple[int, str]]:
@@
         if max_length is None or len(line_text) <= max_length:
-            return [(line_num, line_text)]
+            yield (line_num, line_text)
+            return
 
-        return [
-            (line_num, line_text[i : i + max_length]) for i in range(0, len(line_text), max_length)
-        ]
+        for i in range(0, len(line_text), max_length):
+            yield (line_num, line_text[i : i + max_length])

Also applies to: 75-92

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cordon/segmentation/windower.py` around lines 40 - 43, The current code
materializes chunks via chunks = self._split_line(...) and then iterates,
causing large temporary allocations for very long lines; change to iterate the
generator/iterator returned by _split_line directly (e.g., for chunk_line_num,
chunk_text in self._split_line(line_num, line_text, max_line_length):) and
append each tuple to buffer, and apply the same lazy iteration replacement in
the other occurrence (the block covering lines 75-92) so no intermediate list is
created.

Comment thread tests/test_transformer.py
Comment on lines +217 to +220
rng = np.random.default_rng(0)
raw = rng.standard_normal((1, 384)).astype(np.float32)
mock_model.encode.return_value = raw / np.linalg.norm(raw, axis=1, keepdims=True)
mock_st.return_value = mock_model
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use non-normalized mock output so the normalization test can fail on regressions.

Line 219 returns an already normalized vector, so test_embed_returns_normalized_vectors can pass even if normalization is broken in the embedder.

Suggested fix
         mock_model = MagicMock()
         rng = np.random.default_rng(0)
         raw = rng.standard_normal((1, 384)).astype(np.float32)
-        mock_model.encode.return_value = raw / np.linalg.norm(raw, axis=1, keepdims=True)
+        mock_model.encode.return_value = raw
         mock_st.return_value = mock_model
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rng = np.random.default_rng(0)
raw = rng.standard_normal((1, 384)).astype(np.float32)
mock_model.encode.return_value = raw / np.linalg.norm(raw, axis=1, keepdims=True)
mock_st.return_value = mock_model
rng = np.random.default_rng(0)
raw = rng.standard_normal((1, 384)).astype(np.float32)
mock_model.encode.return_value = raw
mock_st.return_value = mock_model
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_transformer.py` around lines 217 - 220, The test currently sets
mock_model.encode.return_value to an already-normalized vector (raw /
np.linalg.norm(...)), which masks any bug in the embedder's normalization;
change the mock to return the raw unnormalized vector (e.g., raw as created by
rng.standard_normal) so that the embedder's normalization logic (the code under
test exercised via mock_model.encode and mock_st) must produce a unit vector for
the test to pass.

@calebevans calebevans merged commit a818f0a into main May 23, 2026
14 checks passed
@calebevans calebevans deleted the refactor/architecture branch May 23, 2026 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant