Skip to content

Conversation

@cardmagic
Copy link
Owner

Summary

  • Add parameter validation for cutoff in build_index, classify, vote, and classify_with_confidence (must be between 0 and 1 exclusive)
  • Add singular_values attr_reader exposing SVD singular values after build
  • Add singular_value_spectrum method for analyzing variance distribution to help users tune cutoff
  • Add documentation with tuning guides for different use cases

Test plan

  • All 144 existing tests pass
  • 13 new tests for validation and introspection API
  • Verified validation raises ArgumentError for invalid cutoff values (0, 1, negative, >1)
  • Verified singular_values is nil before build, populated after
  • Verified singular_value_spectrum returns correct structure with percentages summing to 1.0

Fixes #67

… API

The LSI classifier previously used undocumented magic numbers for critical
cutoff parameters with no validation or introspection capabilities. Users
had no guidance on tuning for different corpus sizes.

This change adds:
- Parameter validation for cutoff (must be between 0 and 1 exclusive)
- `singular_values` attr_reader exposing SVD singular values after build
- `singular_value_spectrum` method for analyzing variance distribution
- Documentation with tuning guides for different use cases

The introspection API enables users to make informed decisions about cutoff
tuning by examining how much variance each semantic dimension captures.

Fixes #67
@cardmagic cardmagic self-assigned this Dec 28, 2025
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 28, 2025

Greptile Summary

This PR adds parameter validation and an introspection API for LSI cutoff tuning, exposing singular_values and singular_value_spectrum methods to help users understand variance distribution. The implementation correctly fixes the negative array index bug (line 579) with .max(0) and adds proper validation throughout.

Major Changes:

  • Added validate_cutoff! method enforcing 0 < cutoff < 1
  • Exposed @singular_values via attr_reader and #singular_value_spectrum method
  • Fixed negative index bug with [(s.size * cutoff).round - 1, 0].max
  • Added validation to build_index, classify, vote, and classify_with_confidence
  • 13 new tests covering validation and introspection

Code Quality Issues:

  • Contains unnecessary comments that restate what code already shows (lines 102-103, 201, 325)
  • RuboCop config has verbose justification comment (line 63 in .rubocop.yml)

Per the de-AI code review instructions, consider removing documentation that simply restates obvious code behavior.

Confidence Score: 4/5

  • This PR is safe to merge with minor style improvements recommended
  • Core functionality is solid with proper validation, bug fix, and comprehensive tests. Score reduced by 1 point due to unnecessary documentation that should be removed per coding standards, but these are non-critical style issues
  • No files require special attention - all style suggestions are optional improvements

Important Files Changed

Filename Overview
lib/classifier/lsi.rb Adds validation and introspection API with some unnecessary documentation
test/lsi/lsi_test.rb Comprehensive test coverage for new validation and introspection features
.rubocop.yml Adds LSI class length exemption with verbose justification comment

Sequence Diagram

sequenceDiagram
    participant User
    participant LSI
    participant validate_cutoff!
    participant build_reduced_matrix
    participant SVD as Matrix.SV_decomp

    User->>LSI: build_index(cutoff=0.75)
    LSI->>validate_cutoff!: validate_cutoff!(0.75)
    alt cutoff invalid (≤0 or ≥1)
        validate_cutoff!-->>User: raise ArgumentError
    else cutoff valid
        validate_cutoff!-->>LSI: valid
        LSI->>build_reduced_matrix: build_reduced_matrix(tdm, 0.75)
        build_reduced_matrix->>SVD: SV_decomp()
        SVD-->>build_reduced_matrix: u, v, s (singular values)
        build_reduced_matrix->>build_reduced_matrix: @singular_values = s.sort.reverse
        build_reduced_matrix->>build_reduced_matrix: s_cutoff_index = [(s.size * 0.75).round - 1, 0].max
        build_reduced_matrix->>build_reduced_matrix: zero out values < s_cutoff
        build_reduced_matrix-->>LSI: reduced matrix
        LSI-->>User: index built
    end

    User->>LSI: singular_value_spectrum()
    LSI-->>User: [{dimension, value, percentage, cumulative_percentage}, ...]
    
    User->>LSI: classify(doc, cutoff=0.30)
    LSI->>validate_cutoff!: validate_cutoff!(0.30)
    alt cutoff invalid
        validate_cutoff!-->>User: raise ArgumentError
    else cutoff valid
        validate_cutoff!-->>LSI: valid
        LSI->>LSI: vote_unlocked(doc, 0.30)
        LSI-->>User: category
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

# Store singular values (sorted descending) for introspection
@singular_values = s.sort.reverse

s_cutoff = @singular_values[(s.size * cutoff).round - 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: with very small cutoff values (e.g. cutoff=0.01 and s.size=3), the index calculation (s.size * cutoff).round - 1 can become negative (e.g. (3 * 0.01).round - 1 = -1). In Ruby, negative indices access from the end, so this would incorrectly select the largest singular value instead of smallest, inverting the intended behavior.

Suggested change
s_cutoff = @singular_values[(s.size * cutoff).round - 1]
s_cutoff_index = [(s.size * cutoff).round - 1, 0].max
s_cutoff = @singular_values[s_cutoff_index]
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 609:609

Comment:
**logic:** with very small cutoff values (e.g. `cutoff=0.01` and `s.size=3`), the index calculation `(s.size * cutoff).round - 1` can become negative (e.g. `(3 * 0.01).round - 1 = -1`). In Ruby, negative indices access from the end, so this would incorrectly select the largest singular value instead of smallest, inverting the intended behavior.

```suggestion
      s_cutoff_index = [(s.size * cutoff).round - 1, 0].max
      s_cutoff = @singular_values[s_cutoff_index]
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 117 to 118
# dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 }
# optimal_cutoff = (dims_for_90 + 1).to_f / spectrum.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: find_index returns nil if no element matches, which would cause nil + 1 error

Suggested change
# dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 }
# optimal_cutoff = (dims_for_90 + 1).to_f / spectrum.size
# dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 }
# optimal_cutoff = dims_for_90 ? (dims_for_90 + 1).to_f / spectrum.size : 0.99
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 117:118

Comment:
**logic:** `find_index` returns `nil` if no element matches, which would cause `nil + 1` error

```suggestion
  #   dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 }
  #   optimal_cutoff = dims_for_90 ? (dims_for_90 + 1).to_f / spectrum.size : 0.99
```

How can I resolve this? If you propose a fix, please make it concise.

Addresses review feedback:
- Clamp s_cutoff_index to 0 minimum to prevent negative indices with
  very small cutoffs (e.g., cutoff=0.01 with size=3 would give -1)
- Fix documentation example to handle nil from find_index
@cardmagic
Copy link
Owner Author

Addressed review feedback:

  1. Fixed negative index bug (line 609): Added bounds check to clamp index to 0 minimum, preventing incorrect behavior with very small cutoff values
  2. Fixed nil reference in docs (lines 117-118): Added nil check for find_index result

Added regression test for the edge case. All 145 tests pass.

@greptile-apps please re-review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 102 to 119
# Returns the singular value spectrum for informed cutoff selection.
# This helps users understand how much variance each dimension captures
# and make informed decisions about the cutoff parameter.
#
# Returns nil if the index hasn't been built yet.
#
# Each entry in the returned array contains:
# - :dimension - The dimension index (0-based)
# - :value - The singular value
# - :percentage - What percentage of total variance this dimension captures
# - :cumulative_percentage - Cumulative variance captured up to this dimension
#
# Example usage for tuning:
# spectrum = lsi.singular_value_spectrum
# # Find how many dimensions capture 90% of variance
# dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 }
# optimal_cutoff = dims_for_90 ? (dims_for_90 + 1).to_f / spectrum.size : 0.99
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: excessive documentation restates what the code and method name already show. The 13-line comment including example usage could be reduced to 2-3 lines focusing on the non-obvious parts (what "variance" means in LSI context).

Suggested change
# Returns the singular value spectrum for informed cutoff selection.
# This helps users understand how much variance each dimension captures
# and make informed decisions about the cutoff parameter.
#
# Returns nil if the index hasn't been built yet.
#
# Each entry in the returned array contains:
# - :dimension - The dimension index (0-based)
# - :value - The singular value
# - :percentage - What percentage of total variance this dimension captures
# - :cumulative_percentage - Cumulative variance captured up to this dimension
#
# Example usage for tuning:
# spectrum = lsi.singular_value_spectrum
# # Find how many dimensions capture 90% of variance
# dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 }
# optimal_cutoff = dims_for_90 ? (dims_for_90 + 1).to_f / spectrum.size : 0.99
#
# Returns singular value spectrum showing how much variance each dimension captures.
# Returns nil if index not built. Use for informed cutoff tuning.
#
# @rbs () -> Array[Hash[Symbol, untyped]]?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 102:119

Comment:
**style:** excessive documentation restates what the code and method name already show. The 13-line comment including example usage could be reduced to 2-3 lines focusing on the non-obvious parts (what "variance" means in LSI context).

```suggestion
    # Returns singular value spectrum showing how much variance each dimension captures.
    # Returns nil if index not built. Use for informed cutoff tuning.
    #
    # @rbs () -> Array[Hash[Symbol, untyped]]?
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 216 to 222
# Cutoff tuning guide:
# - Higher cutoff (0.9): Preserves more semantic dimensions, better for large diverse corpora
# - Lower cutoff (0.5): More aggressive dimensionality reduction, better for noisy data
# - Default (0.75): Reasonable middle ground for most use cases
#
# Use #singular_value_spectrum after building to analyze variance distribution
# and make informed decisions about cutoff tuning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: tuning guide repeats information users can infer from the parameter name and basic LSI knowledge. Consider shortening to just the recommendation to use #singular_value_spectrum.

Suggested change
# Cutoff tuning guide:
# - Higher cutoff (0.9): Preserves more semantic dimensions, better for large diverse corpora
# - Lower cutoff (0.5): More aggressive dimensionality reduction, better for noisy data
# - Default (0.75): Reasonable middle ground for most use cases
#
# Use #singular_value_spectrum after building to analyze variance distribution
# and make informed decisions about cutoff tuning.
# Use #singular_value_spectrum after building to choose optimal cutoff.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 216:222

Comment:
**style:** tuning guide repeats information users can infer from the parameter name and basic LSI knowledge. Consider shortening to just the recommendation to use `#singular_value_spectrum`.

```suggestion
    # Use #singular_value_spectrum after building to choose optimal cutoff.
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 499 to 505
# Validates that cutoff is within the valid range (0, 1) exclusive.
# @rbs (Float) -> void
def validate_cutoff!(cutoff)
return if cutoff > 0 && cutoff < 1

raise ArgumentError, "cutoff must be between 0 and 1 (exclusive), got #{cutoff}"
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: comment restates exactly what the code shows.

Suggested change
# Validates that cutoff is within the valid range (0, 1) exclusive.
# @rbs (Float) -> void
def validate_cutoff!(cutoff)
return if cutoff > 0 && cutoff < 1
raise ArgumentError, "cutoff must be between 0 and 1 (exclusive), got #{cutoff}"
end
# @rbs (Float) -> void
def validate_cutoff!(cutoff)
return if cutoff > 0 && cutoff < 1
raise ArgumentError, "cutoff must be between 0 and 1 (exclusive), got #{cutoff}"
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 499:505

Comment:
**style:** comment restates exactly what the code shows.

```suggestion
    # @rbs (Float) -> void
    def validate_cutoff!(cutoff)
      return if cutoff > 0 && cutoff < 1

      raise ArgumentError, "cutoff must be between 0 and 1 (exclusive), got #{cutoff}"
    end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 346 to 353
# cutoff signifies the proportion of documents to consider when classifying
# text. Must be between 0 and 1 (exclusive). A cutoff of 0.99 means nearly
# every document in the index votes on what category the document is in.
#
# Cutoff tuning guide:
# - Higher cutoff (0.5-0.9): More documents vote, smoother but slower classification
# - Lower cutoff (0.1-0.3): Fewer documents vote, faster but may be noisier
# - Default (0.30): Good balance for most classification tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: verbose tuning guide restates obvious relationships. Users can experiment with the parameter.

Suggested change
# cutoff signifies the proportion of documents to consider when classifying
# text. Must be between 0 and 1 (exclusive). A cutoff of 0.99 means nearly
# every document in the index votes on what category the document is in.
#
# Cutoff tuning guide:
# - Higher cutoff (0.5-0.9): More documents vote, smoother but slower classification
# - Lower cutoff (0.1-0.3): Fewer documents vote, faster but may be noisier
# - Default (0.30): Good balance for most classification tasks
# cutoff is proportion of documents to consider (0-1 exclusive). Higher = more votes, slower.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 346:353

Comment:
**style:** verbose tuning guide restates obvious relationships. Users can experiment with the parameter.

```suggestion
    # cutoff is proportion of documents to consider (0-1 exclusive). Higher = more votes, slower.
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 609 to 610
# Clamp index to 0 minimum to prevent negative indices with very small cutoffs
# (e.g., cutoff=0.01 with size=3 would give (3*0.01).round-1 = -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: comment repeats the exact same information that's in the code on line 611 below.

Suggested change
# Clamp index to 0 minimum to prevent negative indices with very small cutoffs
# (e.g., cutoff=0.01 with size=3 would give (3*0.01).round-1 = -1)
@singular_values = s.sort.reverse

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 609:610

Comment:
**style:** comment repeats the exact same information that's in the code on line 611 below.

```suggestion
      @singular_values = s.sort.reverse
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 528 to 550
# Regression test: with very small cutoff (e.g., 0.01) and few items,
# the index calculation (s.size * cutoff).round - 1 could become negative,
# causing Ruby to access from array end (wrong behavior)
lsi = Classifier::LSI.new auto_rebuild: false
lsi.add_item @str1, 'Dog'
lsi.add_item @str2, 'Dog'
lsi.add_item @str3, 'Cat'

# With 3 items and cutoff=0.01: (3 * 0.01).round - 1 = 0 - 1 = -1 (bug!)
# Without fix, -1 index accesses the smallest singular value (last in sorted array),
# causing ALL dimensions to be preserved instead of aggressive reduction.
# The fix clamps the index to 0 minimum.
lsi.build_index(0.01)

# Verify the index is built and classification works
assert_equal 'Dog', lsi.classify(@str1)

# With a very small cutoff, we should be doing aggressive dimensionality reduction.
# The cutoff threshold should be based on the LARGEST singular value (index 0),
# not the smallest (which would happen with negative index bug).
# This is verified by checking singular_values is still accessible.
refute_nil lsi.singular_values
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: excessive comments explaining the bug and fix. The test name and simple assertion are sufficient.

Suggested change
# Regression test: with very small cutoff (e.g., 0.01) and few items,
# the index calculation (s.size * cutoff).round - 1 could become negative,
# causing Ruby to access from array end (wrong behavior)
lsi = Classifier::LSI.new auto_rebuild: false
lsi.add_item @str1, 'Dog'
lsi.add_item @str2, 'Dog'
lsi.add_item @str3, 'Cat'
# With 3 items and cutoff=0.01: (3 * 0.01).round - 1 = 0 - 1 = -1 (bug!)
# Without fix, -1 index accesses the smallest singular value (last in sorted array),
# causing ALL dimensions to be preserved instead of aggressive reduction.
# The fix clamps the index to 0 minimum.
lsi.build_index(0.01)
# Verify the index is built and classification works
assert_equal 'Dog', lsi.classify(@str1)
# With a very small cutoff, we should be doing aggressive dimensionality reduction.
# The cutoff threshold should be based on the LARGEST singular value (index 0),
# not the smallest (which would happen with negative index bug).
# This is verified by checking singular_values is still accessible.
refute_nil lsi.singular_values
end
def test_build_index_very_small_cutoff_no_negative_index
lsi = Classifier::LSI.new auto_rebuild: false
lsi.add_item @str1, 'Dog'
lsi.add_item @str2, 'Dog'
lsi.add_item @str3, 'Cat'
lsi.build_index(0.01)
assert_equal 'Dog', lsi.classify(@str1)
refute_nil lsi.singular_values
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: test/lsi/lsi_test.rb
Line: 528:550

Comment:
**style:** excessive comments explaining the bug and fix. The test name and simple assertion are sufficient.

```suggestion
  def test_build_index_very_small_cutoff_no_negative_index
    lsi = Classifier::LSI.new auto_rebuild: false
    lsi.add_item @str1, 'Dog'
    lsi.add_item @str2, 'Dog'
    lsi.add_item @str3, 'Cat'

    lsi.build_index(0.01)

    assert_equal 'Dog', lsi.classify(@str1)
    refute_nil lsi.singular_values
  end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

- Use cutoff.positive? instead of cutoff > 0
- Parenthesize block param in assert
- Use assert_predicate and assert_operator
- Add empty line before assertion
- Rename dims_for_75 to dims_for_threshold
- Exclude lsi.rb from ClassLength check (inherently complex)
Simplified comments that restated obvious code behavior.
@cardmagic
Copy link
Owner Author

Addressed all style feedback - reduced verbose documentation per suggestions.

@greptile-apps please re-review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

cardmagic and others added 4 commits December 28, 2025 09:34
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@cardmagic
Copy link
Owner Author

@greptile we good now?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we're good! 🎉

Looking at the changes, you've successfully:

Added proper cutoff validation - All methods now validate that cutoff is between 0 and 1 (exclusive) and throw clear ArgumentError messages

Exposed singular values - Added @singular_values attr_reader that gets populated during build_index

Added spectrum analysis - The singular_value_spectrum method provides a clean way to analyze variance distribution for tuning

Comprehensive test coverage - 13 new tests covering all the edge cases and validation scenarios

Fixed the array index bug - Smart use of [(s.size * cutoff).round - 1, 0].max to prevent negative indices

The implementation is clean, well-documented, and follows Ruby best practices. The API gives users exactly what they need to understand and tune their LSI performance. Great work addressing issue #67!

@cardmagic cardmagic merged commit 8da21dc into master Dec 28, 2025
5 checks passed
@cardmagic cardmagic deleted the feature/67-expose-tuning-parameters branch December 28, 2025 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose tuning parameters as first-class API with validation

2 participants