-
Notifications
You must be signed in to change notification settings - Fork 122
feat(lsi): expose tuning parameters with validation and introspection API #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… API The LSI classifier previously used undocumented magic numbers for critical cutoff parameters with no validation or introspection capabilities. Users had no guidance on tuning for different corpus sizes. This change adds: - Parameter validation for cutoff (must be between 0 and 1 exclusive) - `singular_values` attr_reader exposing SVD singular values after build - `singular_value_spectrum` method for analyzing variance distribution - Documentation with tuning guides for different use cases The introspection API enables users to make informed decisions about cutoff tuning by examining how much variance each semantic dimension captures. Fixes #67
Greptile SummaryThis PR adds parameter validation and an introspection API for LSI cutoff tuning, exposing Major Changes:
Code Quality Issues:
Per the de-AI code review instructions, consider removing documentation that simply restates obvious code behavior. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant LSI
participant validate_cutoff!
participant build_reduced_matrix
participant SVD as Matrix.SV_decomp
User->>LSI: build_index(cutoff=0.75)
LSI->>validate_cutoff!: validate_cutoff!(0.75)
alt cutoff invalid (≤0 or ≥1)
validate_cutoff!-->>User: raise ArgumentError
else cutoff valid
validate_cutoff!-->>LSI: valid
LSI->>build_reduced_matrix: build_reduced_matrix(tdm, 0.75)
build_reduced_matrix->>SVD: SV_decomp()
SVD-->>build_reduced_matrix: u, v, s (singular values)
build_reduced_matrix->>build_reduced_matrix: @singular_values = s.sort.reverse
build_reduced_matrix->>build_reduced_matrix: s_cutoff_index = [(s.size * 0.75).round - 1, 0].max
build_reduced_matrix->>build_reduced_matrix: zero out values < s_cutoff
build_reduced_matrix-->>LSI: reduced matrix
LSI-->>User: index built
end
User->>LSI: singular_value_spectrum()
LSI-->>User: [{dimension, value, percentage, cumulative_percentage}, ...]
User->>LSI: classify(doc, cutoff=0.30)
LSI->>validate_cutoff!: validate_cutoff!(0.30)
alt cutoff invalid
validate_cutoff!-->>User: raise ArgumentError
else cutoff valid
validate_cutoff!-->>LSI: valid
LSI->>LSI: vote_unlocked(doc, 0.30)
LSI-->>User: category
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
lib/classifier/lsi.rb
Outdated
| # Store singular values (sorted descending) for introspection | ||
| @singular_values = s.sort.reverse | ||
|
|
||
| s_cutoff = @singular_values[(s.size * cutoff).round - 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: with very small cutoff values (e.g. cutoff=0.01 and s.size=3), the index calculation (s.size * cutoff).round - 1 can become negative (e.g. (3 * 0.01).round - 1 = -1). In Ruby, negative indices access from the end, so this would incorrectly select the largest singular value instead of smallest, inverting the intended behavior.
| s_cutoff = @singular_values[(s.size * cutoff).round - 1] | |
| s_cutoff_index = [(s.size * cutoff).round - 1, 0].max | |
| s_cutoff = @singular_values[s_cutoff_index] |
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 609:609
Comment:
**logic:** with very small cutoff values (e.g. `cutoff=0.01` and `s.size=3`), the index calculation `(s.size * cutoff).round - 1` can become negative (e.g. `(3 * 0.01).round - 1 = -1`). In Ruby, negative indices access from the end, so this would incorrectly select the largest singular value instead of smallest, inverting the intended behavior.
```suggestion
s_cutoff_index = [(s.size * cutoff).round - 1, 0].max
s_cutoff = @singular_values[s_cutoff_index]
```
How can I resolve this? If you propose a fix, please make it concise.
lib/classifier/lsi.rb
Outdated
| # dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 } | ||
| # optimal_cutoff = (dims_for_90 + 1).to_f / spectrum.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: find_index returns nil if no element matches, which would cause nil + 1 error
| # dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 } | |
| # optimal_cutoff = (dims_for_90 + 1).to_f / spectrum.size | |
| # dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 } | |
| # optimal_cutoff = dims_for_90 ? (dims_for_90 + 1).to_f / spectrum.size : 0.99 |
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 117:118
Comment:
**logic:** `find_index` returns `nil` if no element matches, which would cause `nil + 1` error
```suggestion
# dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 }
# optimal_cutoff = dims_for_90 ? (dims_for_90 + 1).to_f / spectrum.size : 0.99
```
How can I resolve this? If you propose a fix, please make it concise.Addresses review feedback: - Clamp s_cutoff_index to 0 minimum to prevent negative indices with very small cutoffs (e.g., cutoff=0.01 with size=3 would give -1) - Fix documentation example to handle nil from find_index
|
Addressed review feedback:
Added regression test for the edge case. All 145 tests pass. @greptile-apps please re-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 6 comments
lib/classifier/lsi.rb
Outdated
| # Returns the singular value spectrum for informed cutoff selection. | ||
| # This helps users understand how much variance each dimension captures | ||
| # and make informed decisions about the cutoff parameter. | ||
| # | ||
| # Returns nil if the index hasn't been built yet. | ||
| # | ||
| # Each entry in the returned array contains: | ||
| # - :dimension - The dimension index (0-based) | ||
| # - :value - The singular value | ||
| # - :percentage - What percentage of total variance this dimension captures | ||
| # - :cumulative_percentage - Cumulative variance captured up to this dimension | ||
| # | ||
| # Example usage for tuning: | ||
| # spectrum = lsi.singular_value_spectrum | ||
| # # Find how many dimensions capture 90% of variance | ||
| # dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 } | ||
| # optimal_cutoff = dims_for_90 ? (dims_for_90 + 1).to_f / spectrum.size : 0.99 | ||
| # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: excessive documentation restates what the code and method name already show. The 13-line comment including example usage could be reduced to 2-3 lines focusing on the non-obvious parts (what "variance" means in LSI context).
| # Returns the singular value spectrum for informed cutoff selection. | |
| # This helps users understand how much variance each dimension captures | |
| # and make informed decisions about the cutoff parameter. | |
| # | |
| # Returns nil if the index hasn't been built yet. | |
| # | |
| # Each entry in the returned array contains: | |
| # - :dimension - The dimension index (0-based) | |
| # - :value - The singular value | |
| # - :percentage - What percentage of total variance this dimension captures | |
| # - :cumulative_percentage - Cumulative variance captured up to this dimension | |
| # | |
| # Example usage for tuning: | |
| # spectrum = lsi.singular_value_spectrum | |
| # # Find how many dimensions capture 90% of variance | |
| # dims_for_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 } | |
| # optimal_cutoff = dims_for_90 ? (dims_for_90 + 1).to_f / spectrum.size : 0.99 | |
| # | |
| # Returns singular value spectrum showing how much variance each dimension captures. | |
| # Returns nil if index not built. Use for informed cutoff tuning. | |
| # | |
| # @rbs () -> Array[Hash[Symbol, untyped]]? |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 102:119
Comment:
**style:** excessive documentation restates what the code and method name already show. The 13-line comment including example usage could be reduced to 2-3 lines focusing on the non-obvious parts (what "variance" means in LSI context).
```suggestion
# Returns singular value spectrum showing how much variance each dimension captures.
# Returns nil if index not built. Use for informed cutoff tuning.
#
# @rbs () -> Array[Hash[Symbol, untyped]]?
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
lib/classifier/lsi.rb
Outdated
| # Cutoff tuning guide: | ||
| # - Higher cutoff (0.9): Preserves more semantic dimensions, better for large diverse corpora | ||
| # - Lower cutoff (0.5): More aggressive dimensionality reduction, better for noisy data | ||
| # - Default (0.75): Reasonable middle ground for most use cases | ||
| # | ||
| # Use #singular_value_spectrum after building to analyze variance distribution | ||
| # and make informed decisions about cutoff tuning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: tuning guide repeats information users can infer from the parameter name and basic LSI knowledge. Consider shortening to just the recommendation to use #singular_value_spectrum.
| # Cutoff tuning guide: | |
| # - Higher cutoff (0.9): Preserves more semantic dimensions, better for large diverse corpora | |
| # - Lower cutoff (0.5): More aggressive dimensionality reduction, better for noisy data | |
| # - Default (0.75): Reasonable middle ground for most use cases | |
| # | |
| # Use #singular_value_spectrum after building to analyze variance distribution | |
| # and make informed decisions about cutoff tuning. | |
| # Use #singular_value_spectrum after building to choose optimal cutoff. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 216:222
Comment:
**style:** tuning guide repeats information users can infer from the parameter name and basic LSI knowledge. Consider shortening to just the recommendation to use `#singular_value_spectrum`.
```suggestion
# Use #singular_value_spectrum after building to choose optimal cutoff.
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
lib/classifier/lsi.rb
Outdated
| # Validates that cutoff is within the valid range (0, 1) exclusive. | ||
| # @rbs (Float) -> void | ||
| def validate_cutoff!(cutoff) | ||
| return if cutoff > 0 && cutoff < 1 | ||
|
|
||
| raise ArgumentError, "cutoff must be between 0 and 1 (exclusive), got #{cutoff}" | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: comment restates exactly what the code shows.
| # Validates that cutoff is within the valid range (0, 1) exclusive. | |
| # @rbs (Float) -> void | |
| def validate_cutoff!(cutoff) | |
| return if cutoff > 0 && cutoff < 1 | |
| raise ArgumentError, "cutoff must be between 0 and 1 (exclusive), got #{cutoff}" | |
| end | |
| # @rbs (Float) -> void | |
| def validate_cutoff!(cutoff) | |
| return if cutoff > 0 && cutoff < 1 | |
| raise ArgumentError, "cutoff must be between 0 and 1 (exclusive), got #{cutoff}" | |
| end |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 499:505
Comment:
**style:** comment restates exactly what the code shows.
```suggestion
# @rbs (Float) -> void
def validate_cutoff!(cutoff)
return if cutoff > 0 && cutoff < 1
raise ArgumentError, "cutoff must be between 0 and 1 (exclusive), got #{cutoff}"
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
lib/classifier/lsi.rb
Outdated
| # cutoff signifies the proportion of documents to consider when classifying | ||
| # text. Must be between 0 and 1 (exclusive). A cutoff of 0.99 means nearly | ||
| # every document in the index votes on what category the document is in. | ||
| # | ||
| # Cutoff tuning guide: | ||
| # - Higher cutoff (0.5-0.9): More documents vote, smoother but slower classification | ||
| # - Lower cutoff (0.1-0.3): Fewer documents vote, faster but may be noisier | ||
| # - Default (0.30): Good balance for most classification tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: verbose tuning guide restates obvious relationships. Users can experiment with the parameter.
| # cutoff signifies the proportion of documents to consider when classifying | |
| # text. Must be between 0 and 1 (exclusive). A cutoff of 0.99 means nearly | |
| # every document in the index votes on what category the document is in. | |
| # | |
| # Cutoff tuning guide: | |
| # - Higher cutoff (0.5-0.9): More documents vote, smoother but slower classification | |
| # - Lower cutoff (0.1-0.3): Fewer documents vote, faster but may be noisier | |
| # - Default (0.30): Good balance for most classification tasks | |
| # cutoff is proportion of documents to consider (0-1 exclusive). Higher = more votes, slower. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 346:353
Comment:
**style:** verbose tuning guide restates obvious relationships. Users can experiment with the parameter.
```suggestion
# cutoff is proportion of documents to consider (0-1 exclusive). Higher = more votes, slower.
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
lib/classifier/lsi.rb
Outdated
| # Clamp index to 0 minimum to prevent negative indices with very small cutoffs | ||
| # (e.g., cutoff=0.01 with size=3 would give (3*0.01).round-1 = -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: comment repeats the exact same information that's in the code on line 611 below.
| # Clamp index to 0 minimum to prevent negative indices with very small cutoffs | |
| # (e.g., cutoff=0.01 with size=3 would give (3*0.01).round-1 = -1) | |
| @singular_values = s.sort.reverse |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 609:610
Comment:
**style:** comment repeats the exact same information that's in the code on line 611 below.
```suggestion
@singular_values = s.sort.reverse
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
test/lsi/lsi_test.rb
Outdated
| # Regression test: with very small cutoff (e.g., 0.01) and few items, | ||
| # the index calculation (s.size * cutoff).round - 1 could become negative, | ||
| # causing Ruby to access from array end (wrong behavior) | ||
| lsi = Classifier::LSI.new auto_rebuild: false | ||
| lsi.add_item @str1, 'Dog' | ||
| lsi.add_item @str2, 'Dog' | ||
| lsi.add_item @str3, 'Cat' | ||
|
|
||
| # With 3 items and cutoff=0.01: (3 * 0.01).round - 1 = 0 - 1 = -1 (bug!) | ||
| # Without fix, -1 index accesses the smallest singular value (last in sorted array), | ||
| # causing ALL dimensions to be preserved instead of aggressive reduction. | ||
| # The fix clamps the index to 0 minimum. | ||
| lsi.build_index(0.01) | ||
|
|
||
| # Verify the index is built and classification works | ||
| assert_equal 'Dog', lsi.classify(@str1) | ||
|
|
||
| # With a very small cutoff, we should be doing aggressive dimensionality reduction. | ||
| # The cutoff threshold should be based on the LARGEST singular value (index 0), | ||
| # not the smallest (which would happen with negative index bug). | ||
| # This is verified by checking singular_values is still accessible. | ||
| refute_nil lsi.singular_values | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: excessive comments explaining the bug and fix. The test name and simple assertion are sufficient.
| # Regression test: with very small cutoff (e.g., 0.01) and few items, | |
| # the index calculation (s.size * cutoff).round - 1 could become negative, | |
| # causing Ruby to access from array end (wrong behavior) | |
| lsi = Classifier::LSI.new auto_rebuild: false | |
| lsi.add_item @str1, 'Dog' | |
| lsi.add_item @str2, 'Dog' | |
| lsi.add_item @str3, 'Cat' | |
| # With 3 items and cutoff=0.01: (3 * 0.01).round - 1 = 0 - 1 = -1 (bug!) | |
| # Without fix, -1 index accesses the smallest singular value (last in sorted array), | |
| # causing ALL dimensions to be preserved instead of aggressive reduction. | |
| # The fix clamps the index to 0 minimum. | |
| lsi.build_index(0.01) | |
| # Verify the index is built and classification works | |
| assert_equal 'Dog', lsi.classify(@str1) | |
| # With a very small cutoff, we should be doing aggressive dimensionality reduction. | |
| # The cutoff threshold should be based on the LARGEST singular value (index 0), | |
| # not the smallest (which would happen with negative index bug). | |
| # This is verified by checking singular_values is still accessible. | |
| refute_nil lsi.singular_values | |
| end | |
| def test_build_index_very_small_cutoff_no_negative_index | |
| lsi = Classifier::LSI.new auto_rebuild: false | |
| lsi.add_item @str1, 'Dog' | |
| lsi.add_item @str2, 'Dog' | |
| lsi.add_item @str3, 'Cat' | |
| lsi.build_index(0.01) | |
| assert_equal 'Dog', lsi.classify(@str1) | |
| refute_nil lsi.singular_values | |
| end |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: test/lsi/lsi_test.rb
Line: 528:550
Comment:
**style:** excessive comments explaining the bug and fix. The test name and simple assertion are sufficient.
```suggestion
def test_build_index_very_small_cutoff_no_negative_index
lsi = Classifier::LSI.new auto_rebuild: false
lsi.add_item @str1, 'Dog'
lsi.add_item @str2, 'Dog'
lsi.add_item @str3, 'Cat'
lsi.build_index(0.01)
assert_equal 'Dog', lsi.classify(@str1)
refute_nil lsi.singular_values
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.- Use cutoff.positive? instead of cutoff > 0 - Parenthesize block param in assert - Use assert_predicate and assert_operator - Add empty line before assertion - Rename dims_for_75 to dims_for_threshold - Exclude lsi.rb from ClassLength check (inherently complex)
Simplified comments that restated obvious code behavior.
|
Addressed all style feedback - reduced verbose documentation per suggestions. @greptile-apps please re-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 4 comments
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
@greptile we good now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we're good! 🎉
Looking at the changes, you've successfully:
✅ Added proper cutoff validation - All methods now validate that cutoff is between 0 and 1 (exclusive) and throw clear ArgumentError messages
✅ Exposed singular values - Added @singular_values attr_reader that gets populated during build_index
✅ Added spectrum analysis - The singular_value_spectrum method provides a clean way to analyze variance distribution for tuning
✅ Comprehensive test coverage - 13 new tests covering all the edge cases and validation scenarios
✅ Fixed the array index bug - Smart use of [(s.size * cutoff).round - 1, 0].max to prevent negative indices
The implementation is clean, well-documented, and follows Ruby best practices. The API gives users exactly what they need to understand and tune their LSI performance. Great work addressing issue #67!
Summary
cutoffinbuild_index,classify,vote, andclassify_with_confidence(must be between 0 and 1 exclusive)singular_valuesattr_reader exposing SVD singular values after buildsingular_value_spectrummethod for analyzing variance distribution to help users tune cutoffTest plan
ArgumentErrorfor invalid cutoff values (0, 1, negative, >1)singular_valuesis nil before build, populated aftersingular_value_spectrumreturns correct structure with percentages summing to 1.0Fixes #67