Skip to content

Conversation

@Mpdreamz
Copy link
Member

@Mpdreamz Mpdreamz commented Nov 27, 2025

Improve Search Relevance with Refined Query Strategy

This PR overhauls the Elasticsearch query implementation to significantly improve search result relevance across the documentation site. The changes focus on better field weighting, improved tokenization, and more targeted query matching.

Strategic Approach: Precision First, Recall Later

This PR temporarily removes RRF (Reciprocal Rank Fusion) and semantic search to establish a new baseline with comprehensive test coverage. The previous implementation combined lexical and semantic search through RRF, making it difficult to debug relevance issues and understand why certain results ranked as they did.

By focusing on lexical search precision first, we can:

  1. Build a robust test suite with expected results for common queries
  2. Understand and optimize the baseline query behavior
  3. Establish clear relevance benchmarks
  4. Later reintroduce semantic search + RRF with larger recall once precision is validated

This approach follows the information retrieval principle: optimize precision before expanding recall. Semantic search will be reintroduced after we have confidence in the lexical baseline.

Key Search Relevance Improvements

1. Streamlined Query Structure

The previous implementation used RRF combining 13+ separate query clauses with complex boolean logic. This created unpredictable scoring where short titles (like "ECS") could dominate results due to high term frequency scores, making it nearly impossible to debug why specific documents ranked higher.

The new implementation uses a focused multi-match strategy:

  • Primary matching via MultiMatchQuery with bool_prefix type on new search_title completion fields (boost: 2.0)
  • Content matching on stripped_body with reduced weight (boost: 0.2) to prevent body text from overwhelming title relevance
  • URL matching for single-word queries to ensure direct matches (e.g., "templates") rank appropriately

2. New search_title Field

Introduced a dedicated search_title field that combines the document title with URL components:

  • Prevents documents with very short titles from receiving disproportionate relevance scores
  • Uses search_as_you_type mapping with n-gram subfields (_2gram, _3gram) for better prefix matching
  • Indexed with synonym analysis to handle abbreviations and alternative terms

3. Enhanced Text Analysis Pipeline

Indexing improvements:

  • Added kstem filter to the synonyms analyzer for better stemming (e.g., "getting" → "get")
  • Expanded synonym dictionary with new mappings:
    • ssosingle sign-on
    • querydslquery dsl

Field-specific optimizations:

  • search_title and title now use search_as_you_type for autocomplete-style matching
  • Consistent synonyms_analyzer application across searchable fields

4. Boosting and Dampening Strategy

Positive signals:

  • Multi-match on completion fields (2.0x boost) - prioritizes title/URL matches
  • Single-word URL matches wrapped in constant_score (1.0x) - prevents over-optimization

Negative signals:

  • BoostingQuery with negative dampening (0.8x) for generic terms:
    • Documents with "plugin", "client", or "integration" in title/headings/URL receive reduced scores
    • This prevents overly generic integration/plugin docs from dominating specific feature queries

5. Edge Case Handling

Added special handling for ambiguous terms:

  • datastream/datastreams/data-stream → automatically expanded to "data streams"
    • Fixes N-of-1 corpus problem where a single page used non-standard terminology

Testing Infrastructure Enhancements

New Search.IntegrationTests Project

  • Dedicated integration test suite for search relevance validation
  • Comprehensive test cases covering common search patterns:
    • Language-specific queries (c# client, dotnet client)
    • Acronym searches (ecs, sso)
    • Multi-word phrases (elasticsearch getting started)
    • Variant spellings (data-streams, datastream)

Extended TheoryData Support

  • Tests now support optional additional expected URLs for first-page verification
  • Enables testing that related results appear together (e.g., "runscript" should show both operation docs and security action docs)
  • Better validation of result quality beyond just the top result

Elasticsearch Connection Guard

  • Tests now skip gracefully when Elasticsearch is unavailable (SkipUnless attribute)
  • Improved developer experience for running tests locally

Breaking Changes

  • Temporarily removed RRF retriever: Simplified to single query with post-filter for better explainability
  • Semantic search disabled: Will be reintroduced after precision baseline is established
  • Removed NormalizeSearchQuery: The "dotnet" → "net" transformation is now handled by synonyms

Migration Notes

This change requires reindexing to populate the new search_title field. The field is generated during indexing by combining title and URL components.

Results

Test suite now validates 7+ different search patterns with expected first-page results. The new query strategy produces more intuitive results where:

  • Direct title matches rank highest
  • Generic integration/plugin docs don't overshadow specific features
  • Abbreviations and synonyms work consistently
  • Related documentation appears together on the first page

Next Steps

Once the precision baseline is validated in production:

  1. Reintroduce semantic search for better natural language understanding
  2. Re-enable RRF to combine lexical + semantic signals
  3. Expand test coverage based on production search analytics
  4. Optimize for recall while maintaining precision standards

@Mpdreamz Mpdreamz requested review from a team as code owners November 27, 2025 12:43
@Mpdreamz Mpdreamz requested a review from cotti November 27, 2025 12:43
@Mpdreamz Mpdreamz added the fix label Nov 27, 2025
@Mpdreamz Mpdreamz self-assigned this Nov 27, 2025
@Mpdreamz Mpdreamz added the fix label Nov 27, 2025
@Mpdreamz Mpdreamz changed the title fix/search relevance Improve Search Relevance with new query strategy Nov 27, 2025
@reakaleek
Copy link
Member

reakaleek commented Nov 27, 2025

We need to re-add the changes from #2277 in ElasticsearchGateway.cs

Comment on lines +96 to +104
var tokens = searchQuery.Split(" ");
if (tokens is ["datastream" or "datastreams" or "data-stream" or "data-streams"])
{
// /docs/api/doc/kibana/operation/operation-delete-fleet-epm-packages-pkgname-pkgversion-datastream-assets
// Is the only page that uses "datastream" instead of "data streams" this gives it an N of 1 in the entire corpus
// which is hard to fix through tweaking boosting, should update the page to use "data streams" instead
searchQuery = "data streams";
tokens = ["data", "streams"];
}
Copy link
Member

@reakaleek reakaleek Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to move this logic into a separate class.

If we find additional edge-case handling like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ I will follow up with making this a new index time synonym list. I just need to make sure that when that list is updated we reindex everything automatically (like we do for all setting/mapping changes).

@Mpdreamz
Copy link
Member Author

@reakaleek #2277 is re-applied after fixing the merge conflict.

@Mpdreamz Mpdreamz enabled auto-merge (squash) November 27, 2025 15:32
@Mpdreamz Mpdreamz merged commit 876162a into main Nov 28, 2025
28 checks passed
@Mpdreamz Mpdreamz deleted the fix/search-relevance branch November 28, 2025 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants