Improve Search Relevance with new query strategy #2279

Mpdreamz · 2025-11-27T12:43:25Z

Improve Search Relevance with Refined Query Strategy

This PR overhauls the Elasticsearch query implementation to significantly improve search result relevance across the documentation site. The changes focus on better field weighting, improved tokenization, and more targeted query matching.

Strategic Approach: Precision First, Recall Later

This PR temporarily removes RRF (Reciprocal Rank Fusion) and semantic search to establish a new baseline with comprehensive test coverage. The previous implementation combined lexical and semantic search through RRF, making it difficult to debug relevance issues and understand why certain results ranked as they did.

By focusing on lexical search precision first, we can:

Build a robust test suite with expected results for common queries
Understand and optimize the baseline query behavior
Establish clear relevance benchmarks
Later reintroduce semantic search + RRF with larger recall once precision is validated

This approach follows the information retrieval principle: optimize precision before expanding recall. Semantic search will be reintroduced after we have confidence in the lexical baseline.

Key Search Relevance Improvements

1. Streamlined Query Structure

The previous implementation used RRF combining 13+ separate query clauses with complex boolean logic. This created unpredictable scoring where short titles (like "ECS") could dominate results due to high term frequency scores, making it nearly impossible to debug why specific documents ranked higher.

The new implementation uses a focused multi-match strategy:

Primary matching via MultiMatchQuery with bool_prefix type on new search_title completion fields (boost: 2.0)
Content matching on stripped_body with reduced weight (boost: 0.2) to prevent body text from overwhelming title relevance
URL matching for single-word queries to ensure direct matches (e.g., "templates") rank appropriately

2. New `search_title` Field

Introduced a dedicated search_title field that combines the document title with URL components:

Prevents documents with very short titles from receiving disproportionate relevance scores
Uses search_as_you_type mapping with n-gram subfields (_2gram, _3gram) for better prefix matching
Indexed with synonym analysis to handle abbreviations and alternative terms

3. Enhanced Text Analysis Pipeline

Indexing improvements:

Added kstem filter to the synonyms analyzer for better stemming (e.g., "getting" → "get")
Expanded synonym dictionary with new mappings:
- sso ↔ single sign-on
- querydsl ↔ query dsl

Field-specific optimizations:

search_title and title now use search_as_you_type for autocomplete-style matching
Consistent synonyms_analyzer application across searchable fields

4. Boosting and Dampening Strategy

Positive signals:

Multi-match on completion fields (2.0x boost) - prioritizes title/URL matches
Single-word URL matches wrapped in constant_score (1.0x) - prevents over-optimization

Negative signals:

BoostingQuery with negative dampening (0.8x) for generic terms:
- Documents with "plugin", "client", or "integration" in title/headings/URL receive reduced scores
- This prevents overly generic integration/plugin docs from dominating specific feature queries

5. Edge Case Handling

Added special handling for ambiguous terms:

datastream/datastreams/data-stream → automatically expanded to "data streams"
- Fixes N-of-1 corpus problem where a single page used non-standard terminology

Testing Infrastructure Enhancements

New `Search.IntegrationTests` Project

Dedicated integration test suite for search relevance validation
Comprehensive test cases covering common search patterns:
- Language-specific queries (c# client, dotnet client)
- Acronym searches (ecs, sso)
- Multi-word phrases (elasticsearch getting started)
- Variant spellings (data-streams, datastream)

Extended TheoryData Support

Tests now support optional additional expected URLs for first-page verification
Enables testing that related results appear together (e.g., "runscript" should show both operation docs and security action docs)
Better validation of result quality beyond just the top result

Elasticsearch Connection Guard

Tests now skip gracefully when Elasticsearch is unavailable (SkipUnless attribute)
Improved developer experience for running tests locally

Breaking Changes

Temporarily removed RRF retriever: Simplified to single query with post-filter for better explainability
Semantic search disabled: Will be reintroduced after precision baseline is established
Removed NormalizeSearchQuery: The "dotnet" → "net" transformation is now handled by synonyms

Migration Notes

This change requires reindexing to populate the new search_title field. The field is generated during indexing by combining title and URL components.

Results

Test suite now validates 7+ different search patterns with expected first-page results. The new query strategy produces more intuitive results where:

Direct title matches rank highest
Generic integration/plugin docs don't overshadow specific features
Abbreviations and synonyms work consistently
Related documentation appears together on the first page

Next Steps

Once the precision baseline is validated in production:

Reintroduce semantic search for better natural language understanding
Re-enable RRF to combine lexical + semantic signals
Expand test coverage based on production search analytics
Optimize for recall while maintaining precision standards

tests-integration/Elastic.Assembler.IntegrationTests/Search/SearchBootstrapFixture.cs

src/api/Elastic.Documentation.Api.Infrastructure/Adapters/Search/ElasticsearchGateway.cs

reakaleek · 2025-11-27T12:53:07Z

We need to re-add the changes from #2277 in ElasticsearchGateway.cs

reakaleek · 2025-11-27T12:56:09Z

src/api/Elastic.Documentation.Api.Infrastructure/Adapters/Search/ElasticsearchGateway.cs

+		var tokens = searchQuery.Split(" ");
+		if (tokens is ["datastream" or "datastreams" or "data-stream" or "data-streams"])
+		{
+			// /docs/api/doc/kibana/operation/operation-delete-fleet-epm-packages-pkgname-pkgversion-datastream-assets
+			// Is the only page that uses "datastream" instead of "data streams" this gives it an N of 1 in the entire corpus
+			// which is hard to fix through tweaking boosting, should update the page to use "data streams" instead
+			searchQuery = "data streams";
+			tokens = ["data", "streams"];
+		}


We might want to move this logic into a separate class.

If we find additional edge-case handling like this.

++ I will follow up with making this a new index time synonym list. I just need to make sure that when that list is updated we reindex everything automatically (like we do for all setting/mapping changes).

Mpdreamz · 2025-11-27T15:32:20Z

@reakaleek #2277 is re-applied after fixing the merge conflict.

… under Search.IntegrationTests

Mpdreamz added 2 commits November 27, 2025 13:38

New query for search with updated relevance tests

b64a334

SkipUnless connected to Elasticsearch

8cd7411

Mpdreamz requested review from a team as code owners November 27, 2025 12:43

Mpdreamz requested a review from cotti November 27, 2025 12:43

Mpdreamz added the fix label Nov 27, 2025

Mpdreamz self-assigned this Nov 27, 2025

Mpdreamz added the fix label Nov 27, 2025

github-code-quality bot found potential problems Nov 27, 2025

View reviewed changes

tests-integration/Elastic.Assembler.IntegrationTests/Search/SearchBootstrapFixture.cs Dismissed Show dismissed Hide dismissed

reakaleek reviewed Nov 27, 2025

View reviewed changes

src/api/Elastic.Documentation.Api.Infrastructure/Adapters/Search/ElasticsearchGateway.cs Show resolved Hide resolved

reakaleek approved these changes Nov 27, 2025

View reviewed changes

Mpdreamz changed the title ~~fix/search relevance~~ Improve Search Relevance with new query strategy Nov 27, 2025

reakaleek reviewed Nov 27, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into fix/search-relevance

a871d4a

Mpdreamz enabled auto-merge (squash) November 27, 2025 15:32

Delete SearchRelevanceTests from Assembler.IntegrationTests now lives…

3e23133

… under Search.IntegrationTests

reakaleek approved these changes Nov 27, 2025

View reviewed changes

Mpdreamz added 2 commits November 28, 2025 10:00

update SkipUnless

149908a

update SkipUnless

efc245a

Mpdreamz merged commit 876162a into main Nov 28, 2025
28 checks passed

Mpdreamz deleted the fix/search-relevance branch November 28, 2025 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Search Relevance with new query strategy #2279

Improve Search Relevance with new query strategy #2279

Uh oh!

Mpdreamz commented Nov 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

reakaleek commented Nov 27, 2025 •

edited

Loading

Uh oh!

reakaleek Nov 27, 2025 •

edited

Loading

Uh oh!

Mpdreamz Nov 27, 2025

Uh oh!

Mpdreamz commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve Search Relevance with new query strategy #2279

Improve Search Relevance with new query strategy #2279

Uh oh!

Conversation

Mpdreamz commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improve Search Relevance with Refined Query Strategy

Strategic Approach: Precision First, Recall Later

Key Search Relevance Improvements

1. Streamlined Query Structure

2. New search_title Field

3. Enhanced Text Analysis Pipeline

4. Boosting and Dampening Strategy

5. Edge Case Handling

Testing Infrastructure Enhancements

New Search.IntegrationTests Project

Extended TheoryData Support

Elasticsearch Connection Guard

Breaking Changes

Migration Notes

Results

Next Steps

Uh oh!

Uh oh!

Uh oh!

reakaleek commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reakaleek Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mpdreamz Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Mpdreamz commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mpdreamz commented Nov 27, 2025 •

edited

Loading

2. New `search_title` Field

New `Search.IntegrationTests` Project

reakaleek commented Nov 27, 2025 •

edited

Loading

reakaleek Nov 27, 2025 •

edited

Loading