-
Notifications
You must be signed in to change notification settings - Fork 32
Improve Search Relevance with new query strategy #2279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tests-integration/Elastic.Assembler.IntegrationTests/Search/SearchBootstrapFixture.cs
Dismissed
Show dismissed
Hide dismissed
src/api/Elastic.Documentation.Api.Infrastructure/Adapters/Search/ElasticsearchGateway.cs
Show resolved
Hide resolved
|
We need to re-add the changes from #2277 in ElasticsearchGateway.cs |
| var tokens = searchQuery.Split(" "); | ||
| if (tokens is ["datastream" or "datastreams" or "data-stream" or "data-streams"]) | ||
| { | ||
| // /docs/api/doc/kibana/operation/operation-delete-fleet-epm-packages-pkgname-pkgversion-datastream-assets | ||
| // Is the only page that uses "datastream" instead of "data streams" this gives it an N of 1 in the entire corpus | ||
| // which is hard to fix through tweaking boosting, should update the page to use "data streams" instead | ||
| searchQuery = "data streams"; | ||
| tokens = ["data", "streams"]; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to move this logic into a separate class.
If we find additional edge-case handling like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ I will follow up with making this a new index time synonym list. I just need to make sure that when that list is updated we reindex everything automatically (like we do for all setting/mapping changes).
|
@reakaleek #2277 is re-applied after fixing the merge conflict. |
… under Search.IntegrationTests
Improve Search Relevance with Refined Query Strategy
This PR overhauls the Elasticsearch query implementation to significantly improve search result relevance across the documentation site. The changes focus on better field weighting, improved tokenization, and more targeted query matching.
Strategic Approach: Precision First, Recall Later
This PR temporarily removes RRF (Reciprocal Rank Fusion) and semantic search to establish a new baseline with comprehensive test coverage. The previous implementation combined lexical and semantic search through RRF, making it difficult to debug relevance issues and understand why certain results ranked as they did.
By focusing on lexical search precision first, we can:
This approach follows the information retrieval principle: optimize precision before expanding recall. Semantic search will be reintroduced after we have confidence in the lexical baseline.
Key Search Relevance Improvements
1. Streamlined Query Structure
The previous implementation used RRF combining 13+ separate query clauses with complex boolean logic. This created unpredictable scoring where short titles (like "ECS") could dominate results due to high term frequency scores, making it nearly impossible to debug why specific documents ranked higher.
The new implementation uses a focused multi-match strategy:
MultiMatchQuerywithbool_prefixtype on newsearch_titlecompletion fields (boost: 2.0)stripped_bodywith reduced weight (boost: 0.2) to prevent body text from overwhelming title relevance2. New
search_titleFieldIntroduced a dedicated
search_titlefield that combines the document title with URL components:search_as_you_typemapping with n-gram subfields (_2gram,_3gram) for better prefix matching3. Enhanced Text Analysis Pipeline
Indexing improvements:
kstemfilter to the synonyms analyzer for better stemming (e.g., "getting" → "get")sso↔single sign-onquerydsl↔query dslField-specific optimizations:
search_titleandtitlenow usesearch_as_you_typefor autocomplete-style matchingsynonyms_analyzerapplication across searchable fields4. Boosting and Dampening Strategy
Positive signals:
constant_score(1.0x) - prevents over-optimizationNegative signals:
BoostingQuerywith negative dampening (0.8x) for generic terms:5. Edge Case Handling
Added special handling for ambiguous terms:
datastream/datastreams/data-stream→ automatically expanded to "data streams"Testing Infrastructure Enhancements
New
Search.IntegrationTestsProjectc# client,dotnet client)ecs,sso)elasticsearch getting started)data-streams,datastream)Extended TheoryData Support
Elasticsearch Connection Guard
SkipUnlessattribute)Breaking Changes
NormalizeSearchQuery: The "dotnet" → "net" transformation is now handled by synonymsMigration Notes
This change requires reindexing to populate the new
search_titlefield. The field is generated during indexing by combining title and URL components.Results
Test suite now validates 7+ different search patterns with expected first-page results. The new query strategy produces more intuitive results where:
Next Steps
Once the precision baseline is validated in production: