-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve README and add engine customisation docs
- Loading branch information
Showing
2 changed files
with
130 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,90 @@ | ||
# search-api-v2 | ||
Future API to search for content on GOV.UK (not yet live) | ||
API and synchronisation worker for general site search on GOV.UK | ||
|
||
This application powers the new site search for GOV.UK using Google Cloud Platform (GCP)'s [Vertex | ||
AI Search][vertex-docs] ("Discovery Engine") product as its underlying search engine. It provides | ||
two core pieces of functionality: | ||
- An API that is "minimally compatible" with the existing `search-api` REST interface to the extent | ||
necessary to power the ["site search" (`/search/all`) finder][search-all-finder]. | ||
- A synchonisation worker that receives content updates from the Publishing API message queue and | ||
updates the Discovery Engine dataset accordingly | ||
|
||
## Local development | ||
The official way of running this application locally is through [GOV.UK Docker][govuk-docker], where | ||
a project is defined for it. Because this application is deeply integrated with a SaaS product, you | ||
will have to have access to a GCP Discovery Engine engine to be able to do anything more meaningful | ||
than running the test suite. | ||
|
||
If you work on the GOV.UK team, you should be able to add a development engine for yourself through | ||
[`search-v2-infrastructure`][search-v2-infrastructure] and configure your local setup accordingly. | ||
|
||
Otherwise, you can create your own Discovery Engine engine in GCP and provide the engine's serving | ||
config path and datastore branch. | ||
|
||
You can then run the application from within the `govuk-docker` repository directory as follows: | ||
|
||
```bash | ||
# Add these to your "dotfiles" for convenience, or just export them in your terminal session if you | ||
# prefer: | ||
export DISCOVERY_ENGINE_SERVING_CONFIG=... | ||
export DISCOVERY_ENGINE_DATASTORE_BRANCH=... | ||
|
||
make search-api-v2 | ||
govuk-docker up -d search-api-v2-app # or search-api-v2-lite if you just want to run tests | ||
``` | ||
|
||
## Design goals and `search-api-v2` vs `search-api` | ||
Our primary product goal was to improve the quality of search results for the majority of GOV.UK | ||
users. | ||
|
||
The existing search powers a significant number of use cases within GOV.UK, including numerous | ||
user-facing "finder" pages handled by [Finder Frontend][finder-frontend] (among them the | ||
`/search/all` finder that handles _the_ main search page which we usually refer to as "site | ||
search"), but also acts as a very general "everything but the kitchen sink" API for retrieving | ||
content by a set of criteria. | ||
|
||
We established that attempting to migrate all of these use cases with over a decade of accumulated | ||
logic and edge cases would distract us from our primary goal and be a poor fit for a next-generation | ||
search product anyway (the overwhelming majority of non-"site search" queries being trivial content | ||
retrieval filtered by certain attributes that could be handled by a relational database). | ||
|
||
We therefore made a tactical decision to focus on "site search" only and find the minimal subset of | ||
the existing API contract that is necessary to render search results in this context, and update | ||
[Finder Frontend][finder-frontend] to call our new application if and only if the user is using the | ||
general "site search" finder. | ||
|
||
Nothing in this application precludes more use cases being migrated to it in the future, but for the | ||
time being, it is intentionally not a complete replacement for [Search API][search-api] (despite the | ||
"v2" name). | ||
|
||
See [Search API compatibility](docs/search_api_compatibility.md) for more information about our | ||
compatibility design choices. | ||
|
||
## "Vertex" vs "Discovery Engine" | ||
The marketing name of the search product we use (_Google Vertex AI Search and Conversation_) has | ||
undergone several changes while this application was first developed, and some concepts have | ||
different naming in the Google Cloud Platform UI compared to the actual underlying APIs themselves. | ||
|
||
We have chosen to exclusively use the more stable API naming (_Discovery Engine_, _engine_ instead | ||
of _app_, etc.) throughout the codebase and documentation to avoid having to rename things as the | ||
product reached general availability, but you may see the terms "Vertex" or "Vertex Search" as well | ||
as some other marketing terms used in some project artefacts. | ||
|
||
## Related projects | ||
- [`finder-frontend`][finder-frontend]: Displays results from this application's API depending on | ||
the "finder" in use and some other conditions | ||
- [`search-api`][search-api]: The original Search API, a subset of which this application's API | ||
replicates | ||
- [`search-v2-infrastructure`][search-v2-infrastructure]: Provisions infrastructure for Discovery | ||
Engine including cloud resources and event ingestion for continuous training of the search | ||
engine | ||
- [`search-v2-evaluator`][search-v2-evaluator]: Internal tool to test and rate search results | ||
|
||
|
||
[vertex-docs]: https://cloud.google.com/generative-ai-app-builder/docs/introduction | ||
[search-all-finder]: https://www.gov.uk/search/all | ||
[govuk-docker]: https://github.com/alphagov/govuk-docker | ||
[finder-frontend]: https://github.com/alphagov/finder-frontend | ||
[search-api]: https://github.com/alphagov/search-api | ||
[search-v2-infrastructure]: https://github.com/alphagov/search-v2-infrastructure | ||
[search-v2-evaluator]: https://github.com/alphagov/search-v2-evaluator |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Engine customisation | ||
Out of the box, using a next-generation semantic search product already gives us improved results | ||
over | ||
|
||
## Event ingestion | ||
We capture interaction data from opted in users in Google Analytics. This data is processed and | ||
ingested into Discovery Engine in bulk on a daily basis to help train the model on what content | ||
users are most likely to be looking for. | ||
|
||
This is orchestrated by a set of serverless GCP Cloud Functions and associated plumbing in | ||
[search-v2-infrastructure][search-v2-infrastructure]. | ||
|
||
## Boosting | ||
We apply boosting to documents based on certain criteria. This is a way of asking Discovery Engine | ||
to prioritise some results over others while still optimising for overall relevance. | ||
|
||
### Always active | ||
"Always active" boosts are defined as Discovery Engine serving controls. These apply to all searches | ||
regardless of the user's query or other factors and are defined in | ||
[search-v2-infrastructure][search-v2-infrastructure]. | ||
|
||
We know that certain types of content are much more likely to be useful to the average user than | ||
others, and we want to prioritise them unless their query is extremely specific. For example, a user | ||
searching for "income tax" will be more interested in services and public-facing information around | ||
income tax than internal HMRC manuals. | ||
|
||
### Query-time | ||
Some boosts only make sense to apply at the time a query is made. These are defined in this | ||
application as part of the API and include: | ||
- "best bets", which heavily promote one or more specific pieces of content when a user searches for | ||
a specific search term (see [best_bets.yml](../config/best_bets.yml)) | ||
- boosting for news based on recency, to make sure breaking news is promoted and old news is demoted | ||
(see [news_recency_boost.rb](../app/services/discovery_engine/news_recency_boost.rb)) | ||
|
||
## Synonyms | ||
As a semantic search engine, Discovery Engine doesn't need as much synonym configuration compared to | ||
a more traditional "bag of words" keyword search engine. | ||
|
||
Still, there are certain domain synonyms that we can't expect a general purpose model to know about, so we define a set of synonyms in [search-v2-infrastructure][search-v2-infrastructure]. | ||
|
||
[search-v2-infrastructure]: https://github.com/alphagov/search-v2-infrastructure |