Skip to content

Commit

Permalink
Improve README and add engine customisation docs
Browse files Browse the repository at this point in the history
  • Loading branch information
csutter committed Nov 21, 2023
1 parent 9f315a8 commit d4c53ca
Show file tree
Hide file tree
Showing 2 changed files with 130 additions and 1 deletion.
90 changes: 89 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,90 @@
# search-api-v2
Future API to search for content on GOV.UK (not yet live)
API and synchronisation worker for general site search on GOV.UK

This application powers the new site search for GOV.UK using Google Cloud Platform (GCP)'s [Vertex
AI Search][vertex-docs] ("Discovery Engine") product as its underlying search engine. It provides
two core pieces of functionality:
- An API that is "minimally compatible" with the existing `search-api` REST interface to the extent
necessary to power the ["site search" (`/search/all`) finder][search-all-finder].
- A synchonisation worker that receives content updates from the Publishing API message queue and
updates the Discovery Engine dataset accordingly

## Local development
The official way of running this application locally is through [GOV.UK Docker][govuk-docker], where
a project is defined for it. Because this application is deeply integrated with a SaaS product, you
will have to have access to a GCP Discovery Engine engine to be able to do anything more meaningful
than running the test suite.

If you work on the GOV.UK team, you should be able to add a development engine for yourself through
[`search-v2-infrastructure`][search-v2-infrastructure] and configure your local setup accordingly.

Otherwise, you can create your own Discovery Engine engine in GCP and provide the engine's serving
config path and datastore branch.

You can then run the application from within the `govuk-docker` repository directory as follows:

```bash
# Add these to your "dotfiles" for convenience, or just export them in your terminal session if you
# prefer:
export DISCOVERY_ENGINE_SERVING_CONFIG=...
export DISCOVERY_ENGINE_DATASTORE_BRANCH=...

make search-api-v2
govuk-docker up -d search-api-v2-app # or search-api-v2-lite if you just want to run tests
```

## Design goals and `search-api-v2` vs `search-api`
Our primary product goal was to improve the quality of search results for the majority of GOV.UK
users.

The existing search powers a significant number of use cases within GOV.UK, including numerous
user-facing "finder" pages handled by [Finder Frontend][finder-frontend] (among them the
`/search/all` finder that handles _the_ main search page which we usually refer to as "site
search"), but also acts as a very general "everything but the kitchen sink" API for retrieving
content by a set of criteria.

We established that attempting to migrate all of these use cases with over a decade of accumulated
logic and edge cases would distract us from our primary goal and be a poor fit for a next-generation
search product anyway (the overwhelming majority of non-"site search" queries being trivial content
retrieval filtered by certain attributes that could be handled by a relational database).

We therefore made a tactical decision to focus on "site search" only and find the minimal subset of
the existing API contract that is necessary to render search results in this context, and update
[Finder Frontend][finder-frontend] to call our new application if and only if the user is using the
general "site search" finder.

Nothing in this application precludes more use cases being migrated to it in the future, but for the
time being, it is intentionally not a complete replacement for [Search API][search-api] (despite the
"v2" name).

See [Search API compatibility](docs/search_api_compatibility.md) for more information about our
compatibility design choices.

## "Vertex" vs "Discovery Engine"
The marketing name of the search product we use (_Google Vertex AI Search and Conversation_) has
undergone several changes while this application was first developed, and some concepts have
different naming in the Google Cloud Platform UI compared to the actual underlying APIs themselves.

We have chosen to exclusively use the more stable API naming (_Discovery Engine_, _engine_ instead
of _app_, etc.) throughout the codebase and documentation to avoid having to rename things as the
product reached general availability, but you may see the terms "Vertex" or "Vertex Search" as well
as some other marketing terms used in some project artefacts.

## Related projects
- [`finder-frontend`][finder-frontend]: Displays results from this application's API depending on
the "finder" in use and some other conditions
- [`search-api`][search-api]: The original Search API, a subset of which this application's API
replicates
- [`search-v2-infrastructure`][search-v2-infrastructure]: Provisions infrastructure for Discovery
Engine including cloud resources and event ingestion for continuous training of the search
engine
- [`search-v2-evaluator`][search-v2-evaluator]: Internal tool to test and rate search results


[vertex-docs]: https://cloud.google.com/generative-ai-app-builder/docs/introduction
[search-all-finder]: https://www.gov.uk/search/all
[govuk-docker]: https://github.com/alphagov/govuk-docker
[finder-frontend]: https://github.com/alphagov/finder-frontend
[search-api]: https://github.com/alphagov/search-api
[search-v2-infrastructure]: https://github.com/alphagov/search-v2-infrastructure
[search-v2-evaluator]: https://github.com/alphagov/search-v2-evaluator
41 changes: 41 additions & 0 deletions docs/engine_customisation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Engine customisation
Out of the box, using a next-generation semantic search product already gives us improved results
over

## Event ingestion
We capture interaction data from opted in users in Google Analytics. This data is processed and
ingested into Discovery Engine in bulk on a daily basis to help train the model on what content
users are most likely to be looking for.

This is orchestrated by a set of serverless GCP Cloud Functions and associated plumbing in
[search-v2-infrastructure][search-v2-infrastructure].

## Boosting
We apply boosting to documents based on certain criteria. This is a way of asking Discovery Engine
to prioritise some results over others while still optimising for overall relevance.

### Always active
"Always active" boosts are defined as Discovery Engine serving controls. These apply to all searches
regardless of the user's query or other factors and are defined in
[search-v2-infrastructure][search-v2-infrastructure].

We know that certain types of content are much more likely to be useful to the average user than
others, and we want to prioritise them unless their query is extremely specific. For example, a user
searching for "income tax" will be more interested in services and public-facing information around
income tax than internal HMRC manuals.

### Query-time
Some boosts only make sense to apply at the time a query is made. These are defined in this
application as part of the API and include:
- "best bets", which heavily promote one or more specific pieces of content when a user searches for
a specific search term (see [best_bets.yml](../config/best_bets.yml))
- boosting for news based on recency, to make sure breaking news is promoted and old news is demoted
(see [news_recency_boost.rb](../app/services/discovery_engine/news_recency_boost.rb))

## Synonyms
As a semantic search engine, Discovery Engine doesn't need as much synonym configuration compared to
a more traditional "bag of words" keyword search engine.

Still, there are certain domain synonyms that we can't expect a general purpose model to know about, so we define a set of synonyms in [search-v2-infrastructure][search-v2-infrastructure].

[search-v2-infrastructure]: https://github.com/alphagov/search-v2-infrastructure

0 comments on commit d4c53ca

Please sign in to comment.