Improve README and add engine customisation docs

alphagov · Nov 21, 2023 · d4c53ca · d4c53ca
1 parent 9f315a8
commit d4c53ca
Show file tree

Hide file tree

Showing 2 changed files with 130 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1,2 +1,90 @@
 # search-api-v2
-Future API to search for content on GOV.UK (not yet live)
+API and synchronisation worker for general site search on GOV.UK
+
+This application powers the new site search for GOV.UK using Google Cloud Platform (GCP)'s [Vertex
+AI Search][vertex-docs] ("Discovery Engine") product as its underlying search engine. It provides
+two core pieces of functionality:
+- An API that is "minimally compatible" with the existing `search-api` REST interface to the extent
+  necessary to power the ["site search" (`/search/all`) finder][search-all-finder].
+- A synchonisation worker that receives content updates from the Publishing API message queue and
+  updates the Discovery Engine dataset accordingly
+
+## Local development
+The official way of running this application locally is through [GOV.UK Docker][govuk-docker], where
+a project is defined for it. Because this application is deeply integrated with a SaaS product, you
+will have to have access to a GCP Discovery Engine engine to be able to do anything more meaningful
+than running the test suite.
+
+If you work on the GOV.UK team, you should be able to add a development engine for yourself through
+[`search-v2-infrastructure`][search-v2-infrastructure] and configure your local setup accordingly.
+
+Otherwise, you can create your own Discovery Engine engine in GCP and provide the engine's serving
+config path and datastore branch.
+
+You can then run the application from within the `govuk-docker` repository directory as follows:
+
+```bash
+# Add these to your "dotfiles" for convenience, or just export them in your terminal session if you
+# prefer:
+export DISCOVERY_ENGINE_SERVING_CONFIG=...
+export DISCOVERY_ENGINE_DATASTORE_BRANCH=...
+
+make search-api-v2
+govuk-docker up -d search-api-v2-app # or search-api-v2-lite if you just want to run tests
+```
+
+## Design goals and `search-api-v2` vs `search-api`
+Our primary product goal was to improve the quality of search results for the majority of GOV.UK
+users.
+
+The existing search powers a significant number of use cases within GOV.UK, including numerous
+user-facing "finder" pages handled by [Finder Frontend][finder-frontend] (among them the
+`/search/all` finder that handles _the_ main search page which we usually refer to as "site
+search"), but also acts as a very general "everything but the kitchen sink" API for retrieving
+content by a set of criteria.
+
+We established that attempting to migrate all of these use cases with over a decade of accumulated
+logic and edge cases would distract us from our primary goal and be a poor fit for a next-generation
+search product anyway (the overwhelming majority of non-"site search" queries being trivial content
+retrieval filtered by certain attributes that could be handled by a relational database).
+
+We therefore made a tactical decision to focus on "site search" only and find the minimal subset of
+the existing API contract that is necessary to render search results in this context, and update
+[Finder Frontend][finder-frontend] to call our new application if and only if the user is using the
+general "site search" finder.
+
+Nothing in this application precludes more use cases being migrated to it in the future, but for the
+time being, it is intentionally not a complete replacement for [Search API][search-api] (despite the
+"v2" name).
+
+See [Search API compatibility](docs/search_api_compatibility.md) for more information about our
+compatibility design choices.
+
+## "Vertex" vs "Discovery Engine"
+The marketing name of the search product we use (_Google Vertex AI Search and Conversation_) has
+undergone several changes while this application was first developed, and some concepts have
+different naming in the Google Cloud Platform UI compared to the actual underlying APIs themselves.
+
+We have chosen to exclusively use the more stable API naming (_Discovery Engine_, _engine_ instead
+of _app_, etc.) throughout the codebase and documentation to avoid having to rename things as the
+product reached general availability, but you may see the terms "Vertex" or "Vertex Search" as well
+as some other marketing terms used in some project artefacts.
+
+## Related projects
+- [`finder-frontend`][finder-frontend]: Displays results from this application's API depending on
+      the "finder" in use and some other conditions
+- [`search-api`][search-api]: The original Search API, a subset of which this application's API
+      replicates
+- [`search-v2-infrastructure`][search-v2-infrastructure]: Provisions infrastructure for Discovery
+      Engine including cloud resources and event ingestion for continuous training of the search
+      engine
+- [`search-v2-evaluator`][search-v2-evaluator]: Internal tool to test and rate search results
+
+
+[vertex-docs]: https://cloud.google.com/generative-ai-app-builder/docs/introduction
+[search-all-finder]: https://www.gov.uk/search/all
+[govuk-docker]: https://github.com/alphagov/govuk-docker
+[finder-frontend]: https://github.com/alphagov/finder-frontend
+[search-api]: https://github.com/alphagov/search-api
+[search-v2-infrastructure]: https://github.com/alphagov/search-v2-infrastructure
+[search-v2-evaluator]: https://github.com/alphagov/search-v2-evaluator
diff --git a/docs/engine_customisation.md b/docs/engine_customisation.md
@@ -0,0 +1,41 @@
+# Engine customisation
+Out of the box, using a next-generation semantic search product already gives us improved results
+over
+
+## Event ingestion
+We capture interaction data from opted in users in Google Analytics. This data is processed and
+ingested into Discovery Engine in bulk on a daily basis to help train the model on what content
+users are most likely to be looking for.
+
+This is orchestrated by a set of serverless GCP Cloud Functions and associated plumbing in
+[search-v2-infrastructure][search-v2-infrastructure].
+
+## Boosting
+We apply boosting to documents based on certain criteria. This is a way of asking Discovery Engine
+to prioritise some results over others while still optimising for overall relevance.
+
+### Always active
+"Always active" boosts are defined as Discovery Engine serving controls. These apply to all searches
+regardless of the user's query or other factors and are defined in
+[search-v2-infrastructure][search-v2-infrastructure].
+
+We know that certain types of content are much more likely to be useful to the average user than
+others, and we want to prioritise them unless their query is extremely specific. For example, a user
+searching for "income tax" will be more interested in services and public-facing information around
+income tax than internal HMRC manuals.
+
+### Query-time
+Some boosts only make sense to apply at the time a query is made. These are defined in this
+application as part of the API and include:
+- "best bets", which heavily promote one or more specific pieces of content when a user searches for
+  a specific search term (see [best_bets.yml](../config/best_bets.yml))
+- boosting for news based on recency, to make sure breaking news is promoted and old news is demoted
+  (see [news_recency_boost.rb](../app/services/discovery_engine/news_recency_boost.rb))
+
+## Synonyms
+As a semantic search engine, Discovery Engine doesn't need as much synonym configuration compared to
+a more traditional "bag of words" keyword search engine.
+
+Still, there are certain domain synonyms that we can't expect a general purpose model to know about, so we define a set of synonyms in [search-v2-infrastructure][search-v2-infrastructure].
+
+[search-v2-infrastructure]: https://github.com/alphagov/search-v2-infrastructure