Update documentation

alphagov · Mar 19, 2020 · 2a614ce · 2a614ce
1 parent 413ea6a
commit 2a614ce
Show file tree

Hide file tree

Showing 6 changed files with 76 additions and 55 deletions.
diff --git a/README.md b/README.md
@@ -25,12 +25,6 @@ You can also find some examples in the blog post:
 The instructions will help you to get Search API running
 locally on your machine.
 
-### Dependencies
-
-- [Elasticsearch](https://github.com/elastic/elasticsearch) - "You Know, for Search...".
-- [Redis](https://redis.io/) - used by indexing workers.
-- [AWS Sagemaker](https://aws.amazon.com/sagemaker/) - used for [search relevancy](docs/relevancy.md)
-
 ### Prequisites
 
 Install [govuk-docker](https://github.com/alphagov/govuk-docker)!
@@ -99,6 +93,21 @@ It does some clever stuff at both parts, but that's the meat of it.
 Read the [documentation](/doc) to find out [how documents are indexed](doc/indexing.md)
 or [how documents are retrieved](doc/how-search-works.md).
 
+### Dependencies
+
+Search API depends on other services in order to index documents and provide
+relevant search results:
+
+- [Elasticsearch](https://github.com/elastic/elasticsearch) - "You Know, for Search...".
+- [Redis](https://redis.io/) - used by indexing workers.
+- [AWS Sagemaker](https://aws.amazon.com/sagemaker/) (optional) - used for [search relevancy](docs/relevancy.md)
+
+If you use govuk-docker locally, the required dependencies will be started
+automatically when you start Search API. You don't need to set these up yourself.
+
+See the [learning to rank documentation](doc/learning-to-rank.md) for
+guidance on how to run the ranking model locally.
+
 ### Additional Docs
 
 - [New indexing process](doc/new-indexing-process.md): how to update a format to use the new indexing process

diff --git a/doc/documents.md b/doc/documents.md
@@ -1,6 +1,6 @@
 # Documents API (to be deprecated)
 
-> **Note**: Once whitehall and Search Admin are using the [new indexing process](doc/new-indexing-process.md),
+> **Note**: Once whitehall and Search Admin are using the [new indexing process](new-indexing-process.md),
 the documents API will be removed and search API will consume only from the publishing API.
 
 ### `POST /:index/documents`

diff --git a/doc/how-search-works.md b/doc/how-search-works.md
@@ -11,13 +11,12 @@ On receiving a request to `/search` (a search query), Search API will parse
 the query, construct an Elasticsearch query, and then retrieve documents
 from Elasticsearch.
 
-## Relevancy
+Search API provides a simplified API so that other applications in the GOV.UK
+stack don't need to know how to construct Elasticsearch queries.
 
-If a search query requests that search results be ordered by relevance to the
-query, Search API will attempt to order the search results in the most
-relevant way possible.
+## Relevancy
 
-See the [relevancy documentation](doc/relevancy.md) to learn more about how
+See the [relevancy documentation](relevancy.md) to learn more about how
 Search API determines how relevant a document is to a query.
 
 ### Reranking
@@ -28,7 +27,7 @@ Elasticsearch, the results are re-ranked by a machine learning model.
 This process ensures that we show the most relevant documents at the top
 of the search results.
 
-See the [learning to rank documentation](doc/learning-to-rank.md) to learn
+See the [learning to rank documentation](learning-to-rank.md) to learn
 more about the reranking model.
 
 ## Evaluating search quality

diff --git a/doc/indexing.md b/doc/indexing.md
@@ -29,39 +29,39 @@ documents are added to the search indexes -->
 
 There are two ways documents get added to a search index:
 
-1. HTTP requests to Search API's [Documents API](doc/documents.md) (deprecated)
+1. HTTP requests to Search API's [Documents API](documents.md) (deprecated)
 2. Search API subscribes to RabbitMQ messages from the
 	 [Publishing API](https://github.com/alphagov/publishing-api).
 
-Search API search results are weighted by [popularity](doc/popularity.md). We
+Search API search results are weighted by [popularity](popularity.md). We
 rebuild the index nightly to incorporate the latest analytics.
 
 #### Publishing API integration
 
 Search API subscribes to a RabbitMQ queue of updates from publishing-api. This
 still requires Sidekiq to be running.
 
-		bundle exec rake message_queue:insert_data_into_govuk
+	bundle exec rake message_queue:insert_data_into_govuk
 
 There is also a separate process that listens to only 'links' updates from the publishing API. This is used for updating old indexes that are populated through the '/documents' API (`government`, `detailed`) and can be removed once those indexes no longer exist.
 
-    bundle exec rake message_queue:listen_to_publishing_queue
+  bundle exec rake message_queue:listen_to_publishing_queue
 
 ### Internal only APIs
 
 There are some other APIs that are only exposed internally:
 
-- [doc/content-api.md](doc/content-api.md) for the `/content/*` endpoint.
-- [doc/documents.md](doc/documents.md) for the `*/documents/` endpoint.
+- [doc/content-api.md](content-api.md) for the `/content/*` endpoint.
+- [doc/documents.md](documents.md) for the `*/documents/` endpoint.
 
 These are used by [search admin](https://github.com/alphagov/search-admin/).
 
 ## Schemas
 
-See [schemas](doc/schemas.md) for more detail.
+See [schemas](schemas.md) for more detail.
 
 ### Changing the schema/Reindexing
 
 After changing the schema, you'll need to recreate the index. This reindexes documents from the existing index.
 
-    SEARCH_INDEX=all bundle exec rake search:migrate_schema
+  SEARCH_INDEX=all bundle exec rake search:migrate_schema
diff --git a/doc/relevancy.md b/doc/relevancy.md
@@ -3,66 +3,49 @@
 This document explains how relevancy ordering works when performing a
 search.
 
-<!-- TODO: Update this document with Learning to Rank information -->
-
 ## Contents
 
 1. [What is relevancy?](#what-is-relevancy)
 2. [What impacts relevancy?](#what-impacts-relevancy)
+3. [What impacts document retrieval?](#what-impacts-document-retrieval)
    1. [Boosting](#boosting)
    2. [Best and worst bets](#best-and-worst-bets)
    3. [Stopwords](#stopwords)
    4. [Synonyms](#synonyms)
    5. [Categorisation of fields](#categorisation-of-fields)
    6. [Analyzers](#analyzers)
    7. [Excluded formats](#excluded-formats)
-3. [Possible problems with queries and relevance](#possible-problems-with-queries-and-relevance)
-4. [Finding underperforming queries](#finding-underperforming-queries)
+4. [Possible problems with queries and relevance](#possible-problems-with-queries-and-relevance)
+5. [Finding underperforming queries](#finding-underperforming-queries)
 
 
 ## What is relevancy?
 
-A list of documents returned by Search API will include an `es_score`
-on every document.
+A list of documents returned by Search API will include an `es_score` and
+a `combined_score` on every document.
 
 ```ruby
 # Response for a search for 'Harry Potter'
 [
-  { title: "Harry Potter", es_score: 1 },
-  { title: "Harry Kane", es_score: 0.5 },
-  { title: "Ron Weasley", es_score: 0.05 }
+  { title: "Harry Potter", combined_score: 3 },
+  { title: "Harry Kane", combined_score: 2 },
+  { title: "Ron Weasley", combined_score: 1 }
 ]
 ```
 
-The `es_score` value is used for ranking results and represents how
+The `combined_score` is used for ranking results and represents how
 relevant we think a result is to your query.
 
-### Debugging es_score
-
-If you want to understand why a result has a given `es_score`, you can
-use the Elasticsearch [Explain API][explain].  This is exposed by the
-Search API.
-
-You can see the reasons behind an `es_score` by including the
-`debug=explain` query parameter in your query.  This will add an
-`_explanation` field to every result, similar to a SQL-like `EXPLAIN`.
-
-For example, see the explanation produced by [searching for "harry
-potter"][explain-example].  This shows an example of stemming, where
-"harry" becomes "harri".  This is due to the rule "replace suffix 'y'
-or 'Y' by 'i' if preceded by a non-vowel which is not the first letter
-of the word".  You can also see that text similarity scoring ([BM25][]
-in Elasticsearch 6) works by considering both term frequency and
-document frequency.
+## What impacts relevancy?
 
-You can see the query Search API sends to Elasticsearch with the
-`debug=show_query` parameter.  Debug parameters can be combined, like
-`debug=show_query,explain`.  The debug output is verbose, so sometimes
-restricting to only a handful of results, with `count=0` or `count=1`,
-is useful.
+Once Search API has [retrieved](#what-impacts-document-retrieval) the
+top scoring documents from the search indexes, it ranks the results
+in order of relevance using a pre-trained model.
 
+See the [learning to rank](learning-to-rank.md) documentation for
+more details.
 
-## What impacts relevancy?
+## What impacts document retrieval?
 
 Out of the box, Elasticsearch comes with a decent scoring algorithm.
 They have a [guide on scoring relevancy][scoring] which is worth
@@ -72,6 +55,9 @@ We've done some work in Search API to [tune relevancy][relevancy],
 overriding the default Elasticsearch behaviour, which we go into
 below.
 
+These following factors are combined into a single `es_score`. The
+top scoring documents will be retrieved for ranking.
+
 ### Boosting
 
 We don't only use the query relevancy score to rank documents, we
@@ -390,6 +376,33 @@ We also exclude some paths, such as `/random`, `/homepage`, and `/humans.txt`.
 
 These no-indexed paths and formats are defined in [`config/govuk_index/migrated_formats.yaml`](https://github.com/alphagov/search-api/blob/master/config/govuk_index/migrated_formats.yaml).
 
+### Debugging es_score
+
+If you want to understand why a result has a given `es_score`, you can
+use the Elasticsearch [Explain API][explain].  This is exposed by the
+Search API.
+
+Please note that `es_score` is just one feature used by the reranking
+model; we don't rank results using `es_score` alone.
+
+You can see the reasons behind an `es_score` by including the
+`debug=explain` query parameter in your query.  This will add an
+`_explanation` field to every result, similar to a SQL-like `EXPLAIN`.
+
+For example, see the explanation produced by [searching for "harry
+potter"][explain-example].  This shows an example of stemming, where
+"harry" becomes "harri".  This is due to the rule "replace suffix 'y'
+or 'Y' by 'i' if preceded by a non-vowel which is not the first letter
+of the word".  You can also see that text similarity scoring ([BM25][]
+in Elasticsearch 6) works by considering both term frequency and
+document frequency.
+
+You can see the query Search API sends to Elasticsearch with the
+`debug=show_query` parameter.  Debug parameters can be combined, like
+`debug=show_query,explain`.  The debug output is verbose, so sometimes
+restricting to only a handful of results, with `count=0` or `count=1`,
+is useful.
+
 ## Possible problems with queries and relevance
 
 ### The way we handle longer search terms is broken

diff --git a/doc/search-quality-metrics.md b/doc/search-quality-metrics.md
@@ -21,7 +21,7 @@ tell us how search is performing against relevance judgements.
 ## Offline metrics
 
 Our main offline metric is nDCG. We measure this before and after
-re-ranking by our [learning to rank model](doc/learning-to-rank.md).
+re-ranking by our [learning to rank model](learning-to-rank.md).
 
 We use Elasticsearch's [Ranking Evaluation API](ranking_evaluation_api)
 to assess the quality of results retrieved from Elasticsearch prior