Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Bill Franklin committed Mar 19, 2020
1 parent 413ea6a commit 2a614ce
Show file tree
Hide file tree
Showing 6 changed files with 76 additions and 55 deletions.
21 changes: 15 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,6 @@ You can also find some examples in the blog post:
The instructions will help you to get Search API running
locally on your machine.

### Dependencies

- [Elasticsearch](https://github.com/elastic/elasticsearch) - "You Know, for Search...".
- [Redis](https://redis.io/) - used by indexing workers.
- [AWS Sagemaker](https://aws.amazon.com/sagemaker/) - used for [search relevancy](docs/relevancy.md)

### Prequisites

Install [govuk-docker](https://github.com/alphagov/govuk-docker)!
Expand Down Expand Up @@ -99,6 +93,21 @@ It does some clever stuff at both parts, but that's the meat of it.
Read the [documentation](/doc) to find out [how documents are indexed](doc/indexing.md)
or [how documents are retrieved](doc/how-search-works.md).

### Dependencies

Search API depends on other services in order to index documents and provide
relevant search results:

- [Elasticsearch](https://github.com/elastic/elasticsearch) - "You Know, for Search...".
- [Redis](https://redis.io/) - used by indexing workers.
- [AWS Sagemaker](https://aws.amazon.com/sagemaker/) (optional) - used for [search relevancy](docs/relevancy.md)

If you use govuk-docker locally, the required dependencies will be started
automatically when you start Search API. You don't need to set these up yourself.

See the [learning to rank documentation](doc/learning-to-rank.md) for
guidance on how to run the ranking model locally.

### Additional Docs

- [New indexing process](doc/new-indexing-process.md): how to update a format to use the new indexing process
Expand Down
2 changes: 1 addition & 1 deletion doc/documents.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Documents API (to be deprecated)

> **Note**: Once whitehall and Search Admin are using the [new indexing process](doc/new-indexing-process.md),
> **Note**: Once whitehall and Search Admin are using the [new indexing process](new-indexing-process.md),
the documents API will be removed and search API will consume only from the publishing API.

### `POST /:index/documents`
Expand Down
11 changes: 5 additions & 6 deletions doc/how-search-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,12 @@ On receiving a request to `/search` (a search query), Search API will parse
the query, construct an Elasticsearch query, and then retrieve documents
from Elasticsearch.

## Relevancy
Search API provides a simplified API so that other applications in the GOV.UK
stack don't need to know how to construct Elasticsearch queries.

If a search query requests that search results be ordered by relevance to the
query, Search API will attempt to order the search results in the most
relevant way possible.
## Relevancy

See the [relevancy documentation](doc/relevancy.md) to learn more about how
See the [relevancy documentation](relevancy.md) to learn more about how
Search API determines how relevant a document is to a query.

### Reranking
Expand All @@ -28,7 +27,7 @@ Elasticsearch, the results are re-ranked by a machine learning model.
This process ensures that we show the most relevant documents at the top
of the search results.

See the [learning to rank documentation](doc/learning-to-rank.md) to learn
See the [learning to rank documentation](learning-to-rank.md) to learn
more about the reranking model.

## Evaluating search quality
Expand Down
16 changes: 8 additions & 8 deletions doc/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,39 +29,39 @@ documents are added to the search indexes -->

There are two ways documents get added to a search index:

1. HTTP requests to Search API's [Documents API](doc/documents.md) (deprecated)
1. HTTP requests to Search API's [Documents API](documents.md) (deprecated)
2. Search API subscribes to RabbitMQ messages from the
[Publishing API](https://github.com/alphagov/publishing-api).

Search API search results are weighted by [popularity](doc/popularity.md). We
Search API search results are weighted by [popularity](popularity.md). We
rebuild the index nightly to incorporate the latest analytics.

#### Publishing API integration

Search API subscribes to a RabbitMQ queue of updates from publishing-api. This
still requires Sidekiq to be running.

bundle exec rake message_queue:insert_data_into_govuk
bundle exec rake message_queue:insert_data_into_govuk

There is also a separate process that listens to only 'links' updates from the publishing API. This is used for updating old indexes that are populated through the '/documents' API (`government`, `detailed`) and can be removed once those indexes no longer exist.

bundle exec rake message_queue:listen_to_publishing_queue
bundle exec rake message_queue:listen_to_publishing_queue

### Internal only APIs

There are some other APIs that are only exposed internally:

- [doc/content-api.md](doc/content-api.md) for the `/content/*` endpoint.
- [doc/documents.md](doc/documents.md) for the `*/documents/` endpoint.
- [doc/content-api.md](content-api.md) for the `/content/*` endpoint.
- [doc/documents.md](documents.md) for the `*/documents/` endpoint.

These are used by [search admin](https://github.com/alphagov/search-admin/).

## Schemas

See [schemas](doc/schemas.md) for more detail.
See [schemas](schemas.md) for more detail.

### Changing the schema/Reindexing

After changing the schema, you'll need to recreate the index. This reindexes documents from the existing index.

SEARCH_INDEX=all bundle exec rake search:migrate_schema
SEARCH_INDEX=all bundle exec rake search:migrate_schema
79 changes: 46 additions & 33 deletions doc/relevancy.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,66 +3,49 @@
This document explains how relevancy ordering works when performing a
search.

<!-- TODO: Update this document with Learning to Rank information -->

## Contents

1. [What is relevancy?](#what-is-relevancy)
2. [What impacts relevancy?](#what-impacts-relevancy)
3. [What impacts document retrieval?](#what-impacts-document-retrieval)
1. [Boosting](#boosting)
2. [Best and worst bets](#best-and-worst-bets)
3. [Stopwords](#stopwords)
4. [Synonyms](#synonyms)
5. [Categorisation of fields](#categorisation-of-fields)
6. [Analyzers](#analyzers)
7. [Excluded formats](#excluded-formats)
3. [Possible problems with queries and relevance](#possible-problems-with-queries-and-relevance)
4. [Finding underperforming queries](#finding-underperforming-queries)
4. [Possible problems with queries and relevance](#possible-problems-with-queries-and-relevance)
5. [Finding underperforming queries](#finding-underperforming-queries)


## What is relevancy?

A list of documents returned by Search API will include an `es_score`
on every document.
A list of documents returned by Search API will include an `es_score` and
a `combined_score` on every document.

```ruby
# Response for a search for 'Harry Potter'
[
{ title: "Harry Potter", es_score: 1 },
{ title: "Harry Kane", es_score: 0.5 },
{ title: "Ron Weasley", es_score: 0.05 }
{ title: "Harry Potter", combined_score: 3 },
{ title: "Harry Kane", combined_score: 2 },
{ title: "Ron Weasley", combined_score: 1 }
]
```

The `es_score` value is used for ranking results and represents how
The `combined_score` is used for ranking results and represents how
relevant we think a result is to your query.

### Debugging es_score

If you want to understand why a result has a given `es_score`, you can
use the Elasticsearch [Explain API][explain]. This is exposed by the
Search API.

You can see the reasons behind an `es_score` by including the
`debug=explain` query parameter in your query. This will add an
`_explanation` field to every result, similar to a SQL-like `EXPLAIN`.

For example, see the explanation produced by [searching for "harry
potter"][explain-example]. This shows an example of stemming, where
"harry" becomes "harri". This is due to the rule "replace suffix 'y'
or 'Y' by 'i' if preceded by a non-vowel which is not the first letter
of the word". You can also see that text similarity scoring ([BM25][]
in Elasticsearch 6) works by considering both term frequency and
document frequency.
## What impacts relevancy?

You can see the query Search API sends to Elasticsearch with the
`debug=show_query` parameter. Debug parameters can be combined, like
`debug=show_query,explain`. The debug output is verbose, so sometimes
restricting to only a handful of results, with `count=0` or `count=1`,
is useful.
Once Search API has [retrieved](#what-impacts-document-retrieval) the
top scoring documents from the search indexes, it ranks the results
in order of relevance using a pre-trained model.

See the [learning to rank](learning-to-rank.md) documentation for
more details.

## What impacts relevancy?
## What impacts document retrieval?

Out of the box, Elasticsearch comes with a decent scoring algorithm.
They have a [guide on scoring relevancy][scoring] which is worth
Expand All @@ -72,6 +55,9 @@ We've done some work in Search API to [tune relevancy][relevancy],
overriding the default Elasticsearch behaviour, which we go into
below.

These following factors are combined into a single `es_score`. The
top scoring documents will be retrieved for ranking.

### Boosting

We don't only use the query relevancy score to rank documents, we
Expand Down Expand Up @@ -390,6 +376,33 @@ We also exclude some paths, such as `/random`, `/homepage`, and `/humans.txt`.

These no-indexed paths and formats are defined in [`config/govuk_index/migrated_formats.yaml`](https://github.com/alphagov/search-api/blob/master/config/govuk_index/migrated_formats.yaml).

### Debugging es_score

If you want to understand why a result has a given `es_score`, you can
use the Elasticsearch [Explain API][explain]. This is exposed by the
Search API.

Please note that `es_score` is just one feature used by the reranking
model; we don't rank results using `es_score` alone.

You can see the reasons behind an `es_score` by including the
`debug=explain` query parameter in your query. This will add an
`_explanation` field to every result, similar to a SQL-like `EXPLAIN`.

For example, see the explanation produced by [searching for "harry
potter"][explain-example]. This shows an example of stemming, where
"harry" becomes "harri". This is due to the rule "replace suffix 'y'
or 'Y' by 'i' if preceded by a non-vowel which is not the first letter
of the word". You can also see that text similarity scoring ([BM25][]
in Elasticsearch 6) works by considering both term frequency and
document frequency.

You can see the query Search API sends to Elasticsearch with the
`debug=show_query` parameter. Debug parameters can be combined, like
`debug=show_query,explain`. The debug output is verbose, so sometimes
restricting to only a handful of results, with `count=0` or `count=1`,
is useful.

## Possible problems with queries and relevance

### The way we handle longer search terms is broken
Expand Down
2 changes: 1 addition & 1 deletion doc/search-quality-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ tell us how search is performing against relevance judgements.
## Offline metrics

Our main offline metric is nDCG. We measure this before and after
re-ranking by our [learning to rank model](doc/learning-to-rank.md).
re-ranking by our [learning to rank model](learning-to-rank.md).

We use Elasticsearch's [Ranking Evaluation API](ranking_evaluation_api)
to assess the quality of results retrieved from Elasticsearch prior
Expand Down

0 comments on commit 2a614ce

Please sign in to comment.