Skip to content

Commit

Permalink
Review documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
barrucadu committed Apr 17, 2019
1 parent bdafaa8 commit 1351ba0
Show file tree
Hide file tree
Showing 10 changed files with 56 additions and 94 deletions.
22 changes: 14 additions & 8 deletions doc/adding-new-fields.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
# Adding new fields to rummager
# Adding new fields to a document type

### The schema

`config/schema` contains a bunch of JSON files that together define a schema for documents in rummager. This is described in more detail in the [README](../config/schema/README.md).
`config/schema` contains a bunch of JSON files that together define a schema for documents in Search API. This is described in more detail in the [README](../config/schema/README.md).

First you need to decide which field type to use.
`field_types.json` defines common elasticsearch configuration that we reuse for multiple fields having the same type.

The type you use affects whether the field is [analysed](https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-analysis.html) by elasticsearch and whether you can use it in [filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html) and [aggregates](https://www.elastic.co/guide/en/elasticsearch/reference/1.4/search-aggregations.html).
The type you use affects whether the field is [analysed][] by elasticsearch and whether you can use it in [filters][] and [aggregates][].

[analysed]: https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-analysis.html
[filter]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-filter-context.html
[aggregates]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/search-aggregations.html

Add your new field to `field_definitions.json`.

Expand All @@ -19,17 +23,19 @@ The easiest way to test the new fields is to write an integration test for it. T

### Transformation during indexing

Some fields get transformed by rummager before they are stored in Elasticsearch. This is handled by the `DocumentPreparer` class.
Some fields get transformed by Search API before they are stored in Elasticsearch. This is handled by the `DocumentPreparer` class.

### Presenting for search

Some fields get expanded by rummager when they are presented in search results. For example, `specialist_sector` links get expanded by looking up the corresponding documents from the search index and extracting title, content id, and link fields. This is handled by `Search::BaseRegistry`.
Some fields get expanded by Search API when they are presented in search results. For example, `specialist_sector` links get expanded by looking up the corresponding documents from the search index and extracting title, content id, and link fields. This is handled by `Search::BaseRegistry`.

### Updating Rummager schema indexes on all environments
### Updating Search API schema indexes on all environments

**Caution:** Do not run this rake task in production during working hours except in an emergency. Content published while the task is running will not be available in search results until the task completes. The impact of this can be reduced if you run the task out of peak publishing hours.

In order for the new field to work as expected, you will need to run a Jenkins job on all environments. The job is "Search reindex with new schema" ([Link to integration version of task](https://deploy.integration.publishing.service.gov.uk/job/search_reindex_with_new_schema/)), and will run the `rummager:migrate_schema` rake task. It can take over 40 minutes to complete.
In order for the new field to work as expected, you will need to run a Jenkins job on all environments. The job is "Search reindex with new schema" ([Link to integration version of task][reindex]), and will run the `rummager:migrate_schema` rake task. It can take over 40 minutes to complete.

[reindex]: https://deploy.integration.publishing.service.gov.uk/job/search_api_reindex_with_new_schema/

This job will block other rake tasks from being run for 15 minutes to an hour.

Expand All @@ -43,4 +49,4 @@ For the new elasticsearch configuration to take effect, you need to manually reb

In the past, this was done automatically every night by the [`search_fetch_analytics`](https://github.com/alphagov/search-analytics) jenkins job, but this automation [was reverted](https://github.com/alphagov/search-analytics/commit/a5c3ac58f7198eba74ab7b5bd5555aa07490442a#diff-0484c7ea1cf547a292a2190d0c1c060b). You must run this manually.

If you prefer running a rake task rather than a pre-written Jenkins job, you can run `RUMMAGER_INDEX=all SKIP_LINKS_INDEXING_TO_PREVENT_TIMEOUTS=1 rummager:migrate_schema`.
If you prefer running a rake task rather than a pre-written Jenkins job, you can run `RUMMAGER_INDEX=all CONFIRM_INDEX_MIGRATION_START=1 rummager:migrate_schema`.
8 changes: 4 additions & 4 deletions doc/content-api.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Rummager Content API
# Content API

### `GET /content/?link=/a-link`

Expand All @@ -7,10 +7,10 @@ Returns information about the search result with the specified link.
### Example response

```
curl -XGET http://rummager.dev.gov.uk/content?link=/vehicle-tax
curl -XGET http://search-api.dev.gov.uk/content?link=/vehicle-tax
```

Currently returns a hash with one element: `raw_source`, which contains the raw elasticsearch document (`_source`).
Currently returns a hash with one element: `raw_source`, which contains the raw elasticsearch document.

```json
{
Expand Down Expand Up @@ -43,7 +43,7 @@ Deletes the search result with the specified link.
## Example response

```
curl -XDELETE http://rummager.dev.gov.uk/content?link=/vehicle-tax
curl -XDELETE http://search-api.dev.gov.uk/content?link=/vehicle-tax
```

Will return 404 when the link is not found, 204 when it is deleted.
47 changes: 10 additions & 37 deletions doc/content_overview.md
Original file line number Diff line number Diff line change
@@ -1,56 +1,29 @@
# Content in the search index and where it comes from

This list documents the kinds of things included in Rummager's search indexes,
and the apps currently responsible for publishing them as of June 2017.

![Government content and HMRC manuals make up most of the content we index](rough_content_breakdown.png)

For a broader view of the content that is available, see [Document types on GOV.UK](https://docs.publishing.service.gov.uk/document-types.html).
For an overview view of the sorts of content that are available, see [Document types on GOV.UK](https://docs.publishing.service.gov.uk/document-types.html).

## Whitehall

This is what most publishers use to publish. Content appears on the ["inside government" part of GOV.UK](https://www.gov.uk/government/publications). There are 200,000 documents.

Implemented in [searchable.rb](https://github.com/alphagov/whitehall/blob/master/app/models/searchable.rb).

- 96460 publications
- 53678 news articles
- 11052 world location news articles
- 8112 speeches
- 4012 detailed guidance
- 3771 document collections
- 3766 consultations
- 3684 statistics announcements
- 2729 people
- 1579 case study
- 1109 corporate information pages
- 1017 organisations
- 677 policy groups
- 567 statistical data sets
- 501 fatality notices
- 455 worldwide organisations
- 318 ministers
- 234 world locations
- 63 topical events
- 47 “topics”
- 19 inside-government-links (DEPRECATED)
- 18 take parts
- 7 finders
- 5 operational fields

## Other publishing apps

Most publishing apps, such as publisher and specialist-publisher, do not send
content to Rummager directly. Instead, they publish content to the
content to Search API directly. Instead, they publish content to the
[publishing-api][publishing_api] which adds the content to a notifications queue
to be ingested by rummager.
to be ingested by search-api.

See [ADR 001][adr_001] for more details on this approach.

[publishing_api]: https://github.com/alphagov/publishing-api
[adr_001]: https://github.com/alphagov/rummager/blob/master/doc/arch/adr-001-use-of-both-rabbitmq-and-sidekiq-queues.md
[adr_001]: https://github.com/alphagov/search-api/blob/master/doc/arch/adr-001-use-of-both-rabbitmq-and-sidekiq-queues.md

## Search admin
Admin for GOV.UK search. Sends 506 "recommended links" to Rummager, so we can
show external links in search results.
Admin for GOV.UK search. Publishes "recommended links" to Search API,
so we can show external links in search results; and "best bets", so
selected search results can be artificially boosted to the top of the
list.

Implemented in [elastic_search_recommended_link.rb](https://github.com/alphagov/search-admin/blob/master/app/models/elastic_search_recommended_link.rb).
Implemented in [elastic_search_recommended_link.rb](https://github.com/alphagov/search-admin/blob/master/app/models/elastic_search_recommended_link.rb) and [rummager_saver.rb](https://github.com/alphagov/search-admin/blob/master/app/services/rummager_saver.rb).
2 changes: 1 addition & 1 deletion doc/documents.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Rummager Documents API
# Documents API

### `POST /:index/documents`

Expand Down
52 changes: 15 additions & 37 deletions doc/new-indexing-process.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Example PRs for [adding a new document type](https://docs.publishing.service.gov
- [Publish recommended links to publishing API](https://github.com/alphagov/search-admin/pull/97)
- [Add rake task for publishing all external links to the publishing API](https://github.com/alphagov/search-admin/pull/100/files)

Ensure that all fields the publishing app currently sends to Rummager are included in the payload send to publishing API. If anything is missing, you'll need to update the publishing app and content schemas, and then re-publish existing content to the publishing API.
Ensure that all fields the publishing app currently sends to Search API are included in the payload send to publishing API. If anything is missing, you'll need to update the publishing app and content schemas, and then re-publish existing content to the publishing API.

You may also want to clean up other inconsistencies before changing the indexing method.

Expand All @@ -24,72 +24,50 @@ Example PRs:
- [Ensure we pass the description text to publishing API](https://github.com/alphagov/calendars/pull/162/files)

## Update the presenter to handle the new format
You'll need to update the elasticsearch presenter in Rummager so that it handles any fields which are not yet used by other formats in the govuk index.
You'll need to update the elasticsearch presenter in Search API so that it handles any fields which are not yet used by other formats in the govuk index.

Fields that are common to multiple document types should be handled in a consistent way by Rummager. Don't add in special cases without good reason, even if the publishing app used to do something different.
Fields that are common to multiple document types should be handled in a consistent way by Search API. Don't add in special cases without good reason, even if the publishing app used to do something different.

This is especially true for key fields like `title`, `description`, and `indexable_content`, although in some cases we do prefix titles so that similar looking content is distinguishable.

Example PRs:

- [Make policies indexable](https://github.com/alphagov/rummager/pull/1053)
- [Make policies indexable](https://github.com/alphagov/search-api/pull/1053)

## Get the data in sync on integration

1. [Optional] run the check to see the starting state of the `govuk` index:
1. Remove the format from as `non_indexable` in `migrated_formats.yaml` and deploy to integration.

```rake rummager:compare_govuk[<format>]```
This makes Search API update the `govuk` index when content is published or unpublished.

2. Mark the format as `indexable` in `migrated_formats.yaml` and deploy to integration.

This makes Rummager update the `govuk` index when content is published or unpublished.

3. Delete any existing data from the `govuk` index for the unmigrated format.
1. Delete any existing data from the `govuk` index for the unmigrated format.
This makes sure that it only contains the data you send to it from the publishing api.

``` rake delete:by_format[<format>,govuk]```

4. Resend from publishing API on integration
1. Resend from publishing API on integration

``` rake queue:requeue_document_type[<format>]```

If nothing happens, check the sidekiq logs for the Rummager govuk index worker.

You can also monitor the resending using the Rummager deployment dashboard or the [elasticsearch dashboard](https://grafana.integration.publishing.service.gov.uk/dashboard/db/elasticsearch-activity).

5. Rerun the comparison

```rake rummager:compare_govuk[<format>]```
If nothing happens, check the sidekiq logs for the Search API govuk index worker.

If the comparison shows significant differences between the old index and `govuk`, change the presenter in Rummager and repeat steps 4-5 until it looks consistent.

We expect some differences, for example
- small changes to `indexable_content`
- fields being populated that weren't there before
- some documents removed (if they are unpublished in the publishing app)
- some documents added (if they are published in the publishing app)

If you want to edit/debug the comparer script, it's helpful to run this step locally, using an SSH tunnel to the integration elasticsearch.

`ssh -L9200:localhost:9200 rummager-elasticsearch-1.api.integration`
You can also monitor the resending using the Search API deployment dashboard or the [elasticsearch dashboard](https://grafana.integration.publishing.service.gov.uk/dashboard/file/search_api_elasticsearch.json).

## Deploy to production
When it looks consistent on integration, deploy to production with the format as `indexable` in `migrated_formats.yaml`.
When it looks consistent on integration, deploy to production.

You will need to run steps 3-4 above on each environment.
You will need to run the steps above on each environment.

Verify that the new indexing process runs without errors for a few days, including the nightly popularity update.
Verify that the new indexing process runs without errors for a few days.

## Mark the format as `migrated` in `migrated_formats.yaml`
This will cause Rummager to use the new index for queries.
This will cause Search API to use the new index for queries.

Test all search pages/finders that can show the format, and run the search healthcheck.

If anything goes wrong, roll back to "indexed".

## Remove the indexing code from the publishing app
Once everything is working, the publishing app doesn't need to integrate
with rummager any more.
with Search API any more.

Example PRs:

Expand Down
4 changes: 2 additions & 2 deletions doc/popularity.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ If you do need to fetch the analytics data directly yourself, the
[search-analytics project README](https://github.com/alphagov/search-analytics)
describes how to set up and run the extraction of page traffic information from
Google Analytics. It will produce a dump file suitable for loading into an
elasticsearch index using rummager's `bulk_load` tool.
elasticsearch index using the `bulk_load` tool.

Once you have the popularity data in a file named, say, `page-traffic.dump`,
load it into elasticsearch using:
Expand All @@ -22,4 +22,4 @@ is run after populating the page-traffic index. As part of the migration, the
popularity for each document will be computed from the page-traffic index and
merged into the documents. To do this, run:

RUMMAGER_INDEX=all bundle exec rake rummager:migrate_schema
RUMMAGER_INDEX=all CONFIRM_INDEX_MIRATION_START=1 bundle exec rake rummager:migrate_schema
2 changes: 1 addition & 1 deletion doc/public-api/faceted-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ The special value `_MISSING` may be specified as a filter value - this will matc

For string fields, values are the field value to match.

For date fields, values are date ranges. These are specified as comma separated lists of key:value parameters, where key is one of `from` or `to`, and the value is an ISO formatted date (with no timezone). UTC is assumed for all dates handled by rummager. Date ranges are inclusive of their endpoints.
For date fields, values are date ranges. These are specified as comma separated lists of key:value parameters, where key is one of `from` or `to`, and the value is an ISO formatted date (with no timezone). UTC is assumed for all dates. Date ranges are inclusive of their endpoints.

For example: `from:2014-04-01 00:00,to:2014-04-02 00:00` is a range for 24 hours from midnight at the start of April the 1st 2014, including midnight that day or the following day.

Expand Down
8 changes: 4 additions & 4 deletions doc/publishing-finders.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Publishing finders from a Rummager rake task
# Publishing finders

Rummager is currently used to publish two specific finders which do
not fit the standard specialist document finder pattern. These are:
Search API is currently used to publish some finders which do not fit
the standard specialist document finder pattern. These are:

- Advanced search, available at the path `/search/advanced`
- Find EU Exit guidance for business (currently in development)
Expand Down Expand Up @@ -34,4 +34,4 @@ FINDER_CONFIG=news_and_communications.yml EMAIL_SIGNUP_CONFIG=news_and_communica

**NOTE:** The `find-eu-exit-guidance-business` finder config is overwritten by a
[shared definition in the govuk-app-deployment-secrets repo](https://github.com/alphagov/govuk-app-deployment-secrets/blob/master/shared_config/find-eu-exit-guidance-business.yml), the file committed to the
Rummager repo is a development copy.
Search API repo is a development copy.
Binary file removed doc/rough_content_breakdown.png
Binary file not shown.
5 changes: 5 additions & 0 deletions doc/schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,11 @@ The files contain JSON object with the following keys:
with the hash `{ "label": "Bar the bar", "value": "bar" }`. This can be used
when displaying the search results.

Even though we have different schemas for different "elasticsearch
document types", in practice elasticsearch only knows about one
"type": which is the union of all the schemas. This is because
Elasticsearch 6 does not allow multiple types in the same index.

## Indexes

Indexes in elasticsearch are defined by files in the `indexes` directory.
Expand Down

0 comments on commit 1351ba0

Please sign in to comment.