From 6ef21d7a4a9544f835c3182a2c5fdcfb425f9512 Mon Sep 17 00:00:00 2001 From: Tijmen Brommet Date: Mon, 12 Dec 2016 12:36:49 +0000 Subject: [PATCH] Move schema documentation to docs directory This makes it easier to discover. --- README.md | 1 + config/schema/README.md | 162 +--------------------------------------- docs/schemas.md | 161 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 163 insertions(+), 161 deletions(-) create mode 100644 docs/schemas.md diff --git a/README.md b/README.md index 5db990604..7769ab0b9 100644 --- a/README.md +++ b/README.md @@ -123,6 +123,7 @@ For the most up to date query syntax and API output: ### Additional Docs +- [Schemas](docs/schemas.md): how to work with schemas and the document types - [Health Check](docs/health-check.md): usage instructions for the Health Check functionality. - [Popularity information](docs/popularity.md): Rummager uses Google Analytics diff --git a/config/schema/README.md b/config/schema/README.md index 4d846e494..df299dc91 100644 --- a/config/schema/README.md +++ b/config/schema/README.md @@ -1,161 +1 @@ -# Schema definitions - -This directory tree holds the schema definitions for our search indexes. -The purpose of the schema is to define: - - - which fields are allowed to be present in documents passed to the search - indexer - - how these fields should be processed at indexing time - - how these fields should be searched at query time - -The schema is used to produce the elasticsearch configuration for each index, -but is also used to control how the fields are processed before passing them to -elasticsearch, and after retrieving results from elasticsearch. - -There are several "layers" of the schema; field types, which are used to define -the configuration for a named set of fields, which are then used to define a -set of document types, which are then used to make up the configuration for an -index. - -## Field types - -The `field_types.json` file contains a JSON object describing some "high-level" -named types which can be applied to fields in documents. Each type is keyed by -the name of the type, and has the following properties: - - - `description`: a human-readable description of what the type is to be used - for (this is essentially a comment field). - - - `es_config`: the elasticsearch configuration to be applied to fields using - this type. - - - `multivalued`: (optional, default of `false`) - if true, the field is - allowed to contain multiple values, and will be represented as an array when - documents are being returned. - - - `children`: (optional) - if present, the field contains child fields (ie, - the field values will be JSON objects containing the child fields). This - must be set to one of two values: - - - `named`: The definition of the field (see next section) must contain a - `children` property, defining the allowed child fields. - - - `dynamic`: Arbitrary child fields will be accepted, but how they are - handled will be configured dynamically according to the elasticsearch - configuration. - -## Field definitions - -To keep things simple we require that for a field of a given name, the same -configuration must be used regardless of document type, or even search index -that the document is being placed in. - -The `field_definitions.json` file describes what configuration should be used -for each named field; this single-point of configuration ensures that differing -configuration cannot be used for a single field name. - -The file contains a JSON object in which the keys are the field names. Each -value has the following properties: - - - `description`: a human-readable description of what type of data is expected - to exist in the field. - - - `type`: one of the type values defined in the `field_types.json` file. - - - `children`: (only for fields for which the type had the `children` property - set to `named`) the field definitions for the child fields. These are in - the same format as the top-level field definitions in the file (and could - even be recursive). - -## Elasticsearch document types - -Documents in an elasticsearch index have a type, and each type may have very -different configuration. We call this type "elasticsearch type" to differentiate -with the `document_type` used by GOV.UK to describe types of pages. - -We define a basic configuration which is used across all document types, and -additional configuration for each document type is then merged with this. - -The basic configuration is defined in a `base_elasticsearch_type.json` file. -Document types are defined in the `elasticsearch_types` directory, with the -additional configuration for each type being defined by a JSON file in that -directory. - -The files contain JSON object with the following keys: - - - `fields`: An array of field names which are allowed in documents of this - type. The field names must be defined in the `field_definitions.json` file. - - - `expanded_search_result_fields`: DEPRECATED. Sets up field "expansion" for a - field. This is used by the search result presenter to replace a raw value. - - For example: - - ``` - "expanded_search_result_fields": { - "fault_type": [ - { - "label": "Recall", - "value": "recall" - }, - { - "label": "Non-urgent fault", - "value": "non_urgent_fault" - } - ], - ``` - - Will return make the presenter replace the `fault_type` with value `recall` - with the hash `{ "label": "Recall", "value": "recall" }`. This can be used - when displaying the search results. - -## Indexes - -Indexes in elasticsearch are defined by files in the `indexes` directory. -These files contain a JSON object with the following keys: - - - `elasticsearch_types`: An array of the names of the document types allowed in this index. - -## Synonyms - -Synonyms are defined in the `synonyms.yml` file. - -Synonyms are specified in "lucene" syntax. Each synonym group is represented -as comma separated lists of synonyms, and optionally a "=>" symbol. - -If the => is provided, the left hand side contains a list of words which are -combined at search time into a group, and the right hand side contains the list -of words which are combined at index time into that same group. ie, the left -hand side is the mapping applied to searches, and the right hand side is the -mapping applied to documents. - -If there is no =>, the same group of words is used at index time as at -search time. - -Some examples: - - foo, bar => baz - -means "a search for 'foo' or 'bar' should return documents with 'baz' in them" - - foo => bar, baz - -means "a search for 'foo' should return documents with 'bar' or 'baz' in them" - - foo, bar, baz - -is the same as `foo, bar, baz => foo, bar, baz` and means "a search for -foo, bar or baz should return documents with any of them". - -## Additional configuration - -Additional configuration is defined in the `elasticsearch_schema.yml` and -`stems.yml` files. This configuration is merged with the JSON configuration, -and then passed to elasticsearch directly. - -The `old_synonyms.yml` contains a set of synonyms which were applied in an old, -and moderately broken, way. This is kept separate from the `synonyms.yml` file -so that we can support and modify the lists for the new synonym approach -without changing things for the old approach. Once we're happy with the new -approach, we'll delete the `old_synonyms.yml` file, and remove support for -using it. +[Read docs/schemas.md](/docs/schemas.md) diff --git a/docs/schemas.md b/docs/schemas.md new file mode 100644 index 000000000..a9ff08f84 --- /dev/null +++ b/docs/schemas.md @@ -0,0 +1,161 @@ +# Schema definitions + +The `config/schema` tree holds the schema definitions for our search indexes. +The purpose of the schema is to define: + + - which fields are allowed to be present in documents passed to the search + indexer + - how these fields should be processed at indexing time + - how these fields should be searched at query time + +The schema is used to produce the elasticsearch configuration for each index, +but is also used to control how the fields are processed before passing them to +elasticsearch, and after retrieving results from elasticsearch. + +There are several "layers" of the schema; field types, which are used to define +the configuration for a named set of fields, which are then used to define a +set of document types, which are then used to make up the configuration for an +index. + +## Field types + +The `field_types.json` file contains a JSON object describing some "high-level" +named types which can be applied to fields in documents. Each type is keyed by +the name of the type, and has the following properties: + + - `description`: a human-readable description of what the type is to be used + for (this is essentially a comment field). + + - `es_config`: the elasticsearch configuration to be applied to fields using + this type. + + - `multivalued`: (optional, default of `false`) - if true, the field is + allowed to contain multiple values, and will be represented as an array when + documents are being returned. + + - `children`: (optional) - if present, the field contains child fields (ie, + the field values will be JSON objects containing the child fields). This + must be set to one of two values: + + - `named`: The definition of the field (see next section) must contain a + `children` property, defining the allowed child fields. + + - `dynamic`: Arbitrary child fields will be accepted, but how they are + handled will be configured dynamically according to the elasticsearch + configuration. + +## Field definitions + +To keep things simple we require that for a field of a given name, the same +configuration must be used regardless of document type, or even search index +that the document is being placed in. + +The `field_definitions.json` file describes what configuration should be used +for each named field; this single-point of configuration ensures that differing +configuration cannot be used for a single field name. + +The file contains a JSON object in which the keys are the field names. Each +value has the following properties: + + - `description`: a human-readable description of what type of data is expected + to exist in the field. + + - `type`: one of the type values defined in the `field_types.json` file. + + - `children`: (only for fields for which the type had the `children` property + set to `named`) the field definitions for the child fields. These are in + the same format as the top-level field definitions in the file (and could + even be recursive). + +## Elasticsearch document types + +Documents in an elasticsearch index have a type, and each type may have very +different configuration. We call this type "elasticsearch type" to differentiate +with the `document_type` used by GOV.UK to describe types of pages. + +We define a basic configuration which is used across all document types, and +additional configuration for each document type is then merged with this. + +The basic configuration is defined in a `base_elasticsearch_type.json` file. +Document types are defined in the `elasticsearch_types` directory, with the +additional configuration for each type being defined by a JSON file in that +directory. + +The files contain JSON object with the following keys: + + - `fields`: An array of field names which are allowed in documents of this + type. The field names must be defined in the `field_definitions.json` file. + + - `expanded_search_result_fields`: DEPRECATED. Sets up field "expansion" for a + field. This is used by the search result presenter to replace a raw value. + + For example: + + ``` + "expanded_search_result_fields": { + "fault_type": [ + { + "label": "Recall", + "value": "recall" + }, + { + "label": "Non-urgent fault", + "value": "non_urgent_fault" + } + ], + ``` + + Will return make the presenter replace the `fault_type` with value `recall` + with the hash `{ "label": "Recall", "value": "recall" }`. This can be used + when displaying the search results. + +## Indexes + +Indexes in elasticsearch are defined by files in the `indexes` directory. +These files contain a JSON object with the following keys: + + - `elasticsearch_types`: An array of the names of the document types allowed in this index. + +## Synonyms + +Synonyms are defined in the `synonyms.yml` file. + +Synonyms are specified in "lucene" syntax. Each synonym group is represented +as comma separated lists of synonyms, and optionally a "=>" symbol. + +If the => is provided, the left hand side contains a list of words which are +combined at search time into a group, and the right hand side contains the list +of words which are combined at index time into that same group. ie, the left +hand side is the mapping applied to searches, and the right hand side is the +mapping applied to documents. + +If there is no =>, the same group of words is used at index time as at +search time. + +Some examples: + + foo, bar => baz + +means "a search for 'foo' or 'bar' should return documents with 'baz' in them" + + foo => bar, baz + +means "a search for 'foo' should return documents with 'bar' or 'baz' in them" + + foo, bar, baz + +is the same as `foo, bar, baz => foo, bar, baz` and means "a search for +foo, bar or baz should return documents with any of them". + +## Additional configuration + +Additional configuration is defined in the `elasticsearch_schema.yml` and +`stems.yml` files. This configuration is merged with the JSON configuration, +and then passed to elasticsearch directly. + +The `old_synonyms.yml` contains a set of synonyms which were applied in an old, +and moderately broken, way. This is kept separate from the `synonyms.yml` file +so that we can support and modify the lists for the new synonym approach +without changing things for the old approach. Once we're happy with the new +approach, we'll delete the `old_synonyms.yml` file, and remove support for +using it.