# Data Validation Error Format

Version 0.0.1

Jakob Voß [](https://orcid.org/0000-0002-7613-4123) (Verbundzentrale des GBV (VZG))

This document specifies a data format to report validation errors of digital objects.

# 1. Introduction

## Motivation and scope

Data validation is a crucial part of management of data qualitiy and interoperability. Validation is applied in many ways and contexts, for instance input forms and editors with visual feedback or schema languages with formal error reports. The diversity of use cases imply a variety of error results. No uniform standard exist to express error reports.[1]

The specification of **Data Validation Error Format** has two goals:

-   unify how validation errors are reported by different validators
-   address positions of errors in validated documents

Last but not least the format should help to better separate validation and presentation of validation results, so both can be solved by different applications.

The format is strictly limited to errors and error positions. Neither does it include other kinds of analysis results such as statistics and summaries of documents, nor does in include concepts of test cases, schemas or other information about validation internals.

## Overview

<a href="#fig-validation" class="quarto-xref">Figure 1</a> illustrates the validation process with core concepts used in this specification: a **validator** checks whether a **document** conforms to some requirements and returns a list of **errors** in return. Each error can refer to its locations in the document via **positions**. These concepts are defined more formally with details in the following normative section.

<figure class=''>

<pre class="mermaid mermaid-js">graph LR
   document --- validator --&gt; errors
   errors -. positions .-&gt; document
   validator(validator)
</pre>

</figure>

Figure 1: Validation process

Every document conforms to a **document model**. For instance a JSON document conforms to the JSON model, and a character string conforms to the model “sequence of characters from a known character set”. Document models come with **encodings** how to express documents on a lower level in form of documents of another document model. For instance JSON documents can be encoded with JSON syntax as Unicode strings and Unicode strings can be encoded with UTF-8 as sequences of bytes (<a href="#fig-encodings-and-locators" class="quarto-xref">Figure 2</a>, labelled arrows).

> **Note**
>
> Eventually all documents are given as digital objects, encoded as sequence of bytes. Encodings using a sequence of characters are also called textual data formats, in contrast to binary data formats.

Error positions are given in form of **locators**, each conforming to a **locator model** and encoded as Unicode string with a **locator format**. Each locator model refers to a limited set of document models: for instance JSON Pointer refers to JSON, line numbers refer to any model segmenting a document into a sequence of lines, and offsets refer to simple models of a sequences of elements – such as sequences of bytes (<a href="#fig-encodings-and-locators" class="quarto-xref">Figure 2</a>).[2]

<figure class=''>

<pre class="mermaid mermaid-js">graph LR
   JSON -- JSON syntax --&gt; Unicode
   Unicode -- UTF-8   --&gt; Bytes
   Unicode[Unicode string]

   jsonpointer(JSON Pointer)
   line(line number)
   offset(byte offset)

   style jsonpointer fill:#fff,stroke:#fff
   style line fill:#fff,stroke:#fff
   style offset fill:#fff,stroke:#fff

   jsonpointer -.-&gt; JSON
   line -.-&gt; Unicode
   offset -.-&gt; Bytes
</pre>

</figure>

Figure 2: Example of encodings and locator formats

## Examples

A JSON file can be invalid on many levels. For example the JSON document `{"åå":5}` could be invalid on schema level if element `åå` is expected to hold a string instead of a number (<a href="#lst-1" class="quarto-xref">Example 1</a>):

``` json
{
  "message": "Expected string, got number at element /a",
  "position": { "jsonpointer": "/a", "line": "1" }
}
```

Example 1: Error in a JSON document

The JSON syntax could also be invalid. For example the document `{åå:5}` is valid JavaScript but no valid JSON (<a href="#lst-2" class="quarto-xref">Example 2</a>):

``` json
{
  "message": "Expected property name or '}' at index 2",
  "position": { "line": "1", "char": "2" }
}
```

Example 2: Error in JSON syntax

Last but not least the UTF-8 encoding could be invalid. The following document contains an invalid byte code when decoded as UTF-8. Some validators may replace the byte with the Unicode replacement character `U+FFFD` but the resulting Unicode string is invalid JSON syntax still (<a href="#lst-3" class="quarto-xref">Example 3</a>).

|  |  |  |  |  |  |  |  |  |  |  |
|---------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| **Byte** | `7b` | `22` | `c3` | `a5` | `c3` | `a5` | `22` | `3a` | `c0` | `7d` |
| **Code point** | `U+007B` | `U+0022` | `U+00E5` |  | `U+00E5` |  | `U+007B` | `U+0022` | ERROR `U+FFFD` | `U+0022` |
| **Character** | `{` | `"` | `å` |  | `å` |  | `"` | `:` | `�` | `}` |

``` json
[
  {
    "level": "warning",
    "message": "Ill-formed UTF-8 byte sequence at offset 8",
    "position": { "offset": "8", "line": "1", "char": "7" }
  },
  {
    "level": "error",
    "message": "Expected JSON value at line 1, column 7",
    "position": { "line": "1", "char": "7" }
  }
]
```

Example 3: Invalid JSON on multiple levels

# 2. Terminology

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 ([RFC 2119](https://tools.ietf.org/html/rfc2119) and [RFC 8174](https://tools.ietf.org/html/rfc8174)) when, and only when, they appear in all capitals, as shown here.

A **validator** is an executable function that transforms a **document** into a (possibly empty) set of **errors**.

# 3. Errors

An **Error** is a JSON object with:

-   mandatory field `message` with an **error message**, being a non-empty string
-   optional field `type` with an **error type**, being a non-empty string (SHOULD be an URI)
-   optional field `level` with one of the strings `error` or `warning`
-   optional field `position` with [positions](#positions)

An error is also called **warning** if field `level` has value `warning`.

> **Note**
>
> Language and localisation of error messages is out of the scope of this specification.

# 4. Positions

An error can have one or more **positions**. Positions are

-   either a JSON array of [locators](#locators) (detailled form),
-   or a [locator map](:%20#locator-maps) (condense form).

Every locator map can be transformed to an equivalent array of locators. The reverse transformation is not always possible.

## Locators

A **locator** is a JSON object with

-   mandatory field `format` with the [locator format](#locator-formats)
-   mandatory field `value` with the **locator value**, being a string
-   optional field `position` with nested [positions](#positions)

``` json
{ "format": "line", "value": "7" }
```

Example 4: A simple locator indicating the position line (locator format) 7 (locator value)

Nested positions allow to reference locations within nested documents (<a href="#lst-nested-example" class="quarto-xref">Example 5</a>).

``` json
{
  "message": "Invalid value in line 2 of file example.txt in file archive.zip",
  "position": [ {
    "format": "file",
    "value": "archive.zip",
    "position": [ {
      "format": "file",
      "value": "example.txt",
      "position": [ { "format": "line", "value": "2" } ]
    } ]
  } ]
}
```

Example 5: Error in line 2 of file `example.txt` in archive `archive.zip`

[1] A notable exception are formats from software development used in unit testing such as [JUnit XML](https://github.com/testmoapp/junitxml).

[2] Unicode strings are sequences as well but its not obvious whether the elements are code points, code units or characters. Line numbers in Unicode are not trivial neither because multiple definitions of line breaks exist.

## Locator maps

A **locator map** is a JSON object that maps [locator formats](#locator-formats) to **locator values**.

``` json
{ "line": "7" }
```

Example 6: A simple locator map indicating the position line 7

A locator map is equivalent to an array of locators with key and value of the JSON object entries mapped to field `format` and `value` of each locator.

> **Note**
>
> Locator maps simplify access to error positions when applications assume known locator formats without nested positions.

# 5. Locator formats

Each locator format encodes a locator model that refers to a set of document models.

The Data Validation Error Format requires each locator model to have exactely one encoding called its **locator format** to encode locators as Unicode strings.

A **locator format** is a string that identifies a formal language to locate positions or sections in a document. The identifier must start with lowercase letter `a` to `z`, optionally followed by a sequence of lowercase letters, digits `0` to `9` and/or `-`.

Some locator formats (final version of this specification needs to define a registry of locator formats):

| identifier | locator format | document model |
|---:|----|----|
| `offset` | number (first: 0) | sequence of elements |
| `line` | line number (first: 1) | sequence of lines |
| `char` | character positions | sequence of characters or code points |
| `character` | character position | sequence of (possibly composed) Unicode characters |
| `jsonpointer` | JSON Pointer | JSON |
| `file` | POSIX Path | directory tree |
| `xpath` | XPath (or a subset) | XML |
| `fq` | format and path | all supported by [fq](https://github.com/wader/fq?tab=readme-ov-file#fq) |

The locator formats require some more detailled specification. For instance line number depend on a common definition of line breaks, some formats include U+0B, U+0C, U+85, U+2028, U+2029…

More candidates:

-   IIIF
-   [RFC 7111](https://tools.ietf.org/html/rfc7111)
-   Cell address in spreadsheet (column uses Hexavigesimalsystem, row uses numbers)
-   PDF highlighted text annotations
-   Subsets of query languages (SQL, SPARQL…)

``` json
{
  "message": "Timestamp must not be in the future!",
  "position": {
    "fq": "gzip:.members[0].mtime"
  }
}
```

Example 7: Error using fq to locate the internal timestamp of a file in a .gz archive

# 6. Notes and ideas

## Including document values/fragments

Given a JSON document

``` json
{ "authors": "Bob" }
```

An error in field `authors` could be enriched with document content:

``` json
{
  "message": "authors must be array",
  "position": [
    {
      "format": "jsonpointer",
      "value": "/authors",
      "document": [{ "format": "string", "value": "bob" }]
    }
  ]
}
```

Problem: nested documents don’t have a simple `format`.

Another example, adding `message` to a position (does this make sense?):

``` json
{
  "message": "Invalid file foo.txt",
  "position": [
    {
      "format": "file",
      "value": "foo.txt",
      "document": [{ "format": "string", "value": "alice\nbob\n" }],
      "message": "Invalid value in line 2",
      "position": { "line": "2" }
    }
  ]
}
```

# 7. References

## Normative References

-   Bradner, S.: *Key words for use in RFCs to Indicate Requirement Levels*. BCP 14, RFC 2119, March 1997, <http://www.rfc-editor.org/info/rfc2119>.

-   Bray, T.: *The JavaScript Object Notation (JSON) Data Interchange Format*. RFC 8259, December 2017. <https://tools.ietf.org/html/rfc8259>

-   Leiba, B.: *Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words*. BCP 14, RFC 8174, May 2017, <http://www.rfc-editor.org/info/rfc8174>.

## Informative references

-   [JSON Schema](https://json-schema.org/) schema language

# Appendices

The following information is non-normative.

## JSON Schemas

Error records can be validated with the non-normative JSON Schema [`schema.json`](schema.json) in the specification repository. Rules not covered by the JSON Schema include:

``` json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "$defs": {
    "locator": {
      "type": "object",
      "properties": {
        "format": {
          "type": "string",
          "pattern": "^[a-z][a-z0-9-]*$"
        },
        "value": {
          "type": "string"
        },
        "position": { "$ref": "#/$defs/position" }
      },
      "required": ["format", "locator"]
    },
    "position": {
      "description": "positions",
      "anyOf": [
        {
          "type": "array",
          "items": { "$ref": "#/$defs/locator" }
        },
        {
          "type": "object",
          "patternProperties": {
            "^[a-z0-9-]+$": {
              "type": "string"
            }
          },
          "additionalProperties": false
        }
      ]
    }
  },
  "properties": {
    "message": {
      "type": "string",
      "minLength": 1,
      "description": "error message"
    },
    "type": {
      "type": "string",
      "minLength": 1,
      "description": "identifier of the error type"
    },
    "level": {
      "type": "string",
      "enum": ["error", "warning"],
      "default": "error",
      "description": "error level ('error' or 'warning')"
    },
    "position": { "$ref": "#/$defs/position" }
  },
  "required": ["message"]
}
```

## Changes

This document is managed in a revision control system at <https://github.com/gbv/validation-error-format>, including an [issue tracker](https://github.com/gbv/validation-error-format/issues).

-   **Version 0.0.0**

    Work in progress.

## Acknowledgements

…