Commit
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,369 @@ | ||
Author: Michał Muskała <micmus(at)whatsapp(dot)com> | ||
Status: Draft | ||
Type: Standards Track | ||
Created: 12-02-2024 | ||
Erlang-Version: | ||
Post-History: | ||
**** | ||
# EEP 68: JSON library | ||
Check failure on line 8 in eeps/eep-0068.md GitHub Actions / markdownlintHeading style
Check failure on line 8 in eeps/eep-0068.md GitHub Actions / markdownlintHeadings should be surrounded by blank lines
|
||
---- | ||
Check failure on line 9 in eeps/eep-0068.md GitHub Actions / markdownlintHorizontal rule style
|
||
|
||
## Abstract | ||
Check failure on line 11 in eeps/eep-0068.md GitHub Actions / markdownlintHeading style
|
||
|
||
This EEP proposes introducing a module `json` to the Erlang standard | ||
library with support for encoding and decoding [JSON][1] documents | ||
from and to Erlang data structures. The main reason is to cover | ||
a gap in the Erlang standard library with regards to such a vastly | ||
popular and widespread data format. | ||
|
||
## Rationale | ||
Check failure on line 19 in eeps/eep-0068.md GitHub Actions / markdownlintHeading style
|
||
|
||
JSON is commonly in many different use-cases: | ||
* by web services as a lightweight and human-readable data interchange format; | ||
Check failure on line 22 in eeps/eep-0068.md GitHub Actions / markdownlintLists should be surrounded by blank lines
|
||
* as a configuration language in static files; | ||
* as data interchange format by developer tooling; | ||
* and more. | ||
|
||
There are many existing JSON libraries for Erlang and other BEAM languages, | ||
however adding such a support to standard library would offer unique benefits. | ||
Most notably being able to use it in situations where leveraging third-party | ||
libraries is complex or cumbersome -- such as stand-alone escripts or | ||
fundamental tooling like a build system, or inside OTP itself. | ||
|
||
There have been previous attempts to bring JSON support into OTP, most notably | ||
[EEP 18][EEP], which ultimately weren't adopted previously for various reasons. | ||
However, I believe the time is right to revisit this subject with a fresh | ||
take on an interface such support could take. | ||
|
||
JSON is a well defined format specified in parallel in [RFC 8259][RFC] and | ||
[ECMA 404][ECMA], however how this representation should be translated | ||
into Erlang is not fully clear since the data structures don't present | ||
a direct, 1:1 mapping. To help with this, this EEP proposes an interface | ||
that presents both a convenient and "cannonical" simple API, as well | ||
as an extensible and highly-customisable API with common underlying | ||
implementation. | ||
|
||
This EEP proposes a JSON library which: | ||
* should be easy to adopt in large codebases using one of the popular, | ||
Check failure on line 47 in eeps/eep-0068.md GitHub Actions / markdownlintLists should be surrounded by blank lines
|
||
existing, open-source JSON libraries; | ||
* will allow the existing open-source libraries with custom features | ||
(like support for Elixir protocols) to become thin wrappers around | ||
this library; | ||
* will improve, or at least not regress, performance compared to | ||
leading open-source JSON libraries. | ||
|
||
The proposed JSON library will provide: | ||
* JSON encoding, allowing for single-pass encoding of custom data types –- | ||
Check failure on line 56 in eeps/eep-0068.md GitHub Actions / markdownlintLists should be surrounded by blank lines
|
||
in particular, for Elixir, integrating with a protocol through a thin layer | ||
(implemented outside of OTP); | ||
* JSON decoding with some streaming support allowing to decode messages that | ||
don't fully fit into memory; | ||
* JSON decoding with support for decoding values split across separate | ||
messages without fully concatenating them upfront; | ||
* focus on high-performance encoding and decoding; | ||
* full conformance to [RFC 8259][RFC] and [ECMA 404][ECMA] standards, | ||
the decoder should pass the entire [JSONTestSuite][JSONTestSuite]; | ||
* simple API for common use-cases with canonical data type mapping. | ||
|
||
## Design choices | ||
Check failure on line 68 in eeps/eep-0068.md GitHub Actions / markdownlintHeading style
|
||
|
||
### Data mapping | ||
|
||
We propose, in the "cannonical" API to map JSON data structues to | ||
Erlang and back in the following way: | ||
|
||
| **Decoding from JSON** | **Erlang** | **Encoding into JSON** | | ||
|------------------------|----------------------|------------------------| | ||
| Number | integer() \| float() | Number | | ||
| Boolean | true \| false | Boolean | | ||
| Null | null | Null | | ||
| String | binary() | String | | ||
| | atom() | String | | ||
| Array | list() | Array | | ||
| Object | #{binary() => _} | Object | | ||
| | #{atom() => _} | Object | | ||
| | #{integer() => _} | Object | | ||
|
||
Erlang has generally a richer value system than JSON, therefore | ||
there's generally more types that can be encoded into JSON, | ||
even if they can never be produced directly by the decoder. | ||
|
||
However, with the flexible API, as demonstrated below, the user will | ||
be able to customize the decoding & encoding routines to produce and | ||
consume any Erlang term as necessary in the particular application. | ||
|
||
### Streaming vs value-based parser | ||
|
||
When it comes to data-structure parsers it's common to encounter two | ||
types: ones that given the data produce a complete parsed value, | ||
and others the same data produce a stream of events that can later | ||
be processed to extract values. | ||
|
||
The first kind, which we'll call here value-based, is generally simpler, | ||
usually more efficient, and more convient to use. The second one offers | ||
unique advantages in specific use-cases: for example, where data | ||
can't fully fit into memory. | ||
|
||
For the proposed `json` library this EEP suggests a hybrid approach. | ||
|
||
First, a simple, value-based API: | ||
|
||
```erlang | ||
Check failure on line 111 in eeps/eep-0068.md GitHub Actions / markdownlintCode block style
|
||
-type value() :: | ||
integer() | | ||
float() | | ||
boolean() | | ||
null | | ||
binary() | | ||
list(value()) | | ||
#{binary() => value()}. | ||
|
||
-spec decode(binary()) -> value(). | ||
``` | ||
|
||
Error handling is achieved through exceptions. The following errors | ||
are possible: | ||
```erlang | ||
-type error() :: | ||
unexpected_end | | ||
{unexpected_sequence, binary()} | | ||
{invalid_byte, byte()} | ||
``` | ||
|
||
The exceptions might be enhanced through the [Error Info][ERRINFO] mechanism | ||
with additional meta-data like byte offset where the error occured. | ||
|
||
For the advanced and customizable API, this EEP proposes a callback-based | ||
API that the decoder will use to produce values from the data it parses. | ||
|
||
```erlang | ||
-type from_binary_fun() :: fun((binary()) -> dynamic()). | ||
-type array_start_fun() :: fun((Acc :: dynamic()) -> ArrayAcc :: dynamic()). | ||
-type array_push_fun() :: fun((Value :: dynamic(), Acc :: dynamic()) -> NewAcc :: dynamic()). | ||
-type array_finish_fun() :: fun((ArrayAcc :: dynamic()) -> dynamic()). | ||
-type object_start_fun() :: fun((Acc :: dynamic()) -> ObjectAcc :: dynamic()). | ||
-type object_push_fun() :: fun((Key :: dynamic(), Value :: dynamic(), Acc :: dynamic()) -> NewAcc :: dynamic()). | ||
-type object_finish_fun() :: fun((ObjectAcc :: dynamic()) -> dynamic()). | ||
|
||
-type decoders() :: #{ | ||
empty_array => term(), | ||
array_start => array_start_fun(), | ||
array_push => array_push_fun(), | ||
array_finish => array_finish_fun(), | ||
empty_object => term(), | ||
object_start => object_start_fun(), | ||
object_push => object_push_fun(), | ||
object_finish => object_finish_fun(), | ||
float => from_binary_fun(), | ||
integer => from_binary_fun(), | ||
string => from_binary_fun(), | ||
null => term() | ||
}. | ||
|
||
-spec decode(binary(), Acc :: dynamic(), decoders()) -> | ||
{Value :: dynamic(), FinalAcc :: dynamic(), Rest :: binary()}. | ||
``` | ||
|
||
This allows the user to fully customize the decoded format, including | ||
features seen in open-source JSON libraries: | ||
* decoding string keys as atoms; | ||
* decoding objects as lists of pairs; | ||
* decoding floats as custom structures with decimal precision; | ||
* decoding `null` as another atom, in particular `undefined` or `nil`; | ||
* using `binary:copy/1` on strings that will be retained in memory; | ||
* decoding multiple JSON messages from a single binary blob; | ||
* and more. | ||
|
||
Furthermore, this allows the user to only retain parts of the data structure | ||
to achieve results similar to using a streaming SAX-like parser for data | ||
that does't fully fit into memory. | ||
All the callbacks are optional and have a default value correspnding to the | ||
"simple" API behaviour and using lists as accumulators. | ||
### Incomplete data parsing | ||
We propose a future enhancement to the full `decode/3` API, where | ||
it can return an `{incomplete, continuation()}` value that can be used to | ||
decode values split across multiple binary blobs (for example as received | ||
from a TCP socket). | ||
```erlang | ||
-spec decode_continue(binary(), continuation()) -> | ||
{Value :: dynamic(), FinalAcc :: dynamic(), Rest :: binary()} | | ||
{incomplete, continuation()}. | ||
``` | ||
### Encoding API | ||
For encoding this EEP again proposes two separate sets of APIs. | ||
A simple API using "cannonical" data types: | ||
```erlang | ||
-type encode_value() :: | ||
integer() | | ||
float() | | ||
boolean() | | ||
null | | ||
binary() | | ||
atom() | | ||
list(encode_value()) | | ||
#{binary() | atom() | integer() => encode_value()}. | ||
-spec encode(encode_value()) -> iodata(). | ||
``` | ||
And an advanced, callback-based API allowing for single-pass encoding | ||
of custom data structures. This API is acompanied by a set of functions | ||
facilitating the implementation of custom encoding callbacks. | ||
```erlang | ||
-type encoder() :: fun((dynamic(), encoder()) -> iodata()). | ||
-spec encode(dynamic(), encoder()) -> iodata(). | ||
-spec encode_value(dynamic(), encoder()) -> iodata(). | ||
-spec encode_atom(atom(), encoder()) -> iodata(). | ||
-spec encode_integer(integer()) -> iodata(). | ||
-spec encode_float(float()) -> iodata(). | ||
-spec encode_list(list(), encoder()) -> iodata(). | ||
-spec encode_map(map(), encoder()) -> iodata(). | ||
-spec encode_map_checked(map(), encoder()) -> iodata(). | ||
-spec encode_key_value_list([{dynamic(), dynamic()}], encoder()) -> iodata(). | ||
-spec encode_key_value_list_checked([{dynamic(), dynamic()}], encoder()) -> iodata(). | ||
-spec encode_binary(binary()) -> iodata(). | ||
-spec encode_binary_escape_all(binary()) -> iodata(). | ||
``` | ||
The `encoder()` callback is invoked on every value during traversal. | ||
The simple API specified above is equivalent to using the | ||
`fun json:encode_value/2` function as the encoder. | ||
The `*_checked/2` variants of functions offer verifying the encoder | ||
doesn't produce repeated keys. | ||
The default `encode_binary/1` function will emit unescaped unicode values | ||
as allowed by the specifications; however for compatibility reasons | ||
we provide the optional `encode_binary_escape_all/1` function | ||
that will always produce purely ASCII messages encoding all higher | ||
unicode values with the `\u` escape sequences. | ||
|
||
|
||
### Formatting and pretty-printing | ||
|
||
This EEP further proposes an additional API for formatting (and pretty-printing) | ||
JSON messages. This API consists of transforming a textual JSON message into | ||
a formatted JSON message. | ||
This is the most flexible solution that orthogonally supports | ||
formatting results of custom encoding functions like described above, | ||
without adding the burden of complex formatting options in the middle of the | ||
encoders. | ||
Formatting isn't usually done in critical hot-paths of high-performance | ||
services, thgerefore the overhead of a two-pass formatting is deemed acceptable. | ||
```erlang | ||
-type format_option() :: #{ | ||
indent => iodata(), | ||
line_separator => iodata(), | ||
after_colon => iodata() | ||
}. | ||
-spec format(iodata()) -> iodata(). | ||
-spec format(iodata(), format_option()) -> iodata(). | ||
``` | ||
## Reference Implementation | ||
[PR-8111][PR] Implements the `encode/1`, `encode/2`, `decode/1`, and `decode/3` | ||
functions as proposed in this EEP. | ||
The formatting API and the support for incomplete message decoding is left | ||
as a follow-up taskk. | ||
## Appendix | ||
### Example of a decoding trace | ||
Given the following data: | ||
```json | ||
{"a": [[], {}, true, false, null, {"foo": "baz"}], "b": [1, 2.0, "three"]} | ||
``` | ||
the decoding APIs will be called with following arguments: | ||
```erlang | ||
object_start(Acc0) => Acc1 | ||
string(<<"a">>) => Str1 | ||
array_start(Acc1) => Acc2 | ||
empty_array() => Arr1 | ||
array_push(Acc2, Arr1) => Acc3 | ||
empty_object() => Obj1 | ||
array_push(Obj1, Acc3) => Acc4 | ||
array_push(true, Acc4) => Acc5 | ||
array_push(false, Acc5) => Acc6 | ||
null() => Null | ||
array_push(Null, Acc6) => Acc7 | ||
object_start(Acc7) => Acc8 | ||
string(<<"foo">>) => Str2 | ||
string(<<"baz">>) => Str3 | ||
object_push(Str2, Str3, Acc8) => Acc9 | ||
object_finish(Acc9) => Obj2 | ||
array_push(Obj2, Acc7) => Acc10 | ||
array_finish(Acc10) => Arr1 | ||
object_push(Arr1, Acc1) => Acc11 | ||
string(<<"b">>) => Str4 | ||
array_start(Acc11) => Acc12 | ||
integer(<<"1">>) => Int1 | ||
array_push(Int1, Acc12) => Acc13 | ||
float(<<"2.0">>) => Float1 | ||
array_push(Float1, Acc13) => Acc14 | ||
string(<<"three">>) => Str5 | ||
array_push(Str5, Acc14) => Acc15 | ||
array_finish(Acc15) => Arr2 | ||
object_push(Str4, Arr2, Acc11) => Acc16 | ||
object_finish(Acc16) => Obj3 | ||
% final decode/3 return | ||
{Obj3, Acc16, <<"">>} | ||
``` | ||
### Example of a custom encoder | ||
An example of a custom encoder that would support using a heuristic | ||
to differentiate pais of object-like key-value lists from plain | ||
lists of values could look as follows: | ||
```erlang | ||
custom_encode(Value) -> json:encode(Value, fun encoder/2). | ||
encoder(null, _Encode) -> <<"\"null\"">>; | ||
encoder(nil, _Encode) -> <<"null">>; | ||
encoder([{_, _} | _] = Value, Encode) -> json:encode_key_value_list(Value, Encode); | ||
encoder(Other, Encode) -> json:encode_value(Other, Encode). | ||
``` | ||
Another encoder that supports using Elixir `nil` as Null and protocols for | ||
further customisation could look as follows: | ||
```erlang | ||
encoder(nil, _Encode) -> <<"null">>; | ||
encoder(null, _Encode) -> <<"\"null\"">>; | ||
encoder(#{__struct__ => _} = Struct, Encode) -> 'Elixir.JSONProtocol':encode(Struct, Encode); | ||
encoder(Other, Encode) -> json:encode_value(Other, Encode). | ||
``` | ||
[1]: https://www.json.org/json-en.html | ||
"Introducing JSON" | ||
[RFC]: https://datatracker.ietf.org/doc/html/rfc8259 | ||
"The JavaScript Object Notation (JSON) Data Interchange Format" | ||
[ECMA]: https://ecma-international.org/publications-and-standards/standards/ecma-404/ | ||
"The JSON data interchange syntax" | ||
[EEP]: https://github.com/erlang/eep/blob/master/eeps/eep-0018.md | ||
"EEP 18: JSON bifs" | ||
[ERRINFO]: https://github.com/erlang/eep/blob/master/eeps/eep-0054.md | ||
"EEP 54: Provide more information about errors" | ||
[JSONTestSuite]: https://github.com/nst/JSONTestSuite | ||
[PR]: https://github.com/erlang/otp/pull/8111 | ||
## Copyright | ||
This document is placed in the public domain or under the CC0-1.0-Universal | ||
license, whichever is more permissive. |