Poor error messages for untagged enums ("error occurred in:") #28

emk · 2022-06-07T17:06:39Z

openapi-interfaces generates some particularly confusing error messages. These look something like:

ERROR: error: components.interfaces: expected oneOf interface: unknown field `discriminator`, expected one of `description`, `title`, `oneOf`
error occurred in:
---
(entire interface's worth of YAML here)

Sometimes, we may get several of these errors nested inside of each other, with multiple error occurred in: lines showing larger and smaller chunks of context. This is awful, but it's surprisingly hard to fix. I'm creating this bug to so I remember why, and don't need to work it out from first principles every couple of years.

Untagged enums

An untagged enum uses #[serde(untagged)] to allow several different variants.

/// Custom interface type.
///
/// This is our main extension to OpenAPI. It allows specifying an object schema
/// in way that's less "validation-like" and more "type-like".
#[derive(Clone, Debug, Eq, PartialEq, Serialize)]
#[serde(untagged)]
#[allow(clippy::large_enum_variant)]
pub enum Interface {
    /// An interface that `$includes` another interface. We can't parse this
    /// until the inclusion has been computed.
    Includes(IncludesInterface),
    /// An type-union interface. This exists so it can use `$interface:
    /// ...#SameAsInterface` in a type union.
    OneOf(OneOfInterface),
    /// A fully-resolved interface definition.
    Basic(BasicInterface),
}

#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
#[serde(rename_all = "camelCase")]
pub struct IncludesInterface {
    /// This field is unique to this variant.
    #[serde(rename = "$includes")]
    base: String,

    /// This can be literally any value, which is treated as a JSON Merge patch
    /// against the base type.
    #[serde(flatten)]
    merge_patch: Map<String, Value>,
}

#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
#[serde(rename_all = "camelCase", deny_unknown_fields)]
pub struct OneOfInterface {
    /// This field is unique to this variant.
    one_of: Vec<InterfaceRef>,

    #[serde(default)]
    description: Option<String>,

    #[serde(default, skip_serializing_if = "Option::is_none")]
    title: Option<String>,
}

#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
#[serde(rename_all = "camelCase", deny_unknown_fields)]
pub struct BasicInterface {
    #[serde(default, skip_serializing_if = "BTreeMap::is_empty")]
    members: BTreeMap<String, Member>,

    // ...more
}

On the wire, these interfaces look like:

interface:
  SampleIncludesInterface:
    $includes: "BaseInterface"
    # Arbitrary unstructured YAML
  SampleOneOfInterface:
    oneOf: [...]
  SampleBasicInterface:
    members: { ... }

We can tell them apart by checking for $includes and oneOf.

`serde` parsing strategy

To parse Interface, serde attempts to parse each of the three variants, and it takes the first one which matches. If we expand the #[derive(Deserializer)] macro, we get:

// Recursive expansion of Deserialize! macro
// ==========================================

#[doc(hidden)]
#[allow(non_upper_case_globals,unused_attributes,unused_qualifications)]
const _:() = {
  #[allow(unused_extern_crates,clippy::useless_attribute)]
  extern crate serde as _serde;
  #[allow(unused_macros)]
  macro_rules!try {
    ($__expr:expr) = >{
      match$__expr {
        _serde::__private::Ok(__val) = >__val,_serde::__private::Err(__err) = >{
          return _serde::__private::Err(__err);    
          }
      }
    }
  }#[automatically_derived]
  impl < 'de>_serde::Deserialize< 'de>for Interface {
    fn deserialize<__D>(__deserializer:__D) -> _serde::__private::Result<Self,__D::Error>where __D:_serde::Deserializer< 'de> ,{
      let __content = try!(<_serde::__private::de::Content as _serde::Deserialize> ::deserialize(__deserializer));
      if let _serde::__private::Ok(__ok) = _serde::__private::Result::map(<IncludesInterface as _serde::Deserialize> ::deserialize(_serde::__private::de::ContentRefDeserializer:: <__D::Error> ::new(&__content)),Interface::Includes){
        return _serde::__private::Ok(__ok);
        
      }if let _serde::__private::Ok(__ok) = _serde::__private::Result::map(<OneOfInterface as _serde::Deserialize> ::deserialize(_serde::__private::de::ContentRefDeserializer:: <__D::Error> ::new(&__content)),Interface::OneOf){
        return _serde::__private::Ok(__ok);
        
      }if let _serde::__private::Ok(__ok) = _serde::__private::Result::map(<BasicInterface as _serde::Deserialize> ::deserialize(_serde::__private::de::ContentRefDeserializer:: <__D::Error> ::new(&__content)),Interface::Basic){
        return _serde::__private::Ok(__ok);
        
      }_serde::__private::Err(_serde::de::Error::custom("data did not match any variant of untagged enum Interface"))
    }
    }
  };

In pseudocode, this is:

const content = deserialize_as::<serde::__private::de::Content>(&input)?;
if let Ok(includes) = deserialize_as::<IncludesInterface>(&content)? {
    return Ok(Interface::Includes(includes));
}
if let Ok(oneOf) = deserialize_as::<OneOfInterface>(&content)? {
    return Ok(Interface::OneOf(oneOf));
}
if let Ok(basic) = deserialize_as::<BasicInterface>(&content)? {
    return Ok(Interface::Basic(basic));
}
Err(serde::de::Error::custom("data did not match any variant of untagged enum Interface"))

The code first parses the object as a value of type serde::__private::de::Content (see below). This loses line & column number information, and it looks something like:

    pub enum Content<'de> {
        Bool(bool),

        U8(u8),
        U16(u16),
        U32(u32),
        // ...
}

There are a few unfortunate things going on in the generated deserializer:

If we can't match any interface, we print a generic error.
Even if we have a value containing the key "oneOf", and the actual error occurs many levels down inside, we still fail to match OneOfInterface, and we fall back to the unhelpful generic error.
We lose line number information for errors, which is incredibly frustrating in a large JSON file.

Things which don't work

We can't use serde::__private::de::Content directly, because it's unstable and only allowed to be used in code generated by serde_derive.
Even if we could use Content, it doesn't actually buy us anything, because it still loses line numbers.
We can replace Content with serde_yaml::Value, which is basically the same thing. It also lacks line numbers.
In fact, we could write our own Content type, and we'd still have trouble getting line numbers, because Deserialize is normally implemented in a format-independent way, expecting an arbitrary implementation of Deserializer. And many serde formats don't have line numbers, so Deserializer offers no generic way to get the current line number.
Even if we could use some trickery to reach through Deserializer to get line numbers, serde_yaml doesn't provide them, anyways.
The yaml_rust library used by serde_yaml actually does have some line number support. See Marker and mark. But we'd have to patch serde_rust to make this available.

Also, if we use a tagged enum like:

#[derive(Deserialize)]
#[serde(tag = "type")]
enum Foo {
    Bar { x: String },
    Baz { y: String },
}

...the generated code still goes through Content in some cases. So serde basically always follows the same strategy with enums in YAML-like formats that aren't "self-describing": Parse them to Content, lose line numbers, then re-parse Content.

Our general approach

We use a custom Deserialize implementation that parses the input to serde_yaml:Mapping, losing line number information. Then we use contains_key to manually check for each of our special keys. Once we figure out what kind of structure we're parsing, we parse it normally using deserialize_enum_helper to consume our serde_yaml:Mapping.

impl<'de> Deserialize<'de> for Interface {
    // Manually deserialize for slightly better error messages.
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        use serde_yaml::{Mapping, Value};

        // Parse it as raw YAML.
        let yaml = Mapping::deserialize(deserializer)?;

        // Look for `$includes`.
        let includes_key = Value::String(String::from("$includes"));
        let oneof_key = Value::String(String::from("oneOf"));
        if yaml.contains_key(&includes_key) {
            Ok(Interface::Includes(deserialize_enum_helper::<D, _>(
                "`$includes` interface",
                yaml,
            )?))
        } else if yaml.contains_key(&oneof_key) {
            Ok(Interface::OneOf(deserialize_enum_helper::<D, _>(
                "oneOf interface",
                yaml,
            )?))
        } else {
            Ok(Interface::Basic(deserialize_enum_helper::<D, _>(
                "interface",
                yaml,
            )?))
        }
    }
}

Drawbacks to this approach include:

We lose line number information early, and we can't get it back without ridiculous amounts of work.
We may have multiple levels of untagged enums nested inside each other.
All our errors are passed through serde as raw strings, not structured errors. So it's hard to coordinate between multiple levels of tagged enums.

Possible fixes

Any of Better message when failing to match any variant of an untagged enum serde-rs/serde#2157, Collect errors when deserializing untagged enums serde-rs/serde#1544 or this serde fork might allow us to get rid of the custom Deserialize. But that would still leave the loss of line-number information, and the confusing errors from nested untagged enum types.
We could at least improve errors somewhat if we (ugh) encoded our errors in a machine-readable format, and used that information to detect when we were inside multiple levels of tagged enums. This would allow us to clean our error messages up slightly, at least.

The text was updated successfully, but these errors were encountered:

emk · 2022-06-07T17:58:22Z

Ah, I had completely forgotten the last alternative here! Many of the various serde bindings provide a Spanned type, as discussed in dtolnay/serde-yaml#201. This has been officially rejected from serde_yaml.

See https://docs.rs/toml/latest/toml/struct.Spanned.html, which looks fairly clean. And https://docs.rs/json-spanned-value/latest/json_spanned_value/, which appears to only work for spanned::Value(???), and which relies on a hacked Reader type that uses thread-local storage and forces 1-byte-at-a-time reads.

We could implement something inspired by toml::Spanned by forking serde_yaml, if we really wanted it that badly.

Note that to implement $include, we convert two parsed data structures to JSON, merge them using JSON Merge Patch, and then parse them again. Preserving line numbers through this process would require a custom YAML SpannedValue = Spanned<UnspannedValueWithSpannedChildren> type, and a quick reimplementation of Merge Patch (which is like 5 lines). But this would further require reading spans from two kinds of input:

We'd need to be able to deserialize Spanned<T> from raw YAML.
We'd need to be able to deserialize Spanned<T> from our custom SpannedValue type.

So this is possible, but it's a huge project.

emk · 2022-06-07T18:04:48Z

Some other links from our original comments:

emk · 2022-06-07T19:40:47Z

OK, by inspecting the text of the error messages, and by adding two new custom Deserialize routines, we can transform this:

ERROR: error parsing merged "Widget"
  caused by: expected $ref, $interface or a schema with one of allOf, oneOf, or type: expected schema: data did not match any variant of untagged enum VecOrScalar
error occurred in:
---
type: foo

error occurred in:
---
type: foo

...into this:

ERROR: error parsing merged "Widget"
  caused by: unknown variant `foo`, expected one of `string`, `number`, `integer`, `object`, `array`, `boolean`, `null`
error occurred when expecting one of "string", "number", "integer", "object", "array", "boolean" or "null" in:
  ---
  foo
  ---
...which appeared when expecting schema in:
  ---
  type: foo
  ---

The most important part of this change is unknown variant 'foo', expected one of ..., which is much better than expected $ref, $interface or a schema with one of allOf, oneOf, or type: expected schema: data did not match any variant of untagged enum VecOrScalar.

These errors may still be hard to find a in big source file, but at least they should be a lot clearer. Sadly, line numbers are not on the roadmap.

emk added the wontfix This will not be worked on label Jun 7, 2022

emk added this to Backlog in Kanban for openapi-interfaces via automation Jun 7, 2022

emk closed this as completed in a0c67a9 Jun 7, 2022

Kanban for openapi-interfaces automation moved this from Backlog to Done Jun 7, 2022

emk mentioned this issue Jun 7, 2022

Include definitions from external OpenAPI YAML files #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor error messages for untagged enums ("error occurred in:") #28

Poor error messages for untagged enums ("error occurred in:") #28

emk commented Jun 7, 2022

emk commented Jun 7, 2022 •

edited

Loading

emk commented Jun 7, 2022

emk commented Jun 7, 2022

Poor error messages for untagged enums ("error occurred in:") #28

Poor error messages for untagged enums ("error occurred in:") #28

Comments

emk commented Jun 7, 2022

Untagged enums

serde parsing strategy

Things which don't work

Our general approach

Possible fixes

emk commented Jun 7, 2022 • edited Loading

emk commented Jun 7, 2022

emk commented Jun 7, 2022

`serde` parsing strategy

emk commented Jun 7, 2022 •

edited

Loading