Skip to content

Audit JSON validation coverage against RFC 8259 and industry test suites #37

@membphis

Description

@membphis

Motivation

As a JSON parser library, correctness and stability are fundamental requirements. Before wider adoption, we should systematically audit our validation coverage against:

  1. RFC 8259 (The JSON Data Interchange Format) - the authoritative specification
  2. JSONTestSuite (https://github.com/nst/JSONTestSuite) - the de facto industry standard for parser validation, containing 300+ edge cases
  3. Peer implementations - lua-cjson and lua-resty-simdjson both have mature test suites we can learn from

This audit will identify gaps in our validation logic and help prioritize fixes before they become breaking changes.


Current State

What We Validate Well

Category Status Location
Bracket/brace pairing ✅ Complete scan/scalar.rs, scan/mod.rs::validate_brackets
String escape sequences ✅ Complete decode/string.rs - all 8 escapes + \uXXXX + surrogate pairs
Numeric parsing (i64/f64) ✅ Complete decode/number.rs - overflow detection, type mismatch
Path resolution ✅ Complete path.rs - keys, indices, nesting
Type detection ✅ Complete doc.rs::type_of
SIMD/scalar parity ✅ Complete scanner_crosscheck.rs - proptest with 2000 cases
FFI safety ✅ Complete ffi.rs - panic barrier, null pointer checks

What We Don't Validate (Potential Gaps)

Category Current Behavior RFC 8259 Requirement Risk
Leading zeros in numbers Accepted (007 -> 7) MUST reject Medium
Leading plus sign Accepted (+1 -> 1) MUST reject Medium
Bare decimal point Accepted (.5, 1.) MUST reject Medium
Max nesting depth Unlimited Implementation-defined Medium (stack overflow)
Control chars in strings Accepted (0x00-0x1F) MUST be escaped Low
Invalid UTF-8 sequences Passed through MUST be valid UTF-8 Low
Trailing content after root Ignored Should be rejected Low
UTF-8 BOM Not handled Implementation-defined Low
Duplicate object keys Last wins (implicit) Implementation-defined Low

Test Coverage Comparison

lua-cjson tests:

  • RFC 4627 example files
  • Configurable nesting depth limits (default 5, max 1000)
  • Invalid number detection (hex, leading zeros, Inf/NaN)
  • Locale handling (comma decimal separators)
  • Comment support (single/multi-line)

lua-resty-simdjson tests:

  • Deep nesting (10 levels)
  • Large payloads (2100+ elements)
  • Reentrancy behavior
  • Numeric precision (14-16 digits)
  • Null compatibility (ngx.null, cjson.null)

JSONTestSuite categories:

  • y_* - 100+ cases parsers MUST accept
  • n_* - 200+ cases parsers MUST reject
  • i_* - 50+ implementation-defined edge cases

Spec

Phase 1: RFC 8259 Compliance Test Suite (Next Step)

Goal: Build a comprehensive test suite based on RFC 8259, referencing lua-cjson's test approach.

Reference: https://github.com/openresty/lua-cjson/tree/master/tests

Test Categories to Cover:

1.1 Valid JSON (MUST accept)

// Primitive values
"null"
"true"
"false"
"0"
"-0"
"123"
"-456"
"3.14"
"-2.718"
"1e10"
"1E10"
"1e+10"
"1e-10"
"1.5e2"
"\"\""                    // empty string
"\"hello\""
"\"hello\\nworld\""       // escaped newline
"\"\\u0041\""             // unicode escape -> "A"
"\"\\uD83D\\uDE00\""      // surrogate pair -> emoji
"[]"                      // empty array
"[1,2,3]"
"[1, 2, 3]"               // with whitespace
"{}"                      // empty object
"{\"a\":1}"
"{\"a\": 1, \"b\": 2}"    // with whitespace
"[{\"a\":[1,{\"b\":2}]}]" // nested structures

1.2 Invalid JSON (MUST reject)

// Structural errors
""                        // empty input
"{"                       // unclosed brace
"["                       // unclosed bracket
"{]"                      // mismatched brackets
"[}"                      // mismatched brackets
"{\"a\":}"                // missing value
"{\"a\"}"                 // missing colon and value
"[,]"                     // leading comma
"[1,]"                    // trailing comma
"{\"a\":1,}"              // trailing comma in object
"[1 2]"                   // missing comma

// Invalid numbers
"+1"                      // leading plus
"01"                      // leading zero
"00"                      // leading zeros
".5"                      // no integer part
"1."                      // no fraction part
"1.e5"                    // no fraction digits
"0x1F"                    // hex notation
"NaN"                     // not a JSON value
"Infinity"                // not a JSON value
"-Infinity"               // not a JSON value
"1e"                      // incomplete exponent
"1e+"                     // incomplete exponent

// Invalid strings
"\"hello"                 // unclosed string
"'hello'"                 // single quotes
"\"\\x41\""               // invalid escape sequence
"\"\\u00G0\""             // invalid hex in unicode
"\"\\uD800\""             // lone high surrogate
"\"\\uDC00\""             // lone low surrogate

// Invalid literals
"TRUE"                    // wrong case
"False"                   // wrong case
"NULL"                    // wrong case
"nil"                     // not JSON
"undefined"               // not JSON

// Trailing content
"{}[]"                    // multiple values
"1 2"                     // multiple values
"true false"              // multiple values

1.3 Whitespace Handling

// Valid whitespace (space, tab, newline, carriage return)
" { } "
"\t{\t}\t"
"\n{\n}\n"
"\r{\r}\r"
"{ \"a\" : 1 }"
"[\n  1,\n  2\n]"

1.4 String Edge Cases

// All valid escape sequences
"\"\\\"\""                // \"
"\"\\\\\""                // \\
"\"\\/\""                 // \/
"\"\\b\""                 // \b (backspace)
"\"\\f\""                 // \f (form feed)
"\"\\n\""                 // \n (newline)
"\"\\r\""                 // \r (carriage return)
"\"\\t\""                 // \t (tab)

// Unicode edge cases
"\"\\u0000\""             // U+0000 (NUL)
"\"\\u007F\""             // U+007F (DEL)
"\"\\u0080\""             // U+0080 (first non-ASCII)
"\"\\uFFFF\""             // U+FFFF (BMP limit)
"\"\\uD834\\uDD1E\""      // U+1D11E (surrogate pair)

1.5 Number Edge Cases

// Valid numbers
"0"
"-0"
"1"
"-1"
"123456789"
"1.0"
"1.5"
"-1.5"
"1e1"
"1E1"
"1e+1"
"1e-1"
"1.5e10"
"-1.5e-10"
"1E100"                   // large exponent
"1e-100"                  // small exponent
"9223372036854775807"     // i64::MAX
"-9223372036854775808"    // i64::MIN

// Invalid numbers (to be rejected)
"+1"
"01"
"1."
".1"
"1e"
"1e+"
"1e-"
"Infinity"
"-Infinity"
"NaN"

Implementation:

  • Create tests/rfc8259_compliance.rs
  • Organize tests by category with clear documentation
  • Each test should reference the relevant RFC section

Acceptance Criteria:

  • All valid JSON cases parse successfully
  • All invalid JSON cases return appropriate errors
  • Test file is well-documented with RFC references

Phase 2: Number Format Validation

Goal: Reject non-RFC-compliant number formats during Phase 2 decode.

Spec:

number = [ "-" ] int [ frac ] [ exp ]
int    = "0" / ( digit1-9 *digit )
frac   = "." 1*digit
exp    = ("e" / "E") [ "+" / "-" ] 1*digit

Reject:

  • Leading zeros: 007, 00.5
  • Leading plus: +1, +0
  • Bare decimal: .5, 1., -.5
  • Hex notation: 0x1F
  • Special values: NaN, Infinity, -Infinity

Implementation: Add validate_number_format() in decode/number.rs, called before parse_i64/parse_f64.


Phase 3: Nesting Depth Limit

Goal: Prevent stack overflow on maliciously deep input.

Spec:

  • Default max depth: 128 (matches simdjson)
  • Configurable via Document::parse_with_options(buf, Options { max_depth: 512 })
  • Error: QJD_NESTING_TOO_DEEP (new error code)

Implementation: Track depth in validate_brackets() or add a separate pass.


Phase 4: String Content Validation

Goal: Reject unescaped control characters per RFC 8259.

Spec:

  • Reject bytes 0x00-0x1F inside strings unless escaped
  • Reject bytes 0x7F (DEL) inside strings

Implementation: Add check in scan/scalar.rs during string scanning.


Phase 5: UTF-8 Validation

Goal: Reject invalid UTF-8 sequences in string values.

Spec:

  • Validate UTF-8 during Phase 2 string decode
  • Reject overlong encodings, surrogate halves, out-of-range codepoints

Implementation: Use std::str::from_utf8() or a dedicated validator in decode/string.rs.


Phase 6: Trailing Content Detection

Goal: Reject input with non-whitespace after the root value.

Spec:

  • After Phase 1 scan, verify only whitespace follows the root closer
  • Error: QJD_TRAILING_CONTENT (new error code)

Implementation: Check in Document::parse() after scan() returns.


Phase 7: JSONTestSuite Integration

Goal: Automated regression testing against the industry standard.

Spec:

  • Add tests/json_test_suite.rs
  • Download/vendor JSONTestSuite test files
  • Run all y_* files: assert parse succeeds
  • Run all n_* files: assert parse fails
  • Run i_* files: document our behavior (no assertion)

Non-Goals (Explicitly Out of Scope)

  1. BOM handling - Callers should strip BOM before passing to us
  2. Comment support - Not part of RFC 8259
  3. Trailing commas - Not part of RFC 8259
  4. Duplicate key policy - Current "last wins" behavior is acceptable
  5. Streaming/incremental parsing - Different API surface

Success Criteria

  • All y_* tests from JSONTestSuite pass
  • All n_* tests from JSONTestSuite fail with appropriate error
  • No stack overflow on 10,000-deep nesting
  • Documented behavior for all i_* edge cases

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions