Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSONSerialization: Use NSDecimalNumber for parsing json numbers. #1655

Closed

Conversation

spevans
Copy link
Collaborator

@spevans spevans commented Aug 6, 2018

  • Replace strol()/strtod() parsing with Scanner.scanDecimal().
    This allows parsing of Integers > Int64.max (upto UInt64.max)
    without losing information due to the fallback parsing provided
    by parsing as a Double.

  • Also allows parsing larger integers into a Decimal.

Requires PR #1653 for some NSDecimalNumber and NSNumber fixes

- Replace strol()/strtod() parsing with Scanner.scanDecimal().
  This allows parsing of Integers > Int64.max (upto UInt64.max)
  without losing information due to the fallback parsing provided
  by parsing as a Double.
@spevans spevans requested a review from itaiferber August 6, 2018 06:48
@alblue
Copy link
Contributor

alblue commented Aug 7, 2018

@swift-ci please test

@spevans
Copy link
Collaborator Author

spevans commented Aug 7, 2018

@alblue This will currently fail tests until #1653 is merged

Copy link
Contributor

@itaiferber itaiferber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about three main things here:

  1. The cost of parsing as NSDecimalNumber every time, especially by spinning up a new Scanner for every single read number

  2. Potential new differences in precision and representation. I haven't followed the implementation of NSDecimalNumber on swift-corelibs-foundation, but is its implementation significantly different from NSDecimalNumber on Darwin? Because on Darwin, NSDecimalNumber has a significantly lower possible magnitude than Double does (IIRC 10^±127 vs. 10^±308), with the tradeoff being that it allows for higher precision. On Darwin, we prefer to parse as a Double if the value fits both within the magnitude and precision of a Double — the C standard guarantees this to be true if the string representation of the mantissa is less than DBL_DECIMAL_DIG in length, so this is what we check. If we cannot fit the number in a Double without loss of precision, then we prefer NSDecimalNumber if the value fits within its magnitude; otherwise we're willing to lose precision to fit in a Double. If the number is too large to fit in either, we bail. Here we'd be giving up on the more efficient representation in an Int/Double, along with the greater range that Double provides (unless I'm missing something and NSDecimalNumber is more capable here than it is on Darwin)

  3. @millenomi — this one is more for you. With all of the bridging work you did, do we have any additional testing/coverage for NSDecimalNumber and how that works out? Folks probably have a lot of code which currently checks as? Int and as? Double since for a long time those were the only two supported options here — I'd hate for an NSDecimalNumber to accidentally slip through with loss of precision or representation or something. Even so, the representation in NSDecimalNumber is not good enough for Ints. On Darwin, the following code prints nil due to loss of precision making the number no longer "cleanly" representable in an Int:

    let dec = Decimal(1123123123)
    let num = NSDecimalNumber(decimal: dec)
    let i = (num as NSNumber) as? Int
    print(i)

Overall I think the changes here are good in spirit but we'll likely need to maintain some of the existing representations we already have unless my concerns are unfounded due to my lack of experience here (which I would happily accept!). What we might be able to do is transition over to more of what we have on Darwin — prefer Double unless we can get away with an NSDecimalNumber.

// Validate the first few characters look like a JSON encoded number:
// Optional '-' sign at start only 1 leading zero if followed by a decimal point.
var index = input
func nextDigit() -> UInt8? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this more clearly be called nextASCII()/nextCharacter()/similar?

let MINUS = UInt8(ascii: "-")

func parseTypedNumber(_ string: String) throws -> (Any, IndexDistance)? {
let scan = Scanner(string: string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth testing the performance of creating a new Scanner for every single number we try to parse — I suspect that this is a potentially prohibitively expensive operation.

@itaiferber
Copy link
Contributor

Note also that this conflicts with the work in #1654 so we might need to coordinate something here.

@spevans
Copy link
Collaborator Author

spevans commented Aug 7, 2018

@itaiferber Thanks for your feedback. I hadn't actually considered the cost of the Scanner startup at all to be honest so that may be a point against this solution.

As to the difference between Double v Decimal I was mostly aiming to hold more information for integers > UInt64.max and also to allow parsing of numbers into Decimal as well as Double.
Of course these types both have there advantages and disadvantages and no one type is better over all, and I hadn't considered the large range of Double being more important.

As to the solution in #1654, I did originally implement that same solution and there are a number of issues with it: Firstly it can involve upto 3 separate string parses. Also care must be taken with stroull as it will return positive values for numbers between Int64.min and -UInt64.max ie it effectively runs abs() on its result unless a - sign is checked for separately.

As for the bridging issues, my tests showed up some problems but these should be resolved in #1653 .

However thinking about this problem further, I realised the real issue is that we parse the string before knowing how we want it use it. So I came up with the following solution which is to store the number as a String and only parse it when it unboxed:

class NSJSONNumber: NSNumber {
let value: String

  init?(_ str: String) {
    var _value = ""
    for ch in str {
      // Validate this matches  JSON number format and return nil otherwise
      .
      .
      _value += ch   
    }
    value = str
  }

  var valueAsDouble :Double? { return Double(value) }
  var valueAsDecimal: Decimal? { return Decimal(value) }
  var valueAsUInt64: UInt64? { return UInt64(value) }
  etc..
}

This would ensure that full JSON number validation is performed, which currently doesn't happen, and the value is only parsed once using the correct parser. Does this seem reasonable or am missing anything here?

@itaiferber
Copy link
Contributor

@spevans I was quickly putting together an approach similar to #1654 — I agree that it can involve up to 3 parses but they should be relatively short and quick as the input is limited in length. Otherwise, though: how do we know how we want to use a parsed number? Aren't these values just returned to the developer directly? Or are you proposing a further deferral that will only parse the value once it's read and cast out (e.g. a number that totally defers evaluation until you try to access it)?

(If so, it doesn't necessarily improve things as it's totally valid to write if let num as? Int /* fail */ { ... } else if let num as? UInt64 /* fail */ { ... } else if let num as? Double /* success */ { ... } which doesn't decrease the number of attempted parses.)

@spevans
Copy link
Collaborator Author

spevans commented Aug 7, 2018

I was thinking out not parsing until its accessed and with Codable doesn't it know the type it wants? With the #1654 solution every Double is parsed 3 times, which only matches the worst case you listed above.

Also, if it is a subclass of NSNumber the other properties are still available (eg intValue) which is currently lossy anyway for out of range numbers

@spevans
Copy link
Collaborator Author

spevans commented Aug 7, 2018

FYI, here is the version I had using the strto* functions:

func parseNumber(_ input: Index, options opt: JSONSerialization.ReadingOptions) throws -> (Any, Index)? {
        func parseTypedNumber(_ address: UnsafePointer<UInt8>, count: Int) throws -> (Any, IndexDistance)? {
            let temp_buffer_size = 64
            var temp_buffer = [CChar](repeating: 0, count: temp_buffer_size)

            return try temp_buffer.withUnsafeMutableBufferPointer { (buffer: inout UnsafeMutableBufferPointer<CChar>) throws -> (Any, IndexDistance)? in
                memcpy(buffer.baseAddress!, address, min(count, temp_buffer_size - 1)) // ensure null termination

                var chIndex = 0
                var ch = buffer[chIndex]
                let isNegative: Bool
                if ch == CChar(UInt8(ascii: "-")) {
                    isNegative = true
                    chIndex = chIndex + 1
                    ch = buffer[chIndex]
                } else {
                    isNegative = false
                }

                guard ch >= CChar(UInt8(ascii: "0")) && ch <= CChar(UInt8(ascii: "9")) else {
                    return nil
                }

                // Only allow 1 leading zero and it must be followed by a decimal point
                if ch == CChar(UInt8(ascii: "0")) && buffer[chIndex + 1] == CChar(UInt8(ascii: "0")) {
                    throw NSError(domain: NSCocoaErrorDomain, code: CocoaError.propertyListReadCorrupt.rawValue,
                                  userInfo: ["NSDebugDescription" : "Number with leading zero around character 16." ])
                }

                let startPointer = buffer.baseAddress!
                let endPointer = UnsafeMutablePointer<UnsafeMutablePointer<Int8>?>.allocate(capacity: 1)
                defer { endPointer.deallocate() }

                func parseAsDouble() -> (Double, IndexDistance) {
                    let doubleResult = strtod(startPointer, endPointer)
                    let distance = startPointer.distance(to: endPointer[0]!)
                    return (doubleResult, distance)
                }

                func parseAsInt64() -> (Int64?, IndexDistance) {
                    errno = 0
                    let int64Result = strtoll(startPointer, endPointer, 10)
                    let outOfRange = ((int64Result == Int64.max || int64Result == Int64.min) && errno == ERANGE)
                    let invalid = (int64Result == 0 && errno == EINVAL)
                    let distance = startPointer.distance(to: endPointer[0]!)

                    if distance == 0 || outOfRange || invalid {
                        return (nil, distance)
                    } else {
                        return (int64Result, distance)
                    }
                }

                func parseAsUInt64() -> (UInt64?, IndexDistance) {
                    // strtoull ignores a leading '-' and always returns a positive value
                    // so check against the flag set earlier
                    guard !isNegative else { return (nil, 0) }
                    errno = 0
                    let uint64Result = strtoull(startPointer, endPointer, 10)
                    let outOfRange = (uint64Result == UInt64.max && errno == ERANGE)
                    let invalid = (uint64Result == 0 && errno == EINVAL)
                    let distance = startPointer.distance(to: endPointer[0]!)

                    if distance == 0 || outOfRange || invalid {
                        return (nil, distance)
                    } else {
                        return (uint64Result, distance)
                    }
                }

                // Try parsing as a Double, checks that it is some sort of number
                let (doubleResult, doubleDistance) = parseAsDouble()
                guard doubleDistance > 0 else { return nil }

                let (int64Result, int64distance) = parseAsInt64()
                if let result = int64Result {
                    if int64distance == doubleDistance {
                        // The parsed number can be represented as an Int64
                        return (NSNumber(value: result), int64distance)
                    } else {
                        return (NSNumber(value: doubleResult), doubleDistance)
                    }
                }

                // Int64 wasnt sufficient, try parsing as a UInt64
                let (uint64Result, uint64distance) = parseAsUInt64()
                if let result = uint64Result {
                    return (NSNumber(value: result), uint64distance)
                }

                // This double result will be 0 or infinity for out of range values. How should they be
                // treated?
                return (NSNumber(value: doubleResult), doubleDistance)
            }
        }

        if source.encoding == .utf8 {
            return try parseTypedNumber(source.buffer.baseAddress!.advanced(by: input), count: source.buffer.count - input).map { return ($0.0, input + $0.1) }
        }
        else {
            var numberCharacters = [UInt8]()
            var index = input
            while let (ascii, nextIndex) = source.takeASCII(index), JSONReader.numberCodePoints.contains(ascii) {
                numberCharacters.append(ascii)
                index = nextIndex
            }
            numberCharacters.append(0)

            return try numberCharacters.withUnsafeBufferPointer {
                try parseTypedNumber($0.baseAddress!, count: $0.count)
                }.map { return ($0.0, index) }
        }
    }

@spevans
Copy link
Collaborator Author

spevans commented Aug 8, 2018

Closed in favour of #1657

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants