Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decodable sequence #15

Closed
josh opened this issue Apr 1, 2020 · 9 comments
Closed

Decodable sequence #15

josh opened this issue Apr 1, 2020 · 9 comments
Labels
enhancement New feature or request question Further information is requested
Projects

Comments

@josh
Copy link
Contributor

josh commented Apr 1, 2020

Hey @dehesa!

I think I was too slow, it looks like you already implemented the sequential buffering strategy for 0.5.2. I was taking some time to learn about the Decoder protocol internals.

What I learned is that it's possible to decode an UnkeyedDecodingContainer into any sequence without buffering. ShadowDecoder.UnkeyedContainer seems to do a good job of iteratively decoding each item.

The README demos decoding into a preallocated array.

let decoder = CSVDecoder { $0.bufferingStrategy = .sequential }
let content: [Student] = try decoder([Student].self, from: URL("~/Desktop/Student.csv"))

Instead of an Array, I created a custom sequence wrapper. With the added benefit of customizing how the result is wrapped. I had my 🤞 that AnySequence was Decodable, but it's not.

class DecodableSequence<T: Decodable>: Sequence, IteratorProtocol, Decodable {
    private var container: UnkeyedDecodingContainer

    required init(from decoder: Decoder) throws {
        container = try decoder.unkeyedContainer()
    }

    func next() -> Result<T, Error>? {
        if container.isAtEnd {
            return nil
        }
        // or could use a try! here
        return Result { try container.decode(T.self) }
    }
}

Then:

let decoder = CSVDecoder { $0.bufferingStrategy = .sequential }
let url = URL(fileURLWithPath: "Student.csv")
let results = try decoder.decode(DecodableSequence<Student>.self, from: url)

for result in results {
    print(try result.get())
}

Any thoughts on this technique or Alternatives? Would a sequence wrapper like this be useful to include as part of the library?

Thanks!
@josh

@dehesa dehesa added question Further information is requested enhancement New feature or request labels Apr 2, 2020
@dehesa
Copy link
Owner

dehesa commented Apr 2, 2020

Hi @josh,

Yeah, last week was pretty productive 😄 I didn't manage to do everything I wanted (the .unrequested decoding buffering strategy is still missing), but I succeeded on having full Codable adoption, which makes me very happy.

Referring to your idea:

  1. I quite like it.

    Having a counterpart to CSVReader.readRow() on the decoder seems to be worth while. It would help parsing huge files, since the result is decoded on demand (currently you can only receive the whole decoded file). That said...

  2. Foundation's JSONDecoder and PropertyListDecoder have established the expectation that calling .decode(...) on a decoder will return the whole instance being decoded.

    We would need to come up with a different API that makes it clear you are decoding on demand. The solution you propose is, under my opinion, confusing because the user would expect to fully decode the type they are writing as argument. However, you are just returning a sequence to decode later on.

  3. I don't understand why your DecodableSequence needs to conform to Decodable. There doesn't seem to be a need for that.

    At a high-level:

    let decoder = CSVDecoder()
    let sequence = decoder.__nameForOnDemandRowDecoding__(CustomRowType.self, from: url)
    for row in sequence { /* Do what you want with it */ }

    The sequence could be just a wrapper over the shadowDecoder returning a unkeyed container, so you can reuse everything else in the library.

    extension CSVDecoder {
        internal ShadowSequence<Element>: Sequence, IteratorProtocol {
            private let decoder: ShadowDecoder
            
            func next() -> Result<Element,Error>? {
                guard !__ask_self.decoder.source_if_is_at_end() else { return nil }
                do {
                    try self.decoder.unkeyedContainer().decode(Element.self)
                } catch let error {
                    return .failure(error)
                }
            }
        }
    }
  4. I am not yet 100% convinced that wrapping the decoded row twice is a great idea. One wrapping for Result<T,Error> and another wrapping for Optional<Result<T,Error>>.

    If feels ugly 😅 that you have to send two failure signal on error: one for the Result.failure and another one sending a nil to tell any sequence syntax consumer that the sequence has finished.

    The only other solution I can think of, is to get away completely from the Sequence syntax and have ShadowSequence have a throwing method to be called in a while loop. For example.

    let decoder = CSVDecoder()
    let sequence = decoder.__nameForOnDemandRowDecoding__(CustomRowType.self, from: url)
    while let row = try sequence.next() { /* Do what you want with it */ }

    It would be a pity to throw away all the nice Sequence extension, though. Maybe it is something we can bring up in the Swift forums. The Compiler devs are pretty active there (specifically @itaiferber, who is the creator of the Codable API).

@itaiferber
Copy link

I haven't gotten a chance to look at any of this decoder work at all, so take my comments with a grain of salt, but two small notes:

  1. In general, it is not safe to hold on to a decoding container outside of the scope of an init(from:) call:

    private var container: UnkeyedDecodingContainer
    
    required init(from decoder: Decoder) throws {
        container = try decoder.unkeyedContainer()
    }

    Decoders may rely on the fact that once an init(from:) has returned, all container references are no longer valid. Attempting to decode from the container is undefined — it might work, but more likely than not, the container will decode something you don't expect, or point to something that no longer exists. If CSVDecoder guarantees safety for this, then you can consider creating a type scoped specifically to CSVDecoder, perhaps?

  2. To make this sort of operation safe, you'd really need to expose an interface on CSVDecoder that guarantees this operation being valid. e.g. instead of writing a general DecodableSequence, you should go with something like

    let sequence = CSVDecoder().decodeSequence(of: CustomRowType.self, from: url)
    for row in sequence { ... }

    The Sequence you return would need to hold on to the internal state of the Decoder so it can safely be used without forcing any internal components (like containers) to outlive their intended scope.

In general, I agree with @dehesa that when an object is decoded, it is expected to be fully initialized. If lazily decoding is entirely transparent (e.g. a CSVLazyDecodableSequence holds on to the underlying CSV data and decodes it internally as I iterate it without exposing that to me directly), then that could be safe.

@dehesa
Copy link
Owner

dehesa commented Apr 2, 2020

Hi @itaiferber, thank you for pitching in here. I completely agree with all the points you've raised. My doubt hangs more over my point 4. Let me rephrase it:

For a lazily sequence decoder, as the one you have described:

let sequence = CSVDecoder().decodeSequence(of: CustomRowType.self, from: url)
for row in sequence { ... }

How could we iterate through its elements when the operation may throw at any point? We would like to keep the Sequence syntax and benefit from all the Sequence extensions, but IteratorProtocol's next() function doesn't support throwing an error.

  • One option is to wrap the sequence's element in a Result type, but that actually doesn't stop sequence. The next logic step is to wrap the result in an optional.

    In case of error, the sequence would receive a failure and next time is queried again, it will forward nil. This is problematic for cases where the user is counting the elements (for example).

  • Another option is to wrap the sequence's element in an optional, and when a nil is received, check an error status on the sequence. Which would force the user to always be wary about the decoding state.

Do you see any other option for lazy sequence decoding?

@josh
Copy link
Contributor Author

josh commented Apr 2, 2020

Exposing something like decodeSequence would be great. The only reason I'm using this work around is the exposed public decoding APIs provide the ability to decode a single row. The DecodableSequence is just a hack to get a handle on that internal shadow decoder's container.

@dehesa dehesa added this to Open in Features via automation Apr 3, 2020
@itaiferber
Copy link

@dehesa Good point, I wasn't thinking about that.

I was going to write the following:

Personally, I would go the Result route, wrapping the sequence element (such that next() returns Optional<Result<...>>nil when the sequence is over, and Result<...> per element).

In case of an error, though, I would consider not returning nil but instead repeatedly returning the same error again. This is consistent with UnkeyedDecodingContainer, which will not advance in case of an error, allowing you to retry decoding as another type. In order to skip an element, you need to decode a dummy type which always works. In the Sequence world, you can probably dropFirst() to achieve the same goal.

But the more I think about it, the more Sequence conformance feels somewhat limiting in this case. It feels very easy to accidentally end up in an infinite loop if you're not paying attention:

for row in sequence {
    guard case let .success(row) = row else {
        // Skip invalid rows.
        // <oops, this doesn't actually advance the sequence>
        continue
    }

    // Do something with `row`.
}

Alternatively, if you do want to dropFirst() correctly, you can't easily use for-in syntax because dropFirst() returns a new Sequence to iterate.

So, it sounds like your original suggestion of

let decoder = CSVDecoder()
let sequence = decoder.__nameForOnDemandRowDecoding__(CustomRowType.self, from: url)
while let row = try sequence.next() { /* Do what you want with it */ }

is more apt. Getting rid of Sequence/IteratorProtocol conformance is unfortunate, but easier to get right from the client side.

Two alternative ideas that might be interesting:

  1. You keep IteratorProtocol conformance but add a manual skip() method in case decoding fails (this allows for the equivalent of dropFirst() while still letting you keep the loop simple)
  2. You keep Sequence conformance but have the Element type be an opaque value which has an explicit decode() method which itself throws. This lets you keep the for-in syntax and gives you an easier out by not having to wrap with Result. In this case, I would consider having a failed decode() call allow the loop to continue (unlike my previous suggestion above in which the Sequence would not advance) — in this case, the Sequence you are iterating is of opaque values which are divorced from the underlying encoded representation.

This last idea, to me, seems the more interesting one to investigate.

@dehesa
Copy link
Owner

dehesa commented Apr 3, 2020

Ohh, I like the last idea you mentioned. Thank you @itaiferber (and congratulations for coming up with the amazing Codable interface) 😉

What do you say, @josh? Are you still up for giving it a go with all the information gathered?

By the way @itaiferber, if you find yourself with a bit of time and you end up checking the library. I will be very happy to receive any comment/criticism that you may have 😄

@josh
Copy link
Contributor Author

josh commented Apr 5, 2020

@dehesa yeah, I'd like to give it a try. How does this interface look?

extension CSVDecoder {
    /// Decode CSV data row by row.
    ///
    /// Does not throw since parsing does not start until requested by iterator.
    ///
    /// - Returns: Iterator for decoding each row.
    func decodeSequence<T: Decodable>(of type: T.Type, from url: URL) -> CSVDecodableIterator<T>
}

struct CSVDecodableIterator<Element: Decodable> {
    /// Decode next row.
    /// - Returns: Decoded row as `Element` or `nil` if no next row exists.
    /// - Throws: `DecodingError` if row can not be decoded.
    func next() throws -> Element?
}

@dehesa
Copy link
Owner

dehesa commented Apr 6, 2020

If I understand correctly, @josh's proposed API will look as follows on regular usage:

let decoder = CSVDecoder(configuration: ...)
for row in decoder.decodeSequence(of: Student.self, from: url) {
    let student = try row.decode()
    // Do something
}

I would suggest the following changes:

  1. Let's use the lazy moniker as in the Swift Standard Library.

    We are actually providing that kind of functionality. I believe it is pretty descriptive.

  2. Don't force the outcome type to be passed at the sequence level, but at the row decoding level.

    This will provide the user with the freedom to decode different rows with different types if she wanted.

With those changes in mind. What do you think about this?

let decoder = CSVDecoder(configuration: ...)
for row in decoder.lazy(from: url) {
    let student = try row.decode(Student.self)
    // Do something
}

The benefit of point 2 to is that the user can choose different row representation as she wants. For example, you may have a students CSV where the students from 0 to 99 are represented as a different type than the students from 100 to 199. Although the underlying CSV representation is the same.

@dehesa
Copy link
Owner

dehesa commented Apr 6, 2020

Now, on an implementation perspective. There are some challenges that you will face:

  • Most users will likely write the code as we intended. However it is technically possible to do the following:

    let decoder = CSVDecoder(configuration: ...)
    let sequence = decoder.lazy(from: url)
    guard let rowA = sequence.next(),
          let rowB = sequence.next() else { ... }
    let studentB = try rowB.decode(Student.self)
    let studentA = try rowA.decode(Student.self)

    Meaning, the second row is actually decoded before the first. What happens if studentA contains an error (whether a problem in the underlying string encoding, or the user mistype an escaping character)? We cannot recover from that.

    The solution for this is buffering and decoding the previous rows. But this will require a none trivial amount of state management. You could reuse the buffers I implemented (which will be optimized for performance later on), but even so, it will be some work.

  • The Sequence protocol encompasses the destructive case as it is ours. Meaning, that only one pass is allowed.

    We should be clear in the documentation that that is the case and decide what to do with some Sequence properties such as underestimatedCount. Should we always return 0?, or should we return the number of decoded rows? or should we do a full pass and give the actual number of rows?

  • Take to heart what @itaiferber said about keeping decoding containers alive. They are intended for scoped in-order use and they all reuse the same ShadowDecoder.Source.

    They are used like that in my current decoder implementation. Check it out.

I am happy for you to give it a go, but if it becomes too much, we can always fall back in the easier-to-implement API, which doesn't conform to the Sequence protocol and doesn't use buffering.

let decoder = CSVDecoder()
let sequence = decoder.lazy(from: url)
while let row = try sequence.next() {
    // Do what you want with it
}

@dehesa dehesa moved this from Open to In progress in Features Apr 10, 2020
@dehesa dehesa moved this from In progress to Done in Features Apr 10, 2020
@dehesa dehesa closed this as completed Apr 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
Features
  
Done
Development

No branches or pull requests

3 participants