What is needed to support encoding #2

lightsprint09 · 2019-10-15T13:04:12Z

Could I help you supporting Encodable?

dehesa · 2019-10-18T11:20:54Z

Hi @lightsprint09 sorry for the delay answering (somehow I am not receiving notifications on this project).

Any help is welcome, so if you want to toil at it, I would be happy to review or help. Most of the architecture for Encodable is there and it should be fairly similar to Decodable (which is currently working). To be honest I don't remember what it is missing (since it has been some time ago). I will try to look at it this weekend and I can give you a more thorough answer.

The problem is that adopting and developing for Codable is not straight forward since there is a lot of intermediate structures. It should be a fairly entertaining educational experience, though :D

lightsprint09 · 2019-10-18T15:31:42Z

Sounds good. Would you mind if I set up Travis CI?

dehesa · 2019-10-18T16:29:25Z

That sounds interesting. I have never done it myself in an OpenSource project from Github. As far as I understand it is a CI system to run your test on every commit. But what will it actually entail? Do we have to keep a specific Git structure, who has access to its configuration?, etc.

dehesa · 2019-10-21T10:05:42Z

@lightsprint09 I managed to take a look at the state of the code this weekend and the major problem with the Encodable adoption is the "random access" encoding that the user might decide to do while using encode(to:).

In simple words:

The Codable support is based in the underlying CSVReader/CSVWriter.
The CSVReader/CSVWriter are sequential and are based on streams (currently only supporting in-memory), but with only additive modifications they can access files directly or actually any other streams (including on-demand encoding/decoding for network streams).
The current problem with encoding is that the user might have written a CSV row that is further ahead (lets say index 10) and then decide to write in row at index 6 (for example).
Based on the sequential stream nature of CSVWriter that is not allowed.

Thus, there are several options we can pursue:

The easiest/ugliest: To throw an encoding error as soon as the user tries to do that and implement the most straight forward CSVWriter operation.
Implement an intermediate in-memory cache where all records are being stored and only write them on the stream once the encoding has completed.
Don't use CSVWriter and implement something else instead not using Foundation's OutputStream (which will hinder future usage for file or network writing).

lightsprint09 · 2019-10-21T10:17:32Z

How is it possible to Writer further ahead? Using the same stream

dehesa · 2019-10-21T10:23:38Z

Well, it is not yet implemented, but I suppose that if the user decide to write in row 10 and they are currently in row 5; the CSVWriter would write 4 row delimiters and then start writing whatever the user inputs.

Writing empty rows is actually quite common in CSV files.

lightsprint09 · 2019-10-21T10:59:28Z

let stream = OutputStream(url: ...)
let encoder = CSVEncoder(stream: stream)
let codable = ["A", "B", "C"]

try encoder.encode(codable)

How would I define where top write. I would expect to write at the end of there stream?

dehesa · 2019-10-21T12:22:31Z

OutputStream is just a dumb pipeline that only "transmits" the bytes you are sending. Therefore, you can only write sequentially through its write(_ buffer: UnsafePointer<UInt8>, maxLength len: Int) function.

Encodable, however, offers a way to randomly-access where to write through their KeyedEncodingContainer (it is similar to a dictionary where you pass a CodingKey specifying where to write and the value you are encoding).

As you can see, there is an incompatibility from the sequential expectations of OutputStream and the freedom given to the user with KeyedEncodingContainer. The solution must decide how to operate:

Option 1 is to restrict the freedom to randomly-write by throwing an error when you do so.
Users will probably find that a bit too "crash".
Option 2 is to implement an intermediate buffer and only when completed write sequentially from the beginning to OutputStream.
Option 3 won't use OutputStream.

lightsprint09 · 2019-10-21T12:35:14Z

Ok. I see.

Seems like I don't understand the Codable implementation correctly.

Can you show me an example how one would implement Encodable so it would be problematic?

dehesa · 2019-10-21T13:07:11Z

Don't despair, the Codable internal behavior takes some time to get used to :D Let me give you several examples:

Compiler Generation Example

Let's suppose you have a CSV file listing students in a school. Each row is a student and the columns are: name, age, hasPet. The simple Encodable definition would be:

struct Student: Encodable {
    let name: String
    let age: Int
    let hasPet: Bool
}

Then you can simple use the CSVEncoder as so:

let students = [
    Student(name: "Heidrun", age: 12, hasPet: true),
    Student(name: "Gudrun", age: 11, hasPet: false)
]

let encoder = CSVEncoder()
encoder.encode(students)

Generated Code

You are letting the compiler generate the encode() function for you. The compiler will write the following:

extension Student {
    private enum CodingKeys: String, CodingKey {
        case name, age, hasPet
    }

    func encode(to encoder: Encoder) throws {
        var container = encoder.container(keyedBy: CodingKeys.self)
        try container.encode(self.name, forKey: .name)
        try container.encode(self.age, forKey: .age)
        try container.encode(self.hasPet, forKey: .hasPet)
    }
}

But you are also using the internal encode() function from the Array implementation, which probably looks as follows:

extension Array {
    func encode(to encoder: Encoder) throws {
        var container = encoder.unkeyedContainer()
        for element in self {
            try container.encode(element)   // where an element is a student
        }
    }
}

All Together

So the actions when encoder.encode(students) is executed will look as follows:

Call the internal Array implementation of encode().
The function creates an UnkeyedContainer (from the shadow encoder) which will sequentially go through each element/student.
Call the generated encode() on the Student structure.
Create a KeyedEncodingContainer (from the unkeyedContainer) each time a student is reached.
Encode the students properties.

The rows/students will be sequentially encoded, so we wouldn't have the "randomly-accessible writes" problems there. However, nowhere is defined the order of the row's fields (i.e. the students properties). The CodingKeys generated by the compiler use String values (not Ints). At this moment the program will likely crash. You can overcome this first problem in several ways:

The user shall specify the headers in the CSVEncoder.headers settings.
There you are passing an array of Strings and thus you can match String values with Int values (i.e. the position in the array).
```
let encoder = CSVEncoder()
encoder.headers = ["name", "age", "hasPet"]     // The values must exactly matched the `CodingKeys` values generated by the compiler
```
Make our UnkeyedContainer "clever" by making it figure out the definition order of CodingKeys.
Some solutions might be explored, such as CaseIterable requirements. I haven't seen whether this is possible.

If the previous problem is solved correctly. The program will most likely matched the CodingKeys with sequential Int values and it will generate the expected encoded CSV.

dehesa · 2019-10-21T13:20:46Z

The previous example was the simple/naive one. Let's suppose the framework user is savvy on the Codable ways. In which case, she can define things more tightly.

Unkeyed Fields Example

Lets have the same Student listing as the previous example, but now lets write our own Encodable conformance.

struct Student: Encodable {
    let name: String
    let age: Int
    let hasPet: Bool

    func encode(to encoder: Encoder) throws {
        var container = encoder.unkeyedContainer()
        try container.encode(self.name)
        try container.encode(self.age)
        try container.encode(self.hasPet)
    }
}

In this example, we achieve the best performance since we are creating a sequential container (UnkeyedContainer) and then we are encoding one field at a time. For this case, we could write directly to CSVWriter.

Keyed Fields Examples (Int Values)

We could have decided to use a KeyedEncodingContainer instead of an UnkeyedContainer.

struct Student: Encodable {
    let name: String
    let age: Int
    let hasPet: Bool

    func encode(to encoder: Encoder) throws {
        var container = encoder.container(keyedBy: CodingKeys.self)
        try container.encode(self.name)
        try container.encode(self.age)
        try container.encode(self.hasPet)
    }

    private enum CodingKeys: Int, CodingKey {
        case name = 2
        case age = 1
        case hasPet = 0
    }
}

Interestingly, here we are defining the order of the fields within a row; and weirdly the definition order is not matching the encoding order. Here we encounter the first "randomly-access write" problem. Albeit, in a low scale, since it only pertain to a single row.

We could solve this problem by caching a whole row and only writing it when a new row is started or the end of the file is reached.

dehesa · 2019-10-21T13:27:52Z

Would it be fine to get a cdv with random order when not setting the headers?

Yes and no. In CSV files, the column order implies meaning, therefore the user must know exactly the order of the generated outcome. It cannot be "execution magic". Also, the order must be consistent between several Student encoding.

That is however not the major problem, I believe we can find an "easy solution" for that.

dehesa · 2019-10-21T13:46:29Z

Unordered Rows Example

Lets suppose a similar case (same students listing), but now the row positions have meanings. Lets say:

struct Class: Encodable {
    var heidrun: Student
    var gutrun: Student
    var alexander: Student
    var sigrid: Student

    private enum CodingKeys: Int, CodingKey {
        case alexander = 1
        case gutrun = 2
        case heidrun = 3
        case sigrid = 4
    }
}

The compiler generated code will start encoding on heidrun, although the first row should be alexander. Here lays one of the major problems. For correct implementation, we would need to cache the whole CSV file and only write it on the OutputStream when we are sure of the order.

dehesa · 2020-03-10T14:59:00Z

Hi @lightsprint09,

Just as a heads-up, I have reimplemented the Decodable part of the library making it much simpler, using small value types across the board, and building the buffering infrastructure.

I will probably write the Encodable system in the following two months, but if you ever wanted to challenge yourself and try a Codable implementation; now would be a good time 😄

Cheers

lightsprint09 · 2020-03-11T08:26:26Z

Hey @dehesa,

sounds awesome. I will have a look

dehesa · 2020-03-29T22:27:23Z

Alright, I am happy to announce that after some hard laboring, I have the first encodable implementation. It still needs polishing and support for other buffering strategy, but it currently works 😄

You can find it in the develop branch if you are still interested. It will take me a bit more time to land it in master, though.

dehesa added the question Further information is requested label Jan 29, 2020

dehesa mentioned this issue Mar 26, 2020

Make CSVReader IteratorProtocol safe #14

Closed

orj mentioned this issue Mar 27, 2020

Render output and read input as CSV ittybittyapps/appstoreconnect-cli#24

Closed

dehesa closed this as completed Mar 29, 2020

xsleonard mentioned this issue Feb 1, 2021

Lazy decoding a CSV with CRLF line endings fails without an error #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is needed to support encoding #2

What is needed to support encoding #2

lightsprint09 commented Oct 15, 2019

dehesa commented Oct 18, 2019

lightsprint09 commented Oct 18, 2019

dehesa commented Oct 18, 2019

dehesa commented Oct 21, 2019

lightsprint09 commented Oct 21, 2019

dehesa commented Oct 21, 2019

lightsprint09 commented Oct 21, 2019

dehesa commented Oct 21, 2019

lightsprint09 commented Oct 21, 2019

dehesa commented Oct 21, 2019

dehesa commented Oct 21, 2019

dehesa commented Oct 21, 2019

dehesa commented Oct 21, 2019

dehesa commented Mar 10, 2020

lightsprint09 commented Mar 11, 2020

dehesa commented Mar 29, 2020

What is needed to support encoding #2

What is needed to support encoding #2

Comments

lightsprint09 commented Oct 15, 2019

dehesa commented Oct 18, 2019

lightsprint09 commented Oct 18, 2019

dehesa commented Oct 18, 2019

dehesa commented Oct 21, 2019

lightsprint09 commented Oct 21, 2019

dehesa commented Oct 21, 2019

lightsprint09 commented Oct 21, 2019

dehesa commented Oct 21, 2019

lightsprint09 commented Oct 21, 2019

dehesa commented Oct 21, 2019

Compiler Generation Example

Generated Code

All Together

dehesa commented Oct 21, 2019

Unkeyed Fields Example

Keyed Fields Examples (Int Values)

dehesa commented Oct 21, 2019

dehesa commented Oct 21, 2019

Unordered Rows Example

dehesa commented Mar 10, 2020

lightsprint09 commented Mar 11, 2020

dehesa commented Mar 29, 2020