Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is needed to support encoding #2

Closed
lightsprint09 opened this issue Oct 15, 2019 · 16 comments
Closed

What is needed to support encoding #2

lightsprint09 opened this issue Oct 15, 2019 · 16 comments
Labels
question Further information is requested

Comments

@lightsprint09
Copy link
Contributor

Could I help you supporting Encodable?

@dehesa
Copy link
Owner

dehesa commented Oct 18, 2019

Hi @lightsprint09 sorry for the delay answering (somehow I am not receiving notifications on this project).

Any help is welcome, so if you want to toil at it, I would be happy to review or help. Most of the architecture for Encodable is there and it should be fairly similar to Decodable (which is currently working). To be honest I don't remember what it is missing (since it has been some time ago). I will try to look at it this weekend and I can give you a more thorough answer.

The problem is that adopting and developing for Codable is not straight forward since there is a lot of intermediate structures. It should be a fairly entertaining educational experience, though :D

@lightsprint09
Copy link
Contributor Author

Sounds good. Would you mind if I set up Travis CI?

@dehesa
Copy link
Owner

dehesa commented Oct 18, 2019

That sounds interesting. I have never done it myself in an OpenSource project from Github. As far as I understand it is a CI system to run your test on every commit. But what will it actually entail? Do we have to keep a specific Git structure, who has access to its configuration?, etc.

@dehesa
Copy link
Owner

dehesa commented Oct 21, 2019

@lightsprint09 I managed to take a look at the state of the code this weekend and the major problem with the Encodable adoption is the "random access" encoding that the user might decide to do while using encode(to:).

In simple words:

  • The Codable support is based in the underlying CSVReader/CSVWriter.
  • The CSVReader/CSVWriter are sequential and are based on streams (currently only supporting in-memory), but with only additive modifications they can access files directly or actually any other streams (including on-demand encoding/decoding for network streams).
  • The current problem with encoding is that the user might have written a CSV row that is further ahead (lets say index 10) and then decide to write in row at index 6 (for example).
    Based on the sequential stream nature of CSVWriter that is not allowed.

Thus, there are several options we can pursue:

  • The easiest/ugliest: To throw an encoding error as soon as the user tries to do that and implement the most straight forward CSVWriter operation.
  • Implement an intermediate in-memory cache where all records are being stored and only write them on the stream once the encoding has completed.
  • Don't use CSVWriter and implement something else instead not using Foundation's OutputStream (which will hinder future usage for file or network writing).

@lightsprint09
Copy link
Contributor Author

How is it possible to Writer further ahead? Using the same stream

@dehesa
Copy link
Owner

dehesa commented Oct 21, 2019

Well, it is not yet implemented, but I suppose that if the user decide to write in row 10 and they are currently in row 5; the CSVWriter would write 4 row delimiters and then start writing whatever the user inputs.

Writing empty rows is actually quite common in CSV files.

@lightsprint09
Copy link
Contributor Author

let stream = OutputStream(url: ...)
let encoder = CSVEncoder(stream: stream)
let codable = ["A", "B", "C"]

try encoder.encode(codable)

How would I define where top write. I would expect to write at the end of there stream?

@dehesa
Copy link
Owner

dehesa commented Oct 21, 2019

OutputStream is just a dumb pipeline that only "transmits" the bytes you are sending. Therefore, you can only write sequentially through its write(_ buffer: UnsafePointer<UInt8>, maxLength len: Int) function.

Encodable, however, offers a way to randomly-access where to write through their KeyedEncodingContainer (it is similar to a dictionary where you pass a CodingKey specifying where to write and the value you are encoding).

As you can see, there is an incompatibility from the sequential expectations of OutputStream and the freedom given to the user with KeyedEncodingContainer. The solution must decide how to operate:

  • Option 1 is to restrict the freedom to randomly-write by throwing an error when you do so.
    Users will probably find that a bit too "crash".
  • Option 2 is to implement an intermediate buffer and only when completed write sequentially from the beginning to OutputStream.
  • Option 3 won't use OutputStream.

@lightsprint09
Copy link
Contributor Author

Ok. I see.

Seems like I don't understand the Codable implementation correctly.

Can you show me an example how one would implement Encodable so it would be problematic?

@dehesa
Copy link
Owner

dehesa commented Oct 21, 2019

Don't despair, the Codable internal behavior takes some time to get used to :D Let me give you several examples:

Compiler Generation Example

Let's suppose you have a CSV file listing students in a school. Each row is a student and the columns are: name, age, hasPet. The simple Encodable definition would be:

struct Student: Encodable {
    let name: String
    let age: Int
    let hasPet: Bool
}

Then you can simple use the CSVEncoder as so:

let students = [
    Student(name: "Heidrun", age: 12, hasPet: true),
    Student(name: "Gudrun", age: 11, hasPet: false)
]

let encoder = CSVEncoder()
encoder.encode(students)

Generated Code

You are letting the compiler generate the encode() function for you. The compiler will write the following:

extension Student {
    private enum CodingKeys: String, CodingKey {
        case name, age, hasPet
    }

    func encode(to encoder: Encoder) throws {
        var container = encoder.container(keyedBy: CodingKeys.self)
        try container.encode(self.name, forKey: .name)
        try container.encode(self.age, forKey: .age)
        try container.encode(self.hasPet, forKey: .hasPet)
    }
}

But you are also using the internal encode() function from the Array implementation, which probably looks as follows:

extension Array {
    func encode(to encoder: Encoder) throws {
        var container = encoder.unkeyedContainer()
        for element in self {
            try container.encode(element)   // where an element is a student
        }
    }
}

All Together

So the actions when encoder.encode(students) is executed will look as follows:

  1. Call the internal Array implementation of encode().
  2. The function creates an UnkeyedContainer (from the shadow encoder) which will sequentially go through each element/student.
  3. Call the generated encode() on the Student structure.
  4. Create a KeyedEncodingContainer (from the unkeyedContainer) each time a student is reached.
  5. Encode the students properties.

The rows/students will be sequentially encoded, so we wouldn't have the "randomly-accessible writes" problems there. However, nowhere is defined the order of the row's fields (i.e. the students properties). The CodingKeys generated by the compiler use String values (not Ints). At this moment the program will likely crash. You can overcome this first problem in several ways:

  • The user shall specify the headers in the CSVEncoder.headers settings.
    There you are passing an array of Strings and thus you can match String values with Int values (i.e. the position in the array).
    let encoder = CSVEncoder()
    encoder.headers = ["name", "age", "hasPet"]     // The values must exactly matched the `CodingKeys` values generated by the compiler
  • Make our UnkeyedContainer "clever" by making it figure out the definition order of CodingKeys.
    Some solutions might be explored, such as CaseIterable requirements. I haven't seen whether this is possible.

If the previous problem is solved correctly. The program will most likely matched the CodingKeys with sequential Int values and it will generate the expected encoded CSV.

@dehesa
Copy link
Owner

dehesa commented Oct 21, 2019

The previous example was the simple/naive one. Let's suppose the framework user is savvy on the Codable ways. In which case, she can define things more tightly.

Unkeyed Fields Example

Lets have the same Student listing as the previous example, but now lets write our own Encodable conformance.

struct Student: Encodable {
    let name: String
    let age: Int
    let hasPet: Bool

    func encode(to encoder: Encoder) throws {
        var container = encoder.unkeyedContainer()
        try container.encode(self.name)
        try container.encode(self.age)
        try container.encode(self.hasPet)
    }
}

In this example, we achieve the best performance since we are creating a sequential container (UnkeyedContainer) and then we are encoding one field at a time. For this case, we could write directly to CSVWriter.

Keyed Fields Examples (Int Values)

We could have decided to use a KeyedEncodingContainer instead of an UnkeyedContainer.

struct Student: Encodable {
    let name: String
    let age: Int
    let hasPet: Bool

    func encode(to encoder: Encoder) throws {
        var container = encoder.container(keyedBy: CodingKeys.self)
        try container.encode(self.name)
        try container.encode(self.age)
        try container.encode(self.hasPet)
    }

    private enum CodingKeys: Int, CodingKey {
        case name = 2
        case age = 1
        case hasPet = 0
    }
}

Interestingly, here we are defining the order of the fields within a row; and weirdly the definition order is not matching the encoding order. Here we encounter the first "randomly-access write" problem. Albeit, in a low scale, since it only pertain to a single row.

We could solve this problem by caching a whole row and only writing it when a new row is started or the end of the file is reached.

@dehesa
Copy link
Owner

dehesa commented Oct 21, 2019

Would it be fine to get a cdv with random order when not setting the headers?

Yes and no. In CSV files, the column order implies meaning, therefore the user must know exactly the order of the generated outcome. It cannot be "execution magic". Also, the order must be consistent between several Student encoding.

That is however not the major problem, I believe we can find an "easy solution" for that.

@dehesa
Copy link
Owner

dehesa commented Oct 21, 2019

Unordered Rows Example

Lets suppose a similar case (same students listing), but now the row positions have meanings. Lets say:

struct Class: Encodable {
    var heidrun: Student
    var gutrun: Student
    var alexander: Student
    var sigrid: Student

    private enum CodingKeys: Int, CodingKey {
        case alexander = 1
        case gutrun = 2
        case heidrun = 3
        case sigrid = 4
    }
}

The compiler generated code will start encoding on heidrun, although the first row should be alexander. Here lays one of the major problems. For correct implementation, we would need to cache the whole CSV file and only write it on the OutputStream when we are sure of the order.

@dehesa dehesa added the question Further information is requested label Jan 29, 2020
@dehesa
Copy link
Owner

dehesa commented Mar 10, 2020

Hi @lightsprint09,

Just as a heads-up, I have reimplemented the Decodable part of the library making it much simpler, using small value types across the board, and building the buffering infrastructure.

I will probably write the Encodable system in the following two months, but if you ever wanted to challenge yourself and try a Codable implementation; now would be a good time 😄

Cheers

@lightsprint09
Copy link
Contributor Author

Hey @dehesa,

sounds awesome. I will have a look

@dehesa
Copy link
Owner

dehesa commented Mar 29, 2020

Alright, I am happy to announce that after some hard laboring, I have the first encodable implementation. It still needs polishing and support for other buffering strategy, but it currently works 😄

You can find it in the develop branch if you are still interested. It will take me a bit more time to land it in master, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants