Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is encoding Double or Float supported? #20

Closed
leisurehound opened this issue May 30, 2020 · 16 comments
Closed

Is encoding Double or Float supported? #20

leisurehound opened this issue May 30, 2020 · 16 comments
Assignees
Labels
bug Something isn't working
Projects

Comments

@leisurehound
Copy link

When I change the floatStrategy as .convert I get an fatal error in this code:

            case .throw: throw CSVEncoder.Error._invalidFloatingPoint(value, codingPath: self.codingPath)
            case .convert(let positiveInfinity, let negativeInfinity, let nan):
                if value.isNaN {
                    return nan
                } else if value.isInfinite {
                    return (value < 0) ? negativeInfinity : positiveInfinity
                } else { fatalError() }
            }

So with either strategy I either get the thrown error or a fatal error if a valid Double is attempted to be encoded. Is this expected or is there another configuration I'm missing?

Similarly, when I try to decode a double, I get an error thrown, but when I decode it as a string and convert the string to a double in my structs init(from: Decoder) I process the field correctly.

@leisurehound leisurehound added the question Further information is requested label May 30, 2020
@dehesa
Copy link
Owner

dehesa commented May 30, 2020

Hi @leisurehound

It seems I introduced a bug encoding/decoding Doubles and Floats in one of the last releases and my tests were only covering integer numbers (and not floating-point ones).

It is an easy fix. I will have a patch ready in a couple of hours. Thank you for reporting.

@dehesa dehesa added bug Something isn't working and removed question Further information is requested labels May 30, 2020
@dehesa dehesa added this to Needs triage in Bugs via automation May 30, 2020
@dehesa dehesa moved this from Needs triage to High priority in Bugs May 30, 2020
dehesa added a commit that referenced this issue May 30, 2020
dehesa added a commit that referenced this issue May 30, 2020
dehesa added a commit that referenced this issue May 30, 2020
@dehesa
Copy link
Owner

dehesa commented May 30, 2020

I have fixed the floating-point problem and added some tests to check correct behavior. The fixes can be found in the develop branch.

Please, give it a try and if it solves your problems, I will promote the fix to master and release a new patch version.

@dehesa
Copy link
Owner

dehesa commented May 30, 2020

Oh, and I forgot to mention that I have renamed floatStrategy to nonConformingFloatStrategy. This way, it makes it more clear that the encoding/decoding strategy only gives the option to modify non conforming floating-point values (i.e. infinity, NaN).

@leisurehound
Copy link
Author

Thanks @dehesa.

The following statement now works:

duration = try container.decode(Double.self)

but the following does not:

duration = try container.decodeIfPresent(Double.self)

@dehesa
Copy link
Owner

dehesa commented May 30, 2020

Uhm, ok. That is interesting. Could you provide a small example with the behavior you are expecting, such as the one I have in the tests. It will help me out debugging.

@leisurehound
Copy link
Author

leisurehound commented May 30, 2020

Sorry, am not sure if my Codable experience allows me to replicate without the container that I'm using within the init(from: Decoder).

I'm trying to decode an enum property in the parent struct (see below) by decoding each field individually. I'd really like to be able to decode the Codable enum like I can with JSON, but the new container index seems to get mixed up and always fails.

I'm essentially serializing an enum which has one case with an associated value and the other does not (its essentially a simplified Result type). Here's how I'd expect it to work (but doesn't):

struct myStruct : Codable {
  let dateSent: Date
  let response: Response
  init(from decoder: Decoder) throws {
    var container = try decoder.container.unkeycontainer()
    // ... decode the date, which works fine...
    response = try container.decode(Response.self)
  }
}

enum Response : Codable {
  case time(duration: Double)
  case failure
  init(from decoder: Decoder) {
    var container = try decoder.container.unkeyedcontainer()
    let duration = try decoder.decodeIfPresent(Double.self)
    let failure = try decoder.decodeIfPresent(String.self)
    switch (duration, failure) {
      case (.some(let duration), .none): self = .time(duration: duration)
      case (.none, .some) : self = .failure
      case (.some, .some) : throw DecodingError.dataCorrupted(DecodingError.Context(codingPath: [], debugDescription: "Both Duration & Failure populated"))
      case (.none, .none) : throw DecodingError.dataCorrupted(DecodingError.Context(codingPath: [], debugDescription: "Both Duration & Failure empty"))
    }
}

Here's what I've had to resort to because of the container index becomes problematic when moving to the embedded enum, but it also doesn't work because of the decodeIfPresent(Double.self) always throws.

struct myStruct : Codable {
  let dateSent: Date
  let response: Response
  init(from decoder: Decoder) throws {
    var container = try decoder.container.unkeycontainer()
    // ... decode the date, which works fine...
    // response = try container.decode(Response.self)
    let duration = try decoder.decodeIfPresent(Double.self)
    let failure = try decoder.decodeIfPresent(String.self)
    switch (duration, failure) {
      case (.some(let duration), .none): response = .time(duration: duration)
      case (.none, .some) : response = .failure
      case (.some, .some) : throw DecodingError.dataCorrupted(DecodingError.Context(codingPath: [], debugDescription: "Both Duration & Failure populated"))
      case (.none, .none) : throw DecodingError.dataCorrupted(DecodingError.Context(codingPath: [], debugDescription: "Both Duration & Failure empty"))
    }
  }
}

enum Response {
  case time(duration: Double)
  case failure
}

I'm expecting (and encoding to) the following string records
<datestring>,3.23423,
or
<datestring>,,failure
but am struggling to decode the same

Sorry if this is fundamentally the wrong approach here, but its what I'm coming from the experience I have coding/decoding JSON with Codable.

@dehesa
Copy link
Owner

dehesa commented May 30, 2020

Alright. Let me take a look at it. I have a further question, though. Is the CSV schema closed? I mean, is that row definition final? <datestring>,<float>, and <datestring>,<empty>,failure or would you be open to define it in another way? e.g. <datestring>,<optionalFloat>

@leisurehound
Copy link
Author

leisurehound commented May 30, 2020

For this simple problem then yes, that's doable. I'm basically working on a dummy project to learn some aspects of SwiftUI plotting, Combine and Coding CSV values.

The real project I plan to use all this on, however, is managing an assortment time series that are sampled at various frequencies. So column positions are important to determine if there is a datum at any particular sample time. Time series CSV is generally columnar data and in the exports I'll have to process I won't have control of the collapsing of columns in particular rows, they'll simply be empty.

So, for the dummy project yes its easy to adapt. But the dummy project has a pattern I've chosen specifically as its the construct I'll have to support when I get past learning/prototyping (I'm actually a believer in a different second system effect, the first time I don't know what I'm doing and it ends up ugly, unmaintainable, and often inefficient. When I do it for the second time I have the knowledge to do it right and am not left with refactoring artifacts I wish weren't there...so thats why I'm constraining this work a bit more than necessary...)

dehesa added a commit that referenced this issue May 30, 2020
@dehesa
Copy link
Owner

dehesa commented May 30, 2020

I fixed a small behavior on decodeNil, which is specified in the documentation, but I didn't account for; i.e. if nil is encountered, then +1, must be added to the index; but if not, it should not. With this fix, now decodeIfPresent should work as intended.

For your specific problem, you actually have several solutions (be sure to download the latest from develop branch):

  • Changing row definition to <date>,<floatOrNil>. Similar to this:

    let data = """
        <dateA>,1.23
        <dateB>,
        <dateC>,4.56
        """.data(using: .utf8)!
    let decoder = CSVDecoder { $0.headerStrategy = .none }
    let structs = try decoder.decode([CustomStruct].self, from: data)
    • Easiest.

      struct CustomStruct: Decodable {
          let date: Date
          let response: Double?
          
          private enum CodingKeys: Int, CodingKey {
              case date = 0
              case response = 1
          }
      }
    • Keeping the enum.

      struct CustomStruct: Decodable {
          let date: Date
          let response: Response
      
          init(from decoder: Decoder) throws {
              var container = try decoder.unkeyedContainer()
              self.num = try container.decode(Date.self)
              self.response = try container.decode(Response.self)
          }
      }
      
      enum Response: Decodable {
          case time(duration: Double)
          case failure
      
          init(from decoder: Decoder) throws {
              let container = try decoder.singleValueContainer()
              self = container.decodeNil() ? .failure
                                           : .time(duration: try container.decode(Double.self))
          }
      }
  • Keeping row definition (<datestring>,<float>, or <datestring>,<empty>,failure)

    struct CustomStruct: Decodable {
        let date: Date
        let response: Response
    
        init(from decoder: Decoder) throws {
            var container = try decoder.unkeyedContainer()
            self.num = try container.decode(Date.self)
            if let duration = try container.decodeIfPresent(Double.self) {
                self.response = .time(duration: duration)
            } else {
                self.response = .failure
            }
        }
    }

@dehesa
Copy link
Owner

dehesa commented May 30, 2020

Let me know if that works for you and then I will upstream the patch to master. Again, thank you for reporting.

@leisurehound
Copy link
Author

leisurehound commented May 31, 2020

Yes, I can decodeIfPresent() successfully now.

One last question, I encodeRow() one result at a time (because they're a live time series) thus am using lazy. After each encoding I endEncoding() and then write the data with outputStream.writeBytes(). I've learned that I have to re-create a new encoder for the next datum after I've endEncoding() on the last live time series datum. Is there a manner in which I can reuse an encoder or is creating a new encoder cheap enough on every live datum that is collected?

@dehesa
Copy link
Owner

dehesa commented May 31, 2020

That is not ideal. Creating a new encoder is not "very expensive", but you are creating and deallocating OutputStreams for each row (which is not the intended usage).

Why don't you just store the lazy encoder somewhere (it is a reference type) and reuse when a new row arrives?

// 1. Create lazy encoder and store it somewhere.
self.cacheEncoder = CSVEncoder {
    $0.bufferingStrategy = .sequential
}.lazy(into: /* Whatever you need */)

// 2. Then, every time a new row arrives.
try self.cacheEncoder.encodeRow(customStructure)

// 3. Once you stop the live subscription, call
try self.cacheEncoder.endEncoding()
// there is also a defined deinit, which will `endEncoding()` for you (if you haven't called it) when the reference is deallocated.

@leisurehound
Copy link
Author

leisurehound commented May 31, 2020

I did try to store the encoder as a class property, but endEncoding on each datum so the file is tail-able, for example. This caused the next datum encoding to fail. For a long running process with a live feed, the data persistence is paramount, so I keep my output stream tied to a file open and continuously append to it, and then close the output stream on deinit, which could be never (or at server reboot, or when files roll). Its somewhat akin to using CSV encoding for a logger, I guess, which may be a bit corrupt. But data loggers do that by nature, and CSV is a natural export format for columnar data of time series data.

Would an option to flush the encoder instead of end it be viable here? The other option is for the client code do so on a Timer, letting the buffer grow and end encoding then create a new encoder again.

I think this may simply be a case where encoding is not worth the effort beyond:

let data = "\(dateStr),\(duratitonStr),\(failureStr)".data(using: .utf8)

@dehesa
Copy link
Owner

dehesa commented May 31, 2020

If I understood everything correctly, I believe you have three options to satisfy your requirements.

  1. Create a lazy encoder with .sequential buffering strategy.

    Store the lazy encoder and call encodeRow(customType). The sequential buffering strategy only keeps in memory the last row to be written. Therefore, when a new row arrives, it writes the previous row. That wouldn't do what you want, but if the type being encoded uses a unkeyedContainer, the fields are written sequential and the encoder is actually writing in the file the fields as they arrive (and doesn't keep it in memory).

    So to sum up, you would use a lazy encoder, call encodeRow every time a new row arrives (do not call endEncoding afterwards), and ensures that you are using unkeyedContainers and you are encoding all fields in the row (including empty fields). Then it will have the behavior you want.

  2. Use the imperative CSVWriter.

    This writer do exactly what you tell it to do. Thus, if you call in the writer's write(row:) it actually writes that row in the file including the row delimiter.

  3. Just append data to the file as you were suggesting.

    The upside of using an encoder or writer, is that they have an easier-to-understand API, they will handle any edge case (such as CSV escaping, etc.) and they free you from muddling with file handles. Still, sometimes using low-level APIs might be the solution if your data format never changes.

@leisurehound
Copy link
Author

ok, thanks the details and the help here. Those details were not gleamed from your readme page when I first attempted to integrate the library. I may send you a PR for the README to expand on usage so its easier for first time users who may be new to CSV coding; what's there now seems to have some unstated assumptions/knowledge that I certainly didn't have. I'll admit, I had to trap errors to figure out what I needed to do or was doing wrong, which is not an efficient means and even still wasn't/am not really using it correctly without what you thus shared here.

@dehesa
Copy link
Owner

dehesa commented May 31, 2020

I agree there might be some assumptions in the README about how to use the library. I have spent so much time on it that some knowledge is second nature to me. I would be delighted if you send PRs my way; having a different view on how to express things is always valuable.

I will close this issue now, but feel free to open new ones if you have any more questions or you would like to improve something. If you decide to get involved in the project, the following things help:

  • Check the features board (some of those TODO tickets reference README enhancements).
  • Star the project.
  • Report more bug reports or confusing behavior.

@dehesa dehesa closed this as completed May 31, 2020
Bugs automation moved this from High priority to Closed May 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Bugs
  
Closed
Development

No branches or pull requests

2 participants