Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding CSV file with CRLF line endings fail with error if the last column is quoted #35

Closed
xsleonard opened this issue Mar 20, 2021 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@xsleonard
Copy link
Contributor

xsleonard commented Mar 20, 2021

Describe the bug

A clear and concise description of what the bug is.

Decoding a CSV file with CRLF line endings fails with an error, if the last field in a row is quoted.

The error:

Invalid input
	Reason: The last field is escaped (through quotes) and an EOF (End of File) was encountered before the field was properly closed (with a final quote character).
	Help: End the targeted field with a quote.

To Reproduce

Steps to reproduce the behavior:

Using a CSV file with CRLF line endings (url), decode with this:

    let decoder = try CSVDecoder(configuration: {
        $0.encoding = .utf8
        $0.bufferingStrategy = .sequential
        $0.headerStrategy = .firstLine
        $0.trimStrategy = .whitespaces
        $0.delimiters.row = "\r\n" // or "\n", also fails
    }).lazy(from: url)

Expected behavior

No error

System

  • CodableCSV: 0.6.6

Additional context

This was introduced in v0.6.6

@xsleonard xsleonard added the bug Something isn't working label Mar 20, 2021
@dehesa
Copy link
Owner

dehesa commented Mar 20, 2021

Hi @xsleonard, I am out for the weekend so I don't think I will manage to solve this till Tuesday. Your bug description seems clear and I believe I understand the problem, but I would appreciate if you could provide a CSV with two or three rows (of dummy data) triggering the problem. That way I can even include it as a test so it doesn't happen again.

@xsleonard
Copy link
Contributor Author

Ok, I'll look into more on Monday as well.
The attached file is the simplest case with a header (1 column) so its easier to step through in debugging.
It has a .txt extension because .csv extensions are blocked for upload here, you can rename it as .csv.
csv-crlf.txt

@xsleonard
Copy link
Contributor Author

xsleonard commented Mar 21, 2021

Using a trimStrategy of ["\r"] avoids a crash (CharacterSet.whitespaces.union(["\r"]) in my case).

A more intuitive solution would accept multicharacter Delimiter.Row (that's what I was doing before, $0.delimiter.row = "\r\n") but seems trickier to implement in the _parseEscapedField function.

@dehesa
Copy link
Owner

dehesa commented Mar 23, 2021

Hey @xsleonard. I built some tests and tried your file and everything worked.

extension DecodingBadInputTests {
    /// Tests CRLF delimiters with escape fields.
    func testFileCRLF() throws {
        let decoder = CSVDecoder {
            $0.encoding = .utf8
            $0.bufferingStrategy = .sequential
            $0.headerStrategy = .firstLine
            $0.trimStrategy = .whitespaces
            $0.delimiters.row = "\r\n"
        }
        let file = try decoder.decode([[String]].self, from: URL(fileURLWithPath: "/Users/marcos/Desktop/test.csv"))
        XCTAssertEqual(1, file.count)
        XCTAssertEqual(["G"], file[0])
    }
   
    /// Tests CRLF delimiters with escape fields.
    func testLazyFileCRLF() throws {
        let decoder = try CSVDecoder {
            $0.encoding = .utf8
            $0.bufferingStrategy = .sequential
            $0.headerStrategy = .firstLine
            $0.trimStrategy = .whitespaces
            $0.delimiters.row = "\r\n"
        }.lazy(from: URL(fileURLWithPath: "/Users/marcos/Desktop/test.csv"))
        XCTAssertEqual(["G"], try decoder.decodeRow([String].self))
        XCTAssertNil(decoder.next())
    }
}

I also recreated the file in code in case you sent me an invalid file, but it also worked.

extension DecodingBadInputTests {
    /// Tests CRLF delimiters with escape fields.
    func testInputCRLF() throws {
        let input = "Name\r\n\"G\"\r\n"
        let decoder = CSVDecoder {
            $0.encoding = .utf8
            $0.bufferingStrategy = .sequential
            $0.headerStrategy = .firstLine
            $0.trimStrategy = .whitespaces
            $0.delimiters.row = "\r\n"
        }
        let file = try decoder.decode([String].self, from: input)
        for row in file { print(row) }
    }
}

Are you sure the problem is not on your side. Maybe bad file data?

@xsleonard
Copy link
Contributor Author

Sorry, I described/diagnosed the problem wrong.

If you change $0.delimiters.row = "\r\n" to $0.delimiters.row = "\n" (or leave it unassigned to default to \n) you will see the error.

The error (invalidInput) occurs if the file has \r\n line endings but $0.delimiters.row = "\n". Previously the parser would leave a \r on the last field, but otherwise parse the file, so I'd trim control characters from the last field (post-decoding) in case the file was CRLF. I don't have control over whether or not the file is LF or CRLF.

The solution to parsing both CRLF and LF files is using a trimStrategy containing \r, which is fine for my case, but is not obvious, and the error message (invalidInput) does not point to the problem. Any CSV file generated on a windows machine will have CRLF line endings, so it needs to be handled for user-provided/wild CSVs.

I suggest one of these:

  • Improving the error message if it suspects CRLF, and/or including the invalid scalar in the error message userinfo, and a pointer to the fix to enable both CRLF + LF handling (trimStrategy)
  • Handling both LF and CRLF by default

dehesa added a commit that referenced this issue Mar 24, 2021
@dehesa
Copy link
Owner

dehesa commented Mar 24, 2021

You are right. The error message doesn't provide correct information. I have changed it adding more useful data and push to master.

Regarding changing the default (which is the same problem you brought in #33): It has some side-effects, since \r may be meaningful to some users and they may want that data in their fields. That is why I make the user to explicitly change the configuration values. I agree is not ideal and I will probably add some information in the README about CLRF files.

@xsleonard
Copy link
Contributor Author

I wouldn't make the \r in trimCharacters the default, besides your point, it also forces the use of CharacterSet which you've mentioned is a performance issue, so would impact those who need performance. Possibly it can support multiple row delimiters (\n and \r\n) or if delimiter is not explicitly configured, handle it with special-case code internally

@dehesa
Copy link
Owner

dehesa commented Mar 24, 2021

I agree that having a special row delimiter that cover both \n and \r\n would be nice to have. I did add that to the Roadmap the first time you brought it up. But I don't know when I will have time to implement it. If it is something that really bothers you, you can always try to implement it yourself 😅 The code that needs to be changed is pretty encapsulated in CSVReader.

@dehesa
Copy link
Owner

dehesa commented Aug 27, 2021

This functionality is now supported by CodableCSV. Thanks @xsleonard for bringing this forward.

@dehesa dehesa closed this as completed Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants