Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsv-summarize: doesn't support windows line-ending #96

Closed
Halmaethor opened this issue Nov 24, 2017 · 6 comments
Closed

tsv-summarize: doesn't support windows line-ending #96

Halmaethor opened this issue Nov 24, 2017 · 6 comments

Comments

@Halmaethor
Copy link

When aggregating the last column, the prebuilt binary for linux (v1.1.15) returns the following error:

' when converting from type const(char)[] to type doubleUnexpected '
File: dummy.tsv Line: 2

I've used tsv-append, tsv-uniq and tsv-select, and each works just fine. After updating line-endings to unix, tsv-summarize works.

@jondegenhardt
Copy link
Contributor

jondegenhardt commented Nov 25, 2017

Thanks for the report and for trying the tools.

The only tool explicitly supporting Windows line-endings on Linux is csv2tsv. You are correct, tsv-append works fine, and a couple other tools as well, but it's more accidental than by design.

What's going on is that the tools are using D standard library functions for reading lines, these functions assume unix line endings on unix platforms. If the file has Windows line ending (a \r\n pair), the line is left with an extraneous \r character at the end of the last field. If this extraneous \r interfers with processing, the tool doesn't work.

In the case of tsv-summarize and tsv-filter, if they try to interpret that the last field as numeric value, the conversion will fail. However, even if they don't perform a conversion, the tools are not necessarily working correctly. tsv-select for example, isn't really preserving Windows line endings.

As an example:

$ # This outputs the Windows line endings
$ echo $'AA\tXX\tYY\r\nBB\tXX\tYY\r' | grep $'YY\r'
AA	XX	YY
BB	XX	YY
$ # Select the last field.
$ echo $'AA\tXX\tYY\r\nBB\tXX\tYY\r' | tsv-select -f 3
YY
YY
$ # Grep shows the Windows line ending
$ echo $'AA\tXX\tYY\r\nBB\tXX\tYY\r' | tsv-select -f 3 | grep $'YY\r'
YY
YY
$ # Select the second field
$ echo $'AA\tXX\tYY\r\nBB\tXX\tYY\r' | tsv-select -f 2
XX
XX
$ # Grep shows its not a Windows line ending.
$ echo $'AA\tXX\tYY\r\nBB\tXX\tYY\r' | tsv-select -f 2 | grep $'XX\r'
$

I'm not inclined to add support for Windows line endings on Unix platforms. The dos2unix tool is a good tool for this and fits the pipeline approach being used by the tools.

However, something is going wrong with the error message formatting, and the error message should identify a Windows line ending as a likely problem. The documentation for the tools should also discuss line endings. I'll have to look into both of these.

Regarding csv2tsv - This tool explicitly supports Windows line-endings because they are commonly used in many programs that generate CSV files.

@jondegenhardt
Copy link
Contributor

The badly formatted error message is due to the \r character being included in the error message. I'll have to fix it.

@Halmaethor
Copy link
Author

Thanks for your explanation and the great tools.

As you said, I used dos2unix to convert the line endings, so this issue wasn't deterrent to my work. Updating the documentation and especially the error message is more than enough to solve this issue.

@jondegenhardt
Copy link
Contributor

Current plan: On Unix builds, check for Windows/DOS line endings when processing the first line of a file. That should handle most cases prior to ever hitting the error message. Regarding the poor error message format: There's an open D bug for it: https://issues.dlang.org/show_bug.cgi?id=17708

jondegenhardt added a commit that referenced this issue Dec 18, 2017
* Fix Issue 96: Detect Windows line endings.

* Update tests.
@jondegenhardt
Copy link
Contributor

Addressed by PR #103, merged to master. Will be included in the next release.

@jondegenhardt
Copy link
Contributor

Included in release v.1.1.16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants