Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ledger: handling unicode #195

Closed
petertseng opened this issue Oct 31, 2015 · 9 comments
Closed

ledger: handling unicode #195

petertseng opened this issue Oct 31, 2015 · 9 comments

Comments

@petertseng
Copy link
Member

While doing the ledger exercise, I noticed that both the example solution and initial implementation have a defect in how they count lines that are too long. The "Description" column of the ledger has width 25, so any string that has more than 25 characters should display the first 22 characters followed by "..."

And yet, the example solution and initial implementation both check for length greater than 27 characters, and cut at index 24 instead of 22.
https://github.com/exercism/xgo/blob/master/ledger/example.go#L44
https://github.com/exercism/xgo/blob/master/ledger/ledger.go#L98

You may ask, is it something to do with the padding in the string " | " that separates each column? No, it's not even that because there is in fact a test that tests for proper cutoffs: https://github.com/exercism/xgo/blob/master/ledger/ledger_test.go#L128

In fact it's because that is the only test that tests too-long descriptions and it happens to have two unicode codepoints that take up two bytes each, the two ö (C3 B6 in utf-8). Therefore, for that particular string, cutting at the 24th byte is cutting at the 22nd rune, and everything is correct.

I'm not sure if I'm comfortable leaving that code in example.go, because it's very much not generic - only works on strings with two more bytes than runes. I'm very tempted to add tests that more thoroughly test cutoffs, such as master...petertseng:ledger-unicode and then fix the code. Both the example and initial implementations currently fail:

--- FAIL: TestFormatLedgerSuccess (0.00s)
    ledger_test.go:347: FormatLedger for input named "overlong ascii descriptions" was expected to return...
        Date_______|_Description_______________|_Change
        01/01/2015_|_This_description_is_wa..._|________$1.00_

        ...but returned...
        Date_______|_Description_______________|_Change
        01/01/2015_|_This_description_is_way_..._|________$1.00_

(They cut the string at the wrong place and misalign the columns)

Thing is, if I add the tests, that means the initial implementation has to pass the tests too, and that means I have to also write bad code that happens to pass the tests. I'll consider whether that's a thing I want to do.

So while I think about that, I wanted to open an issue and ask what y'all think about it as well. Worth adding the tests? And ideas on how to write an intentionally-bad implementation that nevertheless passes tests?

As for a good implementation that works correctly, despite there being https://golang.org/pkg/unicode/utf8/#RuneCountInString , I'm not aware of something in Go that lets you substring at a certain rune count instead of a byte count. So that means I may have to iterate over the first 25 runes of the string (j := 0; for i, _ := range description { if (j == 25) { ... } ... j++ } or similar)

@soniakeys
Copy link
Contributor

Hmm, lemme try working the exercise...

@soniakeys
Copy link
Contributor

It looks like a little better would be to count just runes that pass unicode.IsGraphic? (didn't quite work.)

Edit: And a little better would be to use godoc.org/golang.org/x/text/width...

Ha! I have my program passing with descriptions of "Freude schöner Götterfunken" (those are combining characters) and "Ode To Joy 기쁨 에 송시 Hyuna". I'll hold off on posting an iteration so I don't spoil your fun.

@soniakeys
Copy link
Contributor

Huh. I was trying to use godoc.org/golang.org/x/text/currency but sadly ran into this:

    // TODO: use pattern.
    io.WriteString(s, opt.symbol(lang, cur))
    if v.amount != nil {
        s.Write(space)

The "pattern" mentioned in the comment would be a CLDR Number Pattern, which would do the right thing with a space between the currency symbol and the amount -- rather than hard coding a space.

@petertseng
Copy link
Member Author

Ahhh, so we should think about wide characters as well, and this is super interesting for your example "Ode To Joy 기쁨 에 송시 Hyuna" because the cut should happen in the middle of a wide character (in the middle of that 시) . What would you say the result should be when that happens? Replace the wide character with a " ...", or should it just be "..." without the space?

Date       | Description               | Change
01/01/2015 | Ode To Joy 기쁨 에 송...  |        $1.00 

or

Date       | Description               | Change
01/01/2015 | Ode To Joy 기쁨 에 송 ... |        $1.00 

(Oy, that doesn't even look right on my browser because the wide characters aren't exactly the width of two narrows... but it does look fine in a monospace font in my editor)

@petertseng
Copy link
Member Author

Eh, submitted an iteration that deals with wide characters and combining characters, though I'm not confident that I'm really doing the combining characters the right way. I'm somewhat wondering whether we should have people deal with wide characters at all. The argument for it is that if we're dealing with one aspect of unicode (that some codepoints take more than one byte) then we should deal with other aspects too (that characters have varying widths). On the other hand, it could be a lot to deal with in an exercise that is mostly about refactoring some existing code. Then again, maybe it's not that bad if we write some existing code in the initial ledger.go

I am almost tempted to say "this variable width business even deserves its own exercise", but I don't know.

@soniakeys
Copy link
Contributor

Cool. I uploaded an iteration at http://exercism.io/submissions/fbab39b6626b498d96e9f0466d4a5c6d. You can see I handle combining characters with if !unicode.Is(unicode.Mn, r) {.

Unicode is a rabbit hole. To keep the exercise focused on refactoring, we could just take out the multibyte test case and make it all ASCII. If we leave it in, we should at least handle the existing test case properly. Beyond that, I don't know. The exercise gets more complex, but I think in the real world most anyone who handles Unicode has to handle combining characters. I hear Go is popular in China. It's good stuff to be able do.

@petertseng
Copy link
Member Author

Ah I see. I guess my solution just kinda works by chance. I also do a width.LookupRune(r).Kind() but it so happened that the comining characters were EastAsianAmbiguous and so I just treated them as zero... but I'm sure there have to be other EastAsianAmbiguous characters that have nonzero widths, so my solution is not quite good enough.

Given all this, maybe the best thing to do is to go with ASCII only in the ledger tests, and then possibly think of some separate exercise that will deal with Unicode. Will probably be about 24 hours before I can do the first thing so someone could feel free to beat me to it. The Unicode exercise will of course need a bit of thought.

@kytrinyx
Copy link
Member

kytrinyx commented Nov 2, 2015

I think you're both right that

a) unicode handling is important, and
b) it's a rabbit hole, and
c) we should probably put it in a separate exercise.

@soniakeys
Copy link
Contributor

resolved with #199

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants