-
Notifications
You must be signed in to change notification settings - Fork 1
Golang String Gotchas
Go's strings are UTF-8 encoded. This is a great feature but does require you to understand the implications. Note: You can find the code file with all the examples from this article at stringgotchas
In Phillip Compeau's Finding Replication Origins in Bacterial Genomes lecture, we needed a function to find reverse a string of characters representing DNA sequence, which is a sequence of 4 nucleotides represented by the characters 'A', 'T', 'G', 'C'.
Phillip's string reverse function looks like this. The "P" in the name indicates its Phillip's version:
func reverseP(s string) string {
s2 := ""
n := len(s)
for i := range s {
s2 += string(s[n-1-i])
}
return s2
}
He was only dealing with ASCII strings, and it's definitely the right approach for the task at hand. But if you use the function on strings that contain Unicode characters, you will see some strange results! It has to do with the way Go implements strings.
Here's another slightly different implementation (B is for Bill)
func reverseB(s string) string {
s2 := ""
for i := len(s) - 1; i >= 0; i-- {
s2 += string(s[i])
}
return s2
}
Let's try these with an ASCII string:
s1 := "ATGC"
fmt.Println("Input string:", s1, "is length: ", len(s1))
fmt.Println("ReversedP:", reverseP(s1))
fmt.Println("ReversedB:", reverseB(s1))
This will output:
Input string: ATGC is length: 4
ReversedP: CGTA
ReversedB: CGTA
Just as we expect.
On the main Golang page, the Hello World example is written with the string "Hello, 世界". It demonstrates Go's full support for Unicode characters. Let's try reversing this string:
s2 := "Hello, 世界" //copied from https://golang.org
fmt.Println("Input string:", s2, "is length:", len(s2))
fmt.Println("ReversedP:", reverseP(s2))
fmt.Println("ReversedB:", reverseB(s2))
This will output:
Input string: Hello, 世界 is length: 13
ReversedP: ç¸ä ,l
ReversedB: ç¸ä ,olleH
Yikes! What gives here?
- I can only see 9 characters in the string. Why does
len(s2)
return a length of 13?? - Why do the Unicode characters "世界" get all messed up when we reverse the string?
- Why do
reverseP
andreverseB
give such different results?reverseP
seems to have skipped over some characters
The answer is that Go strings are represented using the UTF-8 encoding format. UTF-8 is a very compact, efficient string format that is used by over 95% of the world's web pages, and has become the defacto standard for representing strings of text. One of its many great features is that it's backwards compatible with ASCII encoding - all 7-bit ASCII values are represented as they always were, in a single byte. Unicode characters have variable-sized encodings in UTF-8, ranging in length from 1 up to 6 bytes. An interesting bit of trivia: The inventors of UTF-8 are Ken Thompson and Rob Pike, who are also two of the inventors of Go (The other father of Go is Robert Griesemer).
The "世界" characters are each encoded as 3 bytes long in UTF-8. This means the UTF-8 string "Hello, 世界" is actually 13 bytes long, even though it contains 9 characters.
But Go treats a string as a slice of bytes (type []byte
) under the hood. ASCII characters map unchanged to single bytes in UTF-8. As long as your strings only use the ASCII character set, you can treat Go strings as a read-only slice of bytes and everything will be fine. But when non-ASCII characters are in play, we must take into account the fact that they are represented as multiple bytes in UTF-8.
So:
- The reverseB function simply reverses all the bytes in the slice, so it messes up the UTF-8 multibyte characters.
- The failure mode of reverseP is a little more complicated. That's because when
range
is used with a string it ranges over the UTF-8 characters, not the bytes. This is the only case where go treats a string as UTF-8 and not a slice of bytes. Sorange
over "Hello, 世界" will seti
to 0,1,2,3,4,5,6,7,10 in succession.i
will never take on the values of 8,9,11,12 even though these are valid indices when extracting bytes from the string.
Here's what you need to know to understand how things work:
- character set: a set of characters, along with some unique way to identify them.
- character encoding: a way of encoding (representing) a string of characters into an array of bits. There are many, many character encodings, and a specific character set may have many associated character encodings.
- ASCII: a character set and character encoding, where characters are represented using 7-bits of an 8-bit byte.
-
Unicode: a huge, international standard character set. Each character in the character set is denoted by a "code point". A code point is formally an abstract identifier, without a defined encoding. Code points are written as
U+nnnnnnnn
, where nnnnnnnn is a variable-length hexadecimal number. - UTF-8: a character encoding for Unicode, which is backwards compatible with ASCII, and which uses a variable-length encoding of 1 to 6 bytes to encode each character. (Other Unicode encodings include UTF-16, UTF-32, and UCS-2. But UTF-8 is by far the most common).
-
rune: Go calls a single character a
rune
. The typerune
is just an alias forint32
. In Go, character literals are placed in single quotes like'A'
or'⌘'
. No matter what the character, a single character is represented and anint32
in Go. The value of a rune character is the same as it's Unicode code point. -
string: a Go
string
is always encoded as UTF-8. Each character in a Go string will vary in length from 1 to 6 bytes. - Go strings do not contain runes! A transformation is required to go/from a single rune to its equivalent variable-length UTF-8 encoding.
- Go
strings
are represented internally as a slice of bytes[]byte
.byte
is just a type alias foruint8
.
To make things fairly easy on the programmer, Go does a number of things:
Typecasting between string
, rune
and []rune
and will perform the UTF-8 encoding/decoding:
s4 := string('⌘') + string('A') + string('𐅻') // encodes rune characters into UTF-8 strings
fmt.Println(s4)
s4runes := []rune(s4) // decodes UTF-8 string into a slice of runes
s4UTF8 := string(s4runes) // encodes a slice of runes back into a UTF-8 string
fmt.Println(s4runes)
fmt.Println(s4UTF8)
prints:
⌘A𐅻
[8984 65 65915]
⌘A𐅻
The for ... range
loop on a string treats strings as UTF-8 encoded, not a []byte slice. A range over a string will return two values:
- The first is an
int
which ranges over the starting position of each character in the string. - The second returns the actual rune character extracted from the UTF-8 encoded string.
The code:
s3 := "Hi 𐅻世⌘"
fmt.Println("Input string:", s3, "is length:", len(s3))
fmt.Println("Input string:", s3, "is runelength:", utf8.RuneCountInString(s3))
for index, runeValue := range s3 {
fmt.Printf("rune %#U has type %T starts at byte position %d\n", runeValue, runeValue, index)
fmt.Printf("byte %#U has type %T starts at byte position %d\n", s3[index], s3[index], index)
}
results in the following output:
Input string: Hi 𐅻世⌘ is length: 13
Input string: Hi 𐅻世⌘ is runelength: 6
rune U+0048 'H' has type int32 starts at byte position 0
byte U+0048 'H' has type uint8 starts at byte position 0
rune U+0069 'i' has type int32 starts at byte position 1
byte U+0069 'i' has type uint8 starts at byte position 1
rune U+0020 ' ' has type int32 starts at byte position 2
byte U+0020 ' ' has type uint8 starts at byte position 2
rune U+1017B '𐅻' has type int32 starts at byte position 3
byte U+00F0 'ð' has type uint8 starts at byte position 3
rune U+4E16 '世' has type int32 starts at byte position 7
byte U+00E4 'ä' has type uint8 starts at byte position 7
rune U+2318 '⌘' has type int32 starts at byte position 10
byte U+00E2 'â' has type uint8 starts at byte position 10
First, let's go over the format "verbs" we used in fmt.Println()
:
-
%#U
: prints the representation for a Unicode code point, followed by its character literal -
%T
: prints the type of the value it's given. How cool is that?! -
%v
: prints the value in a default format, based on its type.
Now let's look at the output in detail:
- The string
s3
is 13 bytes long, because a number of the characters in it have multibyte character encodings in UTF-8 - The function
utf8.RuneCountInString(s3)
gives an accurate count of the number of characters, which is 6 - The
for ... range
loop on a string treats strings as UTF-8, not a []byte slice. So:- The first variable is an
int
which ranges over the starting position of each character in the string. In this case that's 0,1,2,3,7,10 - The second is a
rune
which returns the actual character extracted from the UTF-8 encoded string.
- The first variable is an
Since the example prints both the byte found at the first position of each character found, as well as the rune character, it's easy to see how simply treating a Go string as a sequence of bytes can cause problems. Note how:
- ASCII characters are 1 byte long, and the byte version and rune version are equivalent.
- The '𐅻' (the Greek drachma character, BTW) is 4 bytes long, While the '世' Chinese character is 3 bytes. The '⌘' (Swedish point-of-interest character) is also 3 bytes
Here's a very simple way to reverse any UTF-8 string:
func reverseRuneB(s string) string {
s2 := ""
for _, char := range s {
s2 = string(char) + s2
}
return s2
}
If you really want to know the most efficient, Go idiomatic way to safely reverse a string, look at the implementation of stringutil.Reverse(). Here's a version I edited for clarity:
// This version works for runes and is very efficient.
// We convert the input string into an array of runes (which is mutable) and do the string reverse in place
// We index from the front and the back, replace the 2 characters, then move 1 step toward the
// middle of the array from both ends and repeat.
// This is the implementation of the stringutil.Reverse example at https://github.com/golang/example
// Its the Go idiomatic way to reverse a string
func reverse(s string) string {
r := []rune(s) //convert from a string to an array of runes. Go will "unpack" the UTF8 into runes
for front, back := 0, len(r)-1; front < len(r)/2; front, back = front+1, back-1 {
r[front], r[back] = r[back], r[front]
}
return string(r) // And convert from a []rune array back to a UTF8 string.
}
Lets try it:
s2 := "Hello, 世界" //copied from https://golang.org
fmt.Println("Input string:", s2, "is length:", len(s2))
fmt.Println("Reversed:", reverse(s2))
fmt.Println("ReversedRuneB:", reverse(s2))
fmt.Println("Reversed stringutil:", stringutil.Reverse(s2))
This will output:
Input string: Hello, 世界 is length: 13
Reversed: 界世 ,olleH
ReversedRuneB: 界世 ,olleH
Reversed stringutil: 界世 ,olleH
OK sorted! We can reverse any string! I'd encourage you to take the time to walk through in detail how reverseP
, reverseB
and reverse
work and why they give the results that they do.
- For more details on how Go implements strings read The Go Blog - Strings, bytes, runes and characters in Go by Rob Pike, one of the Go creators.
- To learn all about the package
fmt
formatting verbs see package fmt documentation and Go by Example - String Formatting - For detailed background on Unicode and UTF-8 and see UTF-8: Bits, Bytes, and Benefits and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Check out the Go strings package for lots of utilities for working with strings. The Go Unicode package has utilities to simplify working with runes. And the Go utf8 package has utilities to help with detailed encoding and decoding of UTF-8 strings.
- Find and copy your favourite Unicode characters at Unicode Character Table