Skip to content

Golang String Gotchas

abbourne edited this page Nov 10, 2019 · 64 revisions

Go's strings are UTF-8 encoded. This is a great feature but does require you to understand the implications. Note: You can find the code file with all the examples from this article at stringgotchas

Go's string - More Than Meets the Eye

In Phillip Compeau's Finding Replication Origins in Bacterial Genomes lecture, we needed a function to find reverse a string of characters representing DNA sequence, which is a sequence of 4 nucleotides represented by the characters 'A', 'T', 'G', 'C'.

Phillip's string reverse function looks like this. The "P" in the name indicates its Phillip's version:

  func reverseP(s string) string {
	s2 := ""
	n := len(s)
	for i := range s {
		s2 += string(s[n-1-i])
	}
	return s2
  }

He was only dealing with ASCII strings, and it's definitely the right approach for the task at hand. But if you use the function on strings that contain Unicode characters, you will see some strange results! It has to do with the way Go implements strings.

Here's another slightly different implementation (B is for Bill)

  func reverseB(s string) string {
      s2 := ""
      for i := len(s) - 1; i >= 0; i-- {
          s2 += string(s[i])
      }
      return s2
  }

Let's try these with an ASCII string:

    s1 := "ATGC"
    fmt.Println("Input string:", s1, "is length: ", len(s1))
    fmt.Println("ReversedP:", reverseP(s1))
    fmt.Println("ReversedB:", reverseB(s1))

This will output:

  Input string: ATGC is length:  4
  ReversedP: CGTA
  ReversedB: CGTA

Just as we expect.

On the main Golang page, the Hello World example is written with the string "Hello, 世界". It demonstrates Go's full support for Unicode characters. Let's try reversing this string:

    s2 := "Hello, 世界" //copied from https://golang.org
    fmt.Println("Input string:", s2, "is length:", len(s2))
    fmt.Println("ReversedP:", reverseP(s2))
    fmt.Println("ReversedB:", reverseB(s2))

This will output:

    Input string: Hello, 世界 is length: 13
    ReversedP: ç¸ä ,l
    ReversedB: ç¸ä ,olleH

Yikes! What gives here?

  • I can only see 9 characters in the string. Why does len(s2) return a length of 13??
  • Why do the Unicode characters "世界" get all messed up when we reverse the string?
  • Why do reverseP and reverseB give such different results? reverseP seems to have skipped over some characters

The answer is that Go strings are represented using the UTF-8 encoding format. UTF-8 is a very compact, efficient string format that is used by over 95% of the world's web pages, and has become the defacto standard for representing strings of text. One of its many great features is that it's backwards compatible with ASCII encoding - all 7-bit ASCII values are represented as they always were, in a single byte. Unicode characters have variable-sized encodings in UTF-8, ranging in length from 1 up to 6 bytes. An interesting bit of trivia: The inventors of UTF-8 are Ken Thompson and Rob Pike, who are also two of the inventors of Go (The other father of Go is Robert Griesemer).

The "世界" characters are each encoded as 3 bytes long in UTF-8. This means the UTF-8 string "Hello, 世界" is actually 13 bytes long, even though it contains 9 characters.

But Go treats a string as a slice of bytes (type []byte) under the hood. ASCII characters map unchanged to single bytes in UTF-8. As long as your strings only use the ASCII character set, you can treat Go strings as a read-only slice of bytes and everything will be fine. But when non-ASCII characters are in play, we must take into account the fact that they are represented as multiple bytes in UTF-8.

So:

  • The reverseB function simply reverses all the bytes in the slice, so it messes up the UTF-8 multibyte characters.
  • The failure mode of reverseP is a little more complicated. That's because when range is used with a string it ranges over the UTF-8 characters, not the bytes. This is the only case where go treats a string as UTF-8 and not a slice of bytes. So range over "Hello, 世界" will set i to 0,1,2,3,4,5,6,7,10 in succession.i will never take on the values of 8,9,11,12 even though these are valid indices when extracting bytes from the string.

Working with rune, Unicode code points, UTF-8, and ASCII bytes

Here's what you need to know to understand how things work:

  • character set: a set of characters, along with some unique way to identify them.
  • character encoding: a way of encoding (representing) a string of characters into an array of bits. There are many, many character encodings, and a specific character set may have many associated character encodings.
  • ASCII: a character set and character encoding, where characters are represented using 7-bits of an 8-bit byte.
  • Unicode: a huge, international standard character set. Each character in the character set is denoted by a "code point". A code point is formally an abstract identifier, without a defined encoding. Code points are written as U+nnnnnnnn, where nnnnnnnn is a variable-length hexadecimal number.
  • UTF-8: a character encoding for Unicode, which is backwards compatible with ASCII, and which uses a variable-length encoding of 1 to 6 bytes to encode each character. (Other Unicode encodings include UTF-16, UTF-32, and UCS-2. But UTF-8 is by far the most common).
  • rune: Go calls a single character a rune. The type rune is just an alias for int32. In Go, character literals are placed in single quotes like 'A' or '⌘'. No matter what the character, a single character is represented and an int32 in Go. The value of a rune character is the same as it's Unicode code point.
  • string: a Go string is always encoded as UTF-8. Each character in a Go string will vary in length from 1 to 6 bytes.
  • Go strings do not contain runes! A transformation is required to go/from a single rune to its equivalent variable-length UTF-8 encoding.
  • Go strings are represented internally as a slice of bytes []byte. byte is just a type alias for uint8.

To make things fairly easy on the programmer, Go does a number of things:

Typecasting between string, rune and []rune and will perform the UTF-8 encoding/decoding:

   s4 := string('⌘') + string('A') + string('𐅻') // encodes rune characters into UTF-8 strings
   fmt.Println(s4)
   s4runes := []rune(s4)     // decodes UTF-8 string into a slice of runes
   s4UTF8 := string(s4runes) // encodes a slice of runes back into a UTF-8 string
   fmt.Println(s4runes)
   fmt.Println(s4UTF8)

prints:

⌘A𐅻
[8984 65 65915]
⌘A𐅻

The for ... range loop on a string treats strings as UTF-8 encoded, not a []byte slice. A range over a string will return two values:

  • The first is an int which ranges over the starting position of each character in the string.
  • The second returns the actual rune character extracted from the UTF-8 encoded string.

The code:

	s3 := "Hi 𐅻世⌘"
	fmt.Println("Input string:", s3, "is length:", len(s3))
	fmt.Println("Input string:", s3, "is runelength:", utf8.RuneCountInString(s3))
	for index, runeValue := range s3 {
		fmt.Printf("rune %#U has type %T starts at byte position %d\n", runeValue, runeValue, index)
		fmt.Printf("byte %#U has type %T starts at byte position %d\n", s3[index], s3[index], index)
	}

results in the following output:

Input string: Hi 𐅻世⌘ is length: 13
Input string: Hi 𐅻世⌘ is runelength: 6
rune U+0048 'H' has type int32 starts at byte position 0
byte U+0048 'H' has type uint8 starts at byte position 0
rune U+0069 'i' has type int32 starts at byte position 1
byte U+0069 'i' has type uint8 starts at byte position 1
rune U+0020 ' ' has type int32 starts at byte position 2
byte U+0020 ' ' has type uint8 starts at byte position 2
rune U+1017B '𐅻' has type int32 starts at byte position 3
byte U+00F0 'ð' has type uint8 starts at byte position 3
rune U+4E16 '世' has type int32 starts at byte position 7
byte U+00E4 'ä' has type uint8 starts at byte position 7
rune U+2318 '⌘' has type int32 starts at byte position 10
byte U+00E2 'â' has type uint8 starts at byte position 10

First, let's go over the format "verbs" we used in fmt.Println():

  • %#U: prints the representation for a Unicode code point, followed by its character literal
  • %T: prints the type of the value it's given. How cool is that?!
  • %v: prints the value in a default format, based on its type.

Now let's look at the output in detail:

  • The string s3 is 13 bytes long, because a number of the characters in it have multibyte character encodings in UTF-8
  • The function utf8.RuneCountInString(s3) gives an accurate count of the number of characters, which is 6
  • The for ... range loop on a string treats strings as UTF-8, not a []byte slice. So:
    • The first variable is an int which ranges over the starting position of each character in the string. In this case that's 0,1,2,3,7,10
    • The second is a rune which returns the actual character extracted from the UTF-8 encoded string.

Since the example prints both the byte found at the first position of each character found, as well as the rune character, it's easy to see how simply treating a Go string as a sequence of bytes can cause problems. Note how:

  • ASCII characters are 1 byte long, and the byte version and rune version are equivalent.
  • The '𐅻' (the Greek drachma character, BTW) is 4 bytes long, While the '世' Chinese character is 3 bytes. The '⌘' (Swedish point-of-interest character) is also 3 bytes

Wrapping Up

Here's a very simple way to reverse any UTF-8 string:

   func reverseRuneB(s string) string {
       s2 := ""
       for _, char := range s {
           s2 = string(char) + s2
       }
       return s2
   }

If you really want to know the most efficient, Go idiomatic way to safely reverse a string, look at the implementation of stringutil.Reverse(). Here's a version I edited for clarity:

   // This version works for runes and is very efficient.
   // We convert the input string into an array of runes (which is mutable) and do the string reverse in place
   // We index from the front and the back, replace the 2 characters, then move 1 step toward the
   // middle of the array from both ends and repeat.
   // This is the implementation of the stringutil.Reverse example at https://github.com/golang/example
   // Its the Go idiomatic way to reverse a string
   func reverse(s string) string {
       r := []rune(s) //convert from a string to an array of runes. Go will "unpack" the UTF8 into runes 
       for front, back := 0, len(r)-1; front < len(r)/2; front, back = front+1, back-1 {
	    r[front], r[back] = r[back], r[front]
       }
       return string(r) // And convert from a []rune array back to a UTF8 string.
   }

Lets try it:

   s2 := "Hello, 世界" //copied from https://golang.org
   fmt.Println("Input string:", s2, "is length:", len(s2))
   fmt.Println("Reversed:", reverse(s2))
   fmt.Println("ReversedRuneB:", reverse(s2))
   fmt.Println("Reversed stringutil:", stringutil.Reverse(s2))

This will output:

  Input string: Hello, 世界 is length: 13
  Reversed: 界世 ,olleH
  ReversedRuneB: 界世 ,olleH
  Reversed stringutil: 界世 ,olleH

OK sorted! We can reverse any string! I'd encourage you to take the time to walk through in detail how reverseP, reverseB and reverse work and why they give the results that they do.

Learning More