The previous code was more concise, but alas GHC boxed each Word8 it read from the ByteString, which resulted in poor performance. This mankier code adds (seemingly required) strictness annotations, along with a little bit of manual CSE. Timing of the DecodeUtf8/Strict benchmark went from 41.8ms to 19.6ms, a pleasing improvement.
When performance testing encodeUtf8, I noticed that for some reason I was still seeing "ensure" show up in the profile, when I expected it shouldn't have been. Turns out I was using a "min" where I should have been using a "max", and thus allocating an initial bytestring that would almost always be too small, thus forcing reallocations and copying. Boo!
We had been performing a resize any time that (a) we had data to write and (b) we got to within 4 bytes of filling the target bytestring. This was safe, but suboptimal, as it meant that in the common case of encoding ASCII text, we would *always* perform a resize. Now, we check the exact number of bytes we need to fit, and resize only if they won't fit. This eliminates resizes for ASCII data, and makes them a little less likely for other data.