# Encoding

Did you ever open up a file and see something like the following?:

or even

What about importing a text data file only to find two or three strange characters in your first field, which can be resistant to removal? Such as:

Then you've already experienced problems with character encoding. You may have had many more instances where an encoding problem was more subtle and it escaped your attention.

One of the more common challenges for a data wrangler is text data dumps from SQL databases. It is fairly easy to specify a custom delimiter like pipes (|, always a good idea) in the Options dialog of SQL Server Management Studio, but I know of no way to alter the character encoding specified in the table/database design. Thus one will frequently encounter utf-16 encoding in text dumps from SQL databases. Not only are these generally twice as big as they need to be, they cause numerous other challenges as well (see below).

One of the biggest changes from Python 2.xx to 3.xx was the added native support of Unicode in 3.xx. Many of the complaints about 3.xx are rooted in the fact that it is frankly so much easier to work just with ASCII. However, the Internet and the international exchange of ideas and text are here to stay and one has to accept that Unicode will overtake ASCII, just as ASCII beat out its competitors back in the dawn of the computer age.

First a little setup code to prepare for the examples later in this document.

In [1]:
import os
os.chdir(r'C:\Users\dowes\OneDrive\Projects\Encoding')

import pandas as pd
import codecs as cds

def endecode(x, cp):
  try:
    s = chr(x)
    bs = cds.encode(s, 'latin_1')
    return cds.decode(bs, cp)
  except:
    return ''
  
f  = lambda x : hex(x)[2:].zfill(2)
bi = lambda x : bin(x)[2:].zfill(8)
ed1252 = lambda x : endecode(x, 'cp1252')
ed1253 = lambda x : endecode(x, 'cp1253')
ed1254 = lambda x : endecode(x, 'cp1254')
ed_ascii = lambda x : endecode(x, 'ascii')
ed_latin = lambda x : endecode(x, 'latin_1')
ed_8859_2 = lambda x : endecode(x, 'iso8859_2')
ed_8859_5 = lambda x : endecode(x, 'iso8859_5')
ed_macrom = lambda x : endecode(x, 'mac_roman')

## The Basics

As everyone knows, computers work with binary 'bits'. For example:

But that's tough to interpret, even for a computer. So, we need some rules. For example, let's agree to consider bits in blocks of 8, which we'll call bytes:

## ASCII

That's better. Now let's have these bytes "stand for something", such as text characters (including numeric text characters). Note that 8 binary bits can represent 256 different values ($2^8$ =  256) from 0 to 255. You would think that is plenty of values that we can use to code for (stand for) text characters. In fact, 128 values should be enough. Such was the start of ASCII, a rule system for encoding characters. ASCII came into prominence in the 1960's, beating out alternatives such as BCD (binary coding decimal), FIELDATA (US Army), and EBCDIC (extended BCD, an IBM development for System/360's). Incidentally, it was IBM's adoption of ASCII for the IBM PC that really cemented ASCII's position.

Here are most of the 128 ASCII values  
- 0-31 are hidden because they code for things only machines understand  
- 62-97 are hidden for brevity  
- the first column is the byte itself, a binary value  
- the second column is the corresponding decimal value  
- the third is the value in hexadecimal  
- the fourth is the character.

In [3]:
dec = pd.Series(data = range(128))
hx = dec.map(f)
bn = dec.map(bi)
c_ascii = dec.map(ed_ascii)

df = pd.DataFrame(data = {'bin' : bn, 'hex' : hx, 'ascii' : c_ascii})
df = df[['bin', 'hex', 'ascii']]
print(df[32:])

          bin hex ascii
32   00100000  20      
33   00100001  21     !
34   00100010  22     "
35   00100011  23     #
36   00100100  24     $
..        ...  ..   ...
123  01111011  7b     {
124  01111100  7c     |
125  01111101  7d     }
126  01111110  7e     ~
127  01111111  7f     

[96 rows x 3 columns]


Here's "Hello World" in binary, decimal, and hex bytes, along with ASCII characters. 

It is not hard to see the appeal of using hex for byte values:

## Extended ASCII

ASCII is the acronym for American Standard Code for Information Interchange. As Americans, the inventors felt it was more than adequate. However, Europeans with their umlaut's, superimposed tildes, diacritical marks, etc. needed more. Since Basic ASCII is 7-bit (it doesn't use the first bit in its 128 bytes), going to 8-bits was a way to gain 128 more characters to satisfy the Europeans while retaining backward compatibility with ASCII. Thus Extended ASCII was born - actually a variety of extended ASCII's were born:

-IBM - various "code pages"  
-Apple - MacRoman  
-DEC - Multinational Character Set (based on ISO 8859)

Eventually, the ISO released a set of standards called "ISO 8859" of which "ISO 8859-1" (aka Latin-1) is the one used in Western Europe and the USA. Another standard was created for Eastern Europe (ISO 8859-2), one for Cyrillic languages (ISO 8859-5), and others.

In [3]:
dec = pd.Series(data = range(128, 256))
hx = dec.map(f)
bn = dec.map(bi)
c8859_1 = dec.map(ed_latin)
c8859_2 = dec.map(ed_8859_2)
c8859_5 = dec.map(ed_8859_5)

df = pd.DataFrame(data = {'bin' : bn, 'hex' : hx, 'iso_8859-1' : c8859_1, 'iso_8859-2' : c8859_2, 'iso_8859-5' : c8859_5})
df = df[['bin', 'hex', 'iso_8859-1', 'iso_8859-2', 'iso_8859-5']]
df.index = range(128, 256)
print(df[33:70])
print(df[100:])
## again some rows are hidden for the sake of brevity

          bin hex iso_8859-1 iso_8859-2 iso_8859-5
161  10100001  a1          ¡          Ą          Ё
162  10100010  a2          ¢          ˘          Ђ
163  10100011  a3          £          Ł          Ѓ
164  10100100  a4          ¤          ¤          Є
165  10100101  a5          ¥          Ľ          Ѕ
166  10100110  a6          ¦          Ś          І
167  10100111  a7          §          §          Ї
168  10101000  a8          ¨          ¨          Ј
169  10101001  a9          ©          Š          Љ
170  10101010  aa          ª          Ş          Њ
171  10101011  ab          «          Ť          Ћ
172  10101100  ac          ¬          Ź          Ќ
173  10101101  ad          ­          ­          ­
174  10101110  ae          ®          Ž          Ў
175  10101111  af          ¯          Ż          Џ
176  10110000  b0          °          °          А
177  10110001  b1          ±          ą          Б
178  10110010  b2          ²          ˛          В
179  10110011  b3          ³   

To be clear, a "code page" is typically a mapping of all 256 1-byte characters. What I first showed (ASCII) were the first 128 character encodings that are generally common to all encodings. Then I showed the next 128 characters of an extension of ASCII, ISO 8859-1" (aka Latin-1).

Unfortunately, having a code page for Western Europe (ISO 8859-1), another for Eastern Europe (ISO 8859-2), one for Cyrillic languages (ISO 8859-5), etc. makes it difficult to share documents and messages.

On top of that, Microsoft has pursued its penchant for proprietary solutions by developing its own set of code pages, cp1250-57 for Windows plus more for MS-DOS. Of these, the Western Europe version cp1252 is the best known. It is a superset of ISO 8859-1 and is also called "Latin-1", a common source of confusion.

The following shows the variability among these code pages - again, the differences are primarily in the byte values of 128 through 256. 

In [4]:
dec = pd.Series(data = range(128, 256))
hx = dec.map(f)
bn = dec.map(bi)
c8859_1 = dec.map(ed_latin)
c1252 = dec.map(ed1252)
c1253 = dec.map(ed1253)
c1254 = dec.map(ed1254)
cMacRom = dec.map(ed_macrom)

df = pd.DataFrame(data = {'bin' : bn, 'hex' : hx, 'iso_8859-1' : c8859_1, 'Win_1252' : c1252, 'Win_1253' : c1253, 'Win_1254' : c1254, 'Mac_Roman' : cMacRom})
df = df[['bin', 'hex', 'iso_8859-1', 'Win_1252', 'Win_1253', 'Win_1254', 'Mac_Roman']]
df.index = range(128, 256)
print(df[:70])
print(df[100:])
## again some rows are hidden for the sake of brevity

          bin hex iso_8859-1 Win_1252 Win_1253 Win_1254 Mac_Roman
128  10000000  80                  €        €        €         Ä
129  10000001  81                                              Å
130  10000010  82                  ‚        ‚        ‚         Ç
131  10000011  83                  ƒ        ƒ        ƒ         É
132  10000100  84                  „        „        „         Ñ
133  10000101  85                  …        …        …         Ö
134  10000110  86                  †        †        †         Ü
135  10000111  87                  ‡        ‡        ‡         á
136  10001000  88                  ˆ                 ˆ         à
137  10001001  89                  ‰        ‰        ‰         â
138  10001010  8a                  Š                 Š         ä
139  10001011  8b                  ‹        ‹        ‹         ã
140  10001100  8c                  Œ                 Œ         å
141  10001101  8d                                              ç
142  10001

Note that the biggest difference between the ISO standard and the proprietary code pages is the use of the byte values 128 - 159 for characters (in the proprietary code pages). By the way, cp1252 is also known as "ANSI-1252" or "Windows-1252". It is also called "Latin-1" although that term is best reserved for ISO-8859-1.

If encoding weren't already getting out of hand, with the advent of the Internet there was a need to consider Russian,  Hindi, Arabic, Hebrew, Korean, etc. alphabets.

The solution? multi-byte encoding and and something called unicode.

## unicode

Actually, unicode is not an encoding system. Unicode is an agreed upon standard of *code points* (numeric values) for characters. How the code points are implemented is not specified. We'll see how implementation can vary when we compare utf-8 and utf-16.

Unicode can use as many as 4 bytes, thus potentially providing code points for over 1 million characters. That's enough for all the known human languages and then some (including *Klingon*).

### utf-8 and utf-16

The problem with using 4 bytes (as with utf-32) to code for characters is that most of the time, it is remarkably inefficient. The one-byte representations are sufficient for the vast majority of characters. Going to 4 bytes will unnecessarily quadruple the size of most text documents. The answer is to use *variable-length encoding* that only uses extra bytes when necessary.

utf-8 and utf-16 are the two most common implementations of unicode. Google estimates that 84% of the Internet's web pages are utf-8 encoded. utf-8 has at least 1 byte per character and as many as 4. utf-16 starts at 2 bytes and can have as many as 4. This last fact explains why utf-16 encoded text data files will be roughly twice as their utf-8 counterparts. utf-16 also requires the use of a *byte order marker* (BOM) at the beginning of a text file to tell the program what to expect in terms of encoding. The BOM may include information about whether the encoding is *big endian* or *little endian*, a topic beyond the scope of this notebook.

utf-8 is identical to ASCII for the first 128 *code points* (values 0 - 127). Thus, simple text encoded in utf-8 can be read by programs designed for ASCII or even extended-ASCII. This can be another source of puzzling problems. Everything can be going swimmingly for you and your software tools when suddenly you start getting uninterpretable character sequences.

From data point 128 forward, utf-8 starts using 2, 3, or 4 bytes to encode a character with the *leading byte* signaling how many bytes are to be used for the particular character. The second, third (if needed), and fourth (if needed) *continuation bytes* provide the full definition for the character.

Lest you think this is arcane, let me point out a couple of examples from medicine and physics. The inflammatory disease of mucosal surfaces is called:

In [5]:
print('Be' + chr(231) + "het's Disease.")

Beçhet's Disease.


The unit of length used to measure distances between atoms or molecules is the:

In [6]:
print(chr(197) + 'ngstr' + chr(246) +'m.')

Ångström.


and its symbol is:

In [7]:
print(chr(197))

Å


I could have done those examples using cp1252, but frankly it is easier to use the code point numbers and Python's default utf-8 encoding. The point is that scientific and medical terminology includes a number of names, measures, eponyms, etc. that are beyond the encoding range of ASCII or even Latin-1.

### A Note About Python 3.xx and utf-8

Some say that Python 3.xx uses utf-8 internally. That is not quite correct. utf-8 is definitely the default encoding and what you will see at the console or in a text file unless you specify otherwise. But "internally", Python is using the numeric values of the unicode **code points**. In the case of the *string* object, which has its own methods, attributes, etc. the string characters are stored as a series of numeric **code point** values.

The base chr() function always takes **code point** values as its argument, and this has nothing to do with encodings. For example:

In [8]:
# just as with ASCII
print(chr(99))
# just as with cp1252
print(chr(231))
# but here's one the code pages can't do
print(chr(999))

c
ç
ϧ


Since all the unicode implementations use the same **code points** and these all map to the same character, it would not be interesting to see a list of **code points** and **characters** for the different implementations (utf-8, utf-16, utf-32). Instead, it would be interesting to see the differences in bytes.

In this next example, we'll look at the binary bytes required for encoding lower end characters (those in the ASCII range). Here utf-8 is more efficient - it is the same as ascii and thus requires only 1 byte. On the other hand, the 2 byte patterns of the utf-16 encodings are very inefficient - the first byte is always NULL (all zeroes).

In [9]:
dec = pd.Series(data = range(50, 75))

def binbytes(cp, enc):
    bs = cds.encode(chr(cp), enc)
    s = '   '
    for i in range(0, len(bs)):
        s = s + bi(bs[i]) + ' '
    return s.rstrip()

binbytes_8 = lambda x : binbytes(x, 'utf-8')
binbytes_16 = lambda x : binbytes(x, 'utf-16be')
# the be stands for 'big endian'

bb8 = dec.map(binbytes_8)
bb16 = dec.map(binbytes_16)
char = dec.map(chr)

df = pd.DataFrame(data = {'char' : char, 'utf-8 binary' : bb8, 'utf-16be binary' : bb16})
df = df[['char', 'utf-8 binary', 'utf-16be binary']]
df.index = range(50, 75)
print(df)

   char utf-8 binary       utf-16be binary
50    2     00110010     00000000 00110010
51    3     00110011     00000000 00110011
52    4     00110100     00000000 00110100
53    5     00110101     00000000 00110101
54    6     00110110     00000000 00110110
55    7     00110111     00000000 00110111
56    8     00111000     00000000 00111000
57    9     00111001     00000000 00111001
58    :     00111010     00000000 00111010
59    ;     00111011     00000000 00111011
60    <     00111100     00000000 00111100
61    =     00111101     00000000 00111101
62    >     00111110     00000000 00111110
63    ?     00111111     00000000 00111111
64    @     01000000     00000000 01000000
65    A     01000001     00000000 01000001
66    B     01000010     00000000 01000010
67    C     01000011     00000000 01000011
68    D     01000100     00000000 01000100
69    E     01000101     00000000 01000101
70    F     01000110     00000000 01000110
71    G     01000111     00000000 01000111
72    H    

In this next code example, we'll look at some "mid-range" characters, code points just above the "extended-ASCII" range. Note the "be" stands for **big endian** (as opposed to "le" for **little endian**).

In [10]:
dec = pd.Series(data = range(350, 375))

bb8 = dec.map(binbytes_8)
bb16 = dec.map(binbytes_16)
char = dec.map(chr)

df = pd.DataFrame(data = {'char' : char, 'utf-8 binary' : bb8, 'utf-16be binary' : bb16})
df = df[['char', 'utf-8 binary', 'utf-16be binary']]
df.index = range(350, 375)
print(df)

    char          utf-8 binary       utf-16be binary
350    Ş     11000101 10011110     00000001 01011110
351    ş     11000101 10011111     00000001 01011111
352    Š     11000101 10100000     00000001 01100000
353    š     11000101 10100001     00000001 01100001
354    Ţ     11000101 10100010     00000001 01100010
355    ţ     11000101 10100011     00000001 01100011
356    Ť     11000101 10100100     00000001 01100100
357    ť     11000101 10100101     00000001 01100101
358    Ŧ     11000101 10100110     00000001 01100110
359    ŧ     11000101 10100111     00000001 01100111
360    Ũ     11000101 10101000     00000001 01101000
361    ũ     11000101 10101001     00000001 01101001
362    Ū     11000101 10101010     00000001 01101010
363    ū     11000101 10101011     00000001 01101011
364    Ŭ     11000101 10101100     00000001 01101100
365    ŭ     11000101 10101101     00000001 01101101
366    Ů     11000101 10101110     00000001 01101110
367    ů     11000101 10101111     00000001 01

In this next segment, we'll consider *code points* in a much higher range, but one that includes the **Euro sign** (8364). In this range, utf-8 actually requires more bytes (3) than does utf-16. This is because utf-8 uses the *leading byte* to signal how many bytes are used to encode the character.

In [11]:
dec = pd.Series(data = range(8350, 8375))

bb8 = dec.map(binbytes_8)
bb16 = dec.map(binbytes_16)
char = dec.map(chr)

df = pd.DataFrame(data = {'char' : char, 'utf-8 binary' : bb8, 'utf-16be binary' : bb16})
df = df[['char', 'utf-8 binary', 'utf-16be binary']]
df.index = range(8350, 8375)
print(df)

     char                   utf-8 binary       utf-16be binary
8350    ₞     11100010 10000010 10011110     00100000 10011110
8351    ₟     11100010 10000010 10011111     00100000 10011111
8352    ₠     11100010 10000010 10100000     00100000 10100000
8353    ₡     11100010 10000010 10100001     00100000 10100001
8354    ₢     11100010 10000010 10100010     00100000 10100010
8355    ₣     11100010 10000010 10100011     00100000 10100011
8356    ₤     11100010 10000010 10100100     00100000 10100100
8357    ₥     11100010 10000010 10100101     00100000 10100101
8358    ₦     11100010 10000010 10100110     00100000 10100110
8359    ₧     11100010 10000010 10100111     00100000 10100111
8360    ₨     11100010 10000010 10101000     00100000 10101000
8361    ₩     11100010 10000010 10101001     00100000 10101001
8362    ₪     11100010 10000010 10101010     00100000 10101010
8363    ₫     11100010 10000010 10101011     00100000 10101011
8364    €     11100010 10000010 10101100     00100000 1

In this next segment, we'll look at "high end" Han (Mandarin) characters. These have code points well past even the most obscure Western characters. In this range, utf-8 and utf-16 encodings both require 4 bytes. (for space reasons, the results are stacked rather than side-by-side)

One can see the utf-8 *leading byte* in action in these examples. Below, the first 4 bytes of the utf-8 *leading byte* indicate the encoding is 4-bytes. In the previous example, the first 3 bits of the *leading byte* indicate 3 bytes. In the 350-375 **code point** range, the encodings were 2 bytes and the 2 leading bits of the *leading byte* indicated this. In the first utf-8 example, there was no *leading byte* and utf-8 required only 1 byte.

In [12]:
dec = pd.Series(data = range(150350, 150375))

bb8 = dec.map(binbytes_8)
bb16 = dec.map(binbytes_16)
char = dec.map(chr)

df = pd.DataFrame(data = {'char' : char, 'utf-8 binary' : bb8, 'utf-16be binary' : bb16})
df = df[['char', 'utf-8 binary', 'utf-16be binary']]
df.index = range(150350, 150375)
print(df)

       char                            utf-8 binary  \
150350    𤭎     11110000 10100100 10101101 10001110   
150351    𤭏     11110000 10100100 10101101 10001111   
150352    𤭐     11110000 10100100 10101101 10010000   
150353    𤭑     11110000 10100100 10101101 10010001   
150354    𤭒     11110000 10100100 10101101 10010010   
150355    𤭓     11110000 10100100 10101101 10010011   
150356    𤭔     11110000 10100100 10101101 10010100   
150357    𤭕     11110000 10100100 10101101 10010101   
150358    𤭖     11110000 10100100 10101101 10010110   
150359    𤭗     11110000 10100100 10101101 10010111   
150360    𤭘     11110000 10100100 10101101 10011000   
150361    𤭙     11110000 10100100 10101101 10011001   
150362    𤭚     11110000 10100100 10101101 10011010   
150363    𤭛     11110000 10100100 10101101 10011011   
150364    𤭜     11110000 10100100 10101101 10011100   
150365    𤭝     11110000 10100100 10101101 10011101   
150366    𤭞     11110000 10100100 10101101 10011110   
150367    

One can see how utf-8 is much more efficient than utf-16 for typical Western documents. In the traditional ASCII range, utf-8 also requires only 1 byte. On the other hand, it can expand to as many bytes as necessary to code for any **code point**.

That does not mean that utf-16 is dead. There is a large range of **code points** for Han characters where utf-8 requires 3 bytes while utf-16 requires only 2. The segment below lists a few of these. I am no expert on Chinese, but I suspect these are the more frequently used characters. Thus, in representing Han characters, utf-16 can be more efficient than utf-8.

In [13]:
dec = pd.Series(data = range(18350, 18375))

bb8 = dec.map(binbytes_8)
bb16 = dec.map(binbytes_16)
char = dec.map(chr)

df = pd.DataFrame(data = {'char' : char, 'utf-8 binary' : bb8, 'utf-16be binary' : bb16})
df = df[['char', 'utf-8 binary', 'utf-16be binary']]
df.index = range(18350, 18375)
print(df)

      char                   utf-8 binary       utf-16be binary
18350    䞮     11100100 10011110 10101110     01000111 10101110
18351    䞯     11100100 10011110 10101111     01000111 10101111
18352    䞰     11100100 10011110 10110000     01000111 10110000
18353    䞱     11100100 10011110 10110001     01000111 10110001
18354    䞲     11100100 10011110 10110010     01000111 10110010
18355    䞳     11100100 10011110 10110011     01000111 10110011
18356    䞴     11100100 10011110 10110100     01000111 10110100
18357    䞵     11100100 10011110 10110101     01000111 10110101
18358    䞶     11100100 10011110 10110110     01000111 10110110
18359    䞷     11100100 10011110 10110111     01000111 10110111
18360    䞸     11100100 10011110 10111000     01000111 10111000
18361    䞹     11100100 10011110 10111001     01000111 10111001
18362    䞺     11100100 10011110 10111010     01000111 10111010
18363    䞻     11100100 10011110 10111011     01000111 10111011
18364    䞼     11100100 10011110 1011110