# ASCII and Unicode
- In Python 3 and later, the Python strings data type is stored internally as a sequence of Unicode code points
- We've covered that strings can be encoded as bytes and bytes can be decoded to strings. What does this have to do with Unicode?  What is UTF-8?  To answer these questions, we start with ASCII.

---

## ASCII
- **ASCII**--American Standard Code for Information Interchange.  It was created by the American National Standards Institute in the early 1960s. Uppercase and lowercase English letters, characters 0-9, punctuation/symbols, and some basic computer commands are mapped to numbers.  All ASCII is, is the mapping of characters to numbers.  These numbers are commonly written in base 2, 10, or 16 (hexadecimal).
- Standard ASCII is 7 bits long and can map $2^7 = 128$ unique characters.  
- Extended ASCII is 8 bits long and can map $2^8 = 256$ unique characters.  Extended ASCII was not too common.

![](images/ASCIITable.png)

- Notice how a character is listed and encoded with a number, starting with 0
    - Decimal (base 10) numbers that encode the characters go from 0-127
    - Binary (base 2) numbers go from 0 to 1111111
    - Octal (base 8) numbers go from 0 to 177
    - Hexadecimal (base 16) numbers go from 0 to 7F
- Notice how the character "A" is mapped to the decimal number 65.  This was chosen because 65 written as a binary number is 10000001  If we look at the right-most 5 digits, use them as binary, and convert them to decimal we get the decimal number 1.  "A" is the first letter of the alphabet.   "Z" maps to the decimal number 90, which is 101011010 in binary.  Take the 5 digits to the right, use them as binary, convert to decimal, and we get 26.  "Z" is the 26th letter of the alphabet.
- Notice how the same thing was done with the lowercase letters.  "a" starts at 97, which is 1100001 in binary.  "z" is 122, which is 1111010.  The 5 digits to the right are the same as those for uppercase.   Mind blown!
- Notice that uppercase characters are mapped to lower numbers lowercase characters.  This is why uppercase is sorted as lower by Python. **ASCII-betical order**.
- Notice how the characters 0-9 on the keyboard are mapped to the decimal numbers 48 to 57.  This was chosen because we can write 48-57 as binary (base 2) numbers, then take the right 4 digits.  These right 4 digits are themselves the binary (base 2) numbers for the the decimal numbers 0-9.  MIND BLOWN!!!

---

## ISO-8859

- ISO-8859 refers to a family of one byte character sets (mappings) that rose to prominence in the 1980s and 1990s. IS0-8859 character sets have largely been replaced by Unicode and are only used on about 1% of web pages today.
- ASCII focused on the English alphabet and fit 128 characters into 7 bits.  There are 16 ISO-8859 character sets that cover different European alphabets.  Each character set uses 8 bits (1 byte) and encodes 256 unique characters.
- ISO-8859-1, or Latin-1 for short, is a character mapping for Western Europe.  The first 128 characters are the same as ASCII but it introduces new characters for 129-256.  Only the first 128 characters of Latin-1 (ASCII) are compatible with UTF-8.

---

## Unicode
- **Unicode**--Unicode is the successor to ASCII and ISO-8859.  ASCII focused on the English alphabet.  ISO-8859 mappings focused on European languages while sticking to only 1 byte.  Unicode maps all the worlds scripts (alphabets), characters 0-9, punctuation/symbols, basic computer commands, and emojis to numbers.   All Unicode is, is the mapping of characters to numbers.   There is space to map 1.1 million unique characters.
- These numbers are commonly written in base 2, 10, or 16 (hexadecimal)
- **Unicode code point**--hexadecimal number with "U+" in front of it
    - E.g. the character "A" is mapped to the decimal number 65, the binary (base 2) number 1000001, the hex number 41, and the Unicode code point U+0041
- *The Python program stores strings internally as sequences of Unicode code points*

Code | Use
--- | ---
`ord()` | Returns the base 10 integer that is mapped to the Unicode character specified
`chr()` | Returns the Unicode character that is mapped to the base 10 integer specified

---

**EXAMPLES**

**`ord()`**

In [1]:
print(ord('A'))
print(ord('a'))
print(ord('0'))

65
97
48


**`chr()`**

In [2]:
print(chr(65))
print(chr(97))
print(chr(48))

A
a
0


---

## UTF-8
- Unicode has standardized the mapping of characters to numbers and then to Unicode code points. Many programs like Python store character strings as Unicode code points.
- However, computers ultimately store 1s and 0s and so we use the binary (base 2) number for each character when we store them on the computer or when we send data across the internet
- Depending on the character, the binary number could take up anywhere from 1-4 bytes of space.  And how do we concatenate binary numbers so that we know which 1s and which 0s correspond to which characters?  This is where UTF-8, UTF-16, and UTF-32 come in.

**UTF-32**

- UTF-32 stands for Unicode Transformation Format 32.  Uses one 32 bit chunk (code unit) to code for each character.  Not often used because of inefficient use of memory.

**UTF-16**

- UTF-16 stands for Unicode Transformation Format 16.  UTF-16 is a variable-width code that uses one or two 16 bit chunks (code units) to encode each character. Microsoft Windows, Java, and Javascript use UTF-16.  Windows is moving away from it to UTF-8 in the future.

**UTF-8**

- UTF-8 stands for Unicode Transformation Format 8.  UTF-8 is a variable-width encoding that uses 1-4, 8 bit chunks (code units) for each character.  So depending on the character it may use 8, 16, 24, or, 32 bits. 1, 2, 3, or 4, bytes.
- The reason for this variable width encoding comes back to efficient use of computer memory.  If we map/encode a single character to its binary (base 2) number and always store this using 32 bits for every character, we end up with lots of 0s that are just filler.  Ex. "A" would be 00000000 00000000 00000000 010000001.
- For a character that uses 1 byte, the standard 7 bit ASCII codes are used. UTF-8 is simply that 7 bit binary number with a 0 in front. The UTF-8 code for "A" is 01000001.  So standard ASCII is valid UTF-8!
- For characters stored using 2-4 bytes, UTF-8 uses certain bits to indicate how many bytes the character is.  In the following code the `x`s are the character's binary (base 2) number, while 1s and 0s are the prefixes that indicate how many bytes there are.  A human can look at the 1s and can think the following.  Two 1s means two bytes.  Three 1s means three bytes.  Four 1s means four bytes.
 
```
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
```

- Now imagine we had multiple characters's binary concatenated together.  In the below sequence, we could use the prefixes to tell that we have a 2-byte character, then a 1-byte character, then a 4-byte character, then a 3-byte character.

```
110xxxxx10xxxxxx0xxxxxxx11110xxx10xxxxxx10xxxxxx10xxxxxx1110xxxx10xxxxxx10xxxxxx
```
 
- The internet almost exclusively uses UTF-8 as it encodes everything, is backwards compatible with ASCII, and doesn't use an unnecessary number of bits
- UTF-8 is the default encoding and decoding method for Python functions

---