# Covered here:


- [Number systems](#Number-systems)
- [What's a bit?](#What's-a-bit?)
- [Representing text in digital form](#Representing-text-in-digital-form)
- [Hashing & one-hot encoding](#Hashing-&-one-hot-encoding)

## Resources & references

* Python docs: [Unicode HOWTO](https://docs.python.org/3/howto/unicode.html#python-s-unicode-support)
* [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)
* [robelle.com - ASCII Characters for MPE Users](http://www.robelle.com/library/smugbook/ascii.html)
* [Eric Muller/Adobe - To the BMP and Beyond](http://www.unicode.org/notes/tn23/Muller-Slides+Narr.pdf)
* Petzold - _Code: The Hidden Language of Computers_
    * Chapter 7: Our Ten Digits
    * Chapter 8: Alternatives to Ten
    * Chapter 9: Bit by Bit
    * Chapter 15: Bytes and Hex
    * Chapter 20: ASCII and a Cast of Characters
* Slatkin - _59 Ways to Write Better Python_ - Item 3: Know the Differences Between bytes, str, and unicode
* Jukka Korpela - [A tutorial on character code issues](http://jkorpela.fi/chars.html)
* David Beazley - [Mastering Python 3 I/O (part 1)](http://pyvideo.org/pycon-us-2010/pycon-2010--mastering-python-3-i-o.html) (start around 16:30)

# Number systems

_Directly excerpted or paraphrased from Petzold (2000)_.

The use of a _base-ten_ or _decimal_ (from the Latin for _ten_) number scheme inherant to the Indo-Arabic system is somewhat arbitrary.  Each position in number corresponds to a power of ten:

![base10](imgs/base10b.jpg)

As mentioned above, the choice of _base-10_ is somewhat arbitrary, and we can derive number systems using other bases:
- An _octal_ system is _base-8_.
- A _hexadecimal_ ("hex") system is _base-16_.
- A _binary_ system is _base-2_.

Here's counting using an octal system:

![](imgs/base8a.jpg)
![](imgs/base8b.jpg)

Here's a representation of each position of a number in the octal system:

![octal](imgs/octal.jpg)

In binary systems, _the number after 1 is 10_):

> 0, 1, 10, 11, 100, 101, 110, 111, 1000, 1001, 1010, 1011, 1100,, 1101, 1110, 1111, 10000, 10001…

In a multidigit binary number, the positions of the digits correspond to powers of two:

![](imgs/binary_counting.jpg)

People who work with binary numbers often write them with leading zeros (that is, zeros to the left of the first 1)—for example, 0011 rather than just 11. This doesn’t change the value of the number at all; it’s just for cosmetic purposes.

In [30]:
decimal = range(16)
binary = [format(d, '04b') for d in decimal]
pd.DataFrame({'decimal': decimal, 'binary': binary})

Unnamed: 0,binary,decimal
0,0,0
1,1,1
2,10,2
3,11,3
4,100,4
5,101,5
6,110,6
7,111,7
8,1000,8
9,1001,9


The table below shows the result of powers of two in different number systems.

In [31]:
import pandas as pd

def vec_base(op, i):
    return op(i)
vec_base = np.vectorize(vec_base)

strpows = ['2^%s' % i for i in range(13)]
decimal = np.power(2., np.arange(13)).astype(np.int)
octal = vec_base(oct, decimal)
binary = vec_base(bin, decimal)

pd.DataFrame({'decimal': decimal, 'octal': octal, 'binary': binary},
             index=strpows)

Unnamed: 0,binary,decimal,octal
2^0,0b1,1,0o1
2^1,0b10,2,0o2
2^2,0b100,4,0o4
2^3,0b1000,8,0o10
2^4,0b10000,16,0o20
2^5,0b100000,32,0o40
2^6,0b1000000,64,0o100
2^7,0b10000000,128,0o200
2^8,0b100000000,256,0o400
2^9,0b1000000000,512,0o1000


# What's a bit?

A bit (**b**inary dig**it**) is either of the digits 0 or 1 when used in the binary number system.  Think of a bit as **0 or 1**.  There are $2^8=256$ different possible values for 8-bit sequences:

In [32]:
# THe first five 8-bit sequences
from itertools import product
bits = list(product([0, 1], repeat=8))
bits[:5] 

[(0, 0, 0, 0, 0, 0, 0, 0),
 (0, 0, 0, 0, 0, 0, 0, 1),
 (0, 0, 0, 0, 0, 0, 1, 0),
 (0, 0, 0, 0, 0, 0, 1, 1),
 (0, 0, 0, 0, 0, 1, 0, 0)]

In [33]:
len(bits)

256

**A group of eight binary digits is commonly called one byte**.  (Although, historically the size of the byte is not strictly defined as such.)  

In the 1980s, almost all personal computers were 8-bit, meaning that bytes could hold values ranging from 0 to 255.

The number of different possible values for different bit-length sequences would be:

In [3]:
import numpy as np
import pandas as pd
bits = np.arange(0, 17, 2)
s = pd.Series(np.power(2, bits), index=bits, name='possible_values')
s.rename_axis('bits', inplace=True)

bits
0         1
2         4
4        16
6        64
8       256
10     1024
12     4096
14    16384
16    65536
Name: possible_values, dtype: int64

If you know how many codes, or representations, you need, how do you know how many bits you need?  You use the base-two logarithm.

In [35]:
def how_many_bits(codes):
    return np.ceil(np.log2(codes))

for code in [128, 256, 200]:
    print(how_many_bits(code))

7.0
8.0
8.0


**One byte is always represented by a pair of hexadecimal digits.**

**Four bits are needed to express one hexadecimal digit.**  Note that beacuse our system has a base higher than 10, we need additional "digits" (below, actually letters) to express the new additions.  The letters A thru F are chosen.

In [36]:
decimal = range(16)
binary = [format(d, '04b') for d in decimal]
hexa = [str(i) for i in range(10)] + list('ABCDEF')
pd.DataFrame({'hex': hexa, 'binary': binary, 'decimal': decimal})

Unnamed: 0,binary,decimal,hex
0,0,0,0
1,1,1,1
2,10,2,2
3,11,3,3
4,100,4,4
5,101,5,5
6,110,6,6
7,111,7,7
8,1000,8,8
9,1001,9,9


Here's a fuller table of what we've covered thus far:

In [37]:
decimal = range(256)
hexadecimal = [hex(i) for i in decimal]
octal = [oct(i) for i in decimal]
binary = [format(i, '08b') for i in decimal]
with pd.option_context("display.max_rows", None):
    df = pd.DataFrame({
            'dec': decimal,
            'hex': hexadecimal,
            'oct': octal,
            'bin': binary
            })
    print(df)

          bin  dec   hex    oct
0    00000000    0   0x0    0o0
1    00000001    1   0x1    0o1
2    00000010    2   0x2    0o2
3    00000011    3   0x3    0o3
4    00000100    4   0x4    0o4
5    00000101    5   0x5    0o5
6    00000110    6   0x6    0o6
7    00000111    7   0x7    0o7
8    00001000    8   0x8   0o10
9    00001001    9   0x9   0o11
10   00001010   10   0xa   0o12
11   00001011   11   0xb   0o13
12   00001100   12   0xc   0o14
13   00001101   13   0xd   0o15
14   00001110   14   0xe   0o16
15   00001111   15   0xf   0o17
16   00010000   16  0x10   0o20
17   00010001   17  0x11   0o21
18   00010010   18  0x12   0o22
19   00010011   19  0x13   0o23
20   00010100   20  0x14   0o24
21   00010101   21  0x15   0o25
22   00010110   22  0x16   0o26
23   00010111   23  0x17   0o27
24   00011000   24  0x18   0o30
25   00011001   25  0x19   0o31
26   00011010   26  0x1a   0o32
27   00011011   27  0x1b   0o33
28   00011100   28  0x1c   0o34
29   00011101   29  0x1d   0o35
30   000

# Representing text in digital form

We need codes to represent:
1. alphanumeric characters
2. numbers
3. punctuation
4. other symbols.

Such as system is sometimes known as a **coded character set**, and the individual codes are known as **character codes**.

One of the earliest such codes was Baudot code (aka Murray code), use with 30-key typewriters:

![murray.jpg](imgs/murray.jpg)

The "figure shift" key was used to transition to a second set of 32 codes for punctuation (still 00 thru 1F).

## ASCII

American Standard Code for Information Interchange ([ASCII](https://en.wikipedia.org/wiki/ASCII); pronounced "ASK-ee") is a character encoding standard formalized in 1967.
* ASCII is a 7-bit code using binary codes 0000000 through 1111111, which are hexadecimal codes 00h through 7Fh.
* The ASCII character set defines 128 characters (0 to 127 decimal).  _Control characters_ are in 0 thru 31, while _printing characters_ are in 31 thru 127.
 * Control characters: codes originally intended not to represent printable information, but rather to control devices (such as printers).  In fact, they have no true visual representation.
     * Many of these are now obscure because they were originally intended for typewriters, not computer keyboards.
     * The idea here is that control characters can be intermixed with graphic characters to do some rudimentary formatting of the text.
 * Printable characters: represent letters, digits, punctuation marks, and a few miscellaneous symbols.

Here is the full ASCII lookup table:

![](imgs/asciifull.gif)

## Unicode

The issue with ASCII is that it is just "too American."  Although it does include 10 codes designated for "national use" and has been [expanded to a 256-character set](https://en.wikipedia.org/wiki/Extended_ASCII), ASCII is hardly suitable even for other nations whose principal language is English.

Unicode is "a brave effort to create a single character set that included every reasonable writing system on the planet."
- Unicode development was begun in 1988.
- A major consequence of using additional bits to represent one character is that sequences of characters now take up more memory in this expanded character encoding system.

Is Unicode a 16-bit system?  Well, not really:
> Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode.
>
> ...
>
> The modern Unicode specification uses a wider range of codes, 0 through 1,114,111 (0x10FFFF in base 16).
> 
> ...
>
> The latest version of Unicode contains a repertoire of 136,755 characters (32-bit) covering 139 modern and historic scripts, as well as multiple symbol sets.

In [47]:
import sys

sys.maxunicode

1114111

Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:
<center>A -> 0100 0001</center>
In Unicode, a letter maps to something called a **code point** which is still just a theoretical concept.  In Unicode, the letter A is a "platonic ideal."  The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. 

* Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639.  
* This magic **number** is called a code point. 
* The U+ means “Unicode” and the numbers are hexadecimal (base 16).  --> `U+HHHH`

"Hello" corresponds to:

In [4]:
s = 'Hello'

# `ord`:
# Given a string representing one Unicode character, 
# return an integer representing the Unicode code point of that character.

# `hex`:
# Convert an integer number to a lowercase hexadecimal 
# string prefixed with “0x”.

t = [hex(ord(i)) for i in s]
t

['0x48', '0x65', '0x6c', '0x6c', '0x6f']

Or, in the notation from above:

In [39]:
# a bit of a hack
['U+00' + i.partition('x')[-1] for i in t]

['U+0048', 'U+0065', 'U+006c', 'U+006c', 'U+006f']

Putting all of that together visually:

In [8]:
df = pd.DataFrame({'character': list(s), 
                   'code_point': [ord(i) for i in s],
                   'hex_string': [hex(ord(i)) for i in s],
                   'alternate_notation': ['U+00' + i.partition('x')[-1] for i in [hex(ord(i)) for i in s]]})
df[['character', 'code_point', 'hex_string', 'alternate_notation']]

Unnamed: 0,character,code_point,hex_string,alternate_notation
0,H,72,0x48,U+0048
1,e,101,0x65,U+0065
2,l,108,0x6c,U+006c
3,l,108,0x6c,U+006c
4,o,111,0x6f,U+006f


In [9]:
s = 'Python'
df = pd.DataFrame({'character': list(s), 
                   'code_point': [ord(i) for i in s],
                   'hex_string': [hex(ord(i)) for i in s],
                   'alternate_notation': ['U+00' + i.partition('x')[-1] for i in [hex(ord(i)) for i in s]]})
df[['character', 'code_point', 'hex_string', 'alternate_notation']]

Unnamed: 0,character,code_point,hex_string,alternate_notation
0,P,80,0x50,U+0050
1,y,121,0x79,U+0079
2,t,116,0x74,U+0074
3,h,104,0x68,U+0068
4,o,111,0x6f,U+006f
5,n,110,0x6e,U+006e


As another example, a code point is written using the notation `U+12CA` to mean the character with value `0x12ca` (4,810 decimal).

## UTF-8, UTF-16, & UTF-32

To summarize the previous section: a Unicode string is a sequence of code points.  But **we still need a way to represent these sequences as sets of bytes (meaning, values from 0 through 255) in memory.**

The rules for translating a Unicode string into a sequence of bytes are called an **encoding**.  Unicode can be **implemented by different character encodings.** The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use.

UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding.

UTF-8 encodes each of the 1,112,064 valid [code points](https://en.wikipedia.org/wiki/Code_point) in Unicode using **one to four (8-bit) bytes**.  For example, two 8-bit bytes would be:

> <center>00001111 01010101</center>

Specifically, the rules for UTF-8 are that for each code point,

1. If the code point is < 128, it’s represented by the corresponding byte value.
2. If the code point is >= 128, it’s turned into a sequence of **two, three, or four bytes**, where each byte of the sequence is between 128 and 255.

The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.  This is convenient for languages that use Latin characters beacuse first 128 characters of Unicode are the same as the ASCII characters.

Structure of UTF-8 encoding:

![](imgs/utf8b.png)

UTF-16 uses 4 bytes per character and encompasses the first 65,536 code points.  UTF-32 uses four bytes for each character, encompassing 136,755 code points.  It is not widely in use because of the space that it demands.

Usage of UTF-8 has eclipsed other encodings rapidly:

![](imgs/encodings.png)

## Python 2 versus Python 3

In Python 3, there are two types that represent sequences of characters: `bytes` and `str`.
- Instances of **`bytes` contain raw 8-bit values**. 
- Instances of `str` contain Unicode characters.  **All text and text-based I/O is Unicode in Python 3.**

In Python 2, there are two types that represent sequences of characters: `str` and `unicode`. 
- Instances of `str` contain raw 8-bit values. 
- Instances of `unicode` contain Unicode characters.

Some wrapper functions for safer conversion:

In [40]:
def to_str(b, encoding='utf-8', errors='strict'):
    """Converts bytes (or string) to string."""
    if isinstance(b, bytes):
        res = b.decode(encoding=encoding, errors=errors)
    else:
        res = b
    return res


def to_bytes(s, encoding='utf-8', errors='strict'):
    """Converts string (or bytes) to bytes."""
    if isinstance(s, str):
        res = s.encode(encoding=encoding, errors=errors)
    else:
        res = s
    return res

## Python functions related to encoding

Built-ins:

| Function | Use | Example |
| :------- | :-- | :------ |
| `ascii(obj, /)` | Return an ASCII-only representation of an object. | |
| `bin(number, /)` | Return the binary representation of an integer. | `bin(2796202)` |
| `bytes([source[, encoding[, errors]]])` | Return a new “bytes” object, which is an immutable sequence of integers in the range 0 <= x < 256. | | 
| `chr(i)` | Return the string representing a character whose Unicode code point is the integer `i`.  This is the inverse of `ord()`. | | 
| `hash(object)` | Return the hash value of the object (if it has one). Hash values are integers. | | 
| `hex(x)` | Convert an integer number to a lowercase hexadecimal string prefixed with “0x”. | `hex(255)`, `hex(-42)` |
| `oct(x)` | Convert an integer number to an octal string prefixed with “0o”. | | |
| `ord(c)` | Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. | | 
| `repr(object)` | Return a string containing a printable representation of an object. | | 
| `str(bytes_or_buffer[, encoding[, errors]])` | Return a str version of `object`. | |

In [12]:
char = 'a'

In [13]:
ord(char)  # integer representing the Unicode code point of 'a'

97

In [14]:
hex(ord(char))  # convert this integer to lowercase hexadecimal string prefixed with “0x”.

'0x61'

In [17]:
chr(97)  # chr(ord(char)) - string that maps to this code point integer

'a'

The opposite method of `bytes.decode()` is `str.encode()`, which returns a `bytes` representation of the Unicode string, encoded in the requested encoding.

In [25]:
u = chr(40960) + 'abcd' + chr(1972)

In [27]:
print(u)

ꀀabcd޴


In [37]:
ord(chr(40960))

40960

In [28]:
u.encode('utf-8')

b'\xea\x80\x80abcd\xde\xb4'

The above returns a [bytes](https://docs.python.org/3/library/stdtypes.html#bytes) object.  (More about that below.)

Python is aware of Unicode characters directly typed into source code.  We could type:

In [39]:
s = "That's a spicy Jalapeño!"

But, what if we didn't know how to actually type this?  Alternatively, we could use a **Unicode code-point escape**.  As alluded to above there are 3 Unicode escapes:
- `\xhh` embeds the character with 8-bit hex value `hh`.  (Code points U+00-U+FF.)
- `\uhhhh` embeds the _character with 16-bit hex value `hhhh`_ in the string.  (Code points U+0100-U+FFFF.)
- `\uhhhhhhhh`: code points > U+10000

In [41]:
# hex(ord('ñ')) == '0xf1'
t = "That's a spicy Jalape\xf1o!"       # \xhh
t2 = "That's a spicy Jalape\u00f1o!"  # \uhhhh
s == t == t2

True

In [42]:
print(t)

That's a spicy Jalapeño!


Now what happens when we decode this to bytes?

In [43]:
print(s.encode('utf-8'))

b"That's a spicy Jalape\xc3\xb1o!"


The character becomes `\xc3` and `\xb1`.  To see why, remember the [rules](#UTF-8,-UTF-16,-&-UTF-32) from above.  The code point here is:

In [44]:
ord('ñ')

241

Which is > 128.  So we will need **a sequence of two, three, or four bytes**, where each byte of the sequence is between 128 and 255.

Looking back at the table,

![](imgs/utf8b.png)

In [46]:
'U+00' + hex(ord('ñ')).partition('x')[-1]

'U+00f1'

Falls between (U+0080, U+07FF), so we need two bytes.

For more see the [tables](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals) for recognized escape sequences.

# Hashing & one-hot encoding

## One-hot encoding

The following example one-hot-encodes 2 strings to create "feature vectors."  

In [41]:
# Derived from Mueller/Massaron p. 224

def uniquewords(*args):
    """Create order-preserved string with unique words between *args"""
    allwords = ' '.join(args).split()
    return ' '.join(sorted(set(allwords), key=allwords.index)).split()

def encode(*args):
    """One-hot encode the given input strings."""
    unique = uniquewords(*args)
    feature_vectors = np.zeros((len(args), len(unique)))
    for vec, s in zip(feature_vectors, args):
        for num, word in enumerate(unique):                
            vec[num] = word in s
    return feature_vectors

s1 = 'awaken my love'
s2 = 'awaken the beast'

encode(s1, s2)

array([[ 1.,  1.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  1.,  1.]])

An alternate encoding that does not pay attention to original order:

In [42]:
def encode(*phrases):
    words = set(' '.join(phrases).split())
    features = np.zeros((len(phrases), len(words)))
    for feature, phrase in zip(features, phrases):
        for pos, word in enumerate(words):
            feature[pos] = word in phrase
    return features

encode(s1, s2)

array([[ 0.,  0.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  0.,  0.]])

The issue with one-hot encoding is that it fails and becomes difficult to handle when your project experiences a lot of variability with regard to its inputs.  Using hashing is a smart way to handle predictability in your inputs.

## Hashing

Python docs: [`hash`](https://docs.python.org/3/library/functions.html#hash)

A hash function is any function that can be used to **map data of arbitrary size to data of fixed size**.  Hash values are integers.
* You can't convert a hashed code to its original value.
* In some rare cases, different words generate the same hashed result.
* There are many hash functions, with MD5 and SHA being the most popular.

## Defining a simple hashing trick

In [43]:
# from Mueller/Massaron
def hashing_trick(input_str, vec_size=20):
    feature_vector = [0] * vec_size
    for word in input_str.split(' '):
        index = abs(hash(word)) % vec_size
        feature_vector[index] = 1
    return feature_vector

print(hashing_trick(s1))
print(hashing_trick(s2))

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
