# Unicode, Encoding, Bytes and Strings

## How Does a Computer Store Data?

Bits! ... 0's and 1's.

### Straightforward for Numbers: Binary!

* use some set number of bits to represent:
* whole numbers... or floating point numbers
* maybe some bits can be reserved for:
    * where to place decimal point
    * positive or negative
* need _moar_ values? add more bits!

### For Text: Map Numbers to Characters

Map numbers (_code points_) to corresponding characters...

* for example: 65 -> A
* you may have a table of mappings from code points to characters (something like [http://www.asciitable.com](http://www.asciitable.com)
* those mappings have to be encoded into (some number of) bits

### Prove It! (In 🐍)

In [None]:
# ord and chr will convert to and from a character and code point
print(ord('A'))
print(chr(65))

## ASCII

* ASCII is encoded using 7 bits (or 8 for extended!)
    * this is ~ 128 different values
    * good for a-z, A-Z, 0-9 etc.
    * not so good if you live in Korea, Pakistan or any other country with a language that contains different character sets
    * not so good if you want to send 🤢 or other emoji
* ASCII is both the name of the mapping and name of the encoding
    

## _Other Encodings_

Because ASCII was limited, *many* other encodings were created. I mean, [many encodings](https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings). These encodings weren't guaranteed to have common mappings, even if they were meant to represent the same character set! 😩

What to do?

## Unicode

Unicode is the name of a mapping of _code points_ only (it does not specify encoding!). It can represent over 1 million characters! Everything from Cyrillic to all of your fav emoji 🙌 🤯.

The links below show some tables. Code points may be represented in binary, decimal, and hexidecimal. Many tables use hexadecimal... but resulting code point is still same value.

* unicode.org has all the charts: [https://unicode.org/charts/](https://unicode.org/charts/)
* the first 128 characters are backward compatible with ASCII: [https://unicode.org/charts/PDF/U0000.pdf](https://unicode.org/charts/PDF/U0000.pdf)
* here are some emoji mappings if want 'em 🙏: [https://unicode.org/emoji/charts/full-emoji-list.html](https://unicode.org/emoji/charts/full-emoji-list.html)



## Encodings for Unicode

Again, unicode is just the name of the mapping from code points to characters. Want to actually _encode_ a character? You haz some choices:

### utf-8

Can store characters in 1 byte or as many as 4 bytes (variable length encoding)

* even though only 8 bits / 1 byte, can represent other unicode characters by adding additional bytes 
* higher bytes (left most) specify whether or not other bytes should be combined 
* for example, if left-most bit is 0, then character can be represented by a single byte
* if first bit is 1, then multiple bytes needed to represent character!
    * starting with 110xxxxx means two bytes needed
    * starting with 1110xxxx means three bytes needed
    * see a pattern? number of 1's specifies number of bytes to represent character
    * additional / continuation bytes are prefixed with 10
    * so, take the binary representation of a code point and fill in the x's
    * for something that needs 4 bytes to represent:
        * 11110xxx 10xxxxxx 10xxxxxx 10xxxxx
        * first 4 1's and 0 mean 4 bytes
        * remaing 3 bytes are prefixed with 10
        * fill in x's with bits from binary representation of code point
* you can find a nice [explanation on stackoverlow](https://stackoverflow.com/a/44568131)

What? 🤔... Here's an actual example: 

* Let's check out 😂 ([tears of joy](https://www.fileformat.info/info/unicode/char/1f602/index.htm))
* It's unicode code point, in decimal is: `128514`
* In binary, `128514` is `000011111011000000010` (with some 0 padding)
* This can't be represented in a single byte or even 3 bytes in utf-8... but we can do it with 4 bytes
* Here's the pattern for 4 bytes: `11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`
* Breaking up our binary representation of the code point to fit in the x's above, we have: `000 011111 011000 000010`
* And, finally: `11110000 10011111 10011000 10000010`

### Let's prove that this works...

This is a bit much; it's really here just to show that we can encode and decode using utf-8 using the rules above... and we can do it with a little bit of Python.

In [2]:
# here is tears of joy...
ch = '😂'
print(f'Tears of joy: {ch}\n============')

Tears of joy: 😂


In [3]:
# let's see the code point using ord
print(f'Let\'s see the utf-8 encoding of {ch}!\n----')
      
code_point = ord(ch)
print(f'code point for {ch} is: {code_point}')

Let's see the utf-8 encoding of 😂!
----
code point for 😂 is: 128514


In [4]:
# let's see the binary version of the code point

# this format specifier, 021b, means:
# * pad with 0's
# * there should be 21 total 0's
# * format as binary number
format_as_padded_bin = '021b'

# use the format specifier as nested variable in format string after colon
padded_bin = f'{code_point:{format_as_padded_bin}}'
print(f'{code_point} is in binary is: {padded_bin}')

128514 is in binary is: 000011111011000000010


In [5]:
# distribute into 4 bytes: UTF-8 FTW!
# fill bits into the x's in the pattern below:
# 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
a, b, c, d = padded_bin[:3], padded_bin[3:9], padded_bin[9:15], padded_bin[15:]
encoded = f'11110{a}10{b}10{c}10{d}'
print(f'encoded as utf-8, in binary: {encoded}')

encoded as utf-8, in binary: 11110000100111111001100010000010


In [6]:
# ok, let's test if this is encoded correctly by decoding it!
print(f'\nLet\'s go from utf-8 back to the actual character (decode), {ch}\n----')

# let's turn this into a sequence of bytes, with each byte shown in binary (as a string)
bytes_as_bin = list(encoded[i: i + 8] for i in range(0, 25, 8))
print(f'these are our bytes as strings in a list: {bytes_as_bin}')



Let's go from utf-8 back to the actual character (decode), 😂
----
these are our bytes as strings in a list: ['11110000', '10011111', '10011000', '10000010']


In [7]:
# convert each string into an int, and use that to create a bytes object
# call decode on bytes object to get back character
# decode will decode a series of bytes using utf-8 
# (though you can specify an encoding as a keyword arg)
b = bytes([int(i, 2) for i in bytes_as_bin])
print('decode those bytes to get the original character:', b.decode())

decode those bytes to get the original character: 😂


### Addendum

Most of the characters in unicode (character sets from natural languages) are in the first ~65,000 code points (called the _Basic Multilingual Plane_). Emoji exist above that, and typically require 4 bytes to represent.


### Others Encoding Schemes: utf-16, utf-32

If using mostly ASCII characters, then utf-8 is a great choice. However, if using many characters that can only be encoded in more than one byte, utf-32 (or, I guess, utf-16), might be a better option. 

Why use utf-8 (usually)?

* if most of your characters can be encoded in 1 byte; use it! 
* it saves space... (why use 4 bytes to represent `A` when you can use 1?)

Why use utf-32 instead?

* if using lots of code points that require multiple bytes, it's a bit more complex decoding utf-8, since the number of bytes used per character has to be determined

### Er? Y Does This Matter?

Why might knowing about encodings be useful? ...Sometimes you source a file, but you don’t know what encoding it is

* If you have a series of bytes, you can decode with a scheme of your choice (utf-8, latin-1, etc.?)
* Automatic detection of encoding is tricky! (no standard for embedding encoding a file, usually encoding not even included!)
* Editors/viewers will use different strategies, but no guarantee guess will be right! 😮
* btw, some tools: file and enca to guess at encoding... sublime, atom, etc. to load in different encoding
* and, of course, Python can read files with different encodings (though default is utf-8)

### Example / Mystery!

Download this file in the same directory as your notebook:

[https://www.gutenberg.org/files/4909/old/olavg10.txt](https://www.gutenberg.org/files/4909/old/olavg10.txt) 

Try to figure out how to _read_ it correctly. 🕵

* open it in a text editor, what do you see?
* reopen, but change encoding in your text editor of your choice; does that fix things?
* note that most text editors, like sublime and atom, can be set to use a specific encoding
* choose CP1251 or Window-1251

If you're unable to change the encoding, we can look at it with python too 👀:

1. first as utf-8 (which causes an exception)
2. then as cp1251 (which shows us cyrillic)

In [8]:
# we try to read the file with open
# by default, it'll read it as utf-8
# there are some invalid continuation characters
# ... so we'll get an exception
try:
    with open('olavg10.txt', 'r') as f:
        print(f.read())
except FileNotFoundError as e:
    print('ERROR!!!!!!')
    print('plz download https://www.gutenberg.org/files/4909/old/olavg10.txt into same directory as this notebook first. k thx bai.')
except UnicodeDecodeError as e:
    print('Welp! Cannot decode this file... we are trying utf-8, but. it. is. not. that.')
    print(e)

Welp! Cannot decode this file... we are trying utf-8, but. it. is. not. that.
'utf-8' codec can't decode byte 0xce in position 1494: invalid continuation byte


In [9]:
# now let's use codecs.open so we can read the file with a specific encoding
import codecs
try:
    with codecs.open('olavg10.txt', encoding='cp1251') as f:
        lines = f.readlines()
        print('a line in our text:')
        print (lines[51]);
except FileNotFoundError as e:
    print('ERROR!!!!!!')
    print('plz download https://www.gutenberg.org/files/4909/old/olavg10.txt into same directory as this notebook first. k thx bai.')

a line in our text:
Той и сам не знае кога е роден, но като го запитат, казва, че сега е на 36 години. Родното му градче, Грасдорф, е скътано в една гънка от росните поли на Харц, оная величествена планина, която лежи като мила сянка на душата му и която нему е по сърце да възпява в своите песни. Цял ред години го познавам аз, както познавам себе си, и макар помежду ни често да е минавала черна котка - ний сме си и до ден днешен интимни и добри приятели. И това, което аз зная и ще кажа за него, той сам ще го подпише.



## Strings vs Bytes

### Strings
From the docs: "The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal"... see below!

In [10]:
s = "this is clearly a string"
s2 = "also a string ☃"
print(s2)

also a string ☃


In [11]:
s3 = "not sure 🙃"
print(s3)

not sure 🙃


### bytes Objects

From Python docs: "Bytes objects are immutable sequences of single bytes." ...

* sequence of ints 0 - 255
* can be created by using string of ascii characters

In [12]:
b = b"hello"

In [13]:
b[0]

104

In [14]:
ord('h')

104

In [15]:
try:
    b + '!!!!!'
except TypeError as e:
    print(type(e), e)

<class 'TypeError'> can't concat str to bytes


In [16]:
type(b)

bytes

### Use `decode` Method to Conver to a String

* interpret a series of bytes as utf-8

In [17]:
b.decode('utf-8')

'hello'

### Now Let's Try utf-16

In [18]:
b = b'hello!'
b.decode('utf-8')
# ... works as you expect!

'hello!'

In [19]:
b.decode('utf-16')
# ... how about same bytes as utf-16

'敨汬Ⅿ'

# String Formatting

## Using the format function

In [20]:
name = 'joe'
num = 20
food = 'apple pies'
"Hi, my name is {}, and I have {} {}!".format(name, num, food)

'Hi, my name is joe, and I have 20 apple pies!'

### Using `format` with Positions

In [21]:
"Hi, my name is {2}, and I have {0} {1}!".format(num, food, name)

'Hi, my name is joe, and I have 20 apple pies!'

### Using `format` with Positions and Format Specifier

In [22]:
"Hi, my name is {2:s}, and I have {0:.2f} {1:s}!".format(num, food, name)

'Hi, my name is joe, and I have 20.00 apple pies!'

### Format String Literals

In [23]:
f"Hi, my name is {name}, and I have {num} {food}!"

'Hi, my name is joe, and I have 20 apple pies!'

In [24]:
f"Hi, my name is {name}, and I have {num:.2f} {food}!"

'Hi, my name is joe, and I have 20.00 apple pies!'

### Using the % Operator

In [25]:
result = "%s %s" % (num, food)
print(result)

20 apple pies
