# Unicode, Encoding, Bytes and Strings

## Representation of Data

Use bits! ... 0's and 1's

* straightforward for numbers... use numbers, perhaps it may be a set number of bits
* representing text.... mapping of numbers to characters 65... A... in some encodings
* unicode is just the mapping`
* WAT about encoding???
    * utf-8
    * utf-16
    * utf-32

## Encodings

### utf-8

Variable length encoding. Even though only 8 bits / 1 byte, can represent other unicode characters by adding additional bytes (higher bytes specify whether or not other bytes should be combined).

* sometimes encoding of file is not known
* ...if you have a series of bytes, you can decode with an scheme of your choice (utf-8, latin-1, etc.?)

### others

If using mostly ASCII characters, then utf-8 is a great choice. However, if characters require more than one byte, utf-16, might be a better option. utf-32 might take up too much space for every character to be practical.

### why even? 

Ok. Sooo… how is this practical at all? Why might knowing about encodings be useful? ...Sometimes you source a file, but you don’t know what encoding it is

* If you have a series of bytes, you can decode with an scheme of your choice (utf-8, latin-1, etc.?)
* Automatic detection of encoding is tricky! (no standard for embedding encoding a file, usually encoding not even included!)
* Editors/viewers will use different strategies, but no guarantee guess will be right! 😮
* btw, some tools: file and enca to guess at encoding... sublime, atom, etc. to load in different encoding

### example

Download this file... try to figure out what encoding it is:

[Mystery Encoding](https://www.gutenberg.org/files/4909/old/olavg10.txt) 🕵

* open it in a text editor
* use `file` to figure out what it is
* if you can, try enca
* reopen, but change encoding

## Strings vs Bytes

### Strings
From the docs: "The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal"... see below!

In [None]:
s = "this is clearly a string"
s2 = "also a string ☃"
print(s2)

In [2]:
s3 = "not sure 🙃"
print(s3)

### bytes Objects

From Python docs: "Bytes objects are immutable sequences of single bytes." ...

* sequence of ints 0 - 255
* can be created by using string of ascii characters

In [3]:
b = b"hello"

In [4]:
b[0]

104

In [5]:
ord('h')

104

In [6]:
try:
    b + '!!!!!'
except TypeError as e:
    print(type(e), e)

<class 'TypeError'> can't concat str to bytes


In [7]:
type(b)

bytes

### Use `decode` Method to Conver to a String

* interpret a series of bytes as utf-8

In [8]:
b.decode('utf-8')

'hello'

### Now Let's Try utf-16

In [9]:
b = b'hello!'
b.decode('utf-8')
# ... works as you expect!

'hello!'

In [10]:
b.decode('utf-16')
# ... how about same bytes as utf-16

'敨汬Ⅿ'

# String Formatting

## Using the format function

In [11]:
name = 'joe'
num = 20
food = 'apple pies'
"Hi, my name is {}, and I have {} {}!".format(name, num, food)

'Hi, my name is joe, and I have 20 apple pies!'

### Using `format` with Positions

In [12]:
"Hi, my name is {2}, and I have {0} {1}!".format(num, food, name)

'Hi, my name is joe, and I have 20 apple pies!'

### Using `format` with Positions and Format Specifier

In [13]:
"Hi, my name is {2:s}, and I have {0:.2f} {1:s}!".format(num, food, name)

'Hi, my name is joe, and I have 20.00 apple pies!'

### Format String Literals

In [14]:
f"Hi, my name is {name}, and I have {num} {food}!"

'Hi, my name is joe, and I have 20 apple pies!'

In [15]:
f"Hi, my name is {name}, and I have {num:.2f} {food}!"

'Hi, my name is joe, and I have 20.00 apple pies!'

### Using the % Operator

In [16]:
result = "%s %s" % (num, food)
print(result)

20 apple pies
