# Strings

https://pyformat.info


Some relevant functions for stings. Let us conside the following string

```python
x = 'This is a string.'
```

- `x.split(sep=None)`
  - If `sep` is not specified or is `None`, any
    whitespace string is a separator and empty strings are
    removed from the result.

- `x.upper()`
  - Returns the string `x` with all characters uppercase.

- `x.lower()`
  - Returns the string `x` with all characters in lowercase.


In [86]:
x = 'This is a string'
type(x)

str

In [87]:
words = x.split()
words

['This', 'is', 'a', 'string']

In [88]:
# by default split uses blank space ' ' as separator
x = 'We can split separate sentences. Using sep argument.'
x.split()

['We', 'can', 'split', 'separate', 'sentences.', 'Using', 'sep', 'argument.']

In [90]:
# We can define custom separators

x = 'We can split separate sentences. Using sep argument'
x.split(sep=".")

['We can split separate sentences', ' Using sep argument']

# Formatting strings

We can format strings using the `x.format` method. This method allow us to introduce information inside the string x. We write placeholders `{}` inside the string `x` in the positions where we want to put certain information.

Let us see some examples


In [34]:
names = ['David', 'Jaquim', 'Michael']
ages = [19, 30, 50]

In [35]:
for name,age in zip(names,ages):
    print('name: {}\t age: {}'.format(name,age))

name: David	 age: 19
name: Jaquim	 age: 30
name: Michael	 age: 50


## Placeholders and digit formatting

We can speficy the number of digits used when formatting a number, or the maximum number of decimals.

In [36]:
print('big number {} hard to read'.format(10**10))

big number 10000000000 hard to read


In [37]:
print('big number {:,} better with commas'.format(10**10))

big number 10,000,000,000 better with commas


In [38]:
print('big number {:,.3f} also with decimal places'.format(10**10))

big number 10,000,000,000.000 also with decimal places


In [39]:
for i in range(5,15):
    print('number {:02}'.format(i))

number 05
number 06
number 07
number 08
number 09
number 10
number 11
number 12
number 13
number 14


In [40]:
# Decide the number of decimals
for i in range(5,15):
    print('number 2 dedimals {:.2f}'.format(i/3.),end="\t")
    print('number 3 decimals {:.3f}'.format(i/3.))

number 2 dedimals 1.67	number 3 decimals 1.667
number 2 dedimals 2.00	number 3 decimals 2.000
number 2 dedimals 2.33	number 3 decimals 2.333
number 2 dedimals 2.67	number 3 decimals 2.667
number 2 dedimals 3.00	number 3 decimals 3.000
number 2 dedimals 3.33	number 3 decimals 3.333
number 2 dedimals 3.67	number 3 decimals 3.667
number 2 dedimals 4.00	number 3 decimals 4.000
number 2 dedimals 4.33	number 3 decimals 4.333
number 2 dedimals 4.67	number 3 decimals 4.667


## Placeholders with integer values 
We can also use placeholders with integers inside, this can be usefull in a variety of situations. For example, if we want to print a repetead value inside a string we don't need to pass it to the format method several times.

In [91]:
# We don't need to do this
for name,age in zip(names,ages):
    print('name: {}\t name again: {} \t age: {}'.format(name, name, age))

name: David	 name again: David 	 age: 19
name: Jaquim	 name again: Jaquim 	 age: 30
name: Michael	 name again: Michael 	 age: 50


In [92]:
# We can simply use placeholders with integers inside to refer
# to the potition of the input of the format method.
for name,age in zip(names,ages):
    print('name: {0}\t name again: {0} \t age: {1}'.format(name,age))

name: David	 name again: David 	 age: 19
name: Jaquim	 name again: Jaquim 	 age: 30
name: Michael	 name again: Michael 	 age: 50


## Placeholders with keyword arguments

We can also use keyword arguments inside the format method. By doing so we don't need to take into account the order at which the inputs of `forward` are sent.


In [93]:
for n,a in zip(names,ages):
    print('name: {name}\t \t age: {age}'.format(age=a, name=n))

name: David	 	 age: 19
name: Jaquim	 	 age: 30
name: Michael	 	 age: 50


## Placeholders with dictionary inputs

We can pass dictionaries in the `format` method and use the keys of the dictionaries inside the placeholders.

In [94]:
names = ['David', 'Jaquim', 'Michael']
ages = [19, 30, 50]

d = []

for n,a in zip(names,ages):
    d.append({'name':n, 'age':a})

In [95]:
for d_k in d:
    print('name: {name}\t \t age: {age}'.format(**d_k))

name: David	 	 age: 19
name: Jaquim	 	 age: 30
name: Michael	 	 age: 50


# Date strings

In [96]:
import datetime
my_date = datetime.datetime(2017,10,5,10)

In [732]:
'The date was {m.day}/{m.month}/{m.year}'.format(m=my_date)

'The date was 5/10/2017'

# Working with bytes

Strings are internally encoded in lots of different ways but, for the most part, they end up beeing represented as a sequence of bytes.  Just like numbers, one need some extra information to understand the meaning of a sequence of bytes. If no context is provided `10` can be ten (in base 10) or two (in base 2) or sixteen (in base 16). 

In order to decode a sequence of bytes into a string we need to know how the sequence of bytes was encoded.


It is rellevant then to know how to work with bytes in python.

- **`bytes(k)`** creates a `bytes` object with `k` bytes.


- In python bytes are represented by `b'0x##0x##...0x##'` where `##` are two hexadecimal numbers. 


- Recall that a bytes is 8 bits and a hexadecimal number is a numer from 0 to 15 represented as `1,2,...,9,a,b,c,d,e,f`. Therefore any byte, which allows us to represent values up to 255 (2^8-1) can be written as two hexadecimal numbers.
    
    - Hexadecimal numbers are represented with the prefix `0x`.
      For example number 10 in hexadecimal is `0xa`
    
    - Example: `00000110` is represented as `0x06` in hexadecimal


- The function **`hex`** can be used to optain the hexadecimal form of an integer in base 10

    - Example: `hex(65)` is `0x41`, `hex(10)` is `0xa`


- The function **`int(h,16)`** can be used to obtain the integer (in base 10) of the number in hexadecimal form `h`. More generally `int(n,b)` returns the integer (in base 10) of the number `n` passed as a string assuming `n` is represented in base `b`.

    - Example: `int("101",2)` is `5`
    - Example: `int("101",16)` is `257`.
    - Example: `int("0xF,16)` is `15`


##### `bytes` and `bytearrays`

- The function **`bytearrays.fromhex`** can be used to create a `bytearrays` object from a string containing a list of hexadecimal digits.
    
    - Example:
    ```
    b_array = bytearray.fromhex("0f")
    b_array == b'\x0f'
    True
    ```
    Notice that  `bytearray.fromhex("f")` will cause an error because two concatenated hexadecimal digits are needed to define a byte.


In Python 3, `bytes` consists of sequences of 8-bit values, while str consists of sequences of Unicode characters. bytes and strcannot be used together with operators like > or +.



In [823]:
type(b'this is a byte string')

bytes

In [780]:
0xA, hex(10)

(10, '0xa')

We can use `hex` to generate a hexadecimal representatio of the input, if no extra argument is passed the method assumes base 10

In [792]:
ten_base16 = hex(10)
ten_base16

'0xa'

You can get an integer from a string containing a hexadecimal number. You can use `int(n,b)` to convert `n` in base `b` to an integer in base 10.

In [794]:
ten_base10 = int(ten_base16, 16)
ten_base10

10

This also works with other bases

In [795]:
int("101",2), int("101",16), int("101",10)

(5, 257, 101)

In [811]:
int('0xa',16), int('0xF',16), int(b"0x0f",16)

(10, 15, 15)

Notice the difference between `b'x00'` and `x00`.

In [812]:
type(b'x00'), type('x00')

(bytes, str)

Create a byte concatenating the hexadecimal numbers "ab0c"

In [813]:
b'\x0f' == bytearray.fromhex("0f")

True

In [814]:
my_byte_array = bytearray.fromhex("ab0c")

In [815]:
my_byte_array

bytearray(b'\xab\x0c')

In [816]:
len(my_byte_array)

2

In [817]:
# this should produce an error since two hexadecimals are needed to create a byte
bytearray.fromhex("f")

ValueError: non-hexadecimal number found in fromhex() arg at position 1

Working with a bunch of bytes

In [590]:
empty_bytes = bytes(1)

In [591]:
empty_bytes

b'\x00'

In [538]:
"\x41"

'A'

In [529]:
type("\x0A")

str

In [530]:
"\x0A"

'\n'

# String Encodings


- We can do **`b'cafe'`** to encode in binary.
- We can do **`u'cafe'`** to encode in unicode.

## Encoding in utf-8, utf-16, utf-32

We can use `x.encode('utf-8')`, `x.encode('utf-16')`, `x.encode('utf-32')` to encode `x` in utf-8, utf-16 or utf-32 respectively.


Table containing `utf-8` representation 
```
Bytes Bits  Byte representation
1     7      0xxxxxxx            
2     11     110xxxxx    10xxxxxx        
3     16     1110xxxx    10xxxxxx    10xxxxxx    
4     21     11110xxx    10xxxxxx    10xxxxxx    10xxxxxx
```


We can encode a particular scring if we preceed it with:

- `u` for unicode
- `b` for binary (asciii)

Notice that not all strings can be encoded as asciii


In [841]:
u'café'

'café'

In [843]:
b'café'

SyntaxError: bytes can only contain ASCII literal characters. (<ipython-input-843-eb49c049fc5c>, line 1)

In [853]:
'\u0061'

'a'

Examples of encodings in `utf-8` and `utf-16`

In [100]:
cafe_utf8 = 'café'.encode('utf-8')
cafe_utf8

b'caf\xc3\xa9'

In [101]:
cafe_utf8.decode('utf-8')

'café'

In [102]:
cafe_utf16 = 'café'.encode('utf-16')
cafe_utf16

b'\xff\xfec\x00a\x00f\x00\xe9\x00'

In [103]:
cafe_utf16.decode('utf-16')

'café'

# ASCII ENCODING (American Standard Code for Information Interchange)

A natural question might arise: Why do we have all this encoding options for strings?
    
Why don't we want to use allways unicode?

It turns out that, for historical reasons (computers were developed mainly in USA) there was no need to use  many different characters  to represent english text.  ASCII is uses  7 bits to represent any character. Notice that 2^7 = 128. Therefore in ASCII we can represent up to "128 different characters". 

The bijection between `[0,127] <-> ASCII`  can be found in the following table:


##### ASCII table


```

Dec  Char                           Dec  Char     Dec  Char     Dec  Char
---------                           ---------     ---------     ----------
  0  NUL (null)                      32  SPACE     64  @         96  `
  1  SOH (start of heading)          33  !         65  A         97  a
  2  STX (start of text)             34  "         66  B         98  b
  3  ETX (end of text)               35  #         67  C         99  c
  4  EOT (end of transmission)       36  $         68  D        100  d
  5  ENQ (enquiry)                   37  %         69  E        101  e
  6  ACK (acknowledge)               38  &         70  F        102  f
  7  BEL (bell)                      39  '         71  G        103  g
  8  BS  (backspace)                 40  (         72  H        104  h
  9  TAB (horizontal tab)            41  )         73  I        105  i
 10  LF  (NL line feed, new line)    42  *         74  J        106  j
 11  VT  (vertical tab)              43  +         75  K        107  k
 12  FF  (NP form feed, new page)    44  ,         76  L        108  l
 13  CR  (carriage return)           45  -         77  M        109  m
 14  SO  (shift out)                 46  .         78  N        110  n
 15  SI  (shift in)                  47  /         79  O        111  o
 16  DLE (data link escape)          48  0         80  P        112  p
 17  DC1 (device control 1)          49  1         81  Q        113  q
 18  DC2 (device control 2)          50  2         82  R        114  r
 19  DC3 (device control 3)          51  3         83  S        115  s
 20  DC4 (device control 4)          52  4         84  T        116  t
 21  NAK (negative acknowledge)      53  5         85  U        117  u
 22  SYN (synchronous idle)          54  6         86  V        118  v
 23  ETB (end of trans. block)       55  7         87  W        119  w
 24  CAN (cancel)                    56  8         88  X        120  x
 25  EM  (end of medium)             57  9         89  Y        121  y
 26  SUB (substitute)                58  :         90  Z        122  z
 27  ESC (escape)                    59  ;         91  [        123  {
 28  FS  (file separator)            60  <         92  \        124  |
 29  GS  (group separator)           61  =         93  ]        125  }
 30  RS  (record separator)          62  >         94  ^        126  ~
 31  US  (unit separator)            63  ?         95  _        127  DEL
 ```


We can encode a string `s` in `ASCII`  using `s.encode("ascii")`

In [116]:
s = "this is a string"
s = s.encode("ascii")

In [183]:
sys.getsizeof(s)

49

Remember that `sys.getsizeof(x)`  return the size of object `x` in bytes.


In [158]:
s_ascii   = "t".encode("ascii")
s_unicode = "t"

print("size of a single char in ascii",  sys.getsizeof(s_ascii)) 
print("size of a single char in unicode",sys.getsizeof(s_unicode))

size of a single char in ascii 34
size of a single char in unicode 50


Notice that the size we get in bytes for a simple `"t"` is much greater than 7 bits. This happens because python strings are "python objects" and use more memory than needed.

In [159]:
 sys.getsizeof(s_ascii)

34

In [184]:
sys.getsizeof(b"t")

34

There are brilliant details in the encoding table of ascii. One of the most usefull things is that letters a,b,c,...
can be recognized easily from the binary representation.

- Capital letters start with `10`
- Notice that in ascii `A` is 65, in binary is `01000001`
- Notice that in ascii `B` is 66, in binary is `01000010`
- Notice that in ascii `C` is 67, in binary is `01000011`


- Non Capital letters start with `11`
- Notice that in ascii `a` is 97, in binary is `01100001`
- Notice that in ascii `b` is 66, in binary is `01100010`
- Notice that in ascii `c` is 67, in binary is `01100011`






In [412]:
char = "t".encode("ascii")

In [414]:
aux = bytearray(char)

In [416]:
aux

bytearray(b't')

In [31]:
aux[0]

NameError: name 'aux' is not defined

##### doc_0.txt

We can inspect `files/doc_0.txt` which only contains a single letter `a`.

Using **`xxd`** we can inspect the binary document (the byte representation of the document).



In [83]:
!cat files/doc_0.txt

a


In [47]:
!xxd ./files/doc_0.txt

00000000: 610a                                     a.


notice that the file is simply defined in a couple of  bytes: `610a`.

We can see the unicodes characters that correspond to that information.

In [35]:
chr(0x61),chr(0x0a)

('a', '\n')

##### doc_1.txt

The file `.files/doc_1.txt` simply contiains

```
a dog eats.
a dog can not fly.
```

We can inspect again the bytes that encode that information.


In [82]:
!cat files/doc_1.txt

a dog eats.
a dog can not fly.


In [48]:
!xxd ./files/doc_1.txt

00000000: 6120 646f 6720 6561 7473 2e0a 6120 646f  a dog eats..a do
00000010: 6720 6361 6e20 6e6f 7420 666c 792e 0a    g can not fly..


We can inspect the different bytes and see that we can retrieve the text from the bytes.

In [50]:
chr(0x61),chr(0x20), chr(0x64), chr(0x6f), chr(0x67),  chr(0x20)

('a', ' ', 'd', 'o', 'g', ' ')

##### doc_2.txt

The file `.files/doc_2.txt` contains a single letter: `ñ`.

In [77]:
!cat files/doc_2.txt

ñ


In [80]:
!xxd ./files/doc_2.txt

00000000: c3b1 0a                                  ...


Notice though that now we don't see `ñ` using the same technique as before

In [81]:
chr(0xc3)

'Ã'

In order to understand how to interpret the previous sequence of bytes we need to understand unicode and the different types of encodings that exist. 

# Unicode Encoding

The Unicode standard is based on a dictionary where keys are characters and values are integers. For example `unicode("A") = 65`.

In python we have 

- the function **`ord`** which returns the Unicode code point for a given one-character string.
- the function **`chr`** which returns the character string for a given code point. A code point can be passed as a Python integer or as a hexadecimal number.



In reality, keys are characters and values are "code points". Each code point is simply an integer represented in the following format:`U+****`. 

For example, `unicode_codepoint("A")=U+0041`. 

Notice that this dictionary says nothing about how `65` should be stored into Memory (RAM). There are different ways to encode strings to memory. The most revolutionary idea about unicode is the **separation between the mapping of characters-integers from the actual memory representation of the integers**.


A nice table containing code points for different characters can be found here:

https://pkg.julialang.org/docs/julia/THl1k/1.1.1/manual/unicode-input.html



In [323]:
print("The symbol '{}' has code point associated {} ".format("a", ord("a")))

The symbol 'a' has code point associated 97 


Notice `chr` can get as input hexadecimal numbers and integers

In [860]:
chr(0x41), chr(65)

('A', 'A')

Also we can write unicode character in python writting "\uXXXX"

In [373]:
some_unicode_char = u'\u0061'
print(some_unicode_char)

a


We can build the ascii table


In [347]:
for i in range(0,128):
    print(i, chr(i), end="\n")

0  
1 
2 
3 
4 
5 
6 
7 
8 
9 	
10 

11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32  
33 !
34 "
35 #
36 $
37 %
38 &
39 '
40 (
41 )
42 *
43 +
44 ,
45 -
46 .
47 /
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55 7
56 8
57 9
58 :
59 ;
60 <
61 =
62 >
63 ?
64 @
65 A
66 B
67 C
68 D
69 E
70 F
71 G
72 H
73 I
74 J
75 K
76 L
77 M
78 N
79 O
80 P
81 Q
82 R
83 S
84 T
85 U
86 V
87 W
88 X
89 Y
90 Z
91 [
92 \
93 ]
94 ^
95 _
96 `
97 a
98 b
99 c
100 d
101 e
102 f
103 g
104 h
105 i
106 j
107 k
108 l
109 m
110 n
111 o
112 p
113 q
114 r
115 s
116 t
117 u
118 v
119 w
120 x
121 y
122 z
123 {
124 |
125 }
126 ~
127 


We can also observe other characters...

In [299]:
for i in range(50,1100):
    print(chr(i),end="  ")

2  3  4  5  6  7  8  9  :  ;  <  =  >  ?  @  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  [  \  ]  ^  _  `  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z  {  |  }  ~                                                                       ¡  ¢  £  ¤  ¥  ¦  §  ¨  ©  ª  «  ¬  ­  ®  ¯  °  ±  ²  ³  ´  µ  ¶  ·  ¸  ¹  º  »  ¼  ½  ¾  ¿  À  Á  Â  Ã  Ä  Å  Æ  Ç  È  É  Ê  Ë  Ì  Í  Î  Ï  Ð  Ñ  Ò  Ó  Ô  Õ  Ö  ×  Ø  Ù  Ú  Û  Ü  Ý  Þ  ß  à  á  â  ã  ä  å  æ  ç  è  é  ê  ë  ì  í  î  ï  ð  ñ  ò  ó  ô  õ  ö  ÷  ø  ù  ú  û  ü  ý  þ  ÿ  Ā  ā  Ă  ă  Ą  ą  Ć  ć  Ĉ  ĉ  Ċ  ċ  Č  č  Ď  ď  Đ  đ  Ē  ē  Ĕ  ĕ  Ė  ė  Ę  ę  Ě  ě  Ĝ  ĝ  Ğ  ğ  Ġ  ġ  Ģ  ģ  Ĥ  ĥ  Ħ  ħ  Ĩ  ĩ  Ī  ī  Ĭ  ĭ  Į  į  İ  ı  Ĳ  ĳ  Ĵ  ĵ  Ķ  ķ  ĸ  Ĺ  ĺ  Ļ  ļ  Ľ  ľ  Ŀ  ŀ  Ł  ł  Ń  ń  Ņ  ņ  Ň  ň  ŉ  Ŋ  ŋ  Ō  ō  Ŏ  ŏ  Ő  ő  Œ  œ  Ŕ  ŕ  Ŗ  ŗ  Ř  ř  Ś  ś  Ŝ  ŝ  Ş  ş  Š  š  Ţ  ţ  Ť  ť  Ŧ  ŧ  Ũ  ũ  Ū  ū  Ŭ  ŭ  Ů  ů  Ű  ű  Ų  ų  Ŵ  ŵ  Ŷ  ŷ  Ÿ  Ź  ź  Ż  ż  Ž  ž  ſ

In [281]:
# https://www.compart.com/en/unicode/U+4005
'\u4005'

'䀅'

In [284]:
"\u2640"

'♀'

In [12]:
chr(40863)

'龟'

In [31]:
turtle_hex = hex(40863)
turtle_hex

'0x9f9f'

In [41]:
turtle_hex

'0x9f9f'

In [26]:
"\u9F9F"

'龟'

In [86]:
# Curiosity, It's a tibetanian character "\u0FD6" ( I won't print it)

##### Going back to `doc_2.txt` 

In [89]:
!cat files/doc_2.txt

ñ


In [102]:
!xxd files/doc_2.txt

00000000: c3b1 0a                                  ...


Notice that we can read a the previous file without specifying the encoding.
That is because it was encoded as utf-8 which is the default encoding method used by python.

In [145]:
with  open("./files/doc_2.txt") as file:
    print(file.read())

ñ



We can see that in UTF-8, `ñ` corresponds to the byte `\xc3\xb1`.

In [146]:
bytes('ñ','utf-8')

b'\xc3\xb1'

Nevertheless with other encodings this is not the case

In [147]:
bytes('ñ','latin-1')

b'\xf1'

Now we will create a file containing the word `castaña` with encoding `latin-1`.

We will see that this is not correctly interpretted if we try to read it without specifying an encoding

In [126]:
with open("./files/my_doc.txt", "wb") as file:
    file.write("castaña".encode("latin-1"))

In [128]:
!cat files/my_doc.txt

casta�a

If we don't specify the encoding python can't even read the file!

In [139]:
with open("./files/my_doc.txt") as file:
    aux = file.read()
    print(aux)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte

If we specify the encoding then it works as expected

In [142]:
with open("./files/my_doc.txt", encoding="latin-1") as file:
    aux = file.read()
    print(aux)

castaña



##### Character representations

There are 3 well known character representations for unicode codepoints: UTF-32, UTF-16, UTF-8.

## UTF-32

- The memory layout for a character `c` is simply the binary representation of `c` using 4 bytes.
- UTF-32 needs for every character 4 bytes (32 bits). It is a fixed-size character encoding.
- There are no markers inside bytes to define when a new character starts (there is no need, all characters are stored in memory using the same  number of bytes).

Examples 

If we are given that `unicode("A") = 65` and `unicode("B") = 66`  then

```
encoding(A, utf-32) = [[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,1,0,0,0,0,0,1]]
```

```
encoding(B, utf-32) = [[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,1,0,0,0,0,1,0]]
```

Notice that now all english text that could be encoded with 7 bits using ASCII now needs 4 times the amouunt of memory  needed for the same information is ASCII. This happens because  every character has 3 bytes that are 0 and then the last byte is simply the ASCII representation of the character. Therefore a lot of memory is "wasted" (simply full of zeros).

A nice nemotechnic trick to remember how UTF-32 encodes the information is that it allways uses 4 bytes (32 bits) for any character.

## UTF-16

- The memory layout for a character `c` is simply the binary representation of `c` using 2 or 4 bytes.
- UTF-16 uses  4 bytes (32 bits) or 2 bytes (16 bits) for every character. It is a variable-size character encoding.
- There are markers at the beggining of a file in UTF-16 that define how the encoding is done (BE or LE more details below)

The advantadge of UTF-16 over UTF-32 is that the size of a text file will not need 4 times the amount of space needed to store the same with ASCII (only twize the amount).

Notice that a file containing only ASCII text, when converted to UTF-16 will result in a file size of double the original.

###### BE and LE 
Notice that there are 2 ways to encode write an ASCII number into UTF-16, writting the zeros before or after.

The "big endian" byte order stores the data in the most significant big first (the left most bit).

The "little endian" byte order stores the data in the least significant byte first (the right position).

If we are given that `unicode("A") = 65` and `unicode("B") = 66`  then


```
encoding(A, UTF-16BE) = [[0,0,0,0,0,0,0,0], [0,1,0,0,0,0,0,1]]
```


```
encoding(A, UTF-16LE) = [[0,1,0,0,0,0,1,0], [0,0,0,0,0,0,0,0]]
```

How can we know if a file uses `BE` or `LE` ? 

A byte order mark was introduced. 

- Little Endian files starts with ```[1,1,1,1,1,1,1,1], [1,1,1,1,1,1,1,0]```.

- Big Endian files starts with ```[1,1,1,1,1,1,1,0], [1,1,1,1,1,1,1,1] ```.


Byte order marks are not mandatory, if not found a parser will assume one of the encodings and if it encounters and error will use the other encoding (and parse again).


##  UTF-8

- The memory layout for a character `c` is simply the binary representation of `c` in ASCII in 8 bits.
- UTF-8 can use 1,2,3 or 4 bytes for a given character. It is a variable-size character encoding.


- In order to parse a stream of bytes in UTF-8 there are some markers inside the bytes.
- If a byte starts with 0 it means the character was encoded in a single byte (it's an ASCII char).
- If a byte starts with 110 it means the character was encoded in 2 bytes.
- If a byte starts with 1110 it means the character was encoded in 3 bytes.


There are two types of marks.

In the following example we have know a char uses 2 bytes because the first one starts with 110. The first mark is called the leading byte, the second one (10) is called the continuation byte.
```
110XXXX 10XXXXXX
```




Unicode can use more bits than ASCII, allowing more characters. For example we can check `café` cannot be encoded as `ascii`:


In [187]:
'café'.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 3: ordinal not in range(128)

but there is no problem encoding it as `utf-8`

In [71]:
'café'.encode('utf-8')

b'caf\xc3\xa9'

##### utf-8 format


- The utf-8 format several bytes to represent characters.

- The number of bytes in utf-8 is **variable**.

- The first bit of the byte tells us how many bytes were used to encode a value.

- With 1 byte encodings we can represent numbers from `0` to `127=2**7-1` (the first bit tells us
  we use a singly byte to represent the character).

In order to facilitate reading bynary represenations of bytes we will mark the middle (4'th bit) with a vertical bar `|`.


##### 1 byte encodings: 

1-byte encodings have the following form:

`[0,*,*,* | *,*,*,*]`

This is the same as ASCII

##### 2 byte encodings: 

 2-byte encodings  have the following form: 

`[ [1,1,0,* | *,*,*,*], [1,0,*,* | *,*,*,*] ]`

Notice

- The first 3 bits of the first byte are set to `110`
- The first 2 bits of the second byte are set to `10`
- The remaining `11`  bits (`8*2 -3-2`) encode the actual character.



##### 3 byte encodings: 

3-byte encodings  have the following form: 

`[ [1,1,1,0 | *,*,*,*], [1,0,*,* | *,*,*,*] ],  [1,0,*,* | *,*,*,*] ]`

Notice

- The first 4 bits of the first byte are set to `1110`
- The first 2 bits of the second byte are set to `10`
- The first 2 bits of the third byte are set to `10`
- The remaining `16`  bits (`8*3 -4-2-2`) encode the actual character.


##### 4 byte encodings: 

3-byte encodings  have the following form: 

`[ [1,1,1,1 | 0,*,*,*], [1,0,*,* | *,*,*,*] ],  [1,0,*,* | *,*,*,*] ],  [1,0,*,* | *,*,*,*] ]`

Notice

- The first 5 bits of the first byte are set to `11110`
- The first 2 bits of the second byte are set to `10`
- The first 2 bits of the third byte are set to `10`
- The first 2 bits of the forth byte are set to `10`
- The remaining `21`  bits (`8*4 -4-2-2-2`) encode the actual character.



##### Binary, Decimal and Hexadecimal number representations

```
     Binary    Decimal    Hex    
     0         0          0      
     1         1          1      
     10        2          2      
     11        3          3      
     100       4          4      
     101       5          5      
     110       6          6      
     111       7          7      
     1000      8          8      
     1001      9          9      
     1010      10         A      
     1011      11         B      
     1100      12         C      
     1101      13         D      
     1110      14         E      
     1111      15         F      
```

If  you want to convert an integer to hexadecimal you can use `hex`.

Remember that in hexadecimal we use symbols ` 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F`

In [348]:
hex(65)

'0x41'

In [359]:
hex(65)

'0x41'

In [367]:
# 0x41 to base 10
4*16 + 1

65

In [369]:
chr(221)

'Ý'

In [371]:
hex(221)

'0xdd'

In [372]:
0xdd

221

##### Memory requirements for different encodings

In [77]:
# Different encodings give different representations in binary
# The different representations use a different amount of bytes
import sys 

#xascci = 'café'.encode('ascii') accents are not present in ascii
word_utf8  = 'café'.encode('utf-8')
word_utf16 = 'café'.encode('utf-16')
word_utf32 = 'café'.encode('utf-32')

[sys.getsizeof(x) for x in [word_utf8,word_utf16,word_utf32]]

[38, 43, 53]

In [78]:
[x for x in [word_utf8,word_utf16,word_utf32]]

[b'caf\xc3\xa9',
 b'\xff\xfec\x00a\x00f\x00\xe9\x00',
 b'\xff\xfe\x00\x00c\x00\x00\x00a\x00\x00\x00f\x00\x00\x00\xe9\x00\x00\x00']

In [79]:
ex  = 'cafe'.encode('ascii')
sys.getsizeof(ex)

37

## Unicode data 

Unicode is a format for coding exadecimal numbers to symbols. Unicode supports over a million symbols (or characters). Each character is assigned a number, called a code point. Code points are written in Python as\uXXXX, where XXXX is the number in four-digit hexadecimal form.

A font (like the 'times new roman' for a particular character) is a mapping from an image to a symbol/glyph.

We can manipulate unicode strings as 'normal' strings.

Some  important things to consider:

- When reading data from files, expect bytes and decode then with **`b.decode('utf-8')`**.

- When writting data back to a file, encode it with **`b.encode('utf-8')`**.

- Avoid using **`str()`** or  **`bytes()`** without an encoding to convert between types.

In [217]:
some_unicode_char = u'\u0061'

In [218]:
some_unicode_char 

'a'

In [219]:
print(some_unicode_char)

a


##### `ord` and `chr` functions



In [228]:
?ord

In [223]:
# \epsilon + tap
ord("ϵ")

1013

In [227]:
# \epsilon + tap
chr(1013)

'ϵ'

## Checking properties of strings and characters



In [312]:
'A'.isupper(), 'a'.isupper(), 'a'.isdigit(), '1'.isdigit()

(True, False, False, True)


# String similarity with fuzzywuzzy

In [290]:
import fuzzywuzzy

In [293]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [302]:
x = 'this is a string'
y = 'this is a string!'
z = 'this is also an string'

In [311]:
fuzz.ratio(x, y), fuzz.ratio(x, z)

(97, 84)