# Introduction

## Character Issues

The actual bytes that represent a character depend on the encoding in use. An encoding is an algorithm that converts code points to byte sequences and vice-versa. The code point for A (U+0041) is encoded as the single byte \x41 in the UTF-8 encoding, or as the bytes \x41\x00 in UTF-16LE encoding. As another example, the Euro sign (U+20AC) becomes three bytes in UTF-8 — \xe2\x82\xac — but in UTF-16LE it is encoded as two bytes: \xac\x20. 

In [2]:
s = 'café'
b = s.encode('utf8')
'len s = {} is {}, len b = {} is {}'.format(s,len(s),b,len(b))

"len s = café is 4, len b = b'caf\\xc3\\xa9' is 5"

In [8]:
# In bytes and bytearray each itema is an integer from 0 to 255, howewer a lice produces a binary sequence
# You can use most of the string operations
cafe = bytes('café', encoding='utf-8')
cafe,cafe[0],cafe[:1]

(b'caf\xc3\xa9', 99, b'c')

In [10]:
cafe_arr = bytearray(cafe)
cafe_arr,cafe_arr[-1:]

(bytearray(b'caf\xc3\xa9'), bytearray(b'\xa9'))

In [20]:
bytes.fromhex('31 4b ce a9')

b'1K\xce\xa9'

## Structs and memory views

In [23]:
# Using memoryview and struct to inspect a GIF image header

import struct
fmt = '<3s3sHH'
with open('test.gif','rb') as fp:
    img = memoryview(fp.read())

header = img[:10]
bytes(header)

b"GIF89a\xf2\x00'\x00"

In [25]:
struct.unpack(fmt, header)

(b'GIF', b'89a', 242, 39)

## Basic encoders/decoders

The Python distribution bundles more than 100 codecs (encoder/decoder) for text to
byte conversion and vice-versa. Each codec has a name, like 'utf_8', and often aliases,
such as 'utf8', 'utf-8' and 'U8', which you can use as the encoding` argument in
functions like open(), str.encode(), bytes.decode() and so on.

In [31]:
for codec in ['latin_1','utf_8','utf_16']:
    encoded = 'El Niño se río'.encode(codec)
    print(codec,encoded, len(encoded), sep='\t')

latin_1	b'El Ni\xf1o se r\xedo'	14
utf_8	b'El Ni\xc3\xb1o se r\xc3\xado'	16
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00 \x00s\x00e\x00 \x00r\x00\xed\x00o\x00'	30


## Encoding & Decoding Files

In [4]:
open('cafe.txt', 'w', encoding='utf_8').write('café')
open('cafe.txt','r').read()

'cafÃ©'

In [5]:
# Always include the encoder 
open('cafe.txt','r', encoding='utf_8').read()

'café'

## Normalizing sequences

In [7]:
from unicodedata import normalize
s1 = 'café'  # composed "e" with acute accent
s2 = 'cafe\u0301'  # decomposed "e" and acute accent
s1==s2

False

In [9]:
normalize('NFC',s1) == normalize('NFC',s2)

True

## The Unicode Database

In [20]:
import unicodedata
import re

re_digit = re.compile(r'\d')

sample = '1\xbc\xb2\u0969\u136b\u216b\u2466\u2480\u3285'

for char in sample:
    
    print('U+%04x' % ord(char), char.center(6), 
          're_dig' if re_digit.match(char) else '_',
          'isdig' if char.isdigit() else '_',
          'isnum' if char.isnumeric() else '_',
          format(unicodedata.numeric(char),'5.2f'),
          unicodedata.name(char),
          sep='\t'       
         )


U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00bc	  ¼   	_	_	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00b2	  ²   	_	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ३   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136b	  ፫   	_	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216b	  Ⅻ   	_	_	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ⑦   	_	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ⒀   	_	_	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  ㊅   	_	_	isnum	 6.00	CIRCLED IDEOGRAPH SIX
