# Text versus Bytes

Python 3 drew distinction between strings of human text and sequences of raw bytes.

## What is a character?

Python 3 => str = unicode characters & is similar to Py2 unicode object
- ---------------------------------
Python 2 => unicode object = unicode characters

Python 2 => str = raw bytes

The Unicode standard explicitly separates the identity of characters from specific byte representations.

-  Unicode code point is a number from 0 to 1,114,111 (base 10)
-  Represented in the Unicode standard as 4 to 6 hexadecimal digits with a “U+” prefix.
-  About 10% of the valid code points have characters assigned to them
-  The actual bytes that represent a character depend on the encoding in use, where encoding is an algorithm that converts code points to byte sequences and vice-versa. 
    -  The code point for A (U+0041) = \x41 in the UTF-8 encoding & \x41\x00 in UTF-16LE encoding.

In [14]:
#cafe, with a extended ASCII character
s = 'café'
#has 4 unicode characters
print(len(s))
#change encoding to UTF-8
b = s.encode('utf8')
print(b)
#now é is represented by 2 bytes, so len = 5
print(len(b))
print(b.decode('utf8'))

4
b'caf\xc3\xa9'
5
café


## New binary sequences in Py3

-  bytes (Immutable - items are int 0-255)
-  bytearray (Mutable - items are int 0-255

In [28]:
#create a byte string using \xc3\xa9 for é (not \xcc\x81!!)
cafe = bytes('café', encoding='utf_8')
#prints as utf-8 byte literals - NOT code point, which starts with U+
print(cafe)
#prints first character, but represented as ASCII decimal. C = 99 in ASCII decimal
print(cafe[0])
# slice produces output of same type
print(cafe[:1])
cafe_arr = bytearray(cafe)
#byte array displays as bytearray(b....). "caf" are in the ASCII range, so printed
print(cafe_arr)
#cafe_arr has 5 bytes - 2 for é
print(len(cafe_arr))
#... , so the last item in a bytestring is the last of the 2 é bytes - i.e. \xa9
print(cafe_arr[-1:])

b'caf\xc3\xa9'
99
b'c'
bytearray(b'caf\xc3\xa9')
5
bytearray(b'\xa9')


-  For bytes in the printable ASCII range — from space to ~ — the ASCII character itself is used.
-  For bytes corresponding to tab, newline, carriage return and \, the escape sequences \t, \n, \r and \\ are used.
-  For every other byte value, an hexadecimal escape sequence is used, e.g. \x00 is the null byte.

Both bytes and bytearray support every str method except those that do formatting (format, format_map). This means that you can use familiar string methods like endswith, replace, strip, translate, upper etc.

The other ways of building bytes or bytearray instances are calling their constructors with:
-  a str and an encoding keyword argument.
-  an iterable providing items with values from 0 to 255.
-  a single integer, to create a binary sequence of that size initialized with null bytes3.
-  an object that implements the buffer protocol (eg. bytes, bytearray, memoryview, array.array); this copies the bytes from the source object to the newly created binary sequence.

In [30]:
#printing from Hex to UTF-8
print(bytes.fromhex('31 4B CE A9'))

# Initializing bytes from the raw data of an array.
import array
numbers = array.array('h', [-2, -1, 0, 1, 2]) 
octets = bytes(numbers)
octets

b'1K\xce\xa9'


b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

## Structs and memory views

