# Chapter 4: Text and Bytes

Humans use text. Computers speak bytes.

Python 3 introduced a sharp distinction between strings of human text and sequences of raw bytes (这也是一个和Python 2主要的不同).

本章主要介绍：
- unicode strings
- binary sequences
- encodings used to convert between them

## 1. Character Issues

a string is a sequence of characters. 那么问题来了：什么是character?

Unicode: 区分**identity of characters** 和**specific byte representations**:
- the identity of character - **code point**
    - 一个U+4～6个16进制数：U+20AC (euro sign): 
    - About 10% of the valid code points have characters assigned to them
    - unicode str 可以被看作"human text"

- byte representation - **encoding**:
    - an algorithm that converts code points to byte sequences and vice versa.
    - Converting from code points to bytes is encoding. we **encode** str to bytes for storage or transmission
    - converting from bytes to code points is decoding. we **decode** bytes to str to get human-readable text

In [1]:
s = 'café'
len(s) # 4 unicode char

4

In [7]:
b = s.encode('utf8')  # Encoder using UTF-8
b

b'caf\xc3\xa9'

In [8]:
len(b)  # é 在unicode 中是1个character，在utf-8 中是两个characters，因此长度为5.

5

In [9]:
b.decode('utf8')  # decode utf-8 bytes to str using UTF-8

'café'

#### how to print the code points?

## 2. Byte Essentials

上面我们介绍了，将一个code point 转换为byte(s) 的过程称为编码(encoding).下面我们介绍byte.

Python 3 有两个basic built-in types for **binary sequences**:
- immutable **bytes**
- mutable **bytearray**

Each item in **bytes** or **bytearray** is an integer from 0 to 255.

a slice of a binary sequence always produces a binary sequence of the same type. 这一点和列表很像。例如，列表的元素是integer，带式列表的slice 还是一个列表，即使这个slice 只有一个元素。

In [10]:
cafe = bytes('café', encoding='utf_8')  # 通过str 构建byte，需要指明encoding
cafe

b'caf\xc3\xa9'

In [11]:
len(cafe)  # 5个items

5

In [12]:
for c in cafe:  # 每个item 是一个0～255 的数字
    print(c)

99
97
102
195
169


In [13]:
cafe[0]  # retrieves an int

99

In [14]:
cafe[:1]  # Slices of bytes are also bytes — even slices of a single byte.


b'c'

In [15]:
cafe[:1] == cafe[:1]  # 相同

True

In [16]:
cafe_arr = bytearray(cafe)  # There is no literal syntax for bytearray
cafe_arr

bytearray(b'caf\xc3\xa9')

#### what is a literal syntax?

In [17]:
bytearray(b'caf\xc3\xa9')  # bytearray 的slice 还是bytearray
cafe_arr[-1:]

bytearray(b'\xa9')

从上面的例子可以看出，一个byte 其实是一个整数序列，但是我们显示的时候，并没有直接显示整数。对于一个byte，有三种展示方法：
1. printable ASCII range (从空格到～): 直接显示ASCII character
2. tab，newline，carriage return \: use escape \t, \r, \n, \\
3. 其他: 16进制转义字符

Bytes 支持str 绝大多数的方法：
- 除了 casefold, isdecimal, isidentifier, isnumeric, isprintable
- 可以使用 endswith, replace, strip, translate, upper, 除此以外，还可以使用re

除此以外，binary sequence 多一个str 不支持的方法：
- fromhex

In [23]:
bytes.fromhex('31 4B CE A9')

b'1K\xce\xa9'

除此以外，还可以通过调用constructors 来构建一个bytes 或者bytearray.

In [18]:
import array

numbers = array.array('h', [-2, -1, 0, 1, 2])  # Typecode 'h' creates an array of short integers (16 bits).
octets = bytes(numbers)
print(octets)  # These are the 10 bytes that represent the five short integers


b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'


Creating a bytes or bytearray object from any buffer-like source will always **copy** the bytes. In contrast, memoryview objects let you **share** memory between binary data structures.

### 2.1 Memory view class: 
https://docs.python.org/3/library/stdtypes.html#memory-views

Memoryview provides shared memory access to slices of data from
other binary sequences, packed arrays, and buffers.

The great thing about it is that it uses the **buffer protocol** beneath the covers to avoid copies and just juggle pointers to data. 



In [43]:
byte_array = bytearray('XYZ', 'utf-8') 
print('Before update:', byte_array) 
  
mem_view = memoryview(byte_array) 
print(mem_view)

# update 2nd index of mem_view to J  
mem_view[2]= 74
print('Memory view:', bytes(mem_view)) 
print('After update:', byte_array) 

Before update: bytearray(b'XYZ')
<memory at 0x104d8d7c8>
Memory view: b'XYJ'
After update: bytearray(b'XYJ')


从上面的例子可以看出，mem_view 和byte_array 两个变量其实是共享了一个内存。当我们修改mem_view 时，byte_array 同时也会被修改。

当我们尝试打印一个memory_view 的时候，我们可以到打印的是一个内存的地址。

但是，这个特性对于普通的array reference 赋值也可以实现。

In [26]:
char_lst1 = list('XYZ')
char_lst1

['X', 'Y', 'Z']

In [27]:
char_lst2 = char_lst1
char_lst2[-1] = 'W'

print('char_lst1: ', char_lst1)
print('char_lst2: ', char_lst2)

char_lst1:  ['X', 'Y', 'W']
char_lst2:  ['X', 'Y', 'W']


Memory View 真正的价值在于，每次创建一个slice 的时候不需要重新copy 数据。看下面这个例子：

In [30]:
char_slice1 = char_lst1[1:]
char_slice2 = char_lst1[1:]

print(id(char_slice1), char_slice1)
print(id(char_slice2), char_slice2)

4377569928 ['Y', 'W']
4377570376 ['Y', 'W']


In [31]:
char_slice1[0] = 'Z'

print(id(char_slice1), char_slice1)
print(id(char_slice2), char_slice2)

4377569928 ['Z', 'W']
4377570376 ['Y', 'W']


我们首先创建两个相同的slice，然后修改其中一个，我们可以看出，另一个没有被修改。即当我们创建一个slice 的时候，实际上是对slice 的数据进行copy，然后创建一个新的list 变量。这在处理大型数据库的时候很慢。

下面我们看看memory view 是怎样处理的。


In [41]:
byte_arr_slice1 = mem_view[1:]
byte_arr_slice2 = mem_view[1:]

print(id(byte_arr_slice1), byte_arr_slice1, bytes(byte_arr_slice1))
print(id(byte_arr_slice2), byte_arr_slice2, bytes(byte_arr_slice2))

4376287368 <memory at 0x104d8d888> b'YJ'
4376287560 <memory at 0x104d8d948> b'YJ'


In [42]:
byte_arr_slice1[1] = 100

print(id(byte_arr_slice1), byte_arr_slice1, bytes(byte_arr_slice1))
print(id(byte_arr_slice2), byte_arr_slice2, bytes(byte_arr_slice2))
print(bytes(mem_view))

4376287368 <memory at 0x104d8d888> b'Yd'
4376287560 <memory at 0x104d8d948> b'Yd'
b'XYd'


⚠️ 我们可以看出，memory view 实际上没有copy，而是直接指向内存的地址，所以修改了一个变量，我们可以看出，其他变量也修改了，因为他们指向同一个内存地址。

下面，我们来看在处理一个较大的数据集时，使用mem_view 可以提升性能。
It can yield large performance gains when operating on large objects since it doesn’t create a copy when slicing.

In [25]:
import time
for n in (100000, 200000, 300000, 400000):
    data = bytes('x'*n, encoding='utf_8')
    start = time.time()
    b = data
    while b:
        b = b[1:]  # repeat slicing
    print ('bytes', n, time.time()-start)

for n in (100000, 200000, 300000, 400000):
    data = bytes('x'*n, encoding='utf_8')
    start = time.time()
    b = memoryview(data)
    while b:
        b = b[1:]  # repeat slicing
    print ('memoryview', n, time.time()-start)
    

bytes 100000 0.27358317375183105
bytes 200000 1.2130539417266846
bytes 300000 2.9604148864746094
bytes 400000 5.418951034545898
memoryview 100000 0.01489877700805664
memoryview 200000 0.02967095375061035
memoryview 300000 0.04642605781555176
memoryview 400000 0.059120893478393555


You can clearly see quadratic complexity of the repeated string slicing. By using memory view, the complexity is reduced to linear.

### 2.2 Structs and Memory Views
https://docs.python.org/3/library/struct.html

#### Struct module:
to extract structured information from binary sequences

The struct module provides functions to parse packed bytes into a tuple of fields of different types and to perform the opposite conversion, from a tuple into packed bytes.

struct is used with bytes, bytearray, and **memoryview** objects.

In [46]:
import struct

header = b'GIF89a+\x02\xe6\x00'  # 10字节

"""
结构体格式：
    - <: 小写字节序
    - 3s3s: 3字节序列
    - HH: 两个16进制整数
"""

fmt = '<3s3sHH'

struct.unpack(fmt, header)

(b'GIF', b'89a', 555, 230)

⚠️ 如果你需要频繁的读取和修改binary file，参见mmap (Memory Mapped File Support).

## 3. Basic Encoders / Decoders

## 4. Understanding Encoder / Decoder Problems

## 5. handling Text Files

## 6. Normalizing Unicode for Saner Comparisons

## 7. Sorting Unicode Text

## 8. The Unicode Database

## 9. Dual-Mode str and bytes APIs

## 10. Chapter Summary