# String 

### 1. Unicode
* standard for consistent encoding, representation, and handling of text
* can be implemented by different character encoding
* **encoding**: code points $\rightarrow$ bytes
* **decoding**: bytes $\rightarrow$ code points
* **UTF-8**: most widely used encoding

### 2. UTF-8 encoding
| Number of bytes | Bits for code point | First code point | Last code point | Byte1 | Byte2 | Byte3 | Byte4 |
|---|---|---|---|---|---|---|---|
|1|7 |U+0000 |U+007F  |0xxxxxxx||||			
|2|11|U+0080 |U+07FF  |110xxxxx|10xxxxxx|||		
|3|16|U+0800 |U+FFFF  |1110xxxx|10xxxxxx|10xxxxxx||	
|4|21|U+10000|U+10FFFF|11110xxx|10xxxxxx|10xxxxxx|10xxxxxx|

In [7]:
a = 'A'
a, type(a), hex(ord(a)), len(a)

('A', str, '0x41', 1)

In [6]:
a = '가'
a, type(a), hex(ord(a)), len(a)

('가', str, '0xac00', 1)

### 3. bytes
* byte container / immutable

In [18]:
b1 = bytes([65, 255, 200, 0])
print(b1, type(b1))
print(b1[0], b1[1])

b'A\xff\xc8\x00' <class 'bytes'>
65 255


In [14]:
b1 = bytes(b'hello')
b1

b'hello'

### 4) bytearray
* byte container / mutable

In [20]:
b2 = bytearray([65, 255, 200, 0])
print(b2, type(b2))
print(b2[0], b2[1])

bytearray(b'A\xff\xc8\x00') <class 'bytearray'>
65 255


In [24]:
b2[0] = 67
b2[0]

67

### 5) str $\leftrightarrow$ bytes

In [29]:
a = 'hello world'
b = a.encode('utf-8')
b, type(b)

(b'hello world', bytes)

In [30]:
c = b.decode('utf-8')
c, type(c)

('hello world', str)

In [34]:
a = '한글'
b = a.encode('utf-8')
b, type(b)

(b'\xed\x95\x9c\xea\xb8\x80', bytes)

In [32]:
b = a.encode('euc-kr')
b, type(b)

(b'\xc7\xd1\xb1\xdb', bytes)

In [38]:
data = b'\xc0\xa5\xbf\xa1\xbc\xad \xc5\xa9\xb7\xd1\xb8\xb5\xc7\xd1 \xb5\xa5\xc0\xcc\xc5\xcd'
# print(data.decode('utf-8')) -> raise error
print(data.decode('euc-kr'))

웹에서 크롤링한 데이터


### 6) use Unicode in RE

In [41]:
import re
pattern = r'[가-힣]+'
re.findall(pattern, 'this is a english string 한글이 숨어있다. haha python good 여기도 있다 gogo')

['한글이', '숨어있다', '여기도', '있다']

In [42]:
pattern = '[\uac00-\ud7a3]+'
re.findall(pattern, 'this is a english string 한글이 숨어있다. haha python good 한글만 추출됩니다 gogo')

['한글이', '숨어있다', '한글만', '추출됩니다']

### 7) Pickle
* object $\rightarrow$ string $\rightarrow$ store in file
* [more about Pickle](https://docs.python.org/3/library/pickle.html)

In [53]:
import pickle
persons = {'abc' : 1, 'def' : 2, 'ghi' : 3}

### object $\rightarrow$ string

In [54]:
string = pickle.dumps(persons)
print (string)

b'\x80\x03}q\x00(X\x03\x00\x00\x00abcq\x01K\x01X\x03\x00\x00\x00defq\x02K\x02X\x03\x00\x00\x00ghiq\x03K\x03u.'


### string $\rightarrow$ object

In [55]:
print (pickle.loads(string))

{'abc': 1, 'def': 2, 'ghi': 3}


### write in file

In [56]:
pickle.dump(df0, open('obj', 'wb'))

### read from file

In [57]:
p2 = pickle.load(open('obj', 'rb'))
print (p2)

{'abc': 1, 'def': 2, 'ghi': 3}
