# Text versus Bytes

Python 3 drew distinction between strings of human text and sequences of raw bytes.

## What is a character?

Python 3 => str = unicode characters & is similar to Py2 unicode object
- ---------------------------------
Python 2 => unicode object = unicode characters

Python 2 => str = raw bytes

The Unicode standard explicitly separates the identity of characters from specific byte representations.

-  Unicode code point is a number from 0 to 1,114,111 (base 10)
-  Represented in the Unicode standard as 4 to 6 hexadecimal digits with a “U+” prefix.
-  About 10% of the valid code points have characters assigned to them
-  The actual bytes that represent a character depend on the encoding in use, where encoding is an algorithm that converts code points to byte sequences and vice-versa. 
    -  The code point for A (U+0041) = \x41 in the UTF-8 encoding & \x41\x00 in UTF-16LE encoding.

In [14]:
#cafe, with a extended ASCII character
s = 'café'
#has 4 unicode characters
print(len(s))
#change encoding to UTF-8
b = s.encode('utf8')
print(b)
#now é is represented by 2 bytes, so len = 5
print(len(b))
print(b.decode('utf8'))

4
b'caf\xc3\xa9'
5
café


## New binary sequences in Py3

-  bytes (Immutable - items are int 0-255)
-  bytearray (Mutable - items are int 0-255

In [28]:
#create a byte string using \xc3\xa9 for é (not \xcc\x81!!)
cafe = bytes('café', encoding='utf_8')
#prints as utf-8 byte literals - NOT code point, which starts with U+
print(cafe)
#prints first character, but represented as ASCII decimal. C = 99 in ASCII decimal
print(cafe[0])
# slice produces output of same type
print(cafe[:1])
cafe_arr = bytearray(cafe)
#byte array displays as bytearray(b....). "caf" are in the ASCII range, so printed
print(cafe_arr)
#cafe_arr has 5 bytes - 2 for é
print(len(cafe_arr))
#... , so the last item in a bytestring is the last of the 2 é bytes - i.e. \xa9
print(cafe_arr[-1:])

b'caf\xc3\xa9'
99
b'c'
bytearray(b'caf\xc3\xa9')
5
bytearray(b'\xa9')


-  For bytes in the printable ASCII range — from space to ~ — the ASCII character itself is used.
-  For bytes corresponding to tab, newline, carriage return and \, the escape sequences \t, \n, \r and \\ are used.
-  For every other byte value, an hexadecimal escape sequence is used, e.g. \x00 is the null byte.

Both bytes and bytearray support every str method except those that do formatting (format, format_map). This means that you can use familiar string methods like endswith, replace, strip, translate, upper etc.

The other ways of building bytes or bytearray instances are calling their constructors with:
-  a str and an encoding keyword argument.
-  an iterable providing items with values from 0 to 255.
-  a single integer, to create a binary sequence of that size initialized with null bytes3.
-  an object that implements the buffer protocol (eg. bytes, bytearray, memoryview, array.array); this copies the bytes from the source object to the newly created binary sequence.

In [30]:
#printing from Hex to UTF-8
print(bytes.fromhex('31 4B CE A9'))

# Initializing bytes from the raw data of an array.
import array
numbers = array.array('h', [-2, -1, 0, 1, 2]) 
octets = bytes(numbers)
octets

b'1K\xce\xa9'


b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

## Structs and memory views

The struct module provides functions to parse packed bytes into a tuple of fields of different types and to perform the opposite conversion, from a tuple into packed bytes. 

Memoryview class does not let you create or store byte sequences, but provides shared memory access to slices of data from other binary sequences, packed arrays and buffers such as PIL images, without copying the bytes.

In [32]:
import struct

#struct format: < little-endian; 3s3s two sequences of 3 bytes; HH two 16-bit integers.
fmt = '<3s3sHH'

#python.gif = 601x203px
with open('python.gif', 'rb') as fp:
    img = memoryview(fp.read())

header = img[:10]

print(bytes(header))
#type, version, width height
print(struct.unpack(fmt, header))
#release memory associated with memory view instances
del header
del img

b'GIF87aY\x02\xcb\x00'
(b'GIF', b'87a', 601, 203)


## Basic encoders/decoders



for codec in ['latin_1','utf_8', 'utf_16']:
    # Make sure n = ñ from Latin1, not ñ from OSX
    print(codec, 'El Niño'.encode(codec), sep='\t')

In [45]:
city = 'São Paulo'
print(city.encode('utf_8'))
print(city.encode('utf_16'))
# print(city.encode('iso8859_1'))
#silently skip unknown chars
print(city.encode('cp437', errors='ignore'))
#replace with ?
print(city.encode('cp437', errors='replace'))
#replace with XML
print(city.encode('cp437', errors='xmlcharrefreplace'))

##Coping with UnicodeDecodeError

octets = b'Montr\xe9al' 
print(octets.decode('cp1252'))
print(octets.decode('iso8859_7'))
print(octets.decode('koi8_r'))
#print(octets.decode('utf_8')) #'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
print(octets.decode('utf_8', errors='replace'))

b'Sa\xcc\x83o Paulo'
b'\xff\xfeS\x00a\x00\x03\x03o\x00 \x00P\x00a\x00u\x00l\x00o\x00'
b'Sao Paulo'
b'Sa?o Paulo'
b'Sa&#771;o Paulo'
Montréal
Montrιal
MontrИal
Montr�al


Chardet can be used to detect encoding, based on common bytes

## Endianness, and byte order

One big advantage of UTF-8 is that it produces the same byte sequence regardless of machine endianness, so no BOM is needed. 

## Handling text files

-  Bytes should be decoded to str as early as possible on input, e.g. when opening a file for reading. B
-  Business logic of your program, where text handling is done exclusively on str objects. You should never be encoding or de‐ coding in the middle of other processing. 
-  On output, the str are encoded to bytes as late as possible.

Python 3 makes it easier to follow the advice of the Unicode sandwich, because the open built-in does the necessary decoding when reading and encoding when writing files in text mode, so all you get from my_file.read() and pass to my_file.write(text) are str objects.

In [58]:
fp = open('cafe.txt', 'w', encoding='utf_8')
#returns TextIOWrapper object
print(fp)
print(fp.write('café'))
fp.close
import os
os.stat('cafe.txt').st_size
#opens with locale default encoding (ASCII for me)
fp2 = open('cafe.txt')
print(fp2)
print(fp2.encoding)
fp3 = open('cafe.txt', 'rb')
print(fp3)
#read the raw bytes
print(fp3.read())

<_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf_8'>
4
<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='US-ASCII'>
US-ASCII
<_io.BufferedReader name='cafe.txt'>
b'caf\xc3\xa9'


In [64]:
import sys, locale
expressions = """
        locale.getpreferredencoding()
        type(my_file)
        my_file.encoding
        sys.stdout.isatty()
        sys.stdout.encoding
        sys.stdin.isatty()
        sys.stdin.encoding
        sys.stderr.isatty()
        sys.stderr.encoding
        sys.getdefaultencoding()
        sys.getfilesystemencoding()
"""

#locale.getpreferredencoding() = default from locale
#my_file.encoding = file gets from default localte
#sys.stdout.isatty() = output is not going to console
#sys.stdout.encoding = therefore console output is UTF-8
#sys.stdin.isatty()
#sys.stdin.encoding
#sys.stderr.isatty()
#sys.stderr.encoding
#sys.getdefaultencoding() # default from internal Python setting
#sys.getfilesystemencoding() # is mbcs on Windows. On GNU/Linux and OSX all of these encodings... 
    #are set to UTF-8 by default, and have been for several years, so I/O handles all Unicode characters. 

my_file = open('dummy', 'w')

for expression in expressions.split():
    value = eval(expression) 
    print(expression.rjust(30), '->', repr(value))

 locale.getpreferredencoding() -> 'US-ASCII'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'US-ASCII'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'US-ASCII'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


the best advice about encoding defaults is: do not rely on them!

## Normalizing Unicode for saner comparisons

NFC (Normalization Form C) composes the code points to produce the shortest equiv‐ alent string, while NFD decomposes, expanding composed characters into base char‐ acters and separate combining characters. Both of these normalizations make compar‐ isons work as expected:

In [69]:
from unicodedata import normalize
s1 = 'café' # composed "e" with acute accent
s2 = 'café' # decomposed "e" and acute accent
print(len(s1), len(s2))
#using NFC
print(len(normalize('NFC', s1)), len(normalize('NFC', s2)))
print(normalize('NFC', s1) == normalize('NFC', s2))
#Using NFD
print(len(normalize('NFD', s1)), len(normalize('NFD', s2)))
print(normalize('NFD', s1) == normalize('NFD', s2))

5 4
4 4
True
5 5
True


In [70]:
from unicodedata import normalize, name
#ohn the unit
ohm = '\u2126'
print(name(ohm))
#normalise to greek char
ohm_c = normalize('NFC', ohm)
print(name(ohm_c))
#originals don't match
print(ohm == ohm_c)
#normalised do
print(normalize('NFC', ohm) == normalize('NFC', ohm_c))

OHM SIGN
GREEK CAPITAL LETTER OMEGA
False
True


In [74]:
from unicodedata import normalize, name 

half = '½'
print(normalize('NFKC', half))
four_squared = '4²'
print(normalize('NFKC', four_squared))
#the micro sign is considered a “compatibility character”.
micro = 'µ'
micro_kc = normalize('NFKC', micro) 
print(micro, micro_kc)
print(ord(micro), ord(micro_kc))
print(name(micro), name(micro_kc))


1⁄2
42
µ μ
181 956
MICRO SIGN GREEK SMALL LETTER MU


In the NFKC and NFKD forms, each compatibility character is replaced by a “compat‐ ibility decomposition” of one or more characters that are considered a “preferred” rep‐ resentation, even if there is some formatting loss.

## Case folding

Case folding is essentially converting all text to lowercase, with some additional transformations. For any string s containing only latin-1 characters, s.casefold() produces the same result as s.lower(), with only two exceptions: the micro sign 'μ' is changed to the Greek lower case mu (which looks the same in most fonts) and the German Eszett or “sharp s” (ß) becomes “ss”.


In [79]:
micro = 'µ'
micro_cf = micro.casefold() 
print(name(micro_cf))
print(micro, micro_cf)
eszett = 'ß'
print(name(eszett))
eszett_cf = eszett.casefold()
print(eszett, eszett_cf)

GREEK SMALL LETTER MU
µ μ
LATIN SMALL LETTER SHARP S
ß ss


In [83]:
from unicodedata import normalize 

def nfc_equal(str1, str2):
    return normalize('NFC', str1) == normalize('NFC', str2)

def fold_equal(str1, str2):
    return (normalize('NFC', str1).casefold() ==
            normalize('NFC', str2).casefold())

s1 = 'café'
s2 = 'cafe\u0301'
#different é
print(s1 == s2)
#normalise them - ok
print(nfc_equal(s1, s2))
#normalising doesn't work, because both have valid, but different code points
print(nfc_equal('A', 'a'))

s3 = 'Straße'
s4 = 'strasse'
#not equal
print(s3 == s4)
#normalised are equal
print(nfc_equal(s3, s4))
#folding means transformation for ezzet
print(fold_equal(s3, s4))
# é is normalised
print(fold_equal(s1, s2))
# cases are matched during casefold
print(fold_equal('A', 'a'))

False
True
False
False
False
True
True
True


## Extreme “normalization”: taking out diacritics

In [100]:
import unicodedata
import string

def shave_marks(txt):
    """Remove all diacritic marks"""
    norm_txt = unicodedata.normalize('NFD', txt) 
    shaved = ''.join(c for c in norm_txt
        if not unicodedata.combining(c)) 
    return unicodedata.normalize('NFC', shaved)

def shave_marks_latin(txt):
    """Remove all diacritic marks from Latin base characters""" 
    norm_txt = unicodedata.normalize('NFD', txt)
    latin_base = False
    keepers = []
    for c in norm_txt:
        if unicodedata.combining(c) and latin_base: 
            continue # ignore diacritic on Latin base char
        keepers.append(c)
        # if it isn't combining char, it's a new base char 
        if not unicodedata.combining(c):
            latin_base = c in string.ascii_letters 
    shaved = ''.join(keepers)
    return unicodedata.normalize('NFC', shaved)

single_map = str.maketrans("""‚ƒ„†ˆ‹‘’“”•–—~›""",
                           """'f"*^<''""---~>""")

multi_map = str.maketrans({
    '€': '<euro>',
    'Œ': 'OE',
    '‰': '<per mille>',
})

multi_map.update(single_map)

def dewinize(txt):
    """Replace Win1252 symbols with ASCII chars or sequences""" 
    return txt.translate(multi_map)

def asciize(txt):
    no_marks = shave_marks_latin(dewinize(txt)) 
    no_marks = no_marks.replace('ß', 'ss')
    return unicodedata.normalize('NFKC', no_marks)

order = '“Herr Voß: • ½ cup of ŒtkerTM caffè latte • bowl of açaí.”'

print(shave_marks(order))
print(dewinize(order))
print(asciize(order))

“Herr Voß: • ½ cup of ŒtkerTM caffe latte • bowl of acai.”
"Herr Voß: - ½ cup of OEtkerTM caffè latte - bowl of açaí."
"Herr Voss: - 1⁄2 cup of OEtkerTM caffe latte - bowl of acai."


In [106]:
## Doesn't work on OSX?
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
print(sorted(fruits))

import locale
locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')
fruits2 = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits2, key=locale.strxfrm)
print(sorted_fruits)

#Module not found - need to install Django?
#import pyuca
#coll = pyuca.Collator()
#fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
#sorted_fruits = sorted(fruits, key=coll.sort_key)
#sorted_fruits

['acerola', 'açaí', 'atemoia', 'cajá', 'caju']
['acerola', 'açaí', 'atemoia', 'cajá', 'caju']


## The Unicode database

The Unicode standard provides an entire database that includes not only the table mapping code points to character names, but also lot of metadata about the individual characters and how they are related. For example:

-  the Unicode database records whether a character is printable, is a letter, is a decimal digit or is some other numeric symbol. 

In [107]:
import unicodedata
import re

re_digit = re.compile(r'\d')
sample = '1\xbc\xb2\u0969\u136b\u216b\u2466\u2480\u3285'

#print the unicode codepoint
for char in sample: print('U+%04x' % ord(char),
    #print the char
    char.center(6),
        #Show re_dig if character matches the r'\d' regex.
        're_dig' if re_digit.match(char) else '-',
        #Show isdig if char.isdigit() is True.
        'isdig' if char.isdigit() else '-',
        #Show isnum if char.isnumeric() is True.
        'isnum' if char.isnumeric() else '-', 
        format(unicodedata.numeric(char), '5.2f'), 
        unicodedata.name(char),
        sep='\t')


U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00bc	  ¼   	-	-	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00b2	  ²   	-	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ३   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136b	  ፫   	-	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216b	  Ⅻ   	-	-	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ⑦   	-	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ⒀   	-	-	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  ㊅   	-	-	isnum	 6.00	CIRCLED IDEOGRAPH SIX


## str versus bytes in regular expressions

you can use regular expressions on str and bytes but in the second case bytes outside of the ASCII range are treated as non-digits and non-word characters.

In [109]:
import re
re_numbers_str = re.compile(r'\d+')
re_words_str = re.compile(r'\w+')
re_numbers_bytes = re.compile(rb'\d+')
re_words_bytes = re.compile(rb'\w+')
text_str = ("Ramanujan saw \u0be7\u0bed\u0be8\u0bef" 
            " as 1729 = 1³ + 12³ = 9³ + 10³.")
text_bytes = text_str.encode('utf_8')
print('Text', repr(text_str), sep='\n ') 
print('Numbers')
print(' str :', re_numbers_str.findall(text_str)) 
print(' bytes:', re_numbers_bytes.findall(text_bytes)) 
print('Words')
print(' str :', re_words_str.findall(text_str)) 
print(' bytes:', re_words_bytes.findall(text_bytes))

Text
 'Ramanujan saw ௧௭௨௯ as 1729 = 1³ + 12³ = 9³ + 10³.'
Numbers
 str : ['௧௭௨௯', '1729', '1', '12', '9', '10']
 bytes: [b'1729', b'1', b'12', b'9', b'10']
Words
 str : ['Ramanujan', 'saw', '௧௭௨௯', 'as', '1729', '1³', '12³', '9³', '10³']
 bytes: [b'Ramanujan', b'saw', b'as', b'1729', b'1', b'12', b'9', b'10']
