# Text, bytes and fun..

In [4]:
def czynność(аргумент):
    print(аргумент * 2)
    
czynność(42)

84


Wait a minute, does Python allow Unicode?

In [5]:
czynność.__name__

'czynność'

0_o with great power comes great responsibility

Let's harness it.

# Today's agenda
* Strings in Python 3.x
* `str` and `bytes`
* < Intermission > : Python's builtin functions
* Unicode support: encoding & decoding
* Encoding detection
* Regular expressions 101

 ### (not relevant for english-speakers)
Rule #0: you shouldn't use non-latin letters in code. `def czynność():` is unacceptable.

## Some simple string methods

### Strings support slices

In [6]:
s = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit'
print(s[10 : 20])
print(s[10 : 20 : 3])

m dolor si
mori


### Immutable, support basic maths (addition, multiplication)

In [7]:
s += '!' * 2
print(s)
s[1] = '?'

Lorem ipsum dolor sit amet, consectetur adipiscing elit!!


TypeError: 'str' object does not support item assignment

### String join / split

In [8]:
print(s.split())
print(s.split('sit'))

['Lorem', 'ipsum', 'dolor', 'sit', 'amet,', 'consectetur', 'adipiscing', 'elit!!']
['Lorem ipsum dolor ', ' amet, consectetur adipiscing elit!!']


In [9]:
words = s.split(' ')

In [10]:
print(''.join(words))
print(' '.join(words))
print(' ^_^ '.join(words))

Loremipsumdolorsitamet,consecteturadipiscingelit!!
Lorem ipsum dolor sit amet, consectetur adipiscing elit!!
Lorem ^_^ ipsum ^_^ dolor ^_^ sit ^_^ amet, ^_^ consectetur ^_^ adipiscing ^_^ elit!!


### Basic transformations: upper, lower

In [14]:
s.upper()

'LOREM IPSUM DOLOR SIT AMET, CONSECTETUR ADIPISCING ELIT!!'

In [15]:
s.lower()

'lorem ipsum dolor sit amet, consectetur adipiscing elit!!'

In [16]:
s.lower().capitalize()

'Lorem ipsum dolor sit amet, consectetur adipiscing elit!!'

In [4]:
s.title()

'Lorem Ipsum Dolor Sit Amet, Consectetur Adipiscing Elit!!'

# Substring search

In [17]:
'lorem' in s, 'lorem' in s.lower()

(False, True)

In [18]:
s.find('ipsum') # returns first index

6

In [19]:
s.find('nonexistent') # or -1 

-1

In [None]:
s.index('ipsum')

In [None]:
s.index('nonexistent')

### String examination: isalpha, isdigit etc

In [11]:
strings = ['abc', '2', '   ']

print('\t\t'.join('string isalpha isdigit isspace'.split()))
for s in strings:
    print('"{}"'.format(s), s.isalpha(), s.isdigit(), s.isspace(), sep='\t\t')

string		isalpha		isdigit		isspace
"abc"		True		False		False
"2"		False		True		False
"   "		False		False		True


### Misc: startswith, endswith, strip

In [21]:
'Hello, world!'.startswith('Hel')

True

In [22]:
'Hello, world!'.endswith('world')

False

In [23]:
'    Hello world    '.strip()

'Hello world'

## String module

In [12]:
import string

print(string.ascii_letters)
print(string.ascii_lowercase)
print(string.digits)
string.whitespace

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789


' \t\n\r\x0b\x0c'

### String formatting

https://pyformat.info/

In [24]:
# Old style

name = 'Bob'
'Hello, %s' % name

'Hello, Bob'

In [2]:
name = 'Bob'
'Hello, %(name)s' % {'name':name} 

'Hello, Bob'

In [25]:
# New style
name = 'Bob'
'Hello, {}'.format(name)

'Hello, Bob'

In [13]:
names = ['Bob','Alice']
print('Hello, {0}! Hello, {1}! Hello, {0} again!'.format(*names))

Hello, Bob! Hello, Alice! Hello, Bob again!


In [14]:
names = {'name1':'Bob','name2':'Alice'}
print('Hello, {name1}! Hello, {name2}! Hello, {name1} again!'.format(**names))

Hello, Bob! Hello, Alice! Hello, Bob again!


In [15]:
number = 50159747054
name = 'Bob'

print('Hey {}, I have a decimal number {}!'.format(name, number))

# Supports indexes
print('Sammy is {} {} {} {}!'.format('a', 'happy', 'blue', 'shark'))
print('Sammy is {3} {2} {1} {0}!'.format('a', 'happy', 'blue', 'shark'))
print('Sammy is {1} {1} {1} {3}!'.format('a', 'happy', 'blue', 'shark'))

# and named args
print('Coordinates: {latitude}, {longitude}'.format(latitude=37.24, longitude=-115.81))

Hey Bob, I have a decimal number 50159747054!
Sammy is a happy blue shark!
Sammy is shark blue happy a!
Sammy is happy happy happy shark!
Coordinates: 37.24, -115.81


In [17]:
# f-strings

year = 2022
season = 1

# Notice f symbol in front of the string!
print(f'In the year {year} it will be season {season} of the IIMCB Python course') 

In the year 2022 it will be season 1 of the IIMCB Python course


In [22]:
# f-strings support number formatting

year = 2022
season = 1

print(f'In the year {year} it will be season {season:.2f} of the IIMCB Python course') 

In the year 2022 it will be season 1.00 of the IIMCB Python course


In [23]:
# various operations can be performed inside f-strings

year = 2022

print(f'In the year {year} it will be season {year-2022+1} of the IIMCB Python course') 

In the year 2022 it will be season 1 of the IIMCB Python course


In [24]:
# you can access list items by index

years = [2021,2022,2023,]
season = 1


print(f'In the year {years[season]} it will be season {season} of the IIMCB Python course') 

In the year 2022 it will be season 1 of the IIMCB Python course


In [26]:
# And even call methods and functions!

year = 2022
season = 1
name = 'Python'

print(f'In the year {year} it will be season {season} of the IIMCB {name.upper()} course') 

In the year 2022 it will be season 1 of the IIMCB PYTHON course


## Symbols & bytes

Unicode: symbol identifier (such as `r`, `Я` or `韩`) != byte representation.

You can switch between the Unicode symbol (e.g. "U+1D11E") and its integer identifier with `ord` and `chr`:

In [27]:
ord('r')

114

In [28]:
ord('\U0001D11E'), chr(ord('\U0001D11E')), chr(119070)

(119070, '𝄞', '𝄞')

In [29]:
ord('韩')

38889

In [30]:
chr(ord('韩')) == '韩'

True

How can we store it in memory?

```
Encoding: symbols (human-readable) -> bytes 
Decoding: bytes -> symbols
```

In [31]:
s = 'café'
len(s)

4

In [32]:
b = s.encode('utf8')
print(len(b), type(b))
print(*b)
b # binary representation

5 <class 'bytes'>
99 97 102 195 169


b'caf\xc3\xa9'

In [33]:
b.decode('utf8')

'café'

In [33]:
print(b)
for byte in b:
    print(hex(byte), byte, chr(byte))

b'caf\xc3\xa9'
0x63 99 c
0x61 97 a
0x66 102 f
0xc3 195 Ã
0xa9 169 ©


## < Fun >
Let's print the full ASCII table with one print statement

### First, some Python built-in functions

* `map(function, *iterables)`

In [34]:
map(chr, range(50, 70))

<map at 0x7f3ef4c49be0>

In [35]:
print(list(map(chr, range(50, 70))))

['2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E']


In [36]:
string = '2 3 5 7 11 13' # input()
print(string.split())
integers = list(map(int, string.split()))
integers

['2', '3', '5', '7', '11', '13']


[2, 3, 5, 7, 11, 13]

* `zip(*iterables)`

In [37]:
names = ['Bob', 'Sam']
surnames = ['Bronx', 'Jackson']
print(zip(names, surnames))
print(list(zip(names, surnames)))

<zip object at 0x7f3ef0366840>
[('Bob', 'Bronx'), ('Sam', 'Jackson')]


### ASCII table

In [38]:
for char_index in range(128):
    print('{}\t||\t{}\t||\t{}'.format(char_index, hex(char_index), chr(char_index)))

0	||	0x0	||	 
1	||	0x1	||	
2	||	0x2	||	
3	||	0x3	||	
4	||	0x4	||	
5	||	0x5	||	
6	||	0x6	||	
7	||	0x7	||	
8	||	0x8	||	
9	||	0x9	||		
10	||	0xa	||	

11	||	0xb	||	
12	||	0xc	||	
13	||	0xd	||	
14	||	0xe	||	
15	||	0xf	||	
16	||	0x10	||	
17	||	0x11	||	
18	||	0x12	||	
19	||	0x13	||	
20	||	0x14	||	
21	||	0x15	||	
22	||	0x16	||	
23	||	0x17	||	
24	||	0x18	||	
25	||	0x19	||	
26	||	0x1a	||	
27	||	0x1b	||	
28	||	0x1c	||	
29	||	0x1d	||	
30	||	0x1e	||	
31	||	0x1f	||	
32	||	0x20	||	 
33	||	0x21	||	!
34	||	0x22	||	"
35	||	0x23	||	#
36	||	0x24	||	$
37	||	0x25	||	%
38	||	0x26	||	&
39	||	0x27	||	'
40	||	0x28	||	(
41	||	0x29	||	)
42	||	0x2a	||	*
43	||	0x2b	||	+
44	||	0x2c	||	,
45	||	0x2d	||	-
46	||	0x2e	||	.
47	||	0x2f	||	/
48	||	0x30	||	0
49	||	0x31	||	1
50	||	0x32	||	2
51	||	0x33	||	3
52	||	0x34	||	4
53	||	0x35	||	5
54	||	0x36	||	6
55	||	0x37	||	7
56	||	0x38	||	8
57	||	0x39	||	9
58	||	0x3a	||	:
59	||	0x3b	||	;
60	||	0x3c	||	<
61	||	0x3d	||	=
62	||	0x3e	||	>
63	||	0x3f	||	?
64

Can you do this in one line?

In [8]:
# ascii_range = range(128)

print(*['\t||\t'.join(map(str, row)) for row in zip(range(128), map(hex, range(128)), map(chr, range(128)))], sep='\n')

0	||	0x0	||	 
1	||	0x1	||	
2	||	0x2	||	
3	||	0x3	||	
4	||	0x4	||	
5	||	0x5	||	
6	||	0x6	||	
7	||	0x7	||	
8	||	0x8	||	
9	||	0x9	||		
10	||	0xa	||	

11	||	0xb	||	
12	||	0xc	||	
13	||	0xd	||	
14	||	0xe	||	
15	||	0xf	||	
16	||	0x10	||	
17	||	0x11	||	
18	||	0x12	||	
19	||	0x13	||	
20	||	0x14	||	
21	||	0x15	||	
22	||	0x16	||	
23	||	0x17	||	
24	||	0x18	||	
25	||	0x19	||	
26	||	0x1a	||	
27	||	0x1b	||	
28	||	0x1c	||	
29	||	0x1d	||	
30	||	0x1e	||	
31	||	0x1f	||	
32	||	0x20	||	 
33	||	0x21	||	!
34	||	0x22	||	"
35	||	0x23	||	#
36	||	0x24	||	$
37	||	0x25	||	%
38	||	0x26	||	&
39	||	0x27	||	'
40	||	0x28	||	(
41	||	0x29	||	)
42	||	0x2a	||	*
43	||	0x2b	||	+
44	||	0x2c	||	,
45	||	0x2d	||	-
46	||	0x2e	||	.
47	||	0x2f	||	/
48	||	0x30	||	0
49	||	0x31	||	1
50	||	0x32	||	2
51	||	0x33	||	3
52	||	0x34	||	4
53	||	0x35	||	5
54	||	0x36	||	6
55	||	0x37	||	7
56	||	0x38	||	8
57	||	0x39	||	9
58	||	0x3a	||	:
59	||	0x3b	||	;
60	||	0x3c	||	<
61	||	0x3d	||	=
62	||	0x3e	||	>
63	||	0x3f	||	?
64

## < /Fun >

Which encodings can we use? Lots.

In [39]:
string = 'El Niño'

for codec in ['latin_1', 'utf_8', 'utf_16', 'cp437']:
    encoded = string.encode(codec)
    print(codec, encoded.decode(codec), encoded, sep='\t\t')

latin_1		El Niño		b'El Ni\xf1o'
utf_8		El Niño		b'El Ni\xc3\xb1o'
utf_16		El Niño		b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'
cp437		El Niño		b'El Ni\xa4o'


Some of them won't work:

In [40]:
city = 'São Paulo'
print(city.encode('utf_8'))
print(city.encode('utf_16'))
print(city.encode('iso8859_1'))
print(city.encode('cp437'))

b'S\xc3\xa3o Paulo'
b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'
b'S\xe3o Paulo'


UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

Oops!

In [41]:
print(city.encode('cp437', errors='ignore')) # bad
print(city.encode('cp437', errors='replace')) # better
print(city.encode('cp437', errors='xmlcharrefreplace')) # still not perfect

b'So Paulo'
b'S?o Paulo'
b'S&#227;o Paulo'


You can also look at the `codecs` module, it was widely used in Python 2 but is obsolete now.

## Files support

In [42]:
with open('unicode_file.txt', 'w', encoding='utf-16le') as f:
    f.write('韩国烧酒')

In [43]:
with open('unicode_file.txt', 'r') as f:
    print(f.read())

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

In [44]:
with open('unicode_file.txt', 'r', encoding='utf-16') as f:
    print(f.read())

UnicodeError: UTF-16 stream does not start with BOM

In [41]:
with open('unicode_file.txt', 'r', encoding='utf-16le') as f:
    print(f.read())

韩国烧酒


So far we didn't encounter any **tough** cases.

We can handle them with `chardet` and `UnicodeDammit`, which is part of BeautifulSoup package.

### chardet

In [42]:
# !pip install chardet

In [53]:
import urllib, chardet

rawdata = urllib.request.urlopen('https://www.baidu.com/').read()
chardet.detect(rawdata)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

### UnicodeDammit

In [44]:
# !pip install bs4

In [1]:
from bs4 import UnicodeDammit

Let's build a really weird string


In [2]:
snowmen = (u'\N{SNOWMAN}' * 3)
print(snowmen)
quote = (u'\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}')
print(quote)
doc = snowmen.encode('utf8') + quote.encode('windows_1252')

☃☃☃
“I like snowmen!”


In [3]:
print(doc)
# ☃☃☃�I like snowmen!�

print(doc.decode('windows-1252'))
# â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”

print(doc.decode('utf8'))
# ☃☃☃�I like snowmen!�

b'\xe2\x98\x83\xe2\x98\x83\xe2\x98\x83\x93I like snowmen!\x94'
â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 9: invalid start byte

UnicodeDammit CAN handle it!

In [4]:
new_doc = UnicodeDammit.detwingle(doc)
print(new_doc)
print(new_doc.decode('utf8'))
# ☃☃☃“I like snowmen!”

b'\xe2\x98\x83\xe2\x98\x83\xe2\x98\x83\xe2\x80\x9cI like snowmen!\xe2\x80\x9d'
☃☃☃“I like snowmen!”


## Regular expressions 101

Regular expressions debuggins is easy to do with https://regex101.com/

Metasymbols: ```. ˆ $ * + ? { } [ ] | ( )```

Documentation: https://docs.python.org/3/library/re.html

In [5]:
import re

string = '__abc__acc__abc__a6c__'
print(re.findall(r'abc', string))
print(re.findall(r'a\dc', string))
print(re.findall(r'a\wc', string))

['abc', 'abc']
['a6c']
['abc', 'acc', 'abc', 'a6c']


In [6]:
print(re.sub(r'a\wc', '***', string))

__***__***__***__***__


In [7]:
text = u'Français złoty Österreich'
pattern = r'\w+'
ascii_pattern = re.compile(pattern, re.ASCII)
unicode_pattern = re.compile(pattern)

print('Text    :', text)
print('Pattern :', pattern)
print('ASCII   :', list(ascii_pattern.findall(text)))
print('Unicode :', list(unicode_pattern.findall(text)))

Text    : Français złoty Österreich
Pattern : \w+
ASCII   : ['Fran', 'ais', 'z', 'oty', 'sterreich']
Unicode : ['Français', 'złoty', 'Österreich']


## Code Review
### git


Git. And Github. 

http://rogerdudler.github.io/git-guide/

Don't worry, you'll need only `init`, `clone`, `add`, `commit` and `push` commands. Maybe some more.

If you're stuck, just ask for help!