# Text, bytes and fun..

## Functions!

Why oh why not again please . . .

Suppose you want to write an open-source library in Python. 

Consider a piece of very important core code of your library:

In [9]:
# this is my first open-source library

# does stuff
def some_really_complex_function(arg, another_arg, *here_go_more_args, **and_some_kwargs_too):
    print('Here I do some really complicated code')
    return 42

In [10]:
some_really_complex_function(1, 2)

Here I do some really complicated code


42

In [11]:
help(some_really_complex_function)

Help on function some_really_complex_function in module __main__:

some_really_complex_function(arg, another_arg, *here_go_more_args, **and_some_kwargs_too)
    # does stuff



How to make this function self-explanatory?

You should use docstings!

In [3]:
def some_really_complex_function_with_docstring(arg, another_arg, *here_go_more_args, **and_some_kwargs_too):
    '''
    This is a really complex function.
    
    Args:
        arg: some argument
        another_arg: some other argument
        
    (c) –õ—è–≥—É—à–æ–Ω–æ–∫ –ü–µ–ø <------- you see that?
    '''
    print('Here I do some really complicated code')
    # Yeah, and returns 42
    return 42

In [12]:
help(some_really_complex_function_with_docstring)

Help on function some_really_complex_function_with_docstring in module __main__:

some_really_complex_function_with_docstring(arg, another_arg, *here_go_more_args, **and_some_kwargs_too)
    This is a really complex function.
    
    Args:
        arg: some argument
        another_arg: some other argument
        
    (c) –õ—è–≥—É—à–æ–Ω–æ–∫ –ü–µ–ø <------- you see that?



Wait a minute, does Python allow Unicode?

In [13]:
def –¥–≤–∞–∂–¥—ã_–ø—Ä–∏–Ω—Ç(–∞—Ä–≥—É–º–µ–Ω—Ç):
    print(–∞—Ä–≥—É–º–µ–Ω—Ç * 2)
    
–¥–≤–∞–∂–¥—ã_–ø—Ä–∏–Ω—Ç(42)

84


In [14]:
–¥–≤–∞–∂–¥—ã_–ø—Ä–∏–Ω—Ç.__name__

'–¥–≤–∞–∂–¥—ã_–ø—Ä–∏–Ω—Ç'

0_o with great power comes great responsibility

Let's harness it.

# Today's agenda
* Strings in Python 3.x
* `str` and `bytes`
* < Intermission > : Python's builtin functions
* Unicode support: encoding & decoding
* Encoding detection
* Regular expressions 101

Rule #0: you shouldn't use cyrillic letters in code. `def –¥–≤–∞–∂–¥—ã_–ø—Ä–∏–Ω—Ç():` is unacceptable.

## Some simple string methods

* Strings support slices

In [23]:
s = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit'
print(s[10 : 20])
print(s[20 : 10 : -1])
print(s[:: -1])

m dolor si
tis rolod 
tile gnicsipida rutetcesnoc ,tema tis rolod muspi meroL


* Immutable, support basic maths (addition, multiplication)

In [24]:
s += '!' * 2
print(s)
s[-1] = '?'

Lorem ipsum dolor sit amet, consectetur adipiscing elit!!


TypeError: 'str' object does not support item assignment

* String join / split

In [11]:
print(s.split())
print(s.split('sit'))

['Lorem', 'ipsum', 'dolor', 'sit', 'amet,', 'consectetur', 'adipiscing', 'elit!!']
['Lorem ipsum dolor ', ' amet, consectetur adipiscing elit!!']


In [26]:
words = s.split(' ')
words

['Lorem',
 'ipsum',
 'dolor',
 'sit',
 'amet,',
 'consectetur',
 'adipiscing',
 'elit!!']

In [27]:
print(''.join(words))
print(' '.join(words))
print(' ^_^ '.join(words))

Loremipsumdolorsitamet,consecteturadipiscingelit!!
Lorem ipsum dolor sit amet, consectetur adipiscing elit!!
Lorem ^_^ ipsum ^_^ dolor ^_^ sit ^_^ amet, ^_^ consectetur ^_^ adipiscing ^_^ elit!!


* Basic transformations: upper, lower

In [28]:
s.upper()

'LOREM IPSUM DOLOR SIT AMET, CONSECTETUR ADIPISCING ELIT!!'

In [29]:
s.lower()

'lorem ipsum dolor sit amet, consectetur adipiscing elit!!'

In [16]:
s.lower().capitalize()

'Lorem ipsum dolor sit amet, consectetur adipiscing elit!!'

* Substring search

In [30]:
'lorem' in s, 'lorem' in s.lower()

(False, True)

In [31]:
s.find('ipsum') # returns first index

6

In [32]:
s.find('nonexistent') # or -1 

-1

* String examination: isalpha, isdigit etc

In [34]:
strings = ['abc', '25', '   ']

print('\t\t'.join('string isalpha isdigit isspace'.split()))
for s in strings:
    print("\"{}\"".format(s), s.isalpha(), s.isdigit(), s.isspace(), sep='\t\t')

string		isalpha		isdigit		isspace
"abc"		True		False		False
"25"		False		True		False
"   "		False		False		True


* Misc: startswith, endswith, strip

In [35]:
'Hello, world!'.startswith('Hel')

True

In [36]:
'Hello, world!'.endswith('world')

False

In [40]:
'''    Hello world    

    '''.strip().upper()

'HELLO WORLD'

### String formatting

https://pyformat.info/

In [44]:
# Old style

name = 'Bob'
print('Hello, ' + 'Bob' + ' ' + 'Marley') # bad
print('Hello, %s %s' % (name, 'Marley')) # good

Hello, Bob Marley
Hello, Bob Marley


In [45]:
# New style

'Hello, {}'.format(name)

'Hello, Bob'

In [46]:
number = 50159747054

print('Hey {}, I have a decimal number {}!'.format(name, number))

# Supports indexes
print('Sammy is {} {} {} {}!'.format('a', 'happy', 'blue', 'shark'))
print('Sammy is {3} {2} {1} {0}!'.format('a', 'happy', 'blue', 'shark'))
print('Sammy is {1} {1} {1} {3}!'.format('a', 'happy', 'blue', 'shark'))

# and named args
print('Coordinates: {latitude}, {longitude}'.format(latitude=37.24, longitude=-115.81))

Hey Bob, I have a decimal number 50159747054!
Sammy is a happy blue shark!
Sammy is shark blue happy a!
Sammy is happy happy happy shark!
Coordinates: 37.24, -115.81


## Symbols & bytes

Unicode: symbol identifier (such as `r`, `–Ø` or `Èü©`) != byte representation.

You can switch between the Unicode symbol (e.g. "U+1D11E") and its integer identifier with `ord` and `chr`:

In [47]:
ord('r')

114

In [49]:
ord('\U0001D11E'), chr(ord('\U0001D11E'))

(119070, 'ùÑû')

In [50]:
ord('Èü©')

38889

In [51]:
chr(ord('Èü©')) == 'Èü©'

True

How can we store it in memory?

```
Encoding: symbols (human-readable) -> bytes 
Decoding: bytes -> symbols
```

In [52]:
s = 'caf√©'
len(s)

4

In [53]:
b = s.encode('utf8')
print(len(b), type(b))
b # binary representation

5 <class 'bytes'>


b'caf\xc3\xa9'

In [54]:
b.decode('utf8')

'caf√©'

In [55]:
print(b)
for byte in b:
    print(hex(byte), byte, chr(byte))

b'caf\xc3\xa9'
0x63 99 c
0x61 97 a
0x66 102 f
0xc3 195 √É
0xa9 169 ¬©


## < Fun >
Let's print the full ASCII table with one print statement

### First, some Python built-in functions

* `map(function, *iterables)`

In [75]:
map(chr, range(50, 70))

<map at 0x7f88270aa940>

In [57]:
print(list(map(int, input().split())))

2 3 4 5 6
[2, 3, 4, 5, 6]


In [76]:
print(list(map(chr, range(50, 70))))

['2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E']


In [67]:
string = '2 3 5 7 11 13' # input()
print(string.split())
integers = list(map(int, string.split()))
integers

['2', '3', '5', '7', '11', '13']


[2, 3, 5, 7, 11, 13]

* `zip(*iterables)`

In [58]:
names = ['Bob', 'Sam']
surnames = ['Bronx', 'Jackson']
print(zip(names, surnames))
print(dict(zip(names, surnames)))

<zip object at 0x7fd9004a8dc8>
{'Bob': 'Bronx', 'Sam': 'Jackson'}


### ASCII table

In [110]:
for char_index in range(256):
    print('{}\t||\t{}\t||\t{}'.format(char_index, hex(char_index), chr(char_index)))

0	||	0x0	||	 
1	||	0x1	||	
2	||	0x2	||	
3	||	0x3	||	
4	||	0x4	||	
5	||	0x5	||	
6	||	0x6	||	
7	||	0x7	||	
8	||	0x8	||	
9	||	0x9	||		
10	||	0xa	||	

11	||	0xb	||	
12	||	0xc	||	
13	||	0xd	||	
14	||	0xe	||	
15	||	0xf	||	
16	||	0x10	||	
17	||	0x11	||	
18	||	0x12	||	
19	||	0x13	||	
20	||	0x14	||	
21	||	0x15	||	
22	||	0x16	||	
23	||	0x17	||	
24	||	0x18	||	
25	||	0x19	||	
26	||	0x1a	||	
27	||	0x1b	||	
28	||	0x1c	||	
29	||	0x1d	||	
30	||	0x1e	||	
31	||	0x1f	||	
32	||	0x20	||	 
33	||	0x21	||	!
34	||	0x22	||	"
35	||	0x23	||	#
36	||	0x24	||	$
37	||	0x25	||	%
38	||	0x26	||	&
39	||	0x27	||	'
40	||	0x28	||	(
41	||	0x29	||	)
42	||	0x2a	||	*
43	||	0x2b	||	+
44	||	0x2c	||	,
45	||	0x2d	||	-
46	||	0x2e	||	.
47	||	0x2f	||	/
48	||	0x30	||	0
49	||	0x31	||	1
50	||	0x32	||	2
51	||	0x33	||	3
52	||	0x34	||	4
53	||	0x35	||	5
54	||	0x36	||	6
55	||	0x37	||	7
56	||	0x38	||	8
57	||	0x39	||	9
58	||	0x3a	||	:
59	||	0x3b	||	;
60	||	0x3c	||	<
61	||	0x3d	||	=
62	||	0x3e	||	>
63	||	0x3f	||	?
64

Can you do this in one line?

In [113]:
# ascii_range = range(256)

print(*['\t||\t'.join(map(str, row)) for row in zip(range(256), map(hex, range(256)), map(chr, range(256)))], sep='\n')

0	||	0x0	||	 
1	||	0x1	||	
2	||	0x2	||	
3	||	0x3	||	
4	||	0x4	||	
5	||	0x5	||	
6	||	0x6	||	
7	||	0x7	||	
8	||	0x8	||	
9	||	0x9	||		
10	||	0xa	||	

11	||	0xb	||	
12	||	0xc	||	
13	||	0xd	||	
14	||	0xe	||	
15	||	0xf	||	
16	||	0x10	||	
17	||	0x11	||	
18	||	0x12	||	
19	||	0x13	||	
20	||	0x14	||	
21	||	0x15	||	
22	||	0x16	||	
23	||	0x17	||	
24	||	0x18	||	
25	||	0x19	||	
26	||	0x1a	||	
27	||	0x1b	||	
28	||	0x1c	||	
29	||	0x1d	||	
30	||	0x1e	||	
31	||	0x1f	||	
32	||	0x20	||	 
33	||	0x21	||	!
34	||	0x22	||	"
35	||	0x23	||	#
36	||	0x24	||	$
37	||	0x25	||	%
38	||	0x26	||	&
39	||	0x27	||	'
40	||	0x28	||	(
41	||	0x29	||	)
42	||	0x2a	||	*
43	||	0x2b	||	+
44	||	0x2c	||	,
45	||	0x2d	||	-
46	||	0x2e	||	.
47	||	0x2f	||	/
48	||	0x30	||	0
49	||	0x31	||	1
50	||	0x32	||	2
51	||	0x33	||	3
52	||	0x34	||	4
53	||	0x35	||	5
54	||	0x36	||	6
55	||	0x37	||	7
56	||	0x38	||	8
57	||	0x39	||	9
58	||	0x3a	||	:
59	||	0x3b	||	;
60	||	0x3c	||	<
61	||	0x3d	||	=
62	||	0x3e	||	>
63	||	0x3f	||	?
64

## < /Fun >

Which encodings can we use? Lots.

In [59]:
string = 'El Ni√±o'

for codec in ['latin_1', 'utf_8', 'utf_16', 'cp437']:
    encoded = string.encode(codec)
    print(codec, encoded.decode(codec), encoded, sep='\t\t')

latin_1		El Ni√±o		b'El Ni\xf1o'
utf_8		El Ni√±o		b'El Ni\xc3\xb1o'
utf_16		El Ni√±o		b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'
cp437		El Ni√±o		b'El Ni\xa4o'


Some of then won't work:

In [61]:
city = 'S√£o Paulo'
print(city.encode('utf_8'))
print(city.encode('utf_16'))
print(city.encode('iso8859_1'))
print(city.encode('cp437'))

b'S\xc3\xa3o Paulo'
b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'
b'S\xe3o Paulo'


UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

Oops!

In [62]:
print(city.encode('cp437', errors='ignore')) # bad
print(city.encode('cp437', errors='replace')) # better
print(city.encode('cp437', errors='xmlcharrefreplace')) # still not perfect

b'So Paulo'
b'S?o Paulo'
b'S&#227;o Paulo'


You can also look at the `codecs` module, it was widely used in Python 2 but is obsolete now.

## Files support

In [64]:
with open('unicode_file.txt', 'w', encoding='utf-16le') as f:
    f.write('Èü©ÂõΩÁÉßÈÖí')

In [65]:
with open('unicode_file.txt', 'r') as f:
    print(f.read())

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

In [66]:
with open('unicode_file.txt', 'r', encoding='utf-16') as f:
    print(f.read())

UnicodeError: UTF-16 stream does not start with BOM

In [67]:
with open('unicode_file.txt', 'r', encoding='utf-16le') as f:
    print(f.read())

Èü©ÂõΩÁÉßÈÖí


So far we didn't encounter any **tough** cases.

We can handle them with `chardet` and `UnicodeDammit`, which is part of BeautifulSoup package.

### chardet

In [42]:
# !pip install chardet

In [43]:
import urllib, chardet

rawdata = urllib.request.urlopen('http://yahoo.co.jp/').read()
chardet.detect(rawdata)

{'confidence': 0.99, 'encoding': 'utf-8', 'language': ''}

### UnicodeDammit

In [44]:
# !pip install bs4

In [68]:
from bs4 import UnicodeDammit

Let's build a really weird string


In [70]:
snowmen = ('\N{SNOWMAN}' * 3)
print(snowmen)
quote = ('\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}')
print(quote)
doc = snowmen.encode('utf8') + quote.encode('windows_1252')

‚òÉ‚òÉ‚òÉ
‚ÄúI like snowmen!‚Äù


In [71]:
print(doc)
# ‚òÉ‚òÉ‚òÉÔøΩI like snowmen!ÔøΩ

print(doc.decode('windows-1252'))
# √¢Àú∆í√¢Àú∆í√¢Àú∆í‚ÄúI like snowmen!‚Äù

print(doc.decode('utf8'))
# ‚òÉ‚òÉ‚òÉÔøΩI like snowmen!ÔøΩ

b'\xe2\x98\x83\xe2\x98\x83\xe2\x98\x83\x93I like snowmen!\x94'
√¢Àú∆í√¢Àú∆í√¢Àú∆í‚ÄúI like snowmen!‚Äù


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 9: invalid start byte

UnicodeDammit CAN handle it!

In [72]:
new_doc = UnicodeDammit.detwingle(doc)
print(new_doc)
print(new_doc.decode('utf8'))
# ‚òÉ‚òÉ‚òÉ‚ÄúI like snowmen!‚Äù

b'\xe2\x98\x83\xe2\x98\x83\xe2\x98\x83\xe2\x80\x9cI like snowmen!\xe2\x80\x9d'
‚òÉ‚òÉ‚òÉ‚ÄúI like snowmen!‚Äù


## Regular expressions 101

Regular expressions debuggins is easy to do with https://regex101.com/

Metasymbols: ```. ÀÜ $ * + ? { } [ ] | ( )```

In [74]:
import re

string = '__abc__acc__abc__a6c__'
print(re.findall(r'abc', string))
print(re.findall(r'a\dc', string))
print(re.findall(r'a\wc', string))
print(re.findall(r'.*', string))

['abc', 'abc']
['a6c']
['abc', 'acc', 'abc', 'a6c']
['__abc__acc__abc__a6c__', '']


In [75]:
print(re.sub(r'a\wc', '***', string))

__***__***__***__***__


In [77]:
text = u'Fran√ßais z≈Çoty √ñsterreich'
pattern = r'\w+'
ascii_pattern = re.compile(pattern, re.ASCII)
unicode_pattern = re.compile(pattern)

print('Text    :', text)
print('Pattern :', pattern)
print('ASCII   :', list(ascii_pattern.findall(text)))
print('Unicode :', list(unicode_pattern.findall(text)))

Text    : Fran√ßais z≈Çoty √ñsterreich
Pattern : \w+
ASCII   : ['Fran', 'ais', 'z', 'oty', 'sterreich']
Unicode : ['Fran√ßais', 'z≈Çoty', '√ñsterreich']


## So you want to write an open-source library, remember?
### Or just pass your 1st Python review

How will you do that? Using Version Control Systems!

Git especially. And Github. 

http://rogerdudler.github.io/git-guide/

Don't worry, you'll need only `init`, `clone`, `add`, `commit` and `push` commands. Maybe some more.

If you're stuck, just ask for help!