# Introduction to Python and Natural Language Technologies

__Lecture 02, Type system and built-in types__

__Feb 16, 2021__

__Judit Ács__

# Type system

__Dynamic__:
- No need to declare variables
- The `=` operator binds a reference to any arbitrary object

In [1]:
i = 2
type(i), id(i)

(int, 94508059921760)

In [2]:
i = "foo"
type(i), id(i)

(str, 140413869225072)

__Strongly typed__:
- Most implicit conversions are disallowed such as between strings and numeric types:

In [3]:
print("I am " + 20 + " years old")  # raises TypeError

TypeError: can only concatenate str (not "int") to str

We must explicitely cast it:

In [None]:
print("I am " + str(20) + " years old")

Note that many other languages like Javascript allow implicit casting:

In [None]:
%%javascript

element.text("I am " + 20 + " years old")

In [None]:
%%javascript

element.text("1" + 1)

In [None]:
%%javascript

element.text(1 + "1")

Conversions between numeric types are OK:

In [None]:
i = 2
f = 1.2
s = i + f
print(type(i), type(f))
print(type(i + f))
print(s)

# Boolean operators and type

Three boolean operators: `and`, `or` and `not`

In [None]:
x = 2

x < 2 or x >= 2

In [4]:
x > 0 and x < 10

NameError: name 'x' is not defined

In [None]:
not x < 0

Two boolean values: `True` and `False` (must be capitalized)

In [None]:
x = True
type(x)

In [None]:
True and False

In [None]:
True or False

In [None]:
not True

# Bitwise operators

Python has 6 bitwise operators: <<, >>, |, &, ~, ^

Demonstrating each with the binary representation of numbers:

In [None]:
bin(0b00010 << 2)

In [None]:
bin(0b00010 >> 1)

In [None]:
bin(0b1010 | 0b1001)

In [None]:
bin(0b1010 & 0b1001)

`~` returns the complement of x. Same as -x-1:

In [None]:
x = 0b1010
print(f"{x = }")
print(f"{bin(x) = }")
print(f"{~x = }")
print(f"{bin(~x) = }")

# Numeric types

Three numeric types: __`int`__, __`float`__ and __`complex`__ <br/>

An object's type is derived from its initial value:

In [None]:
i = 2
f = 1.2
c = 1+2j

type(i), type(f), type(c)

Implicit conversion between numeric types is supported in arithmetic operations, the resulting type is the one with less data loss:

In [None]:
c2 = i + c
print(c2, type(c2))

Floats can be defined using __scientific notation__:

In [None]:
3e2, 3E2

## Underscores in numeric literals

Examples from [PEP 0515](https://www.python.org/dev/peps/pep-0515/):

In [None]:
# grouping decimal numbers by thousands
amount = 10_000_000.0

# grouping hexadecimal addresses by words
addr = 0xCAFE_F00D

# grouping bits into nibbles in a binary literal
flags = 0b_0011_1111_0100_1110

# same, for string conversions
flags = int('0b_1111_0000', 2)

# two or more consecutive underscores are not allowed
# 2__4  # SyntaxError

## Precision and range

Integers have unlimited precision.

In [None]:
type(2**63 + 1)

More information in `sys.int_info`:

In [None]:
import sys
sys.int_info

Floats are usually implemented using C's double.<br />
Complex numbers use two floats for their real and imaginary parts.<br/>
Check `sys.float_info` for more information:

In [None]:
import sys
sys.float_info

## Arithmetic operators

Addition, subtraction and product result in a type with the least loss in information:

In [None]:
i = 2
f = 4.2
c = 4.1-3j

s1 = i + f
s2 = f - c
s3 = i * c
print(s1, type(s1))
print(s2, type(s2))
print(s3, type(s3))

__quotient operator__

In Python3 operator/ computes the float quotient even if the operands are both integers unlike in C++ or Python2:

In [5]:
3 / 2

1.5

In [6]:
%%python2

print(3 / 2)

1


Explicit __floor quotient operator__

In [7]:
-3.0 // 2, 3 // 2

(-2.0, 1)

## Comparison operators

In [8]:
x = 23
x < 24, x >= 22

(True, True)

Operators can be chained:

In [9]:
23 < x < 100  # equivalent to (23 < x) and (x < 100)

False

In [10]:
23 <= x < 100

True

## Other operators for numeric types

__remainder__

In [11]:
5 % 3

2

__power__

In [12]:
2 ** 3

8

Using it for square root:

In [13]:
2 ** 0.5

1.4142135623730951

__absolute value__

In [14]:
abs(-2 - 1j), abs(1+1j), abs(-2)

(2.23606797749979, 1.4142135623730951, 2)

__round__

In [15]:
round(2.3456), round(2.3456, 2)

(2, 2.35)

## Explicit conversions between numeric types

In [16]:
float(2)  # 2.

2.0

In [17]:
int(2.5), int(-2.5)

(2, -2)

## `math` and `cmath`

Additional operations for real and complex numbers:

In [18]:
import math

math.log(16), math.log(16, 2), math.exp(2), \
math.exp(math.log(10))

(2.772588722239781, 4.0, 7.38905609893065, 10.000000000000002)

# Mutable vs. immutable types

- Instances of mutable types can be modified in place
- Immutable objects have the same value during their lifetime
- Are numeric types mutable or immutable?

In [19]:
x = 2
old_id = id(x)
x += 1
print(x, id(x) == old_id)

3 False


Booleans are singleton immutable objects.

In [20]:
x = True
y = False
print(x is y)
z = False
print(z is y)

False
True


In [21]:
(2 == 2) is (3 == 3), (2 != 2) is (3 < 1)

(True, True)

Lists are mutable:

In [22]:
l1 = [0, 1]
old_id = id(l1)
l1.append(2)
old_id == id(l1)

True

# Sequence types

All sequences support the following basic operations:

| operation | behaviour |
| :----- | :----- |
| `x in s` | 	True if an item of s is equal to x, else False |
| `x not in s` | 	False if an item of s is equal to x, else True |
| `s + t` | 	the concatenation of s and t |
| `s * n or n * s` | 	equivalent to adding s to itself n times |
| `s[i]` | 	ith item of s, origin 0 |
| `s[i:j]` | 	slice of s from i to j |
| `s[i:j:k]` | 	slice of s from i to j with step k |
| `len(s)` | 	length of s |
| `min(s)` | 	smallest item of s |
| `max(s)` | 	largest item of s |
| `s.index(x[, i[, j]])` | 	index of the first occurrence of x in s (at or after index i and before index j) |
| `s.count(x)` | 	total number of occurrences of x in s |

[Table source](https://docs.python.org/3/library/stdtypes.html#common-sequence-operations)

Sidenote: `x not in s` is _Pythonic_, `not x in s` is not:

In [23]:
5 not in [ 2, 3, 4]  # this is Pythonic
not 5 in [2, 3, 4]  # this is less Pythonic

True

## Traversing sequences

All sequence types can be traversed with for loops:

In [24]:
l = [1, -1, "foo", 2, "bar"]
for element in l:
    print(element)

1
-1
foo
2
bar


__`enumerate`__

If we need the indices too, the built-in `enumerate` function iterates over index-element pairs:

In [25]:
for i, element in enumerate(l):
    print(i, element)

0 1
1 -1
2 foo
3 2
4 bar


## `list`

Mutable sequence type:

In [26]:
l = [1, 2, 2, 3]
print(l, l[1])

[1, 2, 2, 3] 2


In [27]:
l[4]  # raises IndexError

IndexError: list index out of range

Negative indexing supported:

In [None]:
l[-1], l[len(l)-1]

But it can also me out of range

In [None]:
l[-5]  # raises IndexError

__`append`__ adds an element to the end of the list:

In [None]:
l = [1, 2, 3]
l.append(3)
l

In [None]:
for e in l:
    print(id(e), e)

__`insert`__ inserts an element at a specific index:

In [None]:
l = [1, 2, 3]
l.insert(1, 5)
l

__`extend`__

In [None]:
l = [1, 2]
l.extend([3, 4, 5])
len(l), l

### Under the hood

`append` is O(1), `insert` is O(n):

In [None]:
l = list(range(100))

In [None]:
%%timeit -n 100

l.append(0)

In [None]:
l = list(range(100))

In [None]:
%%timeit -n 100

l.insert(50, 0)

More about time complexity [here](https://wiki.python.org/moin/TimeComplexity).

__Advanced indexing, ranges__

In [28]:
l = []
for i in range(20):
    l.append(2*i + 1)
l

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39]

In [29]:
l[2:5]

[5, 7, 9]

In [30]:
l[:3], l[-4:]

([1, 3, 5], [33, 35, 37, 39])

This can be used for sliding windows:

In [31]:
for i in range(10):
    print(l[i:i+3])

[1, 3, 5]
[3, 5, 7]
[5, 7, 9]
[7, 9, 11]
[9, 11, 13]
[11, 13, 15]
[13, 15, 17]
[15, 17, 19]
[17, 19, 21]
[19, 21, 23]


Step argument:

In [32]:
l[2:10:3] 

[5, 11, 17]

Reversing a list:

In [33]:
l[::-1]

[39, 37, 35, 33, 31, 29, 27, 25, 23, 21, 19, 17, 15, 13, 11, 9, 7, 5, 3, 1]

New object

In [34]:
l[2:3] is l

False

Lists are __mutable__, elements can be added or changed:

In [35]:
l = []
old_id = id(l)

for element in range(1, 3):
    l.append(element)
    print(id(l) == old_id)
l

True
True


[1, 2]

op= performs reference assignment without creating a new object (similar to shallow copy):

In [36]:
l = [1 ,2]
l2 = l
print(l is l2)

l2.append(42)
l

True


[1, 2, 42]

In [37]:
l.copy() is l

False

elements don't need to be of the same type

In [38]:
l = [1, -1, "foo", 2, "bar"]
l

[1, -1, 'foo', 2, 'bar']

__Sorting lists__

Lists can be sorted using the built-in `sorted` function which returns a new list object.

In [39]:
l = [3, -1, 2, 11]

for e in sorted(l):
    print(e)
    
l

-1
2
3
11


[3, -1, 2, 11]

In [40]:
l is sorted(l)

False

Inplace version of sort:

In [41]:
l = [3, -1, 2, 11]
l.sort()
l

[-1, 2, 3, 11]

In [42]:
l1 = [3, -1, 2]
l2 = sorted(l1)
print(f"{(l1 is l2) = }")
print(f"{l1 = }")
print(f"{l2 = }")

(l1 is l2) = False
l1 = [3, -1, 2]
l2 = [-1, 2, 3]


The sorting key can be specified using the `key` argument:

In [43]:
shopping_list = [
    ["apple", 5],
    ["pear", 2],
    ["milk", 1],
    ["bread", 3],
]

for product in sorted(shopping_list, key=lambda x: -x[1]):
    print(product)

['apple', 5]
['bread', 3]
['pear', 2]
['milk', 1]


In [44]:
sorted(shopping_list, key=lambda x: -x[1])

[['apple', 5], ['bread', 3], ['pear', 2], ['milk', 1]]

## tuple

Tuples are an immutable sequences:

In [45]:
t = ()  # empty tuple
# t = tuple()
print(f"{type(t) = }")
print(f"{len(t) = }")
print(f"{t = }")

type(t) = <class 'tuple'>
len(t) = 0
t = ()


In [46]:
t = (1, 2, 3, "foo")
print(f"{type(t) = }")
print(f"{len(t) = }")
print(f"{t = }")

type(t) = <class 'tuple'>
len(t) = 4
t = (1, 2, 3, 'foo')


Tuples can be indexed the same way as lists:

In [47]:
t[1], t[-1], t[:3]

(2, 'foo', (1, 2, 3))

Tuples with one element can be defined with:

In [48]:
t = (2,)
type(t), len(t)

(tuple, 1)

Tuples contain immutable references, however, the underlying objects may be mutable such as lists

In [49]:
t = ([1, 2, 3], "foo")
# t[0]= "bar"  # raises TypeError

In [50]:
t = ([1, 2, 3], "foo")

# save the id of the first element of the tuple (the list)
old_id = id(t[0])

# change an element of the list (not the tuple!)
t[0][1] = 11
t[0].append(13)

# did the id of the list change?
print(id(t[0]) == old_id)
t

True


([1, 11, 3, 13], 'foo')

__Memory footprint__ of lists and tuples

Note that [`sys.getsizeof`](https://docs.python.org/3/library/sys.html#sys.getsizeof) is only guaranteed to return the correct size for built-in types.

In [51]:
import sys

print(sys.getsizeof(list(range(10000))))
print(sys.getsizeof(tuple(range(10000))))
sys.getsizeof((2, 3)), sys.getsizeof([2, 3])

80056
80040


(56, 72)

Very large integers use more memory than small ones:

In [52]:
sys.getsizeof(2), sys.getsizeof(2**70)

(28, 36)

# Strings

- Strings are **immutable** sequences of Unicode code points
- These code points range from 0 to 0x10FFF
- No separate character type. Some functions can only be used on characters a.k.a strings of length 1
- Strings can be constructed in various ways

In [53]:
single = 'ab\'c'
double = "ab\"c"
print(single, double)
multiline = """
sdfajfklasj;
sdfsdfs
sdfsdf
"""
single == double

ab'c ab"c


False

Newlines in multiline strings are kept:

In [54]:
print(multiline)


sdfajfklasj;
sdfsdfs
sdfsdf



Immutability makes it impossible to change a string (unlike C-style string or C++'s `std::string`)

In [55]:
s = "abc"

# s[1] = "c"  # TypeError

All string operations create new objects:

In [56]:
s = "abc"
old_id = id(s)
s += "def"
id(s) == old_id

False

Strings support most operations available for lists such as __advanced indexing__:

In [57]:
s = "abcdefghijkl"
s[::2]

'acegik'

## Character encodings - Unicode

Unicode provides a mapping from letters to code points or numbers:
  
| character | Unicode code point |
| ---- | ---- |
| a | U+0061 |
| ő | U+0151 | 
| ش | U+0634 |
| گ | U+06AF |
| ¿ | U+00BF |
| ư | U+01B0 |
| Ң | U+04A2 |
| ⛵ | U+26F5 |

- These are abstract code points
- Actual text needs to be stored as a byte array/sequence (byte strings)
- Character encoding: code point - byte array correspondence

- **Encoding**: Unicode code point $\rightarrow$ byte sequence
- **Decoding**: byte sequence $\rightarrow$ Unicode code point
- Most popular encoding: UTF-8. Make sure that your terminal uses UTF-8. This can be verified with the `locale` command on Linux/MacOS.

| character | Unicode code point | UTF-8 byte sequence |
| ---- | ---- | ---- |
| a | U+0061 | 61 |
| ő | U+0151 | C5 91 |
| ش | U+0634 | D8 B4 |
| گ | U+06AF | DA AF |
| ¿ | U+00BF | C2 BF |
| ư | U+01B0 | C6 B0 |
| Ң | U+04A2 | D2 A2 |
| ⛵ | U+26F5 | E2 9B B5 |

__chr__ and __ord__

We can look up code points with `ord` and similarly convert them back to characters with `chr`:

In [58]:
ship = chr(0x26f5)
ship

'⛵'

In [59]:
ord(ship), hex(ord(ship)), oct(ord(ship)), bin(ord(ship))

(9973, '0x26f5', '0o23365', '0b10011011110101')

In [60]:
# ord("ab")  # raises TypeError

Python3 automatically **encodes** Unicode strings when:
- Writing to file
- Printing
- Any kind of operation that requires byte string conversion
- And it automatically **decodes** byte sequnces when reading from file

**IMPORTANT** Python2 does not do this automatically.

## `bytes` in Python 3

- Immutable sequence of bytes.
    - This corresponds to Python2's `str` type.
    - Python3's `str` was `unicode` in Python2.
- Python3 strings can be encoded resulting in a bytes object:

In [61]:
unicode_string = "ábc"
utf8_string = unicode_string.encode("utf8")
type(utf8_string), utf8_string, len(utf8_string), len(unicode_string)

(bytes, b'\xc3\xa1bc', 4, 3)

Different encodings result in different byte sequences:

In [62]:
unicode_string = "ábc"
utf8_string = unicode_string.encode("utf8")
utf16_string = unicode_string.encode("utf16")
latin2_string = unicode_string.encode("latin2")

type(unicode_string), type(utf8_string), type(utf16_string), type(latin2_string)

(str, bytes, bytes, bytes)

Their length is different too:

In [63]:
len(unicode_string), len(utf8_string), len(utf16_string), len(latin2_string)

(3, 4, 8, 3)

`str.encode` raises an error when a particular encoding does not support the input. Latin-1 or ISO 8851-1 does not support the Hungarian ő:

In [64]:
unicode_string = "őbc"
latin1_string = unicode_string.encode("latin1")  # raises UnicodeEncodeError

UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151' in position 0: ordinal not in range(256)

## String operations

Large variety of basic string manipulation: lower, upper, title

In [None]:
"abC".upper(), "ABC".lower(), "abc def".title()

Concatenation with `+`:

In [None]:
s = "\tabc  \n"
print("<START>" + s + "<STOP>")

In [None]:
s = "\tabc  \n"
s.strip()

In [None]:
s.rstrip()

In [65]:
s.lstrip()

'abcdefghijkl'

In [66]:
"abca".strip("ba")

'c'

since each function returns a new string, they can be chained after another

In [67]:
" abcd abc".strip().rstrip("c").lstrip("ab")

'cd ab'

__Binary predicates (i.e. yes-no questions)__

In [68]:
"abc".startswith("ab"), "abc".endswith("cd")  # "abc"[:2] == "ab"

(True, False)

In [69]:
%%timeit -n 1000

"abcsdfsfsdfadadasdasdasfasdasdfadsfaf"[:2] == "absdfsfsfsd"

73.9 ns ± 0.597 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [70]:
%%timeit -n 1000

"abcsdfsfsdfadadasdasdasfasdasdfadsfaf".startswith("absdfsdfsf")

83.6 ns ± 0.829 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [71]:
"abc".istitle(), "Abc".istitle()

(False, True)

In [72]:
"  \t\n".isspace()

True

In [73]:
"989".isdigit(), "1.5".isdigit()

(True, False)

__`split`__

In [74]:
s = "the quick brown fox jumps over the lazy dog"
words = s.split()
words

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

In [75]:
s = "R.E.M."
s.split(".")

['R', 'E', 'M', '']

__`maxsplit` and `partition`__

In [76]:
config = "name=namewith=sign"

config.split("=", maxsplit=1)

['name', 'namewith=sign']

In [77]:
"name=namewith=sign".partition("=")

('name', '=', 'namewith=sign')

In [78]:
"batch_size: 12".partition(":")

('batch_size', ':', ' 12')

__`join`__

In [79]:
"-".join(words)

'the-quick-brown-fox-jumps-over-the-lazy-dog'

use explicit token separators

In [80]:
" <W> ".join(words)

'the <W> quick <W> brown <W> fox <W> jumps <W> over <W> the <W> lazy <W> dog'

In [81]:
print("\n".join(words))

the
quick
brown
fox
jumps
over
the
lazy
dog


## String formatting

Python features several string formatting options.

__`str.format`__

- non-str objects are automatically cast to str
  - under the hood: the object's `__format__` method is called if it exists, otherwise its `__str__` is called

In [82]:
name = "John"
age = 25

print("My name is {0} and I'm {1} years old. I turned {1} last December".format(name, age))

print("My name is {} and I'm {} years old.".format(name, age))
# print("My name is {} and I'm {} years old. I turned {} last December".format(name, age))  # raises IndexError
print("My name is {name} and I'm {age} years old. I turned {age} last December".format(name=name, age=age))

My name is John and I'm 25 years old. I turned 25 last December
My name is John and I'm 25 years old.
My name is John and I'm 25 years old. I turned 25 last December


[Format specification mini language](https://docs.python.org/3/library/string.html#formatspec)

__% operator__

- note that the arguments need to be parenthesized (make it a tuple)

In [83]:
print("My name is %s and I'm %d years old" % (name, age))

My name is John and I'm 25 years old


__f-strings or string interpolation__

f-strings were added in Python 3.6 in [PEP498](https://www.python.org/dev/peps/pep-0498/)

In [84]:
import sys

name = "John"
age = 42

if sys.version_info >= (3, 6):
    print(f"My name is {name} and I'm {age} years old")

My name is John and I'm 42 years old


Aside from variables, expressions can be evaluated in f-strings:

In [85]:
s = "abc"
f"Length of string: {len(s)}, contents: {s}"

'Length of string: 3, contents: abc'

Self-documenting support was added in Python 3.8:

In [86]:
s1 = "abc"
s2 = "aBc"

print(f"{s1 = }")
print(f"{s2 = }")
print(f"{(s1 == s2) = }")
print(f"{(s1.lower() == s2.lower()) = }")

s1 = 'abc'
s2 = 'aBc'
(s1 == s2) = False
(s1.lower() == s2.lower()) = True


# dictionary

Basic and only built-in map type that maps keys to values.

Dictionaries can be defined in a number of ways:

In [87]:
d = {}  # empty dictionary  same as d = dict()
d["apple"] = 12
d["plum"] = 2
d

{'apple': 12, 'plum': 2}

Equivalent to:

In [88]:
d = {"apple": 12, "plum": 2}
d

{'apple': 12, 'plum': 2}

Or:

In [89]:
d = dict(apple=12, plum=2)
d

{'apple': 12, 'plum': 2}

__removing keys__

In [90]:
del d["apple"]
d

{'plum': 2}

__Iterating dictionaries__

Keys and values can be iterated separately or together.

Iterating keys:

In [91]:
d = {"apple": 12, "plum": 2}
for key in d.keys():  # same as for key in d:
    print(key, d[key])

apple 12
plum 2


Iterating values:

In [92]:
for value in d.values():
    print(value)

12
2


Iterating both:

In [93]:
for key, value in d.items():
    print(key, value)

apple 12
plum 2


In [94]:
for pair in d.items():
    print(type(pair), pair)

<class 'tuple'> ('apple', 12)
<class 'tuple'> ('plum', 2)


__Under the hood__

Dictionaries are hash tables (same as C++'s `std::unordered_map`).
- Constraints on key values: they must be hashable i.e. they cannot be or contain mutable objects
- Keys can be mixed type.

In [95]:
d = {}
d[1] = "a"  # numeric types are immutable
d[3+2j] = "b"
d["cde"] = 1.0
d

{1: 'a', (3+2j): 'b', 'cde': 1.0}

- tuples are immutable too

In [96]:
d[("apple", 1)] = -2
d

{1: 'a', (3+2j): 'b', 'cde': 1.0, ('apple', 1): -2}

- however lists are not

In [97]:
# d[["apple", 1]] = 12  # raises TypeError

__Q. Can these be dictionary keys?__

In [98]:
key1 = (2, (3, 4))
key2 = (2, [], (3, 4))

d = {}
d[key1] = 1
# d[key2] = 2  # raises TypeError
d

{(2, (3, 4)): 1}

__Dictionaries preserve insertion order__

In [99]:
d1 = {}
d1['apple'] = 12
d1['plum'] = 3
for key, value in d1.items():
    print(key, value)

apple 12
plum 3


In [100]:
d2 = {}
d2['plum'] = 3
d2['apple'] = 12
for key, value in d2.items():
    print(key, value)

plum 3
apple 12


Regardless of insertion order `d1` and `d2` are different objects with the same content:

In [101]:
d1 == d2, d1 is d2

(True, False)

# set

- Collection of unique, hashable elements
- Implements basic set operations (intersection, union, difference)
- Sets are mutable

In [102]:
s = set()
s.add(2)
s.add(3)
s.add(2)
# s = {2, 3, 2}
s

{2, 3}

In [103]:
s = {2, 3, 2}  # d = {'a': 2}
type(s), s

(set, {2, 3})

In [104]:
# s = {[1, 2], 2, 3}

In [105]:
s = {2, 3}
old_id = id(s)
s.add(10)
s, old_id == id(s)

({2, 3, 10}, True)

__Deleting elements__

In [106]:
s.add(2)
s.remove(2)
# s.remove(2)  # raises KeyError, since we already removed this element
s.discard(2)  # removes if present, does not raise exception

## frozenset

Immutable counterpart of set:

In [107]:
fs = frozenset([1, 2])
# fs.add(2)  # raises AttributeError

In [108]:
fs = frozenset([1, 2])
s = {1, 2}

d = dict()
d[fs] = 1
# d[s] = 2  # raises TypeError
d

{frozenset({1, 2}): 1}

## set operations

- implemented as
  1. methods
  2. overloaded operators

In [109]:
s1 = {1, 2, 3, 4, 5}
s2 = {2, 5, 6, 7}

s1 & s2  # s1.intersection(s2) or s2.intersection(s1)

{2, 5}

In [110]:
s1 | s2  # s1.union(s2) OR s2.union(s1)

{1, 2, 3, 4, 5, 6, 7}

In [111]:
s1 - s2, s2 - s1  # s1.difference(s2), s2.difference(s1)

({1, 3, 4}, {6, 7})

These operations return new sets

In [112]:
s3 = s1 & s2
type(s3), s3 is s1, s3 is s2

(set, False, False)

Subset testing return a boolean

In [113]:
s1 < (s2 | s1)  # s1.issubset(s2) OR s2.issuperset(s1)

True

## Useful set properties

Creating a set is a convenient way of getting the unique elements of a sequence

In [114]:
l = [1, 2, 3, -1, 1, 2, 1, 0]
uniq = set(l)
uniq

{-1, 0, 1, 2, 3}

## Under the hood

- sets and dictionaries provide O(1) lookup
- in contrast lists provide O(n) lookup

In [115]:
import random

# let's define our alphabet
letters = "abcdef"
# we generate string of length 1 to 5
word_len = [1, 2, 3, 4, 5]
# we generate 10000 examples
N = 10000
samples = []

for i in range(N):
    word = []
    # sample a word length
    this_len = random.choice(word_len)
    for j in range(this_len):
        # sample a character from the alphabet and add it to the 'word'
        word.append(random.choice(letters))
    samples.append("".join(word))
    
samples = list(set(samples))

In [116]:
len(samples)

2998

__list lookup__

In [117]:
%%timeit

word = []
for j in range(random.choice(word_len)):
    word.append(random.choice(letters))
word = "".join(word)

word in samples

35.4 µs ± 1.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


__set lookup__

In [118]:
samples_set = set(samples)
len(samples_set), len(samples)

(2998, 2998)

In [119]:
%%timeit

word = []
for j in range(random.choice(word_len)):
    word.append(random.choice(letters))
word = "".join(word)

word in samples_set

2.33 µs ± 198 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


# Miscellaneous

## Mutable default arguments

Mutable default arguments are bound to the function object so they are shared across all function calls:

In [120]:
def insert_value(value, l=[]):
    l.append(value)
    print(l)
    
l1 = []
insert_value(12, l1)
l2 = [-3]
insert_value(14, l2)

[12]
[-3, 14]


In [121]:
insert_value(-1)

[-1]


In [122]:
insert_value(-3)

[-1, -3]


It's best to avoid using mutable defaults.

One solution is to create a new list inside a function if no list is provided:

In [123]:
def insert_value(value, l=None):
    if l is None:
        l = []
    l.append(value)
    return l

l = insert_value(2)
l

[2]

In [124]:
insert_value(12)

[12]

## Lambda expressions

- Unnamed functions
- May take parameters
- Can access local scope

In [125]:
words = ["Plum", "pear", "Apple", "peach"]
sorted(words)

['Apple', 'Plum', 'peach', 'pear']

Let's sort this in a case insensitive way:

In [126]:
def foo(w):
    return w.lower()

sorted(words, key=foo)
sorted(words, key=lambda w: w.lower())

['Apple', 'peach', 'pear', 'Plum']

In [127]:
sorted([1, -1, 0, 4, -5], key=abs)

[0, 1, -1, 4, -5]

Let's sort them by word length AND alphabetically (case insensitive).

We use the fact that tuples are compared elementwise:

In [128]:
(2, "b") < (3, "a")

True

In [129]:
(2, "b") < (2, "a", "a")

False

We need two keys: the lengths and the lowercase word form:

In [130]:
sorted(words, key=lambda w: (len(w), w.lower()))

['pear', 'Plum', 'Apple', 'peach']

# Mandatory reading

- [A4 page printable PEP8 cheat sheet](https://www.kbsoftware.co.uk/docs/_downloads/pep8_cheat.pdf)
- [Introduction to character encodings](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) by Joel Spolsky, co-founder of Stack Overflow

# Suggested reading

- [Time complexity of various operations under CPython](https://wiki.python.org/moin/TimeComplexity)
- [String formatting mini language](https://docs.python.org/3/library/string.html#formatspec)

# Reference

- [Official documentation of built-in types](https://docs.python.org/3/library/stdtypes.html)