Skip to content

Latest commit

 

History

History
1469 lines (994 loc) · 24.2 KB

20130227_stdlib_collections.rst

File metadata and controls

1469 lines (994 loc) · 24.2 KB

Rediscovering the stdlib

The collections module

from collections import (
    deque, defaultdict, namedtuple,
    OrderedDict, Counter
)

About me

Baptiste Mispelon (bmispelon)

I've been using python for over 5 years.

Currently doing web development with django at M2BPO.

About this presentation

No advanced python knowledge required.

Lots of small code examples.

Available online.

Introduction

Python has "batteries included" philosophy,

Introduction

Python has "batteries included" philosophy,

but they are easy to miss if you don't know about them.

What is this talk about?

The :navy:`collections` module.

5 specialized container datatypes.

ABC (not presented today).

1: collections.deque

1: collections.deque

List-like container with fast appends and pops on either end.

Introduced in python 2.4.

Short for "doubled-ended queue".

Time Complexity

Operation list deque
append    
pop (right)    
prepend    
pop (left)    

Time Complexity

Operation list deque
append O(1)  
pop (right)    
prepend    
pop (left)    

Time Complexity

Operation list deque
append O(1)  
pop (right) O(1)  
prepend    
pop (left)    

Time Complexity

Operation list deque
append O(1)  
pop (right) O(1)  
prepend O(n)  
pop (left)    

Time Complexity

Operation list deque
append O(1)  
pop (right) O(1)  
prepend O(n)  
pop (left) O(n)  

Time Complexity

Operation list deque
append O(1) O(1)
pop (right) O(1) O(1)
prepend O(n) O(1)
pop (left) O(n) O(1)

Initialization

deque(iterable, maxlen=None)

Both parameters are optional.

Iterable: list, str, dict, generator, ...

Initialization

deque() # ?

Initialization

deque() # deque([])

Initialization

deque() # deque([])
deque('abc') # ?

Initialization

deque() # deque([])
deque('abc') # deque(['a', 'b', 'c'])

Initialization

deque() # deque([])
deque('abc') # deque(['a', 'b', 'c'])
deque(xrange(1000, 0, -1), maxlen=3)
# ?

Initialization

deque() # deque([])
deque('abc') # deque(['a', 'b', 'c'])
deque(xrange(1000, 0, -1), maxlen=3)
# deque([3, 2, 1], maxlen=3)

Practical Example

Print the last 20 lines of a file.

$ tail -n 20 some_file.txt

Approach 1: Naive

with open('some_file.txt') as f:
    last = list(f)[-20:] ###
print "\n".join(last)

Approach 2: More Robust

last = []
with open('some_file.txt') as f:
    for line in f:
        last.append(line)
        if len(last) > 20:
            last.pop(0) ###
print "\n".join(last)

Approach 3: deque

with open('some_file.txt') as f:
    last = deque(f, maxlen=20) ###
print "\n".join(last)

2: collections.defaultdict

2: collections.defaultdict

Dict subclass that calls a factory function to supply missing values.

Introduced in python 2.5.

Initialization

# defaultdict(factory, *a, **kw)
defaultdict(f)
defaultdict(f, {'foo': 'bar'})
defaultdict(f, [('foo', 'bar')])
defaultdict(f, foo='bar')

What's a factory?

Any callable without required arguments:

  • functions (named or anonymous)
  • classes
  • instance methods

Factory examples:

bool  
int  
float  
complex  
str  
list  
dict  

Factory examples:

bool() False
int  
float  
complex  
str  
list  
dict  

Factory examples:

bool() False
int() 0
float  
complex  
str  
list  
dict  

Factory examples:

bool() False
int() 0
float() 0.0
complex  
str  
list  
dict  

Factory examples:

bool() False
int() 0
float() 0.0
complex() 0j
str  
list  
dict  

Factory examples:

bool() False
int() 0
float() 0.0
complex() 0j
str() ''
list  
dict  

Factory examples:

bool() False
int() 0
float() 0.0
complex() 0j
str() ''
list() []
dict  

Factory examples:

bool() False
int() 0
float() 0.0
complex() 0j
str() ''
list() []
dict() {}

Missing Keys

# Regular dict:
rd = dict(foo='bar')
rd['foo']     # ?

Missing Keys

# Regular dict:
rd = dict(foo='bar')
rd['foo']     # 'bar'

Missing Keys

# Regular dict:
rd = dict(foo='bar')
rd['foo']     # 'bar'
rd['missing'] # ?

Missing Keys

# Regular dict:
rd = dict(foo='bar')
rd['foo']     # 'bar'
rd['missing'] # KeyError

Missing Keys

factory = lambda: 'X'
dd = defaultdict(factory, foo='bar')
dd['foo']     # ?

Missing Keys

factory = lambda: 'X'
dd = defaultdict(factory, foo='bar')
dd['foo']     # 'bar'

Missing Keys

factory = lambda: 'X'
dd = defaultdict(factory, foo='bar')
dd['foo']     # 'bar'
dd['missing'] # ?

Missing Keys

factory = lambda: 'X'
dd = defaultdict(factory, foo='bar')
dd['foo']     # 'bar'
dd['missing'] # 'X'

Practical Example

Given a list of payments (date, amount),

we want to get a mapping of {date: [amounts]}

Approach 1: naive

d = {}
for date, amount in L:
    if date not in L:
        d[date] = []
    d[date].append(amount)

Approach 2: Improved

d = {}
for date, amount in L:
    l = d.setdefault(date, [])
    l.append(amount)

Approach 3: defaultdict

d = defaultdict(list)
for date, amount in L:
    d[date].append(amount)

3: collections.namedtuple

3: collections.namedtuple

Factory function for creating tuple subclasses with named fields.

Introduced in python 2.6.

What is a named tuple?

It's a subclass of tuple whose attributes can also be accessed by name (not just by position).

It uses as much memory as a regular tuple.

Like a tuple, it's immutable.

Initialization

# cls = namedtuple(name, fields)

Initialization

# cls = namedtuple(name, fields)
namedtuple('Point', ['x', 'y', 'z'])

Initialization

# cls = namedtuple(name, fields)
namedtuple('Point', ['x', 'y', 'z'])
namedtuple('Point', 'x y z')

Initialization

# cls = namedtuple(name, fields)
namedtuple('Point', ['x', 'y', 'z'])
namedtuple('Point', 'x y z')
namedtuple('Point', 'x,y,z')

Initialization

# cls = namedtuple(name, fields)
namedtuple('Point', ['x', 'y', 'z'])
namedtuple('Point', 'x y z')
namedtuple('Point', 'x,y,z')
# Creates **classes**,
#     not **instances**.

Creation

Point = namedtuple('Point', 'x y z')

Creation

Point = namedtuple('Point', 'x y z')
p = Point(23, 10, 85)

Creation

Point = namedtuple('Point', 'x y z')
p = Point(23, 10, 85)
p = Point(x=23, y=10, z=85)

Creation

Point = namedtuple('Point', 'x y z')
p = Point(23, 10, 85)
p = Point(x=23, y=10, z=85)
p = Point(y=10, x=23, z=85)

Accessing Fields

p = Point(x=23, y=10, z=85)

Accessing Fields

p = Point(x=23, y=10, z=85)
p[0] # ?

Accessing Fields

p = Point(x=23, y=10, z=85)
p[0] # 23

Accessing Fields

p = Point(x=23, y=10, z=85)
p[0] # 23
p.z  # ?

Accessing Fields

p = Point(x=23, y=10, z=85)
p[0] # 23
p.z  # 85

Attributes And Methods

p = Point(x=23, y=10, z=85)
p._fields # ?

Attributes And Methods

p = Point(x=23, y=10, z=85)
p._fields # ['x', 'y', 'z']

Attributes And Methods

p1 = Point(x=23, y=10, z=85)

Attributes And Methods

p1 = Point(x=23, y=10, z=85)
p2 = p1._replace(z=56)

Attributes And Methods

p1 = Point(x=23, y=10, z=85)
p2 = p1._replace(z=56)
tuple(p2) # ?

Attributes And Methods

p1 = Point(x=23, y=10, z=85)
p2 = p1._replace(z=56)
tuple(p2) # (23, 10, 56)

Attributes And Methods

p = Point(x=23, y=10, z=85)
p._asdict()
# ?

Attributes And Methods

p = Point(x=23, y=10, z=85)
p._asdict()
# {'x': 23, 'y': 10, 'z': 85}

Attributes And Methods

p = Point(x=23, y=10, z=85)
p._asdict()
# {'x': 23, 'y': 10, 'z': 85}
# Actually an OrderedDict instance

4: collections.OrderedDict

4: collections.OrderedDict

Dict subclass that remembers the order entries were added.

Introduced in python 2.7.

Initialization

Identical to dict.

OrderedDict(mapping)
OrderedDict(iterable)
OrderedDict(**kwargs)

dict Has No Order

d = {}
for char in 'abc':
    d[char] = None
print ''.join(d) # ?

dict Has No Order

d = {}
for char in 'abc':
    d[char] = None
print ''.join(d) # 'acb'

dict Has No Order

d = {}
for char in 'abc':
    d[char] = None
print ''.join(d) # 'acb'
# Actually,
# it depends on python's version.

dict Has No Order

d = {}
for char in 'abc':
    d[char] = None
print ''.join(d) # 'acb'
# Actually,
# it depends on python's version.
# And it's also affected by -R flag.

But ...

Order is consistent between two iterations if no items have been added or deleted.

zip(d.keys(), d.values())
# Same as d.items()

Shortcut (fromkeys)

d = {}
for char in 'abc':
    d[char] = None

Shortcut (fromkeys)

d = {}
for char in 'abc':
    d[char] = None
# Can be written as:
d = dict.fromkeys('abc')

Bringing Back Order

OrderDict are ordered by insertion order:

d = OrderedDict.fromkeys('abc')
''.join(d) # ?

Bringing Back Order

OrderDict are ordered by insertion order:

d = OrderedDict.fromkeys('abc')
''.join(d) # 'abc'

Keys are still unique!

d = OrderedDict.fromkeys('abcda')
''.join(d) # ?

Keys are still unique!

d = OrderedDict.fromkeys('abcda')
''.join(d) # 'bcda'

Comparisons

od1 = OrderedDict.fromkeys('abc')
od2 = OrderedDict.fromkeys('cba')
rd1 = dict.fromkeys('abc')
rd2 = dict.fromkeys('cba')

rd1 == rd2 # ?

Comparisons

od1 = OrderedDict.fromkeys('abc')
od2 = OrderedDict.fromkeys('cba')
rd1 = dict.fromkeys('abc')
rd2 = dict.fromkeys('cba')

rd1 == rd2 # True

Comparisons

od1 = OrderedDict.fromkeys('abc')
od2 = OrderedDict.fromkeys('cba')
rd1 = dict.fromkeys('abc')
rd2 = dict.fromkeys('cba')

rd1 == rd2 # True
od1 == od2 # ?

Comparisons

od1 = OrderedDict.fromkeys('abc')
od2 = OrderedDict.fromkeys('cba')
rd1 = dict.fromkeys('abc')
rd2 = dict.fromkeys('cba')

rd1 == rd2 # True
od1 == od2 # False

Comparisons

od1 = OrderedDict.fromkeys('abc')
od2 = OrderedDict.fromkeys('cba')
rd1 = dict.fromkeys('abc')
rd2 = dict.fromkeys('cba')

rd1 == rd2 # True
od1 == od2 # False
od1 == rd2 # ?

Comparisons

od1 = OrderedDict.fromkeys('abc')
od2 = OrderedDict.fromkeys('cba')
rd1 = dict.fromkeys('abc')
rd2 = dict.fromkeys('cba')

rd1 == rd2 # True
od1 == od2 # False
od1 == rd2 # True

5: collections.Counter

5: collections.Counter

Dict subclass for counting hashable objects.

Introduced in python 2.7.

5: collections.Counter

Basically a defaultdict(int).

With some useful methods for counting stuff.

Initialization

Counter(iterable)

Iterable: list, str, dict, generator, ...

Counting Characters

# First approach, with a plain dict:
counter = {}
for letter in 'ababac':
    if letter not in counter:
        counter[letter] = 0
    counter[letter] += 1
counter['a'] # 3

Counting Characters

# Third approach, with a defaultdict
counter = defaultdict(int)
for letter in 'ababac':
    counter[letter] += 1
counter['a'] # 3

Counting Characters

# Finally, with a Counter
counter = Counter('ababac')
counter['a'] # 3

Counter Is A dict!

counter = Counter('aaabbc')
counter['a'] # ?

Counter Is A dict!

counter = Counter('aaabbc')
counter['a'] # 3

Counter Is A dict!

counter = Counter('aaabbc')
counter['a'] # 3
counter['d'] # ?

Counter Is A dict!

counter = Counter('aaabbc')
counter['a'] # 3
counter['d'] # 0

Counter Is A dict!

counter = Counter('aaabbc')
sum(counter.values()) # ?

Counter Is A dict!

counter = Counter('aaabbc')
sum(counter.values()) # 6
# Total number of elements

Counter Is A dict!

counter = Counter('aaabbc')
len(counter) # ?

Counter Is A dict!

counter = Counter('aaabbc')
len(counter) # 3
# Total number of unique elements:

Counter Is A dict!

counter = Counter('aaabbc')
counter.keys()
# ?

Counter Is A dict!

counter = Counter('aaabbc')
counter.keys()
# ['a', 'b', 'c']
# List of unique elements

Counter.elements()

# List elements, with repetition
c = Counter('ababac')
c.elements()
# ?

Counter.elements()

# List elements, with repetition
c = Counter('ababac')
c.elements()
# ['a', 'a', 'a', 'b', 'b', 'c']

Counter.elements()

# List elements, with repetition
c = Counter('ababac')
c.elements()
# ['a', 'a', 'a', 'b', 'b', 'c']
# The order is arbitrary!

Counter.most_common()

# List (element, count)
c = Counter('aaabbc')
c.most_common()
# ?

Counter.most_common()

# List (element, count)
c = Counter('aaabbc')
c.most_common()
# [('a', 3), ('b', 2), ('c', 1)]

Counter.most_common(N)

c = Counter('aaabbc')
c.most_common(2)
# ?

Counter.most_common(N)

c = Counter('aaabbc')
c.most_common(2)
# [('a', 3), ('b', 2)]

Counter.update()

# Add elements
c = Counter('aaabb')
c.update(a=2, b=1)
c # ?

Counter.update()

# Add elements
c = Counter('aaabb')
c.update(a=2, b=1)
c # Counter({'a': 5, 'b': 3})

Counter.subtract()

# subtract elements
c = Counter('aaabb')
c.subtract(a=2, b=1)
c # ?

Counter.subtract()

# subtract elements
c = Counter('aaabb')
c.subtract(a=2, b=1)
c # Counter({'a': 1, 'b': 1})

Arithmetic Operations

c1 = Counter('aaaabb')
c2 = Counter('aabbb')
c1 + c2 # ?

Arithmetic Operations

c1 = Counter('aaaabb')
c2 = Counter('aabbb')
c1 + c2 # Counter({'a': 6, 'b': 5})

Arithmetic Operations

c1 = Counter('aaaabb')
c2 = Counter('aabbb')
c1 + c2 # Counter({'a': 6, 'b': 5})
c1 - c2 # ?

Arithmetic Operations

c1 = Counter('aaaabb')
c2 = Counter('aabbb')
c1 + c2 # Counter({'a': 6, 'b': 5})
c1 - c2 # Counter({'a': 2})

Arithmetic Operations

c1 = Counter('aaaabb')
c2 = Counter('aabbb')
c1 + c2 # Counter({'a': 6, 'b': 5})
c1 - c2 # Counter({'a': 2})
c1 & c2 # ?

Arithmetic Operations

c1 = Counter('aaaabb')
c2 = Counter('aabbb')
c1 + c2 # Counter({'a': 6, 'b': 5})
c1 - c2 # Counter({'a': 2})
c1 & c2 # Counter({'a': 2, 'b': 2})

Arithmetic Operations

c1 = Counter('aaaabb')
c2 = Counter('aabbb')
c1 + c2 # Counter({'a': 6, 'b': 5})
c1 - c2 # Counter({'a': 2})
c1 & c2 # Counter({'a': 2, 'b': 2})
c1 | c2 # ?

Arithmetic Operations

c1 = Counter('aaaabb')
c2 = Counter('aabbb')
c1 + c2 # Counter({'a': 6, 'b': 5})
c1 - c2 # Counter({'a': 2})
c1 & c2 # Counter({'a': 2, 'b': 2})
c1 | c2 # Counter({'a': 4, 'b': 3})

Conclusion (TL;DL)

  • deque: list with fast operations on both sides;
  • defaultdict: dict with default values;
  • namedtuple: tuple with named attributes;
  • OrderedDict: ordered dict;
  • Counter: counts things.

Questions?

Bonus: Recipes

Formatting With Defaults

d = dict(y='yes', no='no')
'%(y)s %(n)s' % d
# ?

Formatting With Defaults

d = dict(y='yes', n='no')
'%(y)s %(n)s' % d
# 'yes no'

Formatting With Defaults

d = dict(y='yes')
'%(y)s %(n)s' % d
# ?

Formatting With Defaults

d = dict(y='yes')
'%(y)s %(n)s' % d
# KeyError: 'n'

Formatting With Defaults

factory = lambda: 'no'
d = defaultdict(factory, y='yes')
'%(y)s %(n)s' % d
# ?

Formatting With Defaults

factory = lambda: 'no'
d = defaultdict(factory, y='yes')
'%(y)s %(n)s' % d
# 'yes no'

Auto-vivifying Tree

# don't try this at home!
def factory():
    return defaultdict(factory)
d = defaultdict(factory)

d['a']['b']['c'] = 42

Counting Letters

s = 'a,b.a8a?b'
Counter(c for c in s
        if c.isalpha())
# ?

Counting Letters

s = 'a,b.a8a?b'
Counter(c for c in s
        if c.isalpha())
# Counter({'a': 3, 'b': 2})

Counting Letters (CI)

s = 'a,b.A8A?B'
Counter(c.lower() for c in s
        if c.isalpha())
# ?

Counting Letters (CI)

s = 'a,b.A8A?B'
Counter(c.lower() for c in s
        if c.isalpha())
# Counter({'a': 3, 'b': 2})

Couting words

text = 'foo bar foo'
Counter(text.split())
# ?

Couting words

text = 'foo bar foo'
Counter(text.split())
# Counter({'foo': 2, 'bar': 1})

Questions?