# Chapter 3: Dictionaries and Sets

Dict 在python 中使用很广，因此highly optimized. 

set 和dict 同样，都是基于hash table 实现的。

## 1. Generic Mapping Types

字典本质上是一个mapping. 

<img src="figures/mapping_uml.png" width="600"/>



创建自己的mapping 类，通常是继承`dict`和`collections.UserDict`类，而不是上面的这些基类。

上面这些基类主要的作用是：
- 标准化接口
- isinstance test

In [3]:
from collections import abc

my_dict = {}
isinstance(my_dict, abc.Mapping)

True

### Hashable keys

所有dict 的key 必须是**hashable**的.

#### Hashable data type:

- 所有primitive type 都是hashable 的
    - str
    - bytes
    - numeric values
- frozen set
- tuple -- only if all elements are hashable
    - `('a'. [1,2])` is not hashable 
- user-defined types - 返回id()
    - 因此所有对象都不是equal
    - 实现`__eq__`方法来基于内容判断


## 2. Dict Comprehension


#### 初始化dict


以下几种构建dict 的方法是等效的。

In [4]:
a = dict(one=1, two=2, three=3)
b = {'one': 1, 'two': 2, 'three': 3}
c = dict(zip(['one', 'two', 'three'], [1, 2, 3]))
d = dict([('two', 2), ('one', 1), ('three', 3)])
e = dict({'three': 3, 'one': 1, 'two': 2})
a == b == c == d == e

True

除了上述的方法，与list 相同，我们还可以使用dict comprehensions 来构建dictionaries. A dictcomp builds a
dict instance by producing key:value pair from any iterable.

In [13]:
# list of tuples --- iterable that can be used for dict comprehension


DIAL_CODES = [
    (86, 'China'),
    (91, 'India'),
    (1, 'United States'),
    (62, 'Indonesia'),
    (55, 'Brazil'),
    (92, 'Pakistan'),
    (880, 'Bangladesh'),
    (234, 'Nigeria'),
    (7, 'Russia'),
    (81, 'Japan')
    ]

In [14]:
country_code = {country: code for code, country in DIAL_CODES}
country_code

{'Bangladesh': 880,
 'Brazil': 55,
 'China': 86,
 'India': 91,
 'Indonesia': 62,
 'Japan': 81,
 'Nigeria': 234,
 'Pakistan': 92,
 'Russia': 7,
 'United States': 1}

In [16]:
# 显示代码大于66的国家

country_code2 = {code: country.upper() for country, code in country_code.items() if code > 66}
country_code2

{81: 'JAPAN',
 86: 'CHINA',
 91: 'INDIA',
 92: 'PAKISTAN',
 234: 'NIGERIA',
 880: 'BANGLADESH'}

## 3. Overview of Common Mapping Methods

The basic API for mappings is quite rich. 我们通常使用的三个dict 类：`dict`, `defaultdict` and `OrderedDict` (后面两个数据类型位于collections模块内).

<img src="figures/mapping_apis.png" width="500"/>

### 3.1 Handling Missing Keys with `setdefault`

当我们使用`d[k]`来查询字典d中键值为k的值的时候，当k不存在的时候，会得到一个KeyError. 
通常的一个解决方法是，`d.get(k, default)` 来返回一个default值。

这个方法的一个问题是，当我们更新一个value 的时候，很低效。请参照下面这个例子。

In [18]:
import sys
import re

WORD_RE = re.compile(r'\w+')
file_name = 'test.txt'


index = {}
with open(file_name, encoding='utf-8') as fp:
    for line_no, line in enumerate(fp, 1):
        print('line number: ', line_no)
        print('line: ', line)

        for match in WORD_RE.finditer(line):
            word = match.group()
            column_no = match.start()+1
            location = (line_no, column_no)  # 字典中每一个value 是一个tuple，表明这个词出现的行号和列号
            # this is ugly; coded like this to make a point
            """
            第一次查找
            """
            occurrences = index.get(word, [])  # get the list of occurrences for word, 如果没有找到，返回一个空数组[]
            occurrences.append(location)       # 把当前单词的位置加入到上面的数组中
            """
            第二次超着：put 操作，相当于还要search 一次，即：找到-更新
            """
            index[word] = occurrences          # 更新更改了之后的accurance -- 因为可能之前没有现在有了，所以必须有更新的步骤

# print in alphabetical order
print("\n=====Word index:===== ")
for word in sorted(index, key=str.upper):  # <4>
    print(word, index[word])

line number:  1
line:  Get the list of occurrences for word, or [] if not found.

line number:  2
line:  Append new location to occurrences.

line number:  3
line:  Put changed occurrences into index dict; this entails a second search through the index.

line number:  4
line:  In the key= argument of sorted I am not calling str.upper, just passing a reference to that method so the sorted function can use it to normalize the words for sorting

=====Word index:===== 
a [(3, 55), (4, 73)]
am [(4, 34)]
Append [(2, 1)]
argument [(4, 13)]
calling [(4, 41)]
can [(4, 123)]
changed [(3, 5)]
dict [(3, 36)]
entails [(3, 47)]
for [(1, 29), (4, 157)]
found [(1, 52)]
function [(4, 114)]
Get [(1, 1)]
I [(4, 32)]
if [(1, 45)]
In [(4, 1)]
index [(3, 30), (3, 83)]
into [(3, 25)]
it [(4, 131)]
just [(4, 60)]
key [(4, 8)]
list [(1, 9)]
location [(2, 12)]
method [(4, 93)]
new [(2, 8)]
normalize [(4, 137)]
not [(1, 48), (4, 37)]
occurrences [(1, 17), (2, 24), (3, 13)]
of [(1, 14), (4, 22)]
or [(1, 39)]
pass

从上面可以看出，对于需要更新的场景，我们需要对字典进行两次查询，第一次查询出value，更新好之后，put 方法还要再查找一次。下面我们介绍如何使用setdefault方法来更优雅的解决这个问题。

In [19]:
import sys
import re

WORD_RE = re.compile(r'\w+')
file_name = 'test.txt'

index = {}
with open(file_name, encoding='utf-8') as fp:
    for line_no, line in enumerate(fp, 1):
        print('line number: ', line_no)
        print('line: ', line)

        for match in WORD_RE.finditer(line):
            word = match.group()
            column_no = match.start()+1
            location = (line_no, column_no)  # 字典中每一个value 是一个tuple，表明这个词出现的行号和列号
            # this is ugly; coded like this to make a point
            """
            下面三行代码可以用setdefault 替换
            it can be updated without requiring a second search
            """
#             occurrences = index.get(word, [])  # get the list of occurrences for word, 如果没有找到，返回一个空数组[]
#             occurrences.append(location)       # 把当前单词的位置加入到上面的数组中
#             index[word] = occurrences          # 更新更改了之后的accurance -- 因为可能之前没有现在有了，所以必须有更新的步骤
            index.setdefault(word, []).append(location)
    
# print in alphabetical order
print("\n=====Word index:===== ")
for word in sorted(index, key=str.upper):  # <4>
    print(word, index[word])

line number:  1
line:  Get the list of occurrences for word, or [] if not found.

line number:  2
line:  Append new location to occurrences.

line number:  3
line:  Put changed occurrences into index dict; this entails a second search through the index.

line number:  4
line:  In the key= argument of sorted I am not calling str.upper, just passing a reference to that method so the sorted function can use it to normalize the words for sorting

=====Word index:===== 
a [(3, 55), (4, 73)]
am [(4, 34)]
Append [(2, 1)]
argument [(4, 13)]
calling [(4, 41)]
can [(4, 123)]
changed [(3, 5)]
dict [(3, 36)]
entails [(3, 47)]
for [(1, 29), (4, 157)]
found [(1, 52)]
function [(4, 114)]
Get [(1, 1)]
I [(4, 32)]
if [(1, 45)]
In [(4, 1)]
index [(3, 30), (3, 83)]
into [(3, 25)]
it [(4, 131)]
just [(4, 60)]
key [(4, 8)]
list [(1, 9)]
location [(2, 12)]
method [(4, 93)]
new [(2, 8)]
normalize [(4, 137)]
not [(1, 48), (4, 37)]
occurrences [(1, 17), (2, 24), (3, 13)]
of [(1, 14), (4, 22)]
or [(1, 39)]
pass

```python
index.setdefault(word, []).append(location)
```

等同于

```python
if key not in my_dict:
    my_dict[key] = []
    my_dict[key].append(new_value)
```

只是下面这段代码要找两次，如果key 不存在，要找三次。而上面的一行代码只要找一次。

上面主要是在insert 的时候key 不存在的场景，下面我们介绍一下missing keys on any lookup.

## 4. Mapping with Flexible Key Lookup

当搜索一个不存在的key 的时候，通常的一个做法是返回一个"人造的"默认值。通常有两个方法：
- 使用defaultdict
- 事项一个dict 或其他mapping 的子类，然后在子类中实现`__missing__` 方法。

### 4.1 defaultdict

A defaultdict is configured to create items on demand whenever a missing key is searched.

`collections.defaultdict([default_factory[, ...]])`

- The first argument provides the initial value for the **default_factory** attribute; it defaults to **None**. 
    - This attribute is used by the __missing__() method;
- All remaining arguments are treated the same as if they were passed to the dict constructor, including keyword arguments.


Here is how it works: when instantiating a defaultdict, you provide a callable that is used to produce a default value whenever `__getitem__` is passed a nonexistent key argument.

例如，我们创建了如下一个defaultdict：`dd = defaultdict(list)`，当我们访问一个不存在的key (new-key)时：
- Calls list() to create a new list.
- Inserts the list into dd using 'new-key' as key.
- Returns a reference to that list.

In [172]:
import sys
import re
import collections

WORD_RE = re.compile(r'\w+')
file_name = 'test.txt'

"""
Create a defaultdict with the list constructor as default_factory
"""
index = collections.defaultdict(list)     

with open(file_name, encoding='utf-8') as fp:
    for line_no, line in enumerate(fp, 1):
        for match in WORD_RE.finditer(line):
            word = match.group()
            column_no = match.start()+1
            location = (line_no, column_no)
            index[word].append(location)  # return an empty list --- 可以调用append 方法

# print in alphabetical order
for word in sorted(index, key=str.upper):
    print(word, index[word])
# END INDEX_DEFAULT

a [(3, 55), (4, 73)]
am [(4, 34)]
Append [(2, 1)]
argument [(4, 13)]
calling [(4, 41)]
can [(4, 123)]
changed [(3, 5)]
dict [(3, 36)]
entails [(3, 47)]
for [(1, 29), (4, 157)]
found [(1, 52)]
function [(4, 114)]
Get [(1, 1)]
I [(4, 32)]
if [(1, 45)]
In [(4, 1)]
index [(3, 30), (3, 83)]
into [(3, 25)]
it [(4, 131)]
just [(4, 60)]
key [(4, 8)]
list [(1, 9)]
location [(2, 12)]
method [(4, 93)]
new [(2, 8)]
normalize [(4, 137)]
not [(1, 48), (4, 37)]
occurrences [(1, 17), (2, 24), (3, 13)]
of [(1, 14), (4, 22)]
or [(1, 39)]
passing [(4, 65)]
Put [(3, 1)]
reference [(4, 75)]
search [(3, 64)]
second [(3, 57)]
so [(4, 100)]
sorted [(4, 25), (4, 107)]
sorting [(4, 161)]
str [(4, 49)]
that [(4, 88)]
the [(1, 5), (3, 79), (4, 4), (4, 103), (4, 147)]
this [(3, 42)]
through [(3, 71)]
to [(2, 21), (4, 85), (4, 134)]
upper [(4, 53)]
use [(4, 127)]
word [(1, 33)]
words [(4, 151)]


注意，default_factory 只能在调用`__getitem__`时触发，对于其他方法，例如`d.get(k)`，仍然返回`None`.

default_factory 背后的原理是实现了`__missing__` 方法。

Setting the default_factory to int makes the defaultdict useful for counting (like a bag or multiset in other languages):

In [175]:
from collections import defaultdict  # 参照第5部分，counter

s = 'mississippi'
d = defaultdict(int)
for k in s:
    d[k] += 1
sorted(d.items())

[('i', 4), ('m', 1), ('p', 2), ('s', 4)]

#### 返回常量

In [176]:
def constant_factory(value):
    return lambda: value

In [182]:
d = defaultdict(constant_factory('<missing>'))
d.update(name='John', action='ran')
d

defaultdict(<function __main__.constant_factory.<locals>.<lambda>>,
            {'action': 'ran', 'name': 'John'})

In [183]:
print('{} {} to {}'.format(d['name'], d['action'], d['object'])) 


John ran to <missing>


In [184]:
'%(name)s %(action)s to %(object)s' % d

'John ran to <missing>'

### 4.2 The `__missing__` Method

`__getitem__` 如果没有找到key，会调用`__missing__`方法，如果`__missing__`方法没有定义，会raise KeyError.

注意，`__missing__` 方法只会被`__getitem__`方法调用，对其他方法并没有影响，例如`__contain__`.

In [22]:
d = {'2':'two', '4':'four'}

In [23]:
d['2']

'two'

In [24]:
d[2]

KeyError: 2

下面我们定义一个新的class，当key是一个整数时，我们将其转换成一个str 然后查询其value。

In [35]:
class StrKeyDict0(dict):  # inherits from dict.

    def __missing__(self, key):
        if isinstance(key, str):  # Check whether key is already a str. If it is, and it’s missing, raise KeyError.
            raise KeyError(key)
        return self[str(key)]  # 如果不是str，转换成str，然后再查询一次

    def get(self, key, default=None):
        try:
            return self[key]  # 重写get，使用d[k] 方法，即调用__getitem__ 方法
        except KeyError:
            return default  # 找不到，返回default

    def __contains__(self, key):
        return key in self.keys() or str(key) in self.keys()  # 返回int 或str

In [26]:
d = StrKeyDict0([('2', 'two'), ('4', 'four')])

#### item retrieval using `d[key]` 

In [27]:
d['2']

'two'

In [28]:
d[2]

'two'

In [36]:
d[1]  # 报错，不存在key

KeyError: '1'

#### item retrieval using `d.get(key)`

In [30]:
d.get('2')

'two'

In [31]:
d.get(2)

'two'

In [32]:
d.get(1, 'N/A')

'N/A'

#### the `in` operator

In [33]:
2 in d

True

In [34]:
1 in d

False

#### 注意，infinite recursion

在实现这些方法的时候，有可能掉入infinite recursion 陷阱 (ref: page 74)

`k in d.keys()` 很快。
dict.keys() returns a view, which is similar to a set, and containment checks in sets are as fast as in dictionaries.

## 5. Variations of dict

### 5.1 `collections.OrderedDict`

和dict 一样，只是多了排序功能。Python 3.7 之后，dict class 也具有排序功能，所以现在没有之前那么重要了。下面列举dict 和OrderedDict 的区别。

OrderedDict 常用方法：
- `popitem`
- `move_to_end`

In [142]:
from collections import OrderedDict

In [162]:
d = OrderedDict.fromkeys('abcde')
d

OrderedDict([('a', None), ('b', None), ('c', None), ('d', None), ('e', None)])

In [163]:
d.keys()

odict_keys(['a', 'b', 'c', 'd', 'e'])

In [164]:
d.move_to_end('b')  # 调整顺序
d.keys()

odict_keys(['a', 'c', 'd', 'e', 'b'])

In [165]:
d.move_to_end('b', last=False)
d.keys()

odict_keys(['b', 'a', 'c', 'd', 'e'])

In [166]:
d.popitem()  # 弹出最后一个元素
d.keys()

odict_keys(['b', 'a', 'c', 'd'])

In [167]:
d.popitem(last=False)  # 弹出第一个元素
d.keys()

odict_keys(['a', 'c', 'd'])

#### `reversed()`

In [168]:
for key in reversed(d):
    print(key)

d
c
a


order is retained for keyword arguments passed to the OrderedDict constructor and its update() method.

#### Use cases:
- LRU: OrderedDict can handle frequent reordering operations better than dict. This makes it suitable for **tracking recent accesses** (for example in an LRU cache)

```python
class LRU(OrderedDict):
    'Limit size, evicting the least recently looked-up key when full'

    def __init__(self, maxsize=128, *args, **kwds):
        self.maxsize = maxsize
        super().__init__(*args, **kwds)

    def __getitem__(self, key):
        value = super().__getitem__(key)
        self.move_to_end(key)
        return value

    def __setitem__(self, key, value):
        super().__setitem__(key, value)
        if len(self) > self.maxsize:
            oldest = next(iter(self))
            del self[oldest]
```

In [111]:
from collections import ChainMap

In [115]:
baseline = {'music': 'bach', 'art': 'rembrandt'}
adjustments = {'art': 'van gogh', 'opera': 'carmen'}
cm = ChainMap(adjustments, baseline)
list(cm)

['music', 'art', 'opera']

上述代码等同于以下代码

In [113]:
combined = baseline.copy()
combined.update(adjustments)
list(combined)

['music', 'art', 'opera']

In [116]:
cm.keys()

KeysView(ChainMap({'art': 'van gogh', 'opera': 'carmen'}, {'music': 'bach', 'art': 'rembrandt'}))

In [117]:
cm.values()

ValuesView(ChainMap({'art': 'van gogh', 'opera': 'carmen'}, {'music': 'bach', 'art': 'rembrandt'}))

In [118]:
'music' in cm.keys()

True

In [119]:
'musique' in cm.keys()

False

In [121]:
cm['art']  # 在第一个map 找到，返回

'van gogh'

In [122]:
cm['sport'] = 'Jordan'  # 插入第一个map

In [123]:
cm 

ChainMap({'art': 'van gogh', 'opera': 'carmen', 'sport': 'Jordan'}, {'music': 'bach', 'art': 'rembrandt'})

In [124]:
cm['art'] = 'Davincci'  # 修改第一个map
cm

ChainMap({'art': 'Davincci', 'opera': 'carmen', 'sport': 'Jordan'}, {'music': 'bach', 'art': 'rembrandt'})

In [125]:
del cm['art']  # 从第一个map 中删除

In [126]:
cm

ChainMap({'opera': 'carmen', 'sport': 'Jordan'}, {'music': 'bach', 'art': 'rembrandt'})

#### maps: 返回一个list of mappings. 

In [135]:
cm.maps

[{'opera': 'carmen', 'sport': 'Jordan'}, {'art': 'rembrandt', 'music': 'bach'}]

#### new_child(m)  
- m 也是一个dict，放在chainmap 的第一个

In [138]:
cm.new_child()  # 返回一个新的ChainMap，在里面添加了一个empty map

ChainMap({}, {'opera': 'carmen', 'sport': 'Jordan'}, {'music': 'bach', 'art': 'rembrandt'})

In [140]:
dd = {'sport':'Jordan'}  

In [141]:
cm.new_child(dd)  # 第一个dict

ChainMap({'sport': 'Jordan'}, {'opera': 'carmen', 'sport': 'Jordan'}, {'music': 'bach', 'art': 'rembrandt'})

#### parents

返回一个新的ChainMap，包含了除了一个map 之外所有的maps

In [134]:
cm.parents  # cm 有两个map，返回第二个map

ChainMap({'music': 'bach', 'art': 'rembrandt'})

#### Example

ChainMap 常用的场景是Python interpreter 变量分析，例如，局部变量，全局变量等。

### 5.3 `collections.Counter`

A Counter is a dict subclass for counting hashable objects. A mapping that holds an integer count for each key. 

通常被用于实现一个multiset (or bag)，记录不同元素的个数。


reference: https://docs.python.org/3/library/collections.html#counter-objects

In [88]:
from collections import Counter

In [89]:
ct = Counter('abracadabra')
ct

Counter({'a': 5, 'b': 2, 'c': 1, 'd': 1, 'r': 2})

In [41]:
ct.update('aaaaazzz')
ct

Counter({'a': 10, 'b': 2, 'c': 1, 'd': 1, 'r': 2, 'z': 3})

In [42]:
ct.most_common(2)

[('a', 10), ('z', 3)]

In [45]:
words = re.findall(r'\w+', open('test.txt').read().lower())  # 读取文件，获取所有tokens
words

['get',
 'the',
 'list',
 'of',
 'occurrences',
 'for',
 'word',
 'or',
 'if',
 'not',
 'found',
 'append',
 'new',
 'location',
 'to',
 'occurrences',
 'put',
 'changed',
 'occurrences',
 'into',
 'index',
 'dict',
 'this',
 'entails',
 'a',
 'second',
 'search',
 'through',
 'the',
 'index',
 'in',
 'the',
 'key',
 'argument',
 'of',
 'sorted',
 'i',
 'am',
 'not',
 'calling',
 'str',
 'upper',
 'just',
 'passing',
 'a',
 'reference',
 'to',
 'that',
 'method',
 'so',
 'the',
 'sorted',
 'function',
 'can',
 'use',
 'it',
 'to',
 'normalize',
 'the',
 'words',
 'for',
 'sorting']

In [48]:
ct_words = Counter(words)  # 基于单词列表，生成一个token 计数器 -- counter 传入的必须是一个iterable 或者一个mapping
ct_words

Counter({'a': 2,
         'am': 1,
         'append': 1,
         'argument': 1,
         'calling': 1,
         'can': 1,
         'changed': 1,
         'dict': 1,
         'entails': 1,
         'for': 2,
         'found': 1,
         'function': 1,
         'get': 1,
         'i': 1,
         'if': 1,
         'in': 1,
         'index': 2,
         'into': 1,
         'it': 1,
         'just': 1,
         'key': 1,
         'list': 1,
         'location': 1,
         'method': 1,
         'new': 1,
         'normalize': 1,
         'not': 2,
         'occurrences': 3,
         'of': 2,
         'or': 1,
         'passing': 1,
         'put': 1,
         'reference': 1,
         'search': 1,
         'second': 1,
         'so': 1,
         'sorted': 2,
         'sorting': 1,
         'str': 1,
         'that': 1,
         'the': 5,
         'this': 1,
         'through': 1,
         'to': 3,
         'upper': 1,
         'use': 1,
         'word': 1,
         'words': 1})

In [50]:
ct_words.most_common(2)  # 查看出现次数最多的2个token

[('the', 5), ('occurrences', 3)]

注意：Counts are allowed to be any integer value including zero or **negative** counts.


#### return a zero count for missing items instead of raising a KeyError

In [51]:
ct_words['Hello']

0

#### Setting a count to zero does not remove an element from a counter. Use del to remove it entirely

In [52]:
ct_words.keys()

dict_keys(['get', 'the', 'list', 'of', 'occurrences', 'for', 'word', 'or', 'if', 'not', 'found', 'append', 'new', 'location', 'to', 'put', 'changed', 'into', 'index', 'dict', 'this', 'entails', 'a', 'second', 'search', 'through', 'in', 'key', 'argument', 'sorted', 'i', 'am', 'calling', 'str', 'upper', 'just', 'passing', 'reference', 'that', 'method', 'so', 'function', 'can', 'use', 'it', 'normalize', 'words', 'sorting'])

In [53]:
ct_words['get']

1

In [55]:
ct_words['get'] = 0

In [56]:
ct_words['get'] 

0

In [57]:
ct_words.keys()

dict_keys(['get', 'the', 'list', 'of', 'occurrences', 'for', 'word', 'or', 'if', 'not', 'found', 'append', 'new', 'location', 'to', 'put', 'changed', 'into', 'index', 'dict', 'this', 'entails', 'a', 'second', 'search', 'through', 'in', 'key', 'argument', 'sorted', 'i', 'am', 'calling', 'str', 'upper', 'just', 'passing', 'reference', 'that', 'method', 'so', 'function', 'can', 'use', 'it', 'normalize', 'words', 'sorting'])

In [58]:
del ct_words['get'] 

In [59]:
ct_words.keys()  # 我们可以看出，get 已经被完全删除了

dict_keys(['the', 'list', 'of', 'occurrences', 'for', 'word', 'or', 'if', 'not', 'found', 'append', 'new', 'location', 'to', 'put', 'changed', 'into', 'index', 'dict', 'this', 'entails', 'a', 'second', 'search', 'through', 'in', 'key', 'argument', 'sorted', 'i', 'am', 'calling', 'str', 'upper', 'just', 'passing', 'reference', 'that', 'method', 'so', 'function', 'can', 'use', 'it', 'normalize', 'words', 'sorting'])

#### Counter 支持三个新的API

#### `elements()`: 返回平铺的所有元素
- counter 为0或负数时，忽略

In [64]:
c = collections.Counter(a=4, b=2, c=0, d=-2)
sorted(c.elements())

['a', 'a', 'a', 'a', 'b', 'b']

#### `most_common(n)`: 返回次数最多的n 个元素

In [65]:
collections.Counter('abracadabra').most_common(3)

[('a', 5), ('b', 2), ('r', 2)]

#### `subtract([iterable-or-mapping])`

In [70]:
c = collections.Counter(a=4, b=2, c=0, d=-2, e=3)
d = collections.Counter(a=1, b=2, c=3, d=4, f=2)
c.subtract(d)
c

Counter({'a': 3, 'b': 0, 'c': -3, 'd': -6, 'e': 3, 'f': -2})

#### `update([iterable-or-mapping])`

In [69]:
c = collections.Counter(a=4, b=2, c=0, d=-2, e=3)
d = collections.Counter(a=1, b=2, c=3, d=4, f=2)
c.update(d)
c

Counter({'a': 5, 'b': 4, 'c': 3, 'd': 2, 'e': 3, 'f': 2})

#### Operations

In [77]:
ct = collections.Counter({'a': 3, 'b': 0, 'c': -3, 'd': -6, 'e': 3, 'f': -2})

In [78]:
+c # remove zero or negative counters

Counter({'a': 3, 'e': 3})

In [79]:
sum(c.values())  #total counts 

-5

In [80]:
list(c)  # list unique elements

['a', 'b', 'c', 'd', 'e', 'f']

In [81]:
set(c)  # convert to set

{'a', 'b', 'c', 'd', 'e', 'f'}

In [82]:
dict(c)  # convert to a regular dictionary

{'a': 3, 'b': 0, 'c': -3, 'd': -6, 'e': 3, 'f': -2}

In [83]:
c.items()  # convert to a list of (elem, cnt) pairs

dict_items([('a', 3), ('b', 0), ('c', -3), ('d', -6), ('e', 3), ('f', -2)])

In [85]:
n=2
c.most_common()[:-n-1:-1]       # n least common elements

[('d', -6), ('c', -3)]

#### math operations

In [90]:
c = Counter(a=3, b=1)
d = Counter(a=1, b=2)

In [91]:
c + d

Counter({'a': 4, 'b': 3})

In [92]:
c - d  # 忽略b，因为counter 小于0

Counter({'a': 2})

In [93]:
d - c  # 忽略a，因为counter 小于0

Counter({'b': 1})

In [94]:
c & d  # intersection:  min(c[x], d[x]) 

Counter({'a': 1, 'b': 1})

In [95]:
c | d # union:  max(c[x], d[x])

Counter({'a': 3, 'b': 2})

unary addition (一元) addition and subtraction 相当于加(减)了一个empty counter

In [97]:
ct = Counter({'a': 3, 'b': 0, 'c': -3, 'd': -6, 'e': 3, 'f': -2})

In [98]:
+ct

Counter({'a': 3, 'e': 3})

In [101]:
-ct  # 减一个负数相当于一个正数

Counter({'c': 3, 'd': 6, 'f': 2})

#### Notes:
- value 可以存储任何值
- most_common() 要求values 时可以被排序的
- multiset: 所有的counter 都是positive values (>0)
- elements(): requires integer counts

In [106]:
ct = Counter({'a': 3.5, 'b': 1, 'c': -3.5})

In [107]:
ct.most_common(2)

[('a', 3.5), ('b', 1)]

In [109]:
ct.elements()  # integer counts

<itertools.chain at 0x10889c978>

### 5.4 `collections.UserDict`: 
主要是用来在创建自己的mapping 类型时继承的

## 6. Subclassing UserDict

A better way to create a user-defined mapping type is to sub class collections.UserDict instead of dict. 通常继承UserDict 比dict 要更简单。

Note that **UserDict** does not inherit from dict, but has an internal **dict** instance, called **data**, which holds the actual items.

UserDict <- MutableMapping <- Mapping

所以UserDict 具有父类所有的方法，例如：
- `MutableMapping.update` 

- `Mapping.get`

In [185]:
import collections

class StrKeyDict(collections.UserDict):  # 继承UserDict

    def __missing__(self, key):  # 与StrKeyDict 相同
        if isinstance(key, str):
            raise KeyError(key)
        return self[str(key)]

    def __contains__(self, key):
        return str(key) in self.data  # simpler
        # return key in self.keys() or str(key) in self.keys()
        
    def __setitem__(self, key, item):
        self.data[str(key)] = item   # converts any key to a str

## 7. Immutable Mappings

guarantee that a user cannot change a mapping by mistake.

mappingproxy: a read-only but dynamic view of the original mapping. 原始的mapping 如果修改了，可以从mappingproxy 看出来，但是mappingproxy 本身不能被修改。

In [186]:
from types import MappingProxyType

d = {1: 'A'}
d_proxy = MappingProxyType(d)
d_proxy

mappingproxy({1: 'A'})

In [188]:
d_proxy[1]

'A'

In [189]:
d_proxy[1] = 2  # 修改报错

TypeError: 'mappingproxy' object does not support item assignment

In [190]:
d[1] = 2

In [192]:
d_proxy  # 修改d，d_proxy 也同样被修改了

mappingproxy({1: 2})

我们已经介绍了绝大多数mapping types，下面我们介绍set types.

## 8. Set Theory

A set is a collection of unique objects. A basic use case is **removing duplication**.
- Set elements must be hashable
- The set type is not hashable
- The frozenset type is hashable

In [193]:
l = ['spam', 'spam', 'eggs', 'spam']
set(l)

{'eggs', 'spam'}

In [194]:
list(set(l))

['eggs', 'spam']

set 的另一个使用场景是支持set operations, e.g., union, intersection, difference. 使用set operation 有时可以让程序更快，可读性更强。

例如：**membership test**

In [203]:
s1 = set([1,2,3,4,5,6,7,8,9,10])
s2 = set([1,3,100])

intersection = s1 & s2
print(len(intersection))  # 相同元素
print(intersection)  # 相同元素的个数

2
{1, 3}


In [197]:
intersection = s1.intersection(s2)
intersection

{1, 3}

如果不用set，我们需要用下面这样的循环嵌套来实现。

In [199]:
def cal_intersection(iter1, iter2):
    intersections = []
    for n in iter1:
        if n in iter2:
            intersections.append(n)
        
    return intersections

In [201]:
intersection = cal_intersection(s1, s2)
intersection

[1, 3]

这种方法的limitation:
- 稍慢
- 可读性不如set intersection 操作

优势：
- 不一定是set，例如下面的例子。

In [202]:
lst1 = [1,2,2,3,3,3]
lst2 = [2,3,3,4,4,4]

intersection = cal_intersection(lst1, lst2)
intersection

[2, 2, 3, 3, 3]

我们可以看出，其实正确的结果是[2,3,3]

todo: 改进

### 8.1 Set Literals

使用literal syntax 通常比较faster and readable.

```python
s1 = {1 ,2 ,3} 
```
等同于
```python
s1 = set([1,2,3])
```

#### why literal is faster than constructor
- literal: 调用`BUILD_SET` byte code
- constructor:
    - 在namespace 中找到set 的constructor
    - build a list
    - pass list to constructor

下面，我们来验证一下这两种方法的byte code

In [210]:
from dis import dis

In [211]:
dis('{1}')

  1           0 LOAD_CONST               0 (1)
              2 BUILD_SET                1
              4 RETURN_VALUE


In [212]:
dis('set([1])')

  1           0 LOAD_NAME                0 (set)
              2 LOAD_CONST               0 (1)
              4 BUILD_LIST               1
              6 CALL_FUNCTION            1
              8 RETURN_VALUE


#### empty set
- to create an empty set, you should use the constructor without an argument: set(). 
- If you write {}, you’re creating an empty dict

In [205]:
s1 = {1 ,2 ,3}  # non-empty set, 使用literal set syntax，等同于使用consructor: s1 = set([1,2,3])
type(s1)

set

In [206]:
s2 = {}  # dict
type(s2)

dict

In [208]:
s3 = set()
type(s3)

set

In [209]:
s3  # empty set

set()

#### FrozenSet

没有特殊的literals, 必须通过constructor.

In [213]:
frozenset([1,2,3])

frozenset({1, 2, 3})

### 8.2 Set Comprehension

In [215]:
lst = list(range(10))
lst

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [217]:
s = {i % 3 for i in lst}  # 除以3的余数
s

{0, 1, 2}

### 8.3 Set Operations

<img src="figures/set_operations.png" width="600"/>

#### math operations

<img src="figures/math_operations.png" width="600"/>

#### set predicates

<img src="figures/set_compare.png" width="600"/>

#### set addition

<img src="figures/set_addition.png" width="500"/>


现在我们知道set 和dict 怎么用，下面我们介绍一下，set 和dict 是如何实现的(hash tables)。