# Chapter 3: Dictionaries and Sets

Dict 在python 中使用很广，因此highly optimized. 

set 和dict 同样，都是基于hash table 实现的。

## 1. Generic Mapping Types

字典本质上是一个mapping. 

<img src="figures/mapping_uml.png" width="600"/>



创建自己的mapping 类，通常是继承`dict`和`collections.UserDict`类，而不是上面的这些基类。

上面这些基类主要的作用是：
- 标准化接口
- isinstance test

In [3]:
from collections import abc

my_dict = {}
isinstance(my_dict, abc.Mapping)

True

### Hashable keys

所有dict 的key 必须是**hashable**的.

#### Hashable data type:

- 所有primitive type 都是hashable 的
    - str
    - bytes
    - numeric values
- frozen set
- tuple -- only if all elements are hashable
    - `('a'. [1,2])` is not hashable 
- user-defined types - 返回id()
    - 因此所有对象都不是equal
    - 实现`__eq__`方法来基于内容判断


## 2. Dict Comprehension


#### 初始化dict


以下几种构建dict 的方法是等效的。

In [4]:
a = dict(one=1, two=2, three=3)
b = {'one': 1, 'two': 2, 'three': 3}
c = dict(zip(['one', 'two', 'three'], [1, 2, 3]))
d = dict([('two', 2), ('one', 1), ('three', 3)])
e = dict({'three': 3, 'one': 1, 'two': 2})
a == b == c == d == e

True

除了上述的方法，与list 相同，我们还可以使用dict comprehensions 来构建dictionaries. A dictcomp builds a
dict instance by producing key:value pair from any iterable.

In [13]:
# list of tuples --- iterable that can be used for dict comprehension


DIAL_CODES = [
    (86, 'China'),
    (91, 'India'),
    (1, 'United States'),
    (62, 'Indonesia'),
    (55, 'Brazil'),
    (92, 'Pakistan'),
    (880, 'Bangladesh'),
    (234, 'Nigeria'),
    (7, 'Russia'),
    (81, 'Japan')
    ]

In [14]:
country_code = {country: code for code, country in DIAL_CODES}
country_code

{'Bangladesh': 880,
 'Brazil': 55,
 'China': 86,
 'India': 91,
 'Indonesia': 62,
 'Japan': 81,
 'Nigeria': 234,
 'Pakistan': 92,
 'Russia': 7,
 'United States': 1}

In [16]:
# 显示代码大于66的国家

country_code2 = {code: country.upper() for country, code in country_code.items() if code > 66}
country_code2

{81: 'JAPAN',
 86: 'CHINA',
 91: 'INDIA',
 92: 'PAKISTAN',
 234: 'NIGERIA',
 880: 'BANGLADESH'}

## 3. Overview of Common Mapping Methods

The basic API for mappings is quite rich. 我们通常使用的三个dict 类：`dict`, `defaultdict` and `OrderedDict` (后面两个数据类型位于collections模块内).

<img src="figures/mapping_apis.png" width="500"/>

### 3.1 Handling Missing Keys with `setdefault`

当我们使用`d[k]`来查询字典d中键值为k的值的时候，当k不存在的时候，会得到一个KeyError. 
通常的一个解决方法是，`d.get(k, default)` 来返回一个default值。

这个方法的一个问题是，当我们更新一个value 的时候，很低效。请参照下面这个例子。

In [18]:
import sys
import re

WORD_RE = re.compile(r'\w+')
file_name = 'test.txt'


index = {}
with open(file_name, encoding='utf-8') as fp:
    for line_no, line in enumerate(fp, 1):
        print('line number: ', line_no)
        print('line: ', line)

        for match in WORD_RE.finditer(line):
            word = match.group()
            column_no = match.start()+1
            location = (line_no, column_no)  # 字典中每一个value 是一个tuple，表明这个词出现的行号和列号
            # this is ugly; coded like this to make a point
            """
            第一次查找
            """
            occurrences = index.get(word, [])  # get the list of occurrences for word, 如果没有找到，返回一个空数组[]
            occurrences.append(location)       # 把当前单词的位置加入到上面的数组中
            """
            第二次超着：put 操作，相当于还要search 一次，即：找到-更新
            """
            index[word] = occurrences          # 更新更改了之后的accurance -- 因为可能之前没有现在有了，所以必须有更新的步骤

# print in alphabetical order
print("\n=====Word index:===== ")
for word in sorted(index, key=str.upper):  # <4>
    print(word, index[word])

line number:  1
line:  Get the list of occurrences for word, or [] if not found.

line number:  2
line:  Append new location to occurrences.

line number:  3
line:  Put changed occurrences into index dict; this entails a second search through the index.

line number:  4
line:  In the key= argument of sorted I am not calling str.upper, just passing a reference to that method so the sorted function can use it to normalize the words for sorting

=====Word index:===== 
a [(3, 55), (4, 73)]
am [(4, 34)]
Append [(2, 1)]
argument [(4, 13)]
calling [(4, 41)]
can [(4, 123)]
changed [(3, 5)]
dict [(3, 36)]
entails [(3, 47)]
for [(1, 29), (4, 157)]
found [(1, 52)]
function [(4, 114)]
Get [(1, 1)]
I [(4, 32)]
if [(1, 45)]
In [(4, 1)]
index [(3, 30), (3, 83)]
into [(3, 25)]
it [(4, 131)]
just [(4, 60)]
key [(4, 8)]
list [(1, 9)]
location [(2, 12)]
method [(4, 93)]
new [(2, 8)]
normalize [(4, 137)]
not [(1, 48), (4, 37)]
occurrences [(1, 17), (2, 24), (3, 13)]
of [(1, 14), (4, 22)]
or [(1, 39)]
pass

从上面可以看出，对于需要更新的场景，我们需要对字典进行两次查询，第一次查询出value，更新好之后，put 方法还要再查找一次。下面我们介绍如何使用setdefault方法来更优雅的解决这个问题。

In [19]:
import sys
import re

WORD_RE = re.compile(r'\w+')
file_name = 'test.txt'

index = {}
with open(file_name, encoding='utf-8') as fp:
    for line_no, line in enumerate(fp, 1):
        print('line number: ', line_no)
        print('line: ', line)

        for match in WORD_RE.finditer(line):
            word = match.group()
            column_no = match.start()+1
            location = (line_no, column_no)  # 字典中每一个value 是一个tuple，表明这个词出现的行号和列号
            # this is ugly; coded like this to make a point
            """
            下面三行代码可以用setdefault 替换
            it can be updated without requiring a second search
            """
#             occurrences = index.get(word, [])  # get the list of occurrences for word, 如果没有找到，返回一个空数组[]
#             occurrences.append(location)       # 把当前单词的位置加入到上面的数组中
#             index[word] = occurrences          # 更新更改了之后的accurance -- 因为可能之前没有现在有了，所以必须有更新的步骤
            index.setdefault(word, []).append(location)
    
# print in alphabetical order
print("\n=====Word index:===== ")
for word in sorted(index, key=str.upper):  # <4>
    print(word, index[word])

line number:  1
line:  Get the list of occurrences for word, or [] if not found.

line number:  2
line:  Append new location to occurrences.

line number:  3
line:  Put changed occurrences into index dict; this entails a second search through the index.

line number:  4
line:  In the key= argument of sorted I am not calling str.upper, just passing a reference to that method so the sorted function can use it to normalize the words for sorting

=====Word index:===== 
a [(3, 55), (4, 73)]
am [(4, 34)]
Append [(2, 1)]
argument [(4, 13)]
calling [(4, 41)]
can [(4, 123)]
changed [(3, 5)]
dict [(3, 36)]
entails [(3, 47)]
for [(1, 29), (4, 157)]
found [(1, 52)]
function [(4, 114)]
Get [(1, 1)]
I [(4, 32)]
if [(1, 45)]
In [(4, 1)]
index [(3, 30), (3, 83)]
into [(3, 25)]
it [(4, 131)]
just [(4, 60)]
key [(4, 8)]
list [(1, 9)]
location [(2, 12)]
method [(4, 93)]
new [(2, 8)]
normalize [(4, 137)]
not [(1, 48), (4, 37)]
occurrences [(1, 17), (2, 24), (3, 13)]
of [(1, 14), (4, 22)]
or [(1, 39)]
pass

```python
index.setdefault(word, []).append(location)
```

等同于

```python
if key not in my_dict:
    my_dict[key] = []
    my_dict[key].append(new_value)
```

只是下面这段代码要找两次，如果key 不存在，要找三次。而上面的一行代码只要找一次。

上面主要是在insert 的时候key 不存在的场景，下面我们介绍一下missing keys on any lookup.

## 4. Mapping with Flexible Key Lookup

当搜索一个不存在的key 的时候，通常的一个做法是返回一个"人造的"默认值。通常有两个方法：
- 使用defaultdict
- 事项一个dict 或其他mapping 的子类，然后在子类中实现`__missing__` 方法。

### 4.1 defaultdict

A defaultdict is configured to create items on demand whenever a missing key is searched.

Here is how it works: when instantiating a defaultdict, you provide a callable that is used to produce a default value whenever `__getitem__` is passed a nonexistent key argument.

例如，我们创建了如下一个defaultdict：`dd = defaultdict(list)`，当我们访问一个不存在的key (new-key)时：
- Calls list() to create a new list.
- Inserts the list into dd using 'new-key' as key.
- Returns a reference to that list.

In [None]:
import sys
import re
import collections

WORD_RE = re.compile(r'\w+')

"""
Create a defaultdict with the list constructor as default_factory
"""
index = collections.defaultdict(list)     

with open(sys.argv[1], encoding='utf-8') as fp:
    for line_no, line in enumerate(fp, 1):
        for match in WORD_RE.finditer(line):
            word = match.group()
            column_no = match.start()+1
            location = (line_no, column_no)
            index[word].append(location)  # return an empty list --- 可以调用append 方法

# print in alphabetical order
for word in sorted(index, key=str.upper):
    print(word, index[word])
# END INDEX_DEFAULT

注意，default_factory 只能在调用`__getitem__`时触发，对于其他方法，例如`d.get(k)`，仍然返回`None`.

default_factory 背后的原理是实现了`__missing__` 方法。

### 4.2 The `__missing__` Method

`__getitem__` 如果没有找到key，会调用`__missing__`方法，如果`__missing__`方法没有定义，会raise KeyError.

注意，`__missing__` 方法只会被`__getitem__`方法调用，对其他方法并没有影响，例如`__contain__`.

In [22]:
d = {'2':'two', '4':'four'}

In [23]:
d['2']

'two'

In [24]:
d[2]

KeyError: 2

下面我们定义一个新的class，当key是一个整数时，我们将其转换成一个str 然后查询其value。

In [35]:
class StrKeyDict0(dict):  # inherits from dict.

    def __missing__(self, key):
        if isinstance(key, str):  # Check whether key is already a str. If it is, and it’s missing, raise KeyError.
            raise KeyError(key)
        return self[str(key)]  # 如果不是str，转换成str，然后再查询一次

    def get(self, key, default=None):
        try:
            return self[key]  # 重写get，使用d[k] 方法，即调用__getitem__ 方法
        except KeyError:
            return default  # 找不到，返回default

    def __contains__(self, key):
        return key in self.keys() or str(key) in self.keys()  # 返回int 或str

In [26]:
d = StrKeyDict0([('2', 'two'), ('4', 'four')])

#### item retrieval using `d[key]` 

In [27]:
d['2']

'two'

In [28]:
d[2]

'two'

In [36]:
d[1]  # 报错，不存在key

KeyError: '1'

#### item retrieval using `d.get(key)`

In [30]:
d.get('2')

'two'

In [31]:
d.get(2)

'two'

In [32]:
d.get(1, 'N/A')

'N/A'

#### the `in` operator

In [33]:
2 in d

True

In [34]:
1 in d

False

#### 注意，infinite recursion

在实现这些方法的时候，有可能掉入infinite recursion 陷阱 (ref: page 74)

`k in d.keys()` 很快。
dict.keys() returns a view, which is similar to a set, and containment checks in sets are as fast as in dictionaries.

## 5. Variations of dict

## 6. Subclassing UserDict

A better way to create a user-defined mapping type is to sub class collections.UserDict instead of dict.