# 1. Beautiful Soup 安装
`pip install beautifulsoup4`

`pip install lxml`

`pip install html5lib`

* Beautiful Soup 支持 Python 标准库中的 HTML 解析器
* 第三方的解析器:
    * Python 会使用 Python 默认的解析器
    * lxml 解析器更加强大 (速度更快，推荐安装)
    

# 2. 创建 Beautiful Soup 对象

In [1]:
# 导入 bs4
from bs4 import BeautifulSoup
from bs4 import element

In [2]:
# 用本地 HTML 文件来创建对象，例如
soup = BeautifulSoup(open('index.html'), "html.parser")
print(soup)
print(type(soup))


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<class 'bs4.BeautifulSoup'>


In [3]:
# 对soup进行格式化
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


# 四大对象种类
Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象，所有对象可以归纳为 4 种:
1. Tag
2. NavigableString
3. BeautifulSoup
4. Comment

### (1) Tag 是什么
`<title>The Dormouse's story</title>`

`<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>`


In [4]:
print(soup.title)

<title>The Dormouse's story</title>


In [5]:
print(soup.head)

<head><title>The Dormouse's story</title></head>


In [6]:
print(soup.a)

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>


In [7]:
print(soup.p)

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


In [8]:
print(type(soup.p))
print(type(soup.a))

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>


#### Tag 两个属性
1. name
2. attrs


In [9]:
# soup 对象本身比较特殊，它的 name 即为 [document]，对于其他内部标签，输出的值便为标签本身的名称。
print(soup.name)
print(soup.head.name)
print(soup.a.name)

[document]
head
a


In [10]:
# attrs(attribute)
print(soup.p.attrs)
print(soup.a.attrs)

{'class': ['title'], 'name': 'dromouse'}
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}


In [11]:
# access one attribute
print(soup.p['class'])
# 等价的access 的 方法
print(soup.p.get('class'))

['title']
['title']


In [12]:
# 对属性和内容进行修改
soup.p['class']="newClass"
print(soup.p)

<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>


In [13]:
# 对属性进行删除
del soup.p['class']
print(soup.p)

<p name="dromouse"><b>The Dormouse's story</b></p>


### (2) NavigableString

In [14]:
# 获取到了标签里面的内容
print(soup.p.string)

The Dormouse's story


In [15]:
print(type(soup.p.string))

<class 'bs4.element.NavigableString'>


### (3) BeautifulSoup
BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候，可以把它当作 Tag 对象，是一个特殊的 Tag

In [16]:
print(type(soup))
print(soup.name)
print(soup.attrs)

<class 'bs4.BeautifulSoup'>
[document]
{}


### (4) Comment
Comment 对象是一个特殊类型的 NavigableString 对象

In [17]:
print(soup.a)
print(soup.a.string)
print(type(soup.a.string))

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
<class 'bs4.element.Comment'>


In [18]:
# a 标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容，我们发现它已经把注释符号去掉了，所以这可能会给我们带来不必要的麻烦。 另外我们打印输出下它的类型，发现它是一个 Comment 类型，所以，我们在使用前最好做一下判断，判断代码如下
if isinstance(soup.a.string, element.Comment):
    print(soup.a.string)

 Elsie 


## 遍历文档树
（1）直接子节点
要点：.contents .children 属性

In [20]:
print(soup.contents)

['\n', <html><head><title>The Dormouse's story</title></head>
<body>
<p name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>]


In [28]:
# num1=[2,5,7,14,15]
dict = {}
# 7- > 0, 6 -> 1, 14-> 2 
num1=[7,6,-13,4,15]
target=10
d1={}
for i in range(len(num1)):
    d1[i]=num1[i]
print(d1)
for i in range(len(num1)):
    if target - num1[i] in dict:
        print([dict[target - num1[i]], i])
    else:
        dict[num1[i]] = i
#     for j in range(i, len(num1)):
#         if d1[i]+d1[j] == target:
#             list1=[i,j]
#             print(list1)
#             break

{0: 7, 1: 6, 2: -13, 3: 4, 4: 15}
[1, 3]


In [33]:
nums = [1,2,3,5,6,7]
target = 9
k = 2

# result = [3, 6], [2, 7]
def dfs(nums, k, target):
    if not nums or k < 0 or target < 0:
        return None
    
    result = []
    
    stack = []
    stack.append([])
    
    while(stack):
        curr = stack.pop()
        
#         if len(curr) == k:
#             if sum(curr) == target:
#                 result.append(curr.copy())
#             continue
        
        if sum(curr) == target and len(curr) == k:
            result.append(curr.copy())
            
        if len(curr) == k:
            continue
        
        for i in range(len(nums)):
            if curr and curr[-1] >= nums[i]:
                continue
            next_list = curr.copy()
            next_list.append(nums[i])
            stack.append(next_list)
    
    return result

In [34]:
result = dfs(nums, k, target)
print(result)

[[3, 6], [2, 7]]
