# 爬虫-Beautiful Soup细说
> [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) 是一个可以从HTML或XML文件中提取数据的
>
> Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式


## 主流解析器
| 解析器           | 使用方法                                                     | 优势                                                  | 劣势                                            |
| ---------------- | ------------------------------------------------------------ | ----------------------------------------------------- | ----------------------------------------------- |
| Python标准库     | `BeautifulSoup(markup, "html.parser")`                       | Python的内置标准库执行速度适中文档容错能力强          | Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差 |
| lxml HTML 解析器 | `BeautifulSoup(markup, "lxml")`                              | 速度快文档容错能力强                                  | 需要安装C语言库                                 |
| lxml XML 解析器  | `BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml")` | 速度快唯一支持XML的解析器                             | 需要安装C语言库                                 |
| html5lib         | `BeautifulSoup(markup, "html5lib")`                          | 最好的容错性以浏览器的方式解析文档生成HTML5格式的文档 | 速度慢不依赖外部扩展                            |
> 不同的解析器会在BeautifulSoup对象中由于HTML代码或XML代码不标准而生成不同的结构

## 快速开始
以下这段HTML代码将会被经常使用

In [1]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

构造Beautifulsoup对象，使用python自带解析器html.parser

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

## 简单的浏览结构化数据的方法

In [13]:
#定位title标签
soup.title
# <title>The Dormouse's story</title>
#title标签名
soup.title.name
# u'title'
#title标签值
soup.title.string
# u'The Dormouse's story'
#title标签父标签名
soup.title.parent.name
# u'head'
#定位p标签
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
#p标签class属性的值
soup.p['class']
# u'title'
#定位a标签
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
#在BeautifulSoup对象中查询所有a节点
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#在BeautifulSoup对象中查询属性为id="link3"的节点
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

## 从文档中找到所有`<a>`标签的链接:

In [12]:
for link in soup.find_all('a'):
    print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


## 从文档中获取所有文字内容

In [25]:
print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



# 构造BeautifulSoup对象

In [2]:
from bs4 import BeautifulSoup
#从HTML文件读取内容
Soup = BeautifulSoup(open("index.html"))
#读取HTML格式的内容
Soup = BeautifulSoup("<html>data</html>")
#BeautifulSoup将HTML文档转换成Unicode格式
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

FileNotFoundError: [Errno 2] No such file or directory: 'index.html'

# 节点选择器
通过直节点名称进行索引获取整个节点元素。

## 选择元素

In [6]:
Soup = BeautifulSoup(html_doc,'lxml')
#打印title节点
print(Soup.title)
#打印title属性的类型为Tag
print(type(Soup.title))
print(Soup.title.string)
print(Soup.head)
print(Soup.p)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>


bs4.element.Tag是BeautifulSoup中的一个数据类型，用于表示节点元素，该属性包含多种常用方法如string获取属性值
如上Soup.p预期输出p节点，p节点有多个，只会输出第一个匹配到的节点。

## 提取信息
1. 获取名称
2. 获取属性
3. 获取内容

In [11]:
print(Soup.title.name)
print(Soup.p.attrs)
print(Soup.p.attrs["class"])
print(Soup.p["class"])

title
{'class': ['title']}
['title']
['title']


## 嵌套选择
在选择了一个节点后，可以继续选择其下级节点

In [12]:
#先选择head节点，后选择title节点
print(Soup.head.title)

<title>The Dormouse's story</title>


## 关联选择
我们也许选择一个独特的节点，但同时可能需要其他级，如父节点，兄弟节点，子节点信息
1. 子节点和子孙节点
2. 父节点和祖先节点
3. 兄弟节点

In [36]:
print(Soup.p)
#直接输入p下节点b
print(Soup.p.b)
#输入contents，返回直接子节点列表形式，不对子孙节点再划分
print("contents:\n",Soup.body.contents)
#输入children，返回直接子节点生成器
print(Soup.body.children)
print("contents:\n",list(Soup.body.children))
#输入descendants，返回向下级递归生成器,包含每个下级直到最终的值，含所有子孙节点
print(Soup.body.descendants)
print("contents:\n",list(Soup.body.descendants))

<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
contents:
 ['\n', <p class="title"><b>The Dormouse's story</b></p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, '\n', <p class="story">...</p>, '\n']
<list_iterator object at 0x00000204199F6470>
contents:
 ['\n', <p class="title"><b>The Dormouse's story</b></p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom 

<generator object descendants at 0x00000204199BCC50>
