# 爬虫-Beautiful Soup细说
> [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) 是一个可以从HTML或XML文件中提取数据的
>
> Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式


## 主流解析器
| 解析器           | 使用方法                                                     | 优势                                                  | 劣势                                            |
| ---------------- | ------------------------------------------------------------ | ----------------------------------------------------- | ----------------------------------------------- |
| Python标准库     | `BeautifulSoup(markup, "html.parser")`                       | Python的内置标准库执行速度适中文档容错能力强          | Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差 |
| lxml HTML 解析器 | `BeautifulSoup(markup, "lxml")`                              | 速度快文档容错能力强                                  | 需要安装C语言库                                 |
| lxml XML 解析器  | `BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml")` | 速度快唯一支持XML的解析器                             | 需要安装C语言库                                 |
| html5lib         | `BeautifulSoup(markup, "html5lib")`                          | 最好的容错性以浏览器的方式解析文档生成HTML5格式的文档 | 速度慢不依赖外部扩展                            |
> 不同的解析器会在BeautifulSoup对象中由于HTML代码或XML代码不标准而生成不同的结构

## 快速开始
以下这段HTML代码将会被经常使用

In [1]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

构造Beautifulsoup对象，使用python自带解析器html.parser

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

## 简单的浏览结构化数据的方法

In [13]:
#定位title标签
soup.title
# <title>The Dormouse's story</title>
#title标签名
soup.title.name
# u'title'
#title标签值
soup.title.string
# u'The Dormouse's story'
#title标签父标签名
soup.title.parent.name
# u'head'
#定位p标签
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
#p标签class属性的值
soup.p['class']
# u'title'
#定位a标签
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
#在BeautifulSoup对象中查询所有a节点
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#在BeautifulSoup对象中查询属性为id="link3"的节点
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

## 从文档中找到所有`<a>`标签的链接:

In [12]:
for link in soup.find_all('a'):
    print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


## 从文档中获取所有文字内容

In [25]:
print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



# 构造BeautifulSoup对象

In [42]:
from bs4 import BeautifulSoup
#从HTML文件读取内容
# Soup = BeautifulSoup(open("index.html"))
#读取HTML格式的内容
# Soup = BeautifulSoup("<html>data</html>")
#BeautifulSoup将HTML文档转换成Unicode格式,由于未指定解析器，会选择对应markup内容（即HTML代码部分）的最佳解析器
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a>
<a href="http://example.com/lacie" class="sister" id="link2"><span>Lacie</span></a> and
<a href="http://example.com/tillie" class="sister" id="link3"><span>Tillie</span></a>
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


# 节点选择器
通过直节点名称进行索引获取整个节点元素。

## 选择元素

In [48]:
Soup = BeautifulSoup(html_doc,'lxml')
#打印title节点
print(Soup.title)
#打印title属性的类型为Tag
print("title节点的值:",Soup.title.string)
print("head节点:",Soup.head)
print("p节点:",Soup.p)
print("节点名称:",Soup.title.name)
print("p节点属性:",Soup.p.attrs)
print("p节点class属性:",Soup.p.attrs["class"])
print("p节点class属性:",Soup.p["class"])

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>
<a class="sister" href="http://example.com/lacie" id="link2"><span>Lacie</span></a> and
<a class="sister" href="http://example.com/tillie" id="link3"><span>Tillie</span></a>
and they lived at the bottom of a well.</p>


bs4.element.Tag是BeautifulSoup中的一个数据类型，用于表示节点元素，该属性包含多种常用方法如string获取属性值
如上Soup.p预期输出p节点，p节点有多个，只会输出第一个匹配到的节点。

## 提取信息
1. 获取名称
2. 获取属性
3. 获取内容

In [32]:
print(Soup.title.name)
print(Soup.p.attrs)
print(Soup.p.attrs["class"])
print(Soup.p["class"])

title
{'class': ['story']}
['story']
['story']


## 嵌套选择
在选择了一个节点后，可以继续选择其下级节点

In [33]:
#先选择head节点，后选择title节点
print(Soup.head.title)

<title>The Dormouse's story</title>


## 关联选择
我们也许选择一个独特的节点，但同时可能需要其他级，如父节点，兄弟节点，子节点信息
### 子节点和子孙节点

In [36]:
print("基点\n",Soup.p)
#直接输入p下节点b
print(Soup.p.b)
#输入contents，返回直接子节点列表形式，不对子孙节点再划分
print("contents直接子节点:\n",Soup.p.contents)
#输入children，返回直接子节点生成器
print("children直接子节点生成器:\n",Soup.p.children)
print("children:\n",list(Soup.p.children))
#输入descendants，返回向下级递归生成器,包含每个下级直到最终的值，含所有子孙节点
print("descendants下级递归生成器:\n",Soup.p.descendants)
print("descendants:\n",list(Soup.p.descendants))

基点
 <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
None
contents直接子节点:
 ['Once upon a time there were three little sisters; and their names were\n', <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, ',\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';\nand they lived at the bottom of a well.']
children直接子节点生成器:
 <list_iterator object at 0x000001BD01DF7518>
children:
 ['Once upon a time there were three little sisters; and their names were\n', <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, ',\n', <a class="sister" href="http://ex

### 父节点和祖先节点

In [52]:
print("基点\n",Soup.a)
#输入parent，a的父节点升级到其p节点
print("父节点\n",Soup.a.parent)
print("父之父节点\n",Soup.a.parent.parent)
#输入parents，返回a的所有祖先节点
print("parents上级递归生成器点\n",Soup.a.parents)

基点
 <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>
父节点
 <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>
<a class="sister" href="http://example.com/lacie" id="link2"><span>Lacie</span></a> and
<a class="sister" href="http://example.com/tillie" id="link3"><span>Tillie</span></a>
and they lived at the bottom of a well.</p>
父之父节点
 <body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>
<a class="sister" href="http://example.com/lacie" id="link2"><span>Lacie</span></a> and
<a class="sister" href="http://example.com/tillie" id="link3"><span>Tillie</span></a>
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
parents上级递归生成器点
 <generator object parents at 0x000001BD01DD38E0>


### 兄弟节点

In [51]:
print('基点\n',Soup.a)
print('同级节点的下一个节点\n',Soup.a.next_sibling)
print('同级节点的下一个节点\n',Soup.a.previous_sibiling)

基点
 <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>
同级节点的下一个节点
 

同级节点的下一个节点
 None


next_sibling没匹配到，可以先用next_siblings查看，发现是先匹配了a标签后的\n换行符，如果markup中的兄弟标签是一行写成则可以匹配到

```<a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a><a class="sister" href="http://example.com/lacie" id="link2"><span>Lacie</span></a>```


In [54]:
print('同级节点的后续节点生成器\n',Soup.a.next_siblings)
print('同级节点的前向节点生成器\n',Soup.a.previous_sibilings)

同级节点的后续节点生成器
 <generator object next_siblings at 0x000001BD01DD32B0>
同级节点的前向节点生成器
 None


## 方法选择器
使用方法：find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
1. 根据节点名查询元素,返回列表形式，同时，列表元素为bs4.element.Tag类型

In [2]:

from bs4 import BeautifulSoup
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
        <li class="element">Foo</li>
        <li class="element">Bar</li>
        <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
        <li class="element">Foo</li>
        <li class="element">Bar</li>
    </ul>
    </div>
</div>
'''
Soup = BeautifulSoup(html,'lxml')

In [20]:
#name参数可以输入Tag,返回为列表元素为bs4.element.Tag，故继承Tag搜索方法
print(Soup.find_all(name='ul'))
print(type(Soup.find_all(name='ul')[0]))
for ul in Soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar


In [3]:
#attrs参数输入是字典类型，{属性:值}，返回列表元素为bs4.element.Tag
print(Soup.find_all(attrs={'id':'list-1'}))

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]


In [5]:
#text传入匹配节点的文本，可以是正则表达式，返回列表元素为bs4.element.Tag
htmlZ='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <a>hello,I'm study Machine Learning!</a>
        <a>hello,I'm study Deep Learning!</a>
    </div>
 </div>
'''
import re
SoupZ=BeautifulSoup(htmlZ,'lxml')
SoupZ.find_all(text=re.compile('Learning'))

["hello,I'm study Machine Learning!", "hello,I'm study Deep Learning!"]

2. find方法，返回单个bs4.element.Tag元素

In [8]:
SoupZ.find(text=re.compile('Learning'))

"hello,I'm study Machine Learning!"

其他方法：需要对bs4.element.Tag元素使用
find_parents()返回所有祖先节点

find_parent()返回父节点

find_next_siblings()返回后面所有兄弟节点

find_next_sibling()返回后面第一个兄弟节点

find_previous_siblings()返回前面所有兄弟节点

find_previous_sibling()返回前面第一个兄弟节点

find_all_next()返回节点后所有符合条件的节点

find_next()返回第一个符合条件的节点

find_all_previous()返回节点所有符合条件的节点

find_previous()返回第一个符合条件的节点

In [31]:
SoupZ.find(text=re.compile('Learning')).find_previous()

<a>hello,I'm study Machine Learning!</a>

In [12]:
SoupZ.find(name='a').find_previous()

<div class="panel-body">
<a>hello,I'm study Machine Learning!</a>
<a>hello,I'm study Deep Learning!</a>
</div>

In [13]:
type(SoupZ.find(text=re.compile('Learning')))

bs4.element.NavigableString

In [23]:
SoupZ.find(attrs={'class':"panel-heading"}).find_next_sibling()

<div class="panel-body">
<a>hello,I'm study Machine Learning!</a>
<a>hello,I'm study Deep Learning!</a>
</div>

In [26]:
SoupZ.find(name='a').find_next_sibling()

<a>hello,I'm study Deep Learning!</a>

## CSS选择器

In [2]:
from bs4 import BeautifulSoup
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
        <li class="element">Foo</li>
        <li class="element">Bar</li>
        <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
        <li class="element">Foo</li>
        <li class="element">Bar</li>
    </ul>
    </div>
</div>
'''
Soup = BeautifulSoup(html,'lxml')
#关于CSS选择器，参考http://www.w3school.com.cn/cssref/css_selectors.asp


我们仅调用select()方法，传入为CSS选择器
`select(selector, _candidate_generator=None, limit=None)`

In [3]:
#selector=".class"，选择class="panel"的类，再选择其下class="panel-heading"
print(Soup.select('.panel .panel-heading'))
#选择标签ul之下li标签
print(Soup.select('ul li'))
#选择id="list-2",class="element"
print(Soup.select('#list-2 .element'))
#选择器返回的列表中，元素为bs4.element.Tag
print(type(Soup.select('ul')[0]))

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>


## 嵌套选择

In [7]:
#选择所有ul节点后选择之下li节点
for ul in Soup.select('ul'):
    print(ul.select('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]


## 获取属性

In [39]:
#对Tag元素获取属性
for attr in (map(lambda x:x['id'],Soup.select('ul'))):
    print(attr)

list-1
list-2


## 获取文本

In [41]:
#get_text()或者string
for li in Soup.select('li'):
    print('Get Text:',li.get_text())
    print('String:',li.string)

Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Get Text: Jay
String: Jay
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
