In [1]:
from bs4 import BeautifulSoup

Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据，
支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。

lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。

使用 pip 安装即可：pip install beautifulsoup4

抓取工具|速度|使用难度|安装难度
---|:--:|---:|---:
正则|最快|困难|无（内置）
BeautifulSoup|慢|最简单|简单
lxml|快|简|一般

**Beautiful Soup对象**

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

    Tag
    NavigableString
    BeautifulSoup
    Comment

**Tag**

通俗点讲就是 HTML 中的一个个标签，例如：

    <title>The Dormouse's story</title>

In [10]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建 Beautiful Soup 对象
bs = BeautifulSoup(html)
# 打开本地文件
#bs = BeautifulSoup(open(test.html))

#格式化输出 soup 对象的内容
#print(bs.prettify())

# 查找Tag，查找的是在所有内容中的第一个符合要求的标签
print(bs.title)
print(bs.a)
print(type(bs.a))

print(bs.name)

#取p的属性
print(bs.p.attrs)
#取p的class属性的值
print(bs.p.attrs['class'])
print(bs.p.get('class'))
#修改p的class属性的值
bs.p.attrs['class'] = 'newClass'
print(bs.p.get('class'))
#删除p的class属性的值
del bs.p.attrs['class']
print(bs.p.attrs)

**NavigableString**

.string:取标签内部的文字

**BeautifulSoup**

BeautifulSoup 对象表示的是一个文档的内容，是一个特殊的 Tag，可以分别获取它的类型，名称，以及属性

**Comment**

Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

In [14]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建 Beautiful Soup 对象
bs = BeautifulSoup(html)

# 取标签内部的文字
print(bs.p.string)

print(bs.name)
print(type(bs.name))
print(bs.attrs)

print(bs.a)
print(bs.a.string)
print(type(bs.a.string))

The Dormouse's story
[document]
<class 'str'>
{}
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
<class 'bs4.element.Comment'>


In [16]:
# 遍历文档树

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建 Beautiful Soup 对象
bs = BeautifulSoup(html)

# 取直接子节点 : .content  \  .children
print(bs.head.content)
print(bs.head.children)
for list in bs.head.children:
    print(list)
print("===================================")
    
# 所有子孙节点: .descendants 属性
for list in bs.descendants:
    print(list)
    print("======")

None
<list_iterator object at 0x0000024046193128>
<title>The Dormouse's story</title>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://examp

**搜索文档树**

1.find_all(name, attrs, recursive, text, **kwargs)

    name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉
        A.传字符串：查找所有与字符串完整匹配的内容
        B.传正则表达式:通过正则表达式的 match() 来匹配内容
        C.传列表:将与列表中任一元素匹配的内容返回
    text 参数:搜搜文档中的字符串内容，与 name 参数的可选值一样, text 参数接受 字符串 , 正则表达式 , 列表


In [20]:
import re
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建 Beautiful Soup 对象
bs = BeautifulSoup(html)
#传字符串
print(bs.find_all('a'))

#传正则表达式
for tag in bs.find_all(re.compile('^b')):
    print(tag.name)

#传列表
print(bs.find_all(['a','b']))
#keyword 参数
print(bs.find_all(id="link2"))
# text参数
print(bs.find_all(text='Lacie'))
print(bs.find_all(text=['Lacie','Tillie']))
print(bs.find_all(text=re.compile('Dormouse')))

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
body
b
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
['Lacie']
['Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]


**CSS选择器**

写 CSS 时，标签名不加任何修饰，类名前加.，id名前加#

在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list


In [21]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建 Beautiful Soup 对象
bs = BeautifulSoup(html)
# 通过标签名查找
print(bs.select('title'))

#通过类名查找
print(bs.select('.sister'))

#通过 id 名查找
print(bs.select('link1'))

#组合查找,空格分开
print(bs.select('p #link1'))
print(bs.select('head > title'))

#属性查找:加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，
# 所以中间不能加空格，否则会无法匹配到。
print(bs.select('a[class="sister"]'))
print(bs.select('p a[href="http://example.com/elsie"]'))

#select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。
soup = BeautifulSoup(html, 'lxml')
print(type(soup.select('title')))
print(soup.select('title')[0].get_text())

for title in soup.select('title'):
    print(title.get_text())

[<title>The Dormouse's story</title>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
[<title>The Dormouse's story</title>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
<class 'list'>
The Dormouse's story
The Dormouse's story
