# 快速开始

下面的这段HTML代码，在接下来的例子中会经常用到。


英文状态下的三引号里面的内容指的也是字符串，它可以按原格式输出。

意思就是我是怎么样输入的，他就怎么样输出。

In [1]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [2]:
print(html_doc)


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>



接下来，我们要使用BeautifulSoup解析下列的这段代码，得到一个BeautifulSoup对象，并能按照标准的缩进结构格式输出。

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

In [4]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


下面介绍几个浏览数据化结构的方法

In [5]:
soup.title

<title>The Dormouse's story</title>

In [6]:
soup.title.name

'title'

In [7]:
soup.title.string

"The Dormouse's story"

In [8]:
soup.title.parent.name

'head'

In [9]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [10]:
soup.p['class']

['title']

In [11]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [12]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

从文档中找到所有的 a 标签的链接

In [13]:
for href in soup.find_all('a'):
    print(href.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


获取文中的所有文字内容

In [14]:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



# 安装BeautifulSoup

pip install BeautifulSoup

# 安装解析器

pip install lxml

# 如何使用

在BeautifulSoup中可以传入两种类型的数据，一个是字符串，一个是文件对象。

In [15]:
from bs4 import BeautifulSoup

In [16]:
soup = BeautifulSoup(open('test.html'))

soup = BeautifulSoup('<html>data</html>')

首先文档被转换为Unicode，并且HTML实例都被转换为了Unicode编码,并做格式化。

In [17]:
BeautifulSoup('hello world')

<html><body><p>hello world</p></body></html>

# 对象的种类

BeautifulSoup 将复杂的HTML文档转换为更加复杂的树形结构，每个节点都是python对象，所有的对象都可以归纳为4类：

Tag , NavigableString , BeautifulSoup , Comment .

## Tag

tag对象与XML或HTML原生文档中的tag相同。

In [18]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b

In [19]:
print(tag)
print(type(tag))

<b class="boldest">Extremely bold</b>
<class 'bs4.element.Tag'>


Tag的方法与属性是比较多的，现在介绍一下Tag中最重要的属性：name和attributes

### name

每个tag都有自己的名字，通过.name来获取

In [20]:
print(tag.name)

b


如果，我们改变了tag的name属性，那么将会影响到当前BeautifulSoup对象生成的HTML文档

In [21]:
tag.name = 'blockquote'
print(tag)

<blockquote class="boldest">Extremely bold</blockquote>


### attirbutes

一个tag中可能会有很多属性，<b class="boldest"> 有一个class为属性为boldest，tag属性的操作方法与字典相同。

In [22]:
tag['class']

['boldest']

In [23]:
tag.get('class')

['boldest']

In [24]:
tag.attrs

{'class': ['boldest']}

tag属性可以添加、删除与修改和字典的操作是一样的

In [25]:
tag['class'] = 'verybold'
tag['id'] = 'verygood'
tag

<blockquote class="verybold" id="verygood">Extremely bold</blockquote>

In [26]:
del tag['class']

In [27]:
# tag['class']

In [28]:
print(tag.get('class'))

None


### 多值属性

最常见的多值属性是class (一个tag可以有多个css的class)，这些多值属性返回的类型是list

In [29]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']

['body', 'strikeout']

In [30]:
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']

['body']

如果某个属性，看起来好像有多个值，例如id值，但是在HTML版本中，没有将其定义为多值属性，

那么Beautiful Soup会将其作为字符串返回。

In [31]:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']

'my id'

将tag转换为字符串时，多值属性会合并成一个值

In [32]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
rel_soup.a['rel']

['index']

In [33]:
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

<p>Back to the <a rel="index contents">homepage</a></p>


如果转换的文档是XML格式，那么tag中不包含多值属性

In [34]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

'body strikeout'

## 可遍历的字符串

字符串常常被包含在tag内，Beautiful Soup用NavigableString类来包装tag中的字符串

In [35]:
tag.string

'Extremely bold'

In [36]:
type(tag.string)

bs4.element.NavigableString

tag中的字符串不能被编辑，但是可以使用replace_with(）方法，替换成其他字符

In [37]:
tag.string.replace_with('No longer bold')

'Extremely bold'

In [38]:
tag

<blockquote id="verygood">No longer bold</blockquote>

## 注释及其特殊字符串

Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容,

但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:

In [39]:
markup = '<b><!--Hey, buddy. Want to buy a used parser?--></b>'
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)

bs4.element.Comment

In [40]:
comment

'Hey, buddy. Want to buy a used parser?'

当他出现在HTML文档中时，Comment对象会使用特殊的格式输出

In [41]:
print(soup.b.prettify())

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>


# 遍历文档树

我们依然取下列文档作为例子

In [42]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

## 子节点

一个tag可能包含多个字符串或者是其他tag,这些都是这个Tag的子节点，Beautiful Soup提供了许多操作和遍历子节点的属性。

注意Beautiful Soup中字符串节点不支持这些属性，因为字符串没有子节点

## Tag的名字

操作文档树最简单的方法，就是告诉他你想获取的tag的name。如果想获取head标签，只需要soup.head

In [43]:
soup.head

<head><title>The Dormouse's story</title></head>

In [44]:
soup.title

<title>The Dormouse's story</title>

这是获取tag中的一个小方法，可以在tag中多次调用这个方法，下面的代码可以获取到body标签中的第一个b标签

In [45]:
soup.body.b

<b>The Dormouse's story</b>

In [46]:
soup.b

<b>The Dormouse's story</b>

通过点属性的方式，只能获取到当前名字的第一个tag

In [47]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果，你想要获取到所有的tag，那么可以使用find_all()方法,它所返回的内容是一个列表

In [48]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# .contents和.children

tag中的.contents属性可以将tag的子节点以列表的方式输出

In [49]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's story</title></head>

In [50]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [51]:
title_tag = head_tag.contents[0]
title_tag

<title>The Dormouse's story</title>

In [52]:
title_tag.contents

["The Dormouse's story"]

Beautiful Soup对象本身包含一个子节点，也就是说HTML标签，也就是Beautiful Soup对象的子节点

In [53]:
len(soup.contents)

1

In [54]:
soup.contents[0].name

'html'

## 过滤器

## 字符串

最简单的过滤器是字符串，在搜索方法中传入一个字符串参数，例如我现在要寻找文档中的所有b标签，这个其实是name参数

In [55]:
soup.find_all('b')

[<b>The Dormouse's story</b>]

## keyword参数

如果指定了名字的参数不是搜索内置的参数名，搜索时会将该参数当作指定名字的tag属性来搜索。如果包含了一个名字为id的参数，那么

Beautiful Soup会搜索每个tag的id属性

In [56]:
soup.find_all(id='link2')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

## 按CSS搜索

按照css类名搜索tag功能非常实用，但是css类名的关键字class是python的保留字，因此使用class会导致错误，但是Beautiful Soup可以通过class_参数实现

指定类名的tag

In [57]:
soup.find_all('a', class_='sister')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tag的class是一个多值属性，按照CSS类名所搜tag时，可以分别搜索tag中的每个css类名

In [58]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.find_all('p', class_='strikeout')

[<p class="body strikeout"></p>]

In [59]:
css_soup.find_all('p', class_='body')

[<p class="body strikeout"></p>]

当然，搜索class属性时，也可以通过css值完全匹配

In [60]:
css_soup.find_all('p', class_='body strikeout')

[<p class="body strikeout"></p>]

完全匹配class的值时，如果CSS类名的顺序与实际不符合，将会搜索不到结果

In [61]:
soup.find_all('a', attrs={'class': 'sister'})

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# CSS选择器

Beautiful Soup支持大部分的css选择器，Tag或者是Beautiful Soup对象的select()方法中传入字符串参数，即可使用CSS选择器语法找到Tag

In [64]:
soup.select('title')

[<title>The Dormouse's story</title>]

通过Tag标签逐层查找

In [67]:
soup.select('body a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

找到某个标签下的子标签

In [68]:
soup.select('head > title')

[<title>The Dormouse's story</title>]

In [69]:
soup.select('p > a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [70]:
soup.select('p > a:nth-of-type(2)')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [71]:
soup.select('p > #link1')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

通过css类名进行查找

In [72]:
soup.select('.sister')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过tag的id进行查找,在整个HTML页面中的id值是唯一的

In [74]:
soup.select('#link1')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [75]:
soup.select('#link2')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]