# beautifulsoup 

beautifulsoup支持python标准库中的HTML解析器，还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装

## 基本使用 

In [15]:
html = """
    <html lang="en" >
    <head>
        <meta charset="utf-8" />
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0">
        <title>客户中心 - Shadowsocks.com</title>
    </head>
    <body>
        <p class="title" name="dromouse"><b> chenzhi </b></p>
        <p class="title" name="dromouse"><b> cz </b></p>
        <p class="title" name="dromouse"><b> cz1 </b></p>
    </body>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')  # 声明对象

print(soup.prettify())  # 格式化html,自动将代码补全

print(soup.title.string) # 选择title标签

<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0" name="viewport"/>
  <title>
   客户中心 - Shadowsocks.com
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    chenzhi
   </b>
  </p>
  <p class="title" name="dromouse">
   <b>
    cz
   </b>
  </p>
  <p class="title" name="dromouse">
   <b>
    cz1
   </b>
  </p>
 </body>
</html>
客户中心 - Shadowsocks.com


## 标签选择器 

我们通过使用soup.标签名,就可以获得这个标签的内容

*通过这种方式获取标签，如果文档中有多个标签，返回的就是第一个标签的内容*

### 选择元素

In [2]:
print(soup.title)

print(type(soup.title))

print(soup.head)

<title>客户中心 - Shadowsocks.com</title>
<class 'bs4.element.Tag'>
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0" name="viewport"/>
<title>客户中心 - Shadowsocks.com</title>
</head>


### 获取标签名称 

使用name方法，soup.tag.name

In [4]:
print(soup.title.name)

title


### 获取属性 

soup.tag['attr'] 或者 soup.tag.attrs['attr']

In [6]:
print(soup.p['name'])

# 等价上面
print(soup.p.attrs['name'])

dromouse
dromouse


### 获取内容 

soup.tag.string

In [7]:
print(soup.p.string)

 chenzhi 


### 嵌套选择

In [8]:
print(soup.body.p.string) # 标签嵌套

 chenzhi 


### 子节点和子孙节点 

In [9]:
# contents 返回所有子节点以列表形式返回
print(soup.body.contents) 

['\n', <p class="title" name="dromouse"><b> chenzhi </b></p>, '\n']


In [10]:
# children 获取子节点，返回类型为迭代器,需要遍历输出

print(soup.p.children)

for i,item in enumerate(soup.p.children):
    print(i,item)

<list_iterator object at 0x7f3b90107990>
0 <b> chenzhi </b>


In [11]:
# descendants 获取所有子孙节点，返回类型为迭代器

print(soup.p.descendants)

for i,item in enumerate(soup.p.descendants):
    print(i,item)

<generator object Tag.descendants at 0x7f3b7eeaf250>
0 <b> chenzhi </b>
1  chenzhi 


### 父节点和祖先节点 

In [12]:
# parent 获取父节点
print(soup.p.parent)

<body>
<p class="title" name="dromouse"><b> chenzhi </b></p>
</body>


In [14]:
# parents 获取祖先节点，需要进行迭代
 
print(soup.p.parents) 

print(list(enumerate(soup.p.parents)))

<generator object PageElement.parents at 0x7f3b7ec5c3d0>
[(0, <body>
<p class="title" name="dromouse"><b> chenzhi </b></p>
</body>), (1, <html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0" name="viewport"/>
<title>客户中心 - Shadowsocks.com</title>
</head>
<body>
<p class="title" name="dromouse"><b> chenzhi </b></p>
</body>
</html>), (2, <html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0" name="viewport"/>
<title>客户中心 - Shadowsocks.com</title>
</head>
<body>
<p class="title" name="dromouse"><b> chenzhi </b></p>
</body>
</html>)]


### 兄弟节点 

In [16]:
# next_siblings 获取后面的兄弟节点
# previous_siblings 获取前面的兄弟节点

print(list(enumerate(soup.p.next_siblings)))

print(list(enumerate(soup.p.previous_siblings)))

[(0, '\n'), (1, <p class="title" name="dromouse"><b> cz </b></p>), (2, '\n'), (3, <p class="title" name="dromouse"><b> cz1 </b></p>), (4, '\n')]
[(0, '\n')]


## 标准选择器 

### find_all

find_all 可根据标签名(name)、属性(attrs)、内容(text)查找文档

#### name 

In [19]:
html = '''

<div class = "panel">
    <div class = "panel-heading">
        <h4>hello</h4>
    </div>
    <div class = "panel-body">
        <ul class = "list" id = "list-1">
            <li class = "elements">foo<li>
            <li class = "elements">foo1<li>
            <li class = "elements">foo2<li>
        </ul>
        <ul class = "list list-small" id = "list-2">
            <li class = "elements">foo3<li>
            <li class = "elements">foo4<li>
            <li class = "elements">foo5<li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print()
print(soup.find_all('ul')[0])

[<ul class="list" id="list-1">
<li class="elements">foo</li><li>
</li><li class="elements">foo1</li><li>
</li><li class="elements">foo2</li><li>
</li></ul>, <ul class="list list-small" id="list-2">
<li class="elements">foo3</li><li>
</li><li class="elements">foo4</li><li>
</li><li class="elements">foo5</li><li>
</li></ul>]

<ul class="list" id="list-1">
<li class="elements">foo</li><li>
</li><li class="elements">foo1</li><li>
</li><li class="elements">foo2</li><li>
</li></ul>


#### attrs 

attrs参数为字典类型

In [20]:
print(soup.find_all(attrs={'id':'list-1'}))

[<ul class="list" id="list-1">
<li class="elements">foo</li><li>
</li><li class="elements">foo1</li><li>
</li><li class="elements">foo2</li><li>
</li></ul>]


In [22]:
# 还有一些常见的属性不需要用字典形式指定

print(soup.find_all(id="list-1"))

print(soup.find_all(class_="elements"))

[<ul class="list" id="list-1">
<li class="elements">foo</li><li>
</li><li class="elements">foo1</li><li>
</li><li class="elements">foo2</li><li>
</li></ul>]
[<li class="elements">foo</li>, <li class="elements">foo1</li>, <li class="elements">foo2</li>, <li class="elements">foo3</li>, <li class="elements">foo4</li>, <li class="elements">foo5</li>]


#### text 

In [23]:
print(soup.find_all(text="foo"))

['foo']


### find

find返回单个元素，find_all返回所有元素

In [24]:
print(soup.find('ul'))

<ul class="list" id="list-1">
<li class="elements">foo</li><li>
</li><li class="elements">foo1</li><li>
</li><li class="elements">foo2</li><li>
</li></ul>


## CSS 选择器 

通过select()直接传入CSS选择器即可完成选择

select 选择方式

1. 如果是选择class,那么class名前要加一个 . 号
2. 直接选择标签
3. 如果使用id进行选择，id名前要加一个 # 号

In [26]:
print(soup.select('.panel .panel-heading')) # CSS 选择 class

print(soup.select('ul li')) # CSS 选择标签

print(soup.select('#list-2 .elements')) # CSS 选择 id

[<div class="panel-heading">
<h4>hello</h4>
</div>]
[<li class="elements">foo</li>, <li>
</li>, <li class="elements">foo1</li>, <li>
</li>, <li class="elements">foo2</li>, <li>
</li>, <li class="elements">foo3</li>, <li>
</li>, <li class="elements">foo4</li>, <li>
</li>, <li class="elements">foo5</li>, <li>
</li>]
[<li class="elements">foo3</li>, <li class="elements">foo4</li>, <li class="elements">foo5</li>]


### 获取属性 

In [27]:
for ul in soup.select('ul'):
    print(ul['id']) # 等价于 ul.attrs['id']

list-1
list-2


### 获取内容 

In [28]:
for li in soup.select('li'):
    print(li.get_text())

NameError: name 'li' is not defined

## 总结 

推荐使用lxml解析库，必要时使用html.parse

标签选择筛选功能弱但是速度快

建议使用 find()、find_all() 查找匹配单个结果或多个结果

如果对 CSS 选择器熟悉建议使用 select

记住常用的获取属性和文本值的方法