## 解析库

解析器 | 使用方法 | 优势 | 劣势
:-: | :-: | :-: | :-: 
python标准库 | BeautifulSoup(markup, "html.parser") |Python的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML解析器 | BeautifulSoup(markup, "lxml") | 速度快、文档容错能力强 | 需安装C语言库
lxml XML解析器 | BeautifulSoup(markup, "xml") | 速度快、唯一支持的XML的解析器 | 需安装C语言库
html5lib | BeautifulSoup(markup, "html5lib") | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部拓展

## 基本使用

In [8]:
html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ; 
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story


### 标签选择器

如果有一个则返回，如果有多个则返回第一个内容。

#### 选择元素

In [4]:
html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(type(soup.title))
print(soup.title)
print(soup.head)
print(soup.p)

<class 'bs4.element.Tag'>
<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


#### 获取名称

In [5]:
html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)

title


#### 获取属性

In [6]:
html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

dromouse
dromouse


#### 获取内容

In [9]:
html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)

The Dormouse's story


#### 嵌套选择

In [11]:
html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.html.head.title.string)

The Dormouse's story


#### 子节点和子孙节点

In [17]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

['\n        Once upon a time there were three little sisters;and their names were\n        ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n        and \n        ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; \n        and they lived at the bottom of a well.\n    ']


In [20]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i, child)

<list_iterator object at 0x0000021D39E11BC8>
0 
        Once upon a time there were three little sisters;and their names were
        
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 
        and 
        
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 ; 
        and they lived at the bottom of a well.
    


In [22]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i, child)

<generator object Tag.descendants at 0x0000021D3AA4D248>
0 
        Once upon a time there were three little sisters;and their names were
        
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
        and 
        
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 ; 
        and they lived at the bottom of a well.
    


#### 父节点和祖先节点

In [24]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)

<p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        and 
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p>


In [27]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parents)
for i,parent in enumerate(soup.a.parents):
    print(i, parent)

<generator object PageElement.parents at 0x0000021D3AAD2D48>
0 <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        and 
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p>
1 <body>
<p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        and 
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p>
<p class="story">...</p>
</body>
2 <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>


#### 兄弟节点

In [29]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '\n        and \n        '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '; \n        and they lived at the bottom of a well.\n    ')]
[(0, '\n        Once upon a time there were three little sisters;and their names were\n        ')]


### 标准选择器

#### find_all(name, attrs, recursive, text, **kwargs)

可根据标签名、属性、内容查找文档

##### name

In [36]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print('-'*10)
print(type(soup.find_all('ul')[0]))
print('-'*10)
for i,ul in enumerate(soup.find_all('ul')):
    print(i, ul)

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
----------
<class 'bs4.element.Tag'>
----------
0 <ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
1 <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>


In [38]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]


##### attrs

In [41]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]


In [45]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element')) #class_

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]


##### text

In [46]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))

['Foo', 'Foo']


#### find(name, attrs, recursive, text, **kwargs)

可根据标签名、属性、内容查找文档
find返回单个元素，find_all返回所有元素

In [49]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(type(soup.find('ul')))
print(soup.find('ul'))
print(soup.find('page'))

<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
None


#### find_parents() find_parent()
find_parents(name, attrs, recursive, text, \**kwargs)返回所有祖先节点，find_parent()file回所以父节点

#### find_next_siblings(name, attrs, recursive, text, \**kwargs) find_next_sibling()
返回后面所有兄弟节点，返回后面第一个兄弟节点

#### find_previous_siblings(name, attrs, recursive, text, \**kwargs) find_previous_sibling()
返回前面所有兄弟节点，返回前面第一个兄弟节点

#### find_all_next() find_next()
返回节点后所有符合条件的节点，返回第一个符合条件的节点

#### find_all_previous() find_previous()
返回节点前所有符合条件的节点，返回第一个符合条件的节点

### CSS选择器
通过select()直接传入CSS选择器即可完成选择
- . class
- \# id
- 标签直打

In [53]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print('-'*10)
print(soup.select('#list-1 .element'))
print('-'*10)
print(soup.select('.panel-body #list-2 li'))

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
----------
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
----------
[<li class="element">Foo</li>, <li class="element">Bar</li>]


In [54]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]


#### 获取属性

In [57]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

list-1
list-1
list-2
list-2


#### 获取内容

In [59]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

Foo
Bar
Jay
Foo
Bar
