In [1]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [2]:
from bs4 import BeautifulSoup

In [3]:
soup = BeautifulSoup(html_doc)

In [9]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [44]:
soup.title

<title>The Dormouse's story</title>

In [25]:
soup.title.name

'title'

In [23]:
soup.title.string

"The Dormouse's story"

In [35]:
soup.p.get('class')

['title']

In [32]:
for item in soup.find_all("p"):
    print(item)

<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>


In [38]:
for item in soup.find_all("a"):
    print(item.get("href"))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [42]:
soup.find(id="link2")

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

#### Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

###### Tag:

In [45]:
a = soup.title

In [46]:
type(a)

bs4.element.Tag

###### tag的name属性，获取tag的名字

In [48]:
a.name

'title'

###### 如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:

In [49]:
a.name = "interesting"

In [51]:
print(soup.prettify())

<html>
 <head>
  <interesting>
   The Dormouse's story
  </interesting>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


##### 一个tag可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:

In [58]:
b = soup.a

In [60]:
b

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [63]:
print(b.get("class"),b.get("id"),b.get("href"))

['sister'] link1 http://example.com/elsie


###### 或者.attrs直接获取该标签所有属性

In [66]:
b.attrs

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

#### 可遍历的字符串

###### 字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:

In [77]:
string = b.string

In [78]:
string

'aaaa'

In [79]:
type(string)

bs4.element.NavigableString

In [80]:
#tag中的内容可以被replace_with方法替换
string.replace_with("aaaa")

'aaaa'

#### BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持 遍历文档树 和 搜索文档树 中描述的大部分的方法.
因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

In [82]:
soup.name

'[document]'

#### comment可以表示文档中特殊的类型，不如注释部分

In [85]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
comment

'Hey, buddy. Want to buy a used parser?'

In [86]:
type(comment)

bs4.element.Comment

#### 遍历文档树

###### .contents 和 .children