# 四种对象

BeautifulSoup4将html文档其转化成可操作的树形结构，每个节点都是python对象，一共有这四种对象：
+ Tag (HTML中的标签)
+ NavigableString (标签内部的文字)
+ BeautifulSoup
+ Comment

In [26]:
from bs4 import BeautifulSoup

file = open(r'./data/baidu_urllib.html', 'rb') # 二进制读取
html = file.read()
bs = BeautifulSoup(html,'html')

## Tag

In [19]:
print(type(bs.title))
print(bs.title)
print(bs.title.name)
print(bs.title.attrs)

<class 'bs4.element.Tag'>
<title>百度一下，你就知道</title>
title
{}


## NavigableString

In [18]:
print(type(bs.title.string))
print(bs.title.string)

<class 'bs4.element.NavigableString'>
百度一下，你就知道


## BeautifulSoup

BeautifulSoup对象表示的是整个文档的内容。大部分时候，可以把它当作 Tag 对象，它是一个特殊的 Tag

In [23]:
print(type(bs))
print(bs.name)
print(bs.attrs)

<class 'bs4.BeautifulSoup'>
[document]
{}


## Comment

Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

In [37]:
print(type(bs.a))
print(bs.a.prettify())

print(type(bs.a.string))
print(bs.a.string,end='\n\n')

print(bs.a.attrs)

<class 'bs4.element.Tag'>
<a class="toindex" href="/">
 <!--我是注释，包含在第一个a标签中-->
</a>

<class 'bs4.element.Comment'>
我是注释，包含在第一个a标签中

{'class': ['toindex'], 'href': '/'}


# 使用官网测试文本

In [56]:
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class='my' id='zrd'>668</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html')

# 美化输出

In [57]:
# 按照标准的缩进格式美化html
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="my" id="zrd">
   668
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


# 遍历

In [None]:
# .contents 和 .children

# .descendants

# .parent 和 .parents

# .next_sibling 和 .previous_sibling

# .next_siblings 和 .previous_siblings

# 搜索

## find_all() 搜索

In [63]:
import re

# 字符串搜索
print(r"find_all('a')")
tag_list = soup.find_all('a')
for tag in tag_list: print(tag)
print()

# 列表搜索
print(r"find_all(['a','b'])")
tag_list = soup.find_all(['a','b'])
for tag in tag_list: print(tag)
print()

# 正则表达式搜索
print(r"find_all(re.compile('a'))")
tag_list = soup.find_all(re.compile('a'))
for tag in tag_list: print(tag)
print()

# 方法搜索（过滤）
print(r"find_all(lambda tag: ...)")
tag_list = soup.find_all(lambda tag: tag.has_attr('class') and not tag.has_attr('href'))
for tag in tag_list: print(tag)
print()

# keyword 
print(r"find_all(id='link1')")
tag_list = soup.find_all(id='link1')
for tag in tag_list: print(tag)
print()
    
print(r"find_all(href=re.compile('lacie'))")
tag_list = soup.find_all(href=re.compile('lacie'))
for tag in tag_list: print(tag)
print()

# css  
print(r"find_all('p', class_='story')")
tag_list = soup.find_all('p', class_='story')
for tag in tag_list: print(tag)
print()

# string  
print(r"find_all(string=re.compile('\d'))")
tag_list = soup.find_all(string=re.compile('\d'))
for tag in tag_list: print(tag)
print()

# text  
print(r"find_all(text=re.compile('\d'))")
tag_list = soup.find_all(text=re.compile('\d'))
for tag in tag_list: print(tag)
print()

# limit 参数  
print(r"find_all('a', limit=2)")
tag_list = soup.find_all('a', limit=2)
for tag in tag_list: print(tag)
print()

find_all('a')
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

find_all(['a','b'])
<b>The Dormouse's story</b>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

find_all(re.compile('a'))
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

find_all(lambda tag: ...)
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id

## find()

相当于find_all()的参数limit=1，但find找不到时返回None，find_all找不到时返回[]

## select() 选择器 （语法类似jquery）

In [65]:
# 标签 
print(r"select('a')")
tag_list = soup.select('a')
for tag in tag_list: print(tag)
print()

# id #
print(r"select('#zrd')")
tag_list = soup.select('#zrd')
for tag in tag_list: print(tag)
print()

# 类名 .
print(r"select('.story')")
tag_list = soup.select('.story')
for tag in tag_list: print(tag)
print()

# 直接子标签 >
print(r"select('head > title')")
tag_list = soup.select('head > title')
for tag in tag_list: print(tag)
print()

# 兄弟 ~
print(r"select('.my ~ .story')")
tag_list = soup.select('.my ~ .story')
for tag in tag_list: print(tag)
print()

# 属性 []
print("select(\"p[class='my']\")")
tag_list = soup.select("p[class='my']")
for tag in tag_list: print(tag)
print()

select('a')
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

select('#zrd')
<p class="my" id="zrd">668</p>

select('.story')
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

select('head > title')
<title>The Dormouse's story</title>

select('.my ~ .story')
<p class="story">...</p>

select("p[class='my']")
<p class="my" id="zrd">668</p>

