# Beautiful Soup 4 Tutorial
- Beautiful Soup is a Python library for pulling data out of HTML and XML files.
- This tutorial illustrates all major features of Beautiful Soup 4 based on Beautiful Soup Documentation 4.4.0 and with along my personal practices. 
- The environment is Python 3, but those codes work in Pyhton 2 as well.


## 1. Request and Read HTML

In [1]:
import urllib
from bs4 import BeautifulSoup

content = urllib.request.urlopen('https://google.com').read()
soup = BeautifulSoup(content,'html.parser')
print(soup.prettify()[0:10000])  # soup.prettify() to get a outlook


<!DOCTYPE doctype html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en">
 <head>
  <meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description">
   <meta content="noodp" name="robots">
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
     <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image">
      <title>
       Google
      </title>
      <script>
       (function(){window.google={kEI:'kIYTWf2yJJXQjwPy8LeACg',kEXPI:'201761,3700209,3700269,3700347,3700410,4017607,4028875,4029815,4031109,4032677,4036527,4039268,4043492,4045841,4048347,4065787,4071842,4072364,4072774,4075963,4076095,4076999,4078430,4078763,4079444,4081039,4081164,4083044,4083496,4090550,4090553,4091420,4092934,4093313,4093813,4093951,4094251,4094544,4095909,4096324,4097153,4097469,4097922,4097929,4097951,4098

In [2]:
soup.title

<title>Google</title>

In [3]:
soup.title.name

'title'

In [4]:
soup.title.string

'Google'

In [5]:
soup.title.parent.name

'meta'

In [6]:
soup.p

<p style="color:#767676;font-size:8pt">© 2017 - <a href="/intl/en/policies/privacy/">Privacy</a> - <a href="/intl/en/policies/terms/">Terms</a></p>

In [7]:
soup.p['style']

'color:#767676;font-size:8pt'

In [8]:
soup.a                        

<a class="gb1" href="https://www.google.com/imghp?hl=en&amp;tab=wi">Images</a>

In [9]:
soup.find_all('a')

[<a class="gb1" href="https://www.google.com/imghp?hl=en&amp;tab=wi">Images</a>,
 <a class="gb1" href="https://maps.google.com/maps?hl=en&amp;tab=wl">Maps</a>,
 <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=w8">Play</a>,
 <a class="gb1" href="https://www.youtube.com/?tab=w1">YouTube</a>,
 <a class="gb1" href="https://news.google.com/nwshp?hl=en&amp;tab=wn">News</a>,
 <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a>,
 <a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a>,
 <a class="gb1" href="https://www.google.com/intl/en/options/" style="text-decoration:none"><u>More</u> »</a>,
 <a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>,
 <a class="gb4" href="/preferences?hl=en">Settings</a>,
 <a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&amp;passive=true&amp;continue=https://www.google.com/" id="gb_70" target="_top">Sign in</a>,
 <a href="/advanced_search?hl=en&amp;authuser=0">Advanced search</a

In [10]:
soup.find_all(class_='gb4')

[<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>,
 <a class="gb4" href="/preferences?hl=en">Settings</a>,
 <a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&amp;passive=true&amp;continue=https://www.google.com/" id="gb_70" target="_top">Sign in</a>]

### Common task 1: extract all the URLs found within a page’s 

In [55]:
for link in soup.find_all('a'):
    print(link.get('href'))

https://www.google.com/imghp?hl=en&tab=wi
https://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/options/
http://www.google.com/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/
/advanced_search?hl=en&authuser=0
/language_tools?hl=en&authuser=0
/intl/en/ads/
/services/
https://plus.google.com/116899029375914044550
/intl/en/about.html
/intl/en/policies/privacy/
/intl/en/policies/terms/


### Common task 2: extract all the text from a page:

In [56]:
print(soup.get_text())

Google(function(){window.google={kEI:'paQKWaibFKeF0wLc0rCIDQ',kEXPI:'201761,1352552,1353201,1353476,3700278,3700347,4029765,4031109,4032678,4036527,4039268,4041899,4043492,4045841,4048347,4065786,4072364,4072774,4075963,4076095,4076999,4078430,4081039,4081165,4082441,4083046,4085335,4089939,4090550,4090553,4090806,4091353,4092182,4092935,4093134,4093313,4093499,4093550,4093813,4093951,4094251,4094544,4094837,4095910,4095999,4096323,4096464,4097153,4097922,4097929,4097951,4098096,4098721,4098728,4098752,4100169,4100174,4100380,4100459,4100689,4100828,4101376,4101429,4101750,4102020,4102032,4102099,4102238,4102827,4103215,4103236,4103470,4103475,4103845,4103849,4103999,4104204,4104527,4104620,4105085,4105100,4105178,4105317,4105321,4105469,4105562,4105786,4106949,4107221,4107395,4107628,4107895,4107900,4107956,4107965,4107968,4107989,4108479,4108537,4108539,4108553,4108885,4108932,4109236,4109316,4109490,4109498,4109528,4110510,8503585,8508112,8508229,8508931,8509037,8509090,8509373,1020

### 四大对象种类
- Beautiful Soup 将复杂的HTML文档转换为一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：Tag/NavigableString/BeautifulSoup/Comment


### 1. Tag 
- 指HTML中的一个个标签 e.g. 'title'/'a'/'p'
- Tag有两个属性 name & attrs 

In [4]:
soup.title

<title>Google</title>

In [5]:
soup.a

<a class="gb1" href="https://www.google.com/imghp?hl=en&amp;tab=wi">Images</a>

In [6]:
soup.p

<p style="color:#767676;font-size:8pt">© 2017 - <a href="/intl/en/policies/privacy/">Privacy</a> - <a href="/intl/en/policies/terms/">Terms</a></p>

In [7]:
# soup + tagname 可以轻松获取这些标签的内容，但查找到的是所有内容中的第一个符合要求的标签。
# 查所有的标签，用find_all()

In [8]:
print(soup.name)      # soup对象本身特殊，它的name即为[document]
print(soup.head.name) # 对于其他标签，输出的值便为标签本身的名称

[document]
head


In [9]:
print(soup.p.attrs)   # 把p标签的所有属性都打印出来了，类型是一个dict

{'style': 'color:#767676;font-size:8pt'}


In [10]:
# 单独获取某种特定属性
print(soup.a)
print(soup.a['class'])
print(soup.a['href'])

<a class="gb1" href="https://www.google.com/imghp?hl=en&amp;tab=wi">Images</a>
['gb1']
https://www.google.com/imghp?hl=en&tab=wi


In [27]:
soup.a['class'] = 'newclass'  # rename the attrs
print(soup.a)

<a class="newclass" href="https://www.google.com/imghp?hl=en&amp;tab=wi">Images</a>


### 2. NavigableString (可遍历的字符串)
- 获取标签内部的文字 .string

In [11]:
print(soup.a.string)

Images


In [29]:
print(type(soup.a.string))

<class 'bs4.element.NavigableString'>


### 3. BeautifulSoup
- BeautifulSoup对象表示的是一个文档的全部内容，大部分时候可以当作Tag对象

In [12]:
print(type(soup.name))
print(soup.name)
print(soup.attrs)  # empty list

<class 'str'>
[document]
{}


### 4. Comment
- Comment对象是一个特殊类型的NavigableString对象，其实输出的内容仍然不包括注释符号

In [14]:
print(soup.comment)  # Google主页里没有comment

None


In [23]:
import requests

response = requests.get('https://github.com/freena22')
content = response.content

soup = BeautifulSoup(content,'html.parser')
soup.title.text

'freena22 · GitHub'

In [16]:
print(soup.prettify()[0:2000])

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8">
   <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/frameworks-81a59bf26d881d29286674f6deefe779c444382fff322085b50ba455460ccae5.css" media="all" rel="stylesheet"/>
   <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github-64951a579f72746470cd6d8d29a3170eb697f3b1e3a7472c5787af321ad3cfc9.css" media="all" rel="stylesheet"/>
   <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/site-7d9c6bd23286465361abfa183deb8a05eabb77438ab42033e8230c0c0768d539.css" media="all" rel="stylesheet"/>
   <meta content="width=device-width" name="viewport">
    <title>
     freena22 · GitHub
    </title>
    <link href="/opensearch.xml" rel="search" title="GitHub" type="application/opensearchdescription+xml">
     <link href="https://github.com/fluidicon.png" rel="fluid-icon" title="GitHub">
      <meta content="1401488693436528" property="fb:app_id">
       <meta content="htt

In [17]:
print(soup.comment)  # still no comment in GitHub

None


In [18]:
soup.find_all(content='freena22')

[<meta content="freena22" property="og:title"/>,
 <meta content="freena22" property="profile:username"/>,
 <meta content="freena22" name="octolytics-dimension-user_login"/>]

### 遍历文档树

### 1. 直接子节点
- .contents
- .children

In [31]:
# .contents -- tag的.contents可以将tag的子节点以列表方式输出
import requests

response = requests.get('https://www.reddit.com')
content = response.content

soup = BeautifulSoup(content,'html.parser')
print(soup.head.contents)  # output ad list

[<title>reddit: the front page of the internet</title>, <meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>, <meta content="reddit: the front page of the internet" name="description"/>, <meta content="always" name="referrer"><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><link href="/static/opensearch.xml" rel="search" type="application/opensearchdescription+xml"/><link href="https://www.reddit.com/" rel="canonical"/><meta content="width=1024" name="viewport"><link href="//out.reddit.com" rel="dns-prefetch"><link href="//out.reddit.com" rel="preconnect"><link href="//www.redditstatic.com/icon.png" rel="icon" sizes="256x256" type="image/png"/><link href="//www.redditstatic.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="//www.redditstatic.com/icon-touch.png" rel="apple-touch-icon-precomposed"/><link href="https://www.reddit.com/.rss" rel="alternate" title="RSS" type="application/atom+xml"/><link href="//www.redditstati

In [35]:
print(soup.head.contents[0])
print(soup.head.contents[1])  # 用列表索引获取单个元素


<title>reddit: the front page of the internet</title>
<meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>


In [36]:
# .children -- 返回的不是一个list,可以通过遍历获取所有子节点
print(soup.head.children)  # output: 是一个list生成器对象

<list_iterator object at 0x10e31d128>


In [37]:
for child in soup.head.children:
    print(child)

<title>reddit: the front page of the internet</title>
<meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>
<meta content="reddit: the front page of the internet" name="description"/>
<meta content="always" name="referrer"><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><link href="/static/opensearch.xml" rel="search" type="application/opensearchdescription+xml"/><link href="https://www.reddit.com/" rel="canonical"/><meta content="width=1024" name="viewport"><link href="//out.reddit.com" rel="dns-prefetch"><link href="//out.reddit.com" rel="preconnect"><link href="//www.redditstatic.com/icon.png" rel="icon" sizes="256x256" type="image/png"/><link href="//www.redditstatic.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="//www.redditstatic.com/icon-touch.png" rel="apple-touch-icon-precomposed"/><link href="https://www.reddit.com/.rss" rel="alternate" title="RSS" type="application/atom+xml"/><link href="//www.redditstatic.co

### 2. 所有子孙节点
- .contents和.children属性仅包括tag的直接子节点，.descendants属性可以对所有tag的子孙节点进行递归循环，需要遍历

In [40]:
for child in soup.descendants:
    print(child)

doctype html
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><title>reddit: the front page of the internet</title><meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/><meta content="reddit: the front page of the internet" name="description"/><meta content="always" name="referrer"><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><link href="/static/opensearch.xml" rel="search" type="application/opensearchdescription+xml"/><link href="https://www.reddit.com/" rel="canonical"/><meta content="width=1024" name="viewport"><link href="//out.reddit.com" rel="dns-prefetch"><link href="//out.reddit.com" rel="preconnect"><link href="//www.redditstatic.com/icon.png" rel="icon" sizes="256x256" type="image/png"/><link href="//www.redditstatic.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="//www.redditstatic.com/icon-touch.png" rel="apple-touch-icon-precomposed"/><link href="https://www.reddit.com/.rss" rel="al

<div class="subscribe-thanks"><img alt="_('thanks for subscribing')" src="//www.redditstatic.com/subscribe-header-thanks.svg"/></div>
<img alt="_('thanks for subscribing')" src="//www.redditstatic.com/subscribe-header-thanks.svg"/>
<h2 class="result-message">get the best of reddit, delivered once a week</h2>
get the best of reddit, delivered once a week
<form action="https://www.reddit.com/api/newsletter.json" class="newsletter-signup form-v2 c-form-inline" method="post"><input name="uh" type="hidden" value=""/><input name="source" type="hidden" value="newsletterbar"><div class="c-form-group "><label class="screenreader-only" for="email">email:</label><input class="c-form-control" data-validate-on="change blur" data-validate-url="/api/check_email.json" name="email" placeholder="enter your email" type="email" value=""><div class="c-form-control-feedback-wrapper inside-input"><span class="c-form-control-feedback c-form-control-feedback-throbber"></span><span class="c-form-control-feedbac

<a class="bylink comments may-blank" data-event-action="comments" data-href-url="/r/whitepeoplegifs/comments/6acm7x/when_edibles_kick_in/" data-inbound-url="/r/whitepeoplegifs/comments/6acm7x/when_edibles_kick_in/?utm_content=comments&amp;utm_medium=hot&amp;utm_source=reddit&amp;utm_name=frontpage" href="https://www.reddit.com/r/whitepeoplegifs/comments/6acm7x/when_edibles_kick_in/" rel="nofollow">400 comments</a>
400 comments
<li class="share"><a class="post-sharing-button" href="javascript: void 0;">share</a></li>
<a class="post-sharing-button" href="javascript: void 0;">share</a>
share
<div class="reportform report-t3_6acm7x"></div>
<div class="expando expando-uninitialized" data-cachedhtml=' &lt;iframe src="//www.redditmedia.com/mediaembed/6acm7x" id="media-embed-6acm7x-s1d" class="media-embed " width="368" height="650" border="0" frameBorder="0" scrolling="no" allowfullscreen&gt;&lt;/iframe&gt; ' style="display: none"><span class="error">loading...</span></div>
<span class="error"

### 搜索文档树 find_all()
- find_all() 搜索当前tag的所有tag字节点，并判断是否符合过滤器条件

In [42]:
# 查找文档中所有的tag
soup.find_all('b')

[<b>231.26 minutes</b>]

In [45]:
# 正则表达式
import re
for tag in soup.find_all(re.compile('^b')):
    print(tag.name)

body
button
b
button
br


In [51]:
# 传列表
soup.find_all(['b','style'])
 

[<style type="text/css">/* Custom css: use this block to insert special translation-dependent css in the page header */</style>,
 <b>231.26 minutes</b>,
 <style>body >.content .link .rank, .rank-spacer { width: 2.2ex } body >.content .link .midcol, .midcol-spacer { width: 6.1ex } .adsense-wrap { background-color: #eff7ff; font-size: 18px; padding-left: 8.3ex; padding-right: 5px; }</style>]

In [58]:
# 传TRUE -- 查找所有tag
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
meta
meta
meta
meta
link
link
meta
link
link
link
link
link
link
link
link
link
link
script
script
script
style
script
script
script
body
div
script
script
div
a
div
div
div
span
div
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
div
ul
li
a
li
span
a
li
span
a
span
ul
li
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
li
span
a
a
div
a
ul
li
a
li
a
li
a
li
a
li
a
li
a
li
a
li
a
div
span
a
span
ul
li
a
div
div
form
input
input
div
div
p
dl
dt
i
dd
dt
i
dd
dt
i
dd
dt
i
dd
dt
i
dd
dt
dd
dt
dd
p
code
p
a


In [59]:
# 传方法
def has_class_but_no_id(tag):
    return tag.has_attr('class')and not tag.has_attr('id')

In [60]:
soup.find_all(has_class_but_no_id)

[<body class="listing-page hot-page front-page"><div class="GoogleAd HomeAds InArticleAd LeftAd SidebarAd ad-300-250 ad-banner adbar adbox1 ads-area adsense-ad box_ad googad" id="adblock-test"></div><script>if (!window.DO_NOT_TRACK) { var frame = document.createElement('iframe'); frame.style.display = 'none'; frame.referrer = 'no-referrer'; frame.id = 'gtm-jail'; frame.name = JSON.stringify({ subreddit: r.config.post_site, origin: location.origin, url: location.href, userMatching: r.config.feature_ads_user_matching, userId: r.config.user_id, advertiserCategory: r.config.advertiser_category, }); frame.src = '//' + "www.redditmedia.com" + '/gtm/jail?cb=' + "8CqR7FcToPI"; document.body.appendChild(frame); }</script><script>if (!window.DO_NOT_TRACK) { var mf = document.createElement('script'); mf.type = 'text/javascript'; mf.async = true; mf.src = "//www.redditstatic.com/moat/moatframe.js"; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(mf, s); }</script><div

In [68]:
# 按照参数来搜索
soup.find_all('a', class_='choice')  # class_ (由于class是python的关键词，改为class_)

[<a class="choice" href="https://www.reddit.com/r/announcements/">announcements</a>,
 <a class="choice" href="https://www.reddit.com/r/Art/">Art</a>,
 <a class="choice" href="https://www.reddit.com/r/AskReddit/">AskReddit</a>,
 <a class="choice" href="https://www.reddit.com/r/askscience/">askscience</a>,
 <a class="choice" href="https://www.reddit.com/r/aww/">aww</a>,
 <a class="choice" href="https://www.reddit.com/r/blog/">blog</a>,
 <a class="choice" href="https://www.reddit.com/r/books/">books</a>,
 <a class="choice" href="https://www.reddit.com/r/creepy/">creepy</a>,
 <a class="choice" href="https://www.reddit.com/r/dataisbeautiful/">dataisbeautiful</a>,
 <a class="choice" href="https://www.reddit.com/r/DIY/">DIY</a>,
 <a class="choice" href="https://www.reddit.com/r/Documentaries/">Documentaries</a>,
 <a class="choice" href="https://www.reddit.com/r/EarthPorn/">EarthPorn</a>,
 <a class="choice" href="https://www.reddit.com/r/explainlikeimfive/">explainlikeimfive</a>,
 <a class="ch

In [75]:
soup.find_all(text=re.compile('^w'))

['worldnews', 'worldnews', 'wiki', 'wiki']

In [76]:
soup.find_all('a', limit=2)  # limit

[<a href="#content" id="jumpToContent" tabindex="1">jump to content</a>,
 <a class="choice" href="https://www.reddit.com/r/announcements/">announcements</a>]