### Installing Beautiful Soup & parser

In [30]:
# pip install beautifulsoup4
# pip install lxml

### Import the library

In [7]:
from bs4 import BeautifulSoup
import requests

In [5]:
r = requests.get('https://beautiful-soup-4.readthedocs.io/en/latest/')
r.ok

True

In [31]:
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation
  </title>
  <script src="_static/js/modernizr.min.js" type="text/javascript">
  </script>
  <script data-url_root="./" id="documentation_options" src="_static/documentation_options.js" type="text/javascript">
  </script>
  <script src="_static/jquery.js" type="text/javascript">
  </script>
  <script src="_static/underscore.js" type="text/javascript">
  </script>
  <script src="_static/doctools.js" type="text/javascript">
  </script>
  <script src="_static/language_data.js" type="text/javascript">
  </script>
  <script src="https://assets.readthedocs.org/static/javascript/readthedocs-doc-embed.js" type="text/javascript">
  </script>
  <script src="_st

In [12]:
soup.title

<title>Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation</title>

In [13]:
soup.title.name

'title'

In [14]:
soup.title.string

'Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation'

In [15]:
soup.p

<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> is a
Python library for pulling data out of HTML and XML files. It works
with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It commonly saves programmers
hours or days of work.</p>

In [16]:
soup.p.a

<a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a>

In [17]:
soup.p.a['href']

'http://www.crummy.com/software/BeautifulSoup/'

In [20]:
# .find_all return all anchor tags
soup.find_all('a')

[<a class="icon icon-home" href="#"> Beautiful Soup
           
 
           
           </a>,
 <a class="reference internal" href="#">Beautiful Soup Documentation</a>,
 <a class="reference internal" href="#getting-help">Getting help</a>,
 <a class="reference internal" href="#quick-start">Quick Start</a>,
 <a class="reference internal" href="#installing-beautiful-soup">Installing Beautiful Soup</a>,
 <a class="reference internal" href="#problems-after-installation">Problems after installation</a>,
 <a class="reference internal" href="#installing-a-parser">Installing a parser</a>,
 <a class="reference internal" href="#making-the-soup">Making the soup</a>,
 <a class="reference internal" href="#kinds-of-objects">Kinds of objects</a>,
 <a class="reference internal" href="#tag"><code class="docutils literal notranslate"><span class="pre">Tag</span></code></a>,
 <a class="reference internal" href="#name">Name</a>,
 <a class="reference internal" href="#attributes">Attributes</a>,
 <a class="r

In [25]:
# extracting all the links from the anchor tags
for tag in soup.find_all('a', class_="reference external"):
    print(tag['href'])

http://www.crummy.com/software/BeautifulSoup/
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
http://kondou.com/BS4/
https://www.crummy.com/software/BeautifulSoup/bs4/doc.ko/
https://www.crummy.com/software/BeautifulSoup/bs4/doc.ptbr/
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
http://www.crummy.com/software/BeautifulSoup/download/4.x/
http://lxml.de/
http://code.google.com/p/html5lib/
https://facelessuser.github.io/soupsieve/
https://facelessuser.github.io/soupsieve/
http://www.w3.org/TR/html5/syntax.html#syntax
http://wiki.python.org/moin/PrintFails
http://lxml.de/
http://pypi.python.org/pypi/cchardet/
http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
http://www.python.org/dev/peps/pep-0008/


In [27]:
# Another common task is extracting all the text from a page
print(soup.text) # or print(soup.get_text())




  



Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation



























 Beautiful Soup
          

          
          

                latest
              











Beautiful Soup Documentation
Getting help


Quick Start
Installing Beautiful Soup
Problems after installation
Installing a parser


Making the soup
Kinds of objects
Tag
Name
Attributes
Multi-valued attributes




NavigableString
BeautifulSoup
Comments and other special strings


Navigating the tree
Going down
Navigating using tag names
.contents and .children
.descendants
.string
.strings and stripped_strings


Going up
.parent
.parents


Going sideways
.next_sibling and .previous_sibling
.next_siblings and .previous_siblings


Going back and forth
.next_element and .previous_element
.next_elements and .previous_elements




Searching the tree
Kinds of filters
A string
A regular expression
A list
True
A function


find_all()
The name argument
The keyword arguments
Searching by CSS class


|Parser	| Typical usage	| Advantages	| Disadvantages|
| --- | :- | :- |--- |
| Python’s html.parser	| BeautifulSoup(markup, "html.parser")	| Batteries included, Decent speed, Lenient (As of Python 2.7.3 and 3.2.) | Not as fast as lxml, less lenient than html5lib. |
| lxml’s HTML parser	| BeautifulSoup(markup, "lxml") | Very fast, Lenient | External C dependency |
| lxml’s XML parser	| BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml") | Very fast, The only currently supported XML parser | External C dependency |
| html5lib	| BeautifulSoup(markup, "html5lib") | Extremely lenient, Parses pages the same way a web browser does ,Creates valid HTML5 | Very slow, External Python dependency | 


In [34]:
for child in soup.descendants:
    print(child)

html
[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]
[if gt IE 8]><!
<html class="no-js" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation</title>
<script src="_static/js/modernizr.min.js" type="text/javascript"></script>
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js" type="text/javascript"></script>
<script src="_static/jquery.js" type="text/javascript"></script>
<script src="_static/underscore.js" type="text/javascript"></script>
<script src="_static/doctools.js" type="text/javascript"></script>
<script src="_static/language_data.js" type="text/javascript"></script>
<script src="https://assets.readthedocs.org/static/javascript/readthedocs-doc-embed.js" type="text/javascript"></script>
<script src="_static/js/theme.js" type="text/javascript"></script>
<link href="_static/css/th

<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav class="wy-nav-side" data-toggle="wy-nav-shift">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<a class="icon icon-home" href="#"> Beautiful Soup
          

          
          </a>
<div class="version">
                latest
              </div>
<div role="search">
<form action="search.html" class="wy-form" id="rtd-search-form" method="get">
<input name="q" placeholder="Search docs" type="text"/>
<input name="check_keywords" type="hidden" value="yes"/>
<input name="area" type="hidden" value="default"/>
</form>
</div>
</div>
<div aria-label="main navigation" class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation">
<!-- Local TOC -->
<div class="local-toc"><ul>
<li><a class="reference internal" href="#">Beautiful Soup Documentation</a><ul>
<li><a class="reference internal" href="#getting-help">Getting help</a></li>
</ul>
</li>
<li><a class="reference internal" href="#quick-start">Quick Start<

<div class="wy-grid-for-nav">
<nav class="wy-nav-side" data-toggle="wy-nav-shift">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<a class="icon icon-home" href="#"> Beautiful Soup
          

          
          </a>
<div class="version">
                latest
              </div>
<div role="search">
<form action="search.html" class="wy-form" id="rtd-search-form" method="get">
<input name="q" placeholder="Search docs" type="text"/>
<input name="check_keywords" type="hidden" value="yes"/>
<input name="area" type="hidden" value="default"/>
</form>
</div>
</div>
<div aria-label="main navigation" class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation">
<!-- Local TOC -->
<div class="local-toc"><ul>
<li><a class="reference internal" href="#">Beautiful Soup Documentation</a><ul>
<li><a class="reference internal" href="#getting-help">Getting help</a></li>
</ul>
</li>
<li><a class="reference internal" href="#quick-start">Quick Start</a></li>
<li><a class="referenc

<li><a class="reference internal" href="#diagnose"><code class="docutils literal notranslate"><span class="pre">diagnose()</span></code></a></li>
<a class="reference internal" href="#diagnose"><code class="docutils literal notranslate"><span class="pre">diagnose()</span></code></a>
<code class="docutils literal notranslate"><span class="pre">diagnose()</span></code>
<span class="pre">diagnose()</span>
diagnose()


<li><a class="reference internal" href="#errors-when-parsing-a-document">Errors when parsing a document</a></li>
<a class="reference internal" href="#errors-when-parsing-a-document">Errors when parsing a document</a>
Errors when parsing a document


<li><a class="reference internal" href="#version-mismatch-problems">Version mismatch problems</a></li>
<a class="reference internal" href="#version-mismatch-problems">Version mismatch problems</a>
Version mismatch problems


<li><a class="reference internal" href="#parsing-xml">Parsing XML</a></li>
<a class="reference internal" hr

<div class="wy-nav-content">
<div class="rst-content">
<div aria-label="breadcrumbs navigation" role="navigation">
<ul class="wy-breadcrumbs">
<li><a href="#">Docs</a> »</li>
<li>Beautiful Soup Documentation</li>
<li class="wy-breadcrumbs-aside">
<a href="_sources/index.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div class="document" itemscope="itemscope" itemtype="http://schema.org/Article" role="main">
<div itemprop="articleBody">
<div class="section" id="beautiful-soup-documentation">
<span id="documentation"></span><h1>Beautiful Soup Documentation<a class="headerlink" href="#beautiful-soup-documentation" title="Permalink to this headline">Â¶</a></h1>
<img alt='"The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."' class="align-right" src="_images/6.1.jpg"/>
<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> is a
Python library for pulling data out of HTML

<div class="rst-content">
<div aria-label="breadcrumbs navigation" role="navigation">
<ul class="wy-breadcrumbs">
<li><a href="#">Docs</a> »</li>
<li>Beautiful Soup Documentation</li>
<li class="wy-breadcrumbs-aside">
<a href="_sources/index.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div class="document" itemscope="itemscope" itemtype="http://schema.org/Article" role="main">
<div itemprop="articleBody">
<div class="section" id="beautiful-soup-documentation">
<span id="documentation"></span><h1>Beautiful Soup Documentation<a class="headerlink" href="#beautiful-soup-documentation" title="Permalink to this headline">Â¶</a></h1>
<img alt='"The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."' class="align-right" src="_images/6.1.jpg"/>
<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> is a
Python library for pulling data out of HTML and XML files. It works
with

<div class="document" itemscope="itemscope" itemtype="http://schema.org/Article" role="main">
<div itemprop="articleBody">
<div class="section" id="beautiful-soup-documentation">
<span id="documentation"></span><h1>Beautiful Soup Documentation<a class="headerlink" href="#beautiful-soup-documentation" title="Permalink to this headline">Â¶</a></h1>
<img alt='"The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."' class="align-right" src="_images/6.1.jpg"/>
<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> is a
Python library for pulling data out of HTML and XML files. It works
with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It commonly saves programmers
hours or days of work.</p>
<p>These instructions illustrate all major features of Beautiful Soup 4,
with examples. I show you what the library is good for, how it works,
how to us

<div itemprop="articleBody">
<div class="section" id="beautiful-soup-documentation">
<span id="documentation"></span><h1>Beautiful Soup Documentation<a class="headerlink" href="#beautiful-soup-documentation" title="Permalink to this headline">Â¶</a></h1>
<img alt='"The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."' class="align-right" src="_images/6.1.jpg"/>
<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> is a
Python library for pulling data out of HTML and XML files. It works
with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It commonly saves programmers
hours or days of work.</p>
<p>These instructions illustrate all major features of Beautiful Soup 4,
with examples. I show you what the library is good for, how it works,
how to use it, how to make it do what you want, and what to do when it
violates your expectations.</p>


<div class="section" id="kinds-of-objects">
<h1>Kinds of objects<a class="headerlink" href="#kinds-of-objects" title="Permalink to this headline">Â¶</a></h1>
<p>Beautiful Soup transforms a complex HTML document into a complex tree
of Python objects. But youâll only ever have to deal with about four
<cite>kinds</cite> of objects: <code class="docutils literal notranslate"><span class="pre">Tag</span></code>, <code class="docutils literal notranslate"><span class="pre">NavigableString</span></code>, <code class="docutils literal notranslate"><span class="pre">BeautifulSoup</span></code>,
and <code class="docutils literal notranslate"><span class="pre">Comment</span></code>.</p>
<div class="section" id="tag">
<span id="id4"></span><h2><code class="docutils literal notranslate"><span class="pre">Tag</span></code><a class="headerlink" href="#tag" title="Permalink to this headline">Â¶</a></h2>
<p>A <code class="docutils literal notranslate"><span class="pre">Tag</span></code> object corres

<span class="o">.</span>
.
<span class="n">a</span>
a
<span class="p">[</span>
[
<span class="s1">'rel'</span>
'rel'
<span class="p">]</span>
]
 
<span class="o">=</span>
=
 
<span class="p">[</span>
[
<span class="s1">'index'</span>
'index'
<span class="p">,</span>
,
 
<span class="s1">'contents'</span>
'contents'
<span class="p">]</span>
]


<span class="nb">print</span>
print
<span class="p">(</span>
(
<span class="n">rel_soup</span>
rel_soup
<span class="o">.</span>
.
<span class="n">p</span>
p
<span class="p">)</span>
)


<span class="c1"># &lt;p&gt;Back to the &lt;a rel="index contents"&gt;homepage&lt;/a&gt;&lt;/p&gt;</span>
# <p>Back to the <a rel="index contents">homepage</a></p>






<p>You can disable this by passing <code class="docutils literal notranslate"><span class="pre">multi_valued_attributes=None</span></code> as a
keyword argument into the <code class="docutils literal notranslate"><span class="pre">BeautifulSoup</span></code> constructor:</p>
You can disable this 

<span class="n">replace_with</span>
replace_with
<span class="p">(</span>
(
<span class="n">cdata</span>
cdata
<span class="p">)</span>
)



<span class="nb">print</span>
print
<span class="p">(</span>
(
<span class="n">soup</span>
soup
<span class="o">.</span>
.
<span class="n">b</span>
b
<span class="o">.</span>
.
<span class="n">prettify</span>
prettify
<span class="p">())</span>
())


<span class="c1"># &lt;b&gt;</span>
# <b>


<span class="c1">#  &lt;![CDATA[A CDATA block]]&gt;</span>
#  <![CDATA[A CDATA block]]>


<span class="c1"># &lt;/b&gt;</span>
# </b>










<div class="section" id="navigating-the-tree">
<h1>Navigating the tree<a class="headerlink" href="#navigating-the-tree" title="Permalink to this headline">Â¶</a></h1>
<p>Hereâs the âThree sistersâ HTML document again:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">html_doc</span> <span class="o">=</span> <span class="s2">"""</span>
<span class="s2">&lt;ht


<span class="c1"># The Dormouse's story</span>
# The Dormouse's story








<div class="section" id="descendants">
<h3><code class="docutils literal notranslate"><span class="pre">.descendants</span></code><a class="headerlink" href="#descendants" title="Permalink to this headline">Â¶</a></h3>
<p>The <code class="docutils literal notranslate"><span class="pre">.contents</span></code> and <code class="docutils literal notranslate"><span class="pre">.children</span></code> attributes only consider a tagâs
<cite>direct</cite> children. For instance, the &lt;head&gt; tag has a single direct
childâthe &lt;title&gt; tag:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">head_tag</span><span class="o">.</span><span class="n">contents</span>
<span class="c1"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span>
</pre></div>
</div>
<p>But the &lt;title&gt; tag itself has a child: the string âThe Dormouseâs
storyâ. Thereâs

<span class="o">=</span>
=
 
<span class="n">soup</span>
soup
<span class="o">.</span>
.
<span class="n">a</span>
a


<span class="n">link</span>
link


<span class="c1"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;</span>
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


<span class="k">for</span>
for
 
<span class="n">parent</span>
parent
 
<span class="ow">in</span>
in
 
<span class="n">link</span>
link
<span class="o">.</span>
.
<span class="n">parents</span>
parents
<span class="p">:</span>
:

    
<span class="k">if</span>
if
 
<span class="n">parent</span>
parent
 
<span class="ow">is</span>
is
 
<span class="kc">None</span>
None
<span class="p">:</span>
:

        
<span class="nb">print</span>
print
<span class="p">(</span>
(
<span class="n">parent</span>
parent
<span class="p">)</span>
)

    
<span class="k">else</span>
else
<span class="p">:</span>
:

        
<span class="nb">print</span>
print
<span class="p"

</div>


<span id="sibling-generators"></span>
<h3><code class="docutils literal notranslate"><span class="pre">.next_siblings</span></code> and <code class="docutils literal notranslate"><span class="pre">.previous_siblings</span></code><a class="headerlink" href="#next-siblings-and-previous-siblings" title="Permalink to this headline">Â¶</a></h3>
<code class="docutils literal notranslate"><span class="pre">.next_siblings</span></code>
<span class="pre">.next_siblings</span>
.next_siblings
 and 
<code class="docutils literal notranslate"><span class="pre">.previous_siblings</span></code>
<span class="pre">.previous_siblings</span>
.previous_siblings
<a class="headerlink" href="#next-siblings-and-previous-siblings" title="Permalink to this headline">Â¶</a>
Â¶


<p>You can iterate over a tagâs siblings with <code class="docutils literal notranslate"><span class="pre">.next_siblings</span></code> or
<code class="docutils literal notranslate"><span class="pre">.previous_siblings</span><

Searching the tree
<a class="headerlink" href="#searching-the-tree" title="Permalink to this headline">Â¶</a>
Â¶


<p>Beautiful Soup defines a lot of methods for searching the parse tree,
but theyâre all very similar. Iâm going to spend a lot of time explaining
the two most popular methods: <code class="docutils literal notranslate"><span class="pre">find()</span></code> and <code class="docutils literal notranslate"><span class="pre">find_all()</span></code>. The other
methods take almost exactly the same arguments, so Iâll just cover
them briefly.</p>
Beautiful Soup defines a lot of methods for searching the parse tree,
but theyâre all very similar. Iâm going to spend a lot of time explaining
the two most popular methods: 
<code class="docutils literal notranslate"><span class="pre">find()</span></code>
<span class="pre">find()</span>
find()
 and 
<code class="docutils literal notranslate"><span class="pre">find_all()</span></code>
<span class="pre">find_all()</span>
find_a

</pre>
<span></span>
<span class="k">def</span>
def
 
<span class="nf">not_lacie</span>
not_lacie
<span class="p">(</span>
(
<span class="n">href</span>
href
<span class="p">):</span>
):

    
<span class="k">return</span>
return
 
<span class="n">href</span>
href
 
<span class="ow">and</span>
and
 
<span class="ow">not</span>
not
 
<span class="n">re</span>
re
<span class="o">.</span>
.
<span class="n">compile</span>
compile
<span class="p">(</span>
(
<span class="s2">"lacie"</span>
"lacie"
<span class="p">)</span>
)
<span class="o">.</span>
.
<span class="n">search</span>
search
<span class="p">(</span>
(
<span class="n">href</span>
href
<span class="p">)</span>
)


<span class="n">soup</span>
soup
<span class="o">.</span>
.
<span class="n">find_all</span>
find_all
<span class="p">(</span>
(
<span class="n">href</span>
href
<span class="o">=</span>
=
<span class="n">not_lacie</span>
not_lacie
<span class="p">)</span>
)


<span class="c1"># [&lt;a class="sister" href="http://example.c

</div>
<div class="highlight"><pre><span></span><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"elsie"</span><span class="p">))</span>
<span class="c1"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;]</span>
</pre></div>
<pre><span></span><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"elsie"</span><span class="p">))</span>
<span class="c1"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;]</span>
</pre>
<span></span>
<span class="n">soup</span>
soup
<span class

</pre></div>
<pre><span></span><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">string</span><span class="o">=</span><span class="s2">"Elsie"</span><span class="p">)</span>
<span class="c1"># [u'Elsie']</span>

<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">string</span><span class="o">=</span><span class="p">[</span><span class="s2">"Tillie"</span><span class="p">,</span> <span class="s2">"Elsie"</span><span class="p">,</span> <span class="s2">"Lacie"</span><span class="p">])</span>
<span class="c1"># [u'Elsie', u'Lacie', u'Tillie']</span>

<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">string</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"Dormouse"</span><span cl

</div>


<h2><code class="docutils literal notranslate"><span class="pre">find()</span></code><a class="headerlink" href="#find" title="Permalink to this headline">Â¶</a></h2>
<code class="docutils literal notranslate"><span class="pre">find()</span></code>
<span class="pre">find()</span>
find()
<a class="headerlink" href="#find" title="Permalink to this headline">Â¶</a>
Â¶


<p>Signature: find(<a class="reference internal" href="#id11"><span class="std std-ref">name</span></a>, <a class="reference internal" href="#attrs"><span class="std std-ref">attrs</span></a>, <a class="reference internal" href="#recursive"><span class="std std-ref">recursive</span></a>, <a class="reference internal" href="#id12"><span class="std std-ref">string</span></a>, <a class="reference internal" href="#kwargs"><span class="std std-ref">**kwargs</span></a>)</p>
Signature: find(
<a class="reference internal" href="#id11"><span class="std std-ref">name</span></a>
<span class="std std-ref">name</span>
name
, 


<a class="reference internal" href="#id12"><span class="std std-ref">string</span></a>
<span class="std std-ref">string</span>
string
, 
<a class="reference internal" href="#limit"><span class="std std-ref">limit</span></a>
<span class="std std-ref">limit</span>
limit
, 
<a class="reference internal" href="#kwargs"><span class="std std-ref">**kwargs</span></a>
<span class="std std-ref">**kwargs</span>
**kwargs
)


<p>Signature: find_previous_sibling(<a class="reference internal" href="#id11"><span class="std std-ref">name</span></a>, <a class="reference internal" href="#attrs"><span class="std std-ref">attrs</span></a>, <a class="reference internal" href="#id12"><span class="std std-ref">string</span></a>, <a class="reference internal" href="#kwargs"><span class="std std-ref">**kwargs</span></a>)</p>
Signature: find_previous_sibling(
<a class="reference internal" href="#id11"><span class="std std-ref">name</span></a>
<span class="std std-ref">name</span>
name
, 
<a class="reference int

.
<span class="n">select</span>
select
<span class="p">(</span>
(
<span class="s2">"#link1 + .sister"</span>
"#link1 + .sister"
<span class="p">)</span>
)


<span class="c1"># [&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;]</span>
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]






<p>Find tags by CSS class:</p>
Find tags by CSS class:


<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">".sister"</span><span class="p">)</span>
<span class="c1"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
<span class="c1">#  &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
<span class="c1">#  &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>

<span class=

</pre></div>
<pre><span></span><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s1">'&lt;b class="boldest"&gt;Extremely bold&lt;/b&gt;'</span><span class="p">)</span>
<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>

<span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s2">"blockquote"</span>
<span class="n">tag</span><span class="p">[</span><span class="s1">'class'</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'verybold'</span>
<span class="n">tag</span><span class="p">[</span><span class="s1">'id'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">tag</span>
<span class="c1"># &lt;blockquote class="verybold" id="1"&gt;Extremely bold&lt;/blockquote&gt;</span>

<span class="k">del</span> <span clas

<span class="o">=</span>
=
 
<span class="n">Comment</span>
Comment
<span class="p">(</span>
(
<span class="s2">"Nice to see you."</span>
"Nice to see you."
<span class="p">)</span>
)


<span class="n">tag</span>
tag
<span class="o">.</span>
.
<span class="n">append</span>
append
<span class="p">(</span>
(
<span class="n">new_comment</span>
new_comment
<span class="p">)</span>
)


<span class="n">tag</span>
tag


<span class="c1"># &lt;b&gt;Hello there&lt;!--Nice to see you.--&gt;&lt;/b&gt;</span>
# <b>Hello there<!--Nice to see you.--></b>


<span class="n">tag</span>
tag
<span class="o">.</span>
.
<span class="n">contents</span>
contents


<span class="c1"># [u'Hello', u' there', u'Nice to see you.']</span>
# [u'Hello', u' there', u'Nice to see you.']






<p>(This is a new feature in Beautiful Soup 4.4.0.)</p>
(This is a new feature in Beautiful Soup 4.4.0.)


<p>What if you need to create a whole new tag?  The best solution is to
call the factory method <code class="docutils liter



<span class="n">soup</span>
soup
<span class="o">.</span>
.
<span class="n">b</span>
b


<span class="c1"># &lt;b&gt;&lt;i&gt;Don't&lt;/i&gt; you &lt;div&gt;ever&lt;/div&gt; stop&lt;/b&gt;</span>
# <b><i>Don't</i> you <div>ever</div> stop</b>


<span class="n">soup</span>
soup
<span class="o">.</span>
.
<span class="n">b</span>
b
<span class="o">.</span>
.
<span class="n">contents</span>
contents


<span class="c1"># [&lt;i&gt;Don't&lt;/i&gt;, u' you', &lt;div&gt;ever&lt;/div&gt;, u'stop']</span>
# [<i>Don't</i>, u' you', <div>ever</div>, u'stop']








<div class="section" id="clear">
<h2><code class="docutils literal notranslate"><span class="pre">clear()</span></code><a class="headerlink" href="#clear" title="Permalink to this headline">Â¶</a></h2>
<p><code class="docutils literal notranslate"><span class="pre">Tag.clear()</span></code> removes the contents of a tag:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">markup</s







<p>The <code class="docutils literal notranslate"><span class="pre">smooth()</span></code> method is new in Beautiful Soup 4.8.0.</p>
The 
<code class="docutils literal notranslate"><span class="pre">smooth()</span></code>
<span class="pre">smooth()</span>
smooth()
 method is new in Beautiful Soup 4.8.0.






<div class="section" id="output">
<h1>Output<a class="headerlink" href="#output" title="Permalink to this headline">Â¶</a></h1>
<div class="section" id="pretty-printing">
<span id="prettyprinting"></span><h2>Pretty-printing<a class="headerlink" href="#pretty-printing" title="Permalink to this headline">Â¶</a></h2>
<p>The <code class="docutils literal notranslate"><span class="pre">prettify()</span></code> method will turn a Beautiful Soup parse tree into a
nicely formatted Unicode string, with a separate line for each
tag and each string:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">markup</span> <span class="o">=</

</pre></div>
<pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="kc">None</span><span class="p">))</span>
<span class="c1"># &lt;html&gt;</span>
<span class="c1">#  &lt;body&gt;</span>
<span class="c1">#   &lt;p&gt;</span>
<span class="c1">#    Il a dit &lt;&lt;SacrÃ© bleu!&gt;&gt;</span>
<span class="c1">#   &lt;/p&gt;</span>
<span class="c1">#  &lt;/body&gt;</span>
<span class="c1"># &lt;/html&gt;</span>

<span class="n">link_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s1">'&lt;a href="http://example.com/?foo=val1&amp;bar=val2"&gt;A link&lt;/a&gt;'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">link_soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</

<span></span>
<span class="c1"># soup.get_text("|", strip=True)</span>
# soup.get_text("|", strip=True)


<span class="sa">u</span>
u
<span class="s1">'I linked to|example.com'</span>
'I linked to|example.com'






<p>But at that point you might want to use the <a class="reference internal" href="#string-generators"><span class="std std-ref">.stripped_strings</span></a>
generator instead, and process the text yourself:</p>
But at that point you might want to use the 
<a class="reference internal" href="#string-generators"><span class="std std-ref">.stripped_strings</span></a>
<span class="std std-ref">.stripped_strings</span>
.stripped_strings

generator instead, and process the text yourself:


<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="n">text</span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">stripped_

<span class="pre">.contains_replacement_characters</span>
.contains_replacement_characters
 is 
<code class="docutils literal notranslate"><span class="pre">False</span></code>
<span class="pre">False</span>
False
,
youâll know that the ï¿½ was there originally (as it is in this
paragraph) and doesnât stand in for missing data.


<div class="section" id="output-encoding">
<h2>Output encoding<a class="headerlink" href="#output-encoding" title="Permalink to this headline">Â¶</a></h2>
<p>When you write out a document from Beautiful Soup, you get a UTF-8
document, even if the document wasnât in UTF-8 to begin with. Hereâs a
document written in the Latin-1 encoding:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">markup</span> <span class="o">=</span> <span class="sa">b</span><span class="s1">'''</span>
<span class="s1"> &lt;html&gt;</span>
<span class="s1">  &lt;head&gt;</span>
<span class="s1">   &lt;meta content="text/html; 

 
<span class="p">[</span>
[
<span class="s2">"windows-1252"</span>
"windows-1252"
<span class="p">],</span>
],
 
<span class="n">smart_quotes_to</span>
smart_quotes_to
<span class="o">=</span>
=
<span class="s2">"ascii"</span>
"ascii"
<span class="p">)</span>
)
<span class="o">.</span>
.
<span class="n">unicode_markup</span>
unicode_markup


<span class="c1"># u'&lt;p&gt;I just "love" Microsoft Word\'s smart quotes&lt;/p&gt;'</span>
# u'<p>I just "love" Microsoft Word\'s smart quotes</p>'






<p>Hopefully youâll find this feature useful, but Beautiful Soup doesnât
use it. Beautiful Soup prefers the default behavior, which is to
convert Microsoft smart quotes to Unicode characters along with
everything else:</p>
Hopefully youâll find this feature useful, but Beautiful Soup doesnât
use it. Beautiful Soup prefers the default behavior, which is to
convert Microsoft smart quotes to Unicode characters along with
everything else:


<div class="highlight-default notranslate"><div cl

<span class="o">.</span>
.
<span class="n">parent</span>
parent


<span class="c1"># None</span>
# None






<p>This is because two different <code class="docutils literal notranslate"><span class="pre">Tag</span></code> objects canât occupy the same
space at the same time.</p>
This is because two different 
<code class="docutils literal notranslate"><span class="pre">Tag</span></code>
<span class="pre">Tag</span>
Tag
 objects canât occupy the same
space at the same time.




<div class="section" id="parsing-only-part-of-a-document">
<h1>Parsing only part of a document<a class="headerlink" href="#parsing-only-part-of-a-document" title="Permalink to this headline">Â¶</a></h1>
<p>Letâs say you want to use Beautiful Soup look at a documentâs &lt;a&gt;
tags. Itâs a waste of time and memory to parse the entire document and
then go over it again looking for &lt;a&gt; tags. It would be much faster to
ignore everything that wasnât an &lt;a&gt; tag in the first place. The
<code cla

html5lib.</span></a>
<span class="std std-ref">install lxml or
html5lib.</span>
install lxml or
html5lib.


<p>The most common type of unexpected behavior is that you canât find a
tag that you know is in the document. You saw it going in, but
<code class="docutils literal notranslate"><span class="pre">find_all()</span></code> returns <code class="docutils literal notranslate"><span class="pre">[]</span></code> or <code class="docutils literal notranslate"><span class="pre">find()</span></code> returns <code class="docutils literal notranslate"><span class="pre">None</span></code>. This is
another common problem with Pythonâs built-in HTML parser, which
sometimes skips tags it doesnât understand.  Again, the solution is to
<a class="reference internal" href="#parser-installation"><span class="std std-ref">install lxml or html5lib.</span></a></p>
The most common type of unexpected behavior is that you canât find a
tag that you know is in the document. You saw it going in, but

<

</ul>


<li><code class="docutils literal notranslate"><span class="pre">renderContents</span></code> -&gt; <code class="docutils literal notranslate"><span class="pre">encode_contents</span></code></li>
<code class="docutils literal notranslate"><span class="pre">renderContents</span></code>
<span class="pre">renderContents</span>
renderContents
 -> 
<code class="docutils literal notranslate"><span class="pre">encode_contents</span></code>
<span class="pre">encode_contents</span>
encode_contents


<li><code class="docutils literal notranslate"><span class="pre">replaceWith</span></code> -&gt; <code class="docutils literal notranslate"><span class="pre">replace_with</span></code></li>
<code class="docutils literal notranslate"><span class="pre">replaceWith</span></code>
<span class="pre">replaceWith</span>
replaceWith
 -> 
<code class="docutils literal notranslate"><span class="pre">replace_with</span></code>
<span class="pre">replace_with</span>
replace_with


<li><code class="docutil

In [37]:
len(list(soup.children))

4

In [38]:
len(list(soup.descendants))

16645

In [39]:
# If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator
for string in soup.strings:
    print(repr(string))

' '
'\n'
'\n'
'\n'
'\n'
'Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
' Beautiful Soup\n          \n\n          \n          '
'\n'
'\n                latest\n              '
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'Beautiful Soup Documentation'
'\n'
'Getting help'
'\n'
'\n'
'\n'
'Quick Start'
'\n'
'Installing Beautiful Soup'
'\n'
'Problems after installation'
'\n'
'Installing a parser'
'\n'
'\n'
'\n'
'Making the soup'
'\n'
'Kinds of objects'
'\n'
'Tag'
'\n'
'Name'
'\n'
'Attributes'
'\n'
'Multi-valued attributes'
'\n'
'\n'
'\n'
'\n'
'\n'
'NavigableString'
'\n'
'BeautifulSoup'
'\n'
'Comments and other special strings'
'\n'
'\n'
'\n'
'Navigating the tree'
'\n'
'Going down'
'\n'
'Navigating using tag names'
'\n'
'.contents'
' and '
'.children'
'\n'
'.descendants'
'\n'
'.string'
'\n'
'.strings'
' and '
'stripped

'.'
'string'
'.'
'replace_with'
'('
'"No longer bold"'
')'
'\n'
'tag'
'\n'
'# <blockquote>No longer bold</blockquote>'
'\n'
'\n'
'\n'
'NavigableString'
' supports most of the features described in\n'
'Navigating the tree'
' and '
'Searching the tree'
', but not all of\nthem. In particular, since a string canâ\x80\x99t contain anything (the way a\ntag may contain a string or another tag), strings donâ\x80\x99t support the\n'
'.contents'
' or '
'.string'
' attributes, or the '
'find()'
' method.'
'\n'
'If you want to use a '
'NavigableString'
' outside of Beautiful Soup,\nyou should call '
'unicode()'
' on it to turn it into a normal Python\nUnicode string. If you donâ\x80\x99t, your string will carry around a\nreference to the entire Beautiful Soup parse tree, even when youâ\x80\x99re\ndone using Beautiful Soup. This is a big waste of memory.'
'\n'
'\n'
'\n'
'BeautifulSoup'
'Â¶'
'\n'
'The '
'BeautifulSoup'
' object represents the parsed document as a\nwhole. For most purposes, you can t

'title_tag'
'.'
'string'
'\n'
"# u'The Dormouse's story'"
'\n'
'\n'
'\n'
'If a tagâ\x80\x99s only child is another tag, and '
'that'
' tag has a\n'
'.string'
', then the parent tag is considered to have the same\n'
'.string'
' as its child:'
'\n'
'head_tag'
'.'
'contents'
'\n'
"# [<title>The Dormouse's story</title>]"
'\n\n'
'head_tag'
'.'
'string'
'\n'
"# u'The Dormouse's story'"
'\n'
'\n'
'\n'
'If a tag contains more than one thing, then itâ\x80\x99s not clear what\n'
'.string'
' should refer to, so '
'.string'
' is defined to be\n'
'None'
':'
'\n'
'print'
'('
'soup'
'.'
'html'
'.'
'string'
')'
'\n'
'# None'
'\n'
'\n'
'\n'
'\n'
'\n'
'.strings'
' and '
'stripped_strings'
'Â¶'
'\n'
'If thereâ\x80\x99s more than one thing inside a tag, you can still look at\njust the strings. Use the '
'.strings'
' generator:'
'\n'
'for'
' '
'string'
' '
'in'
' '
'soup'
'.'
'strings'
':'
'\n    '
'print'
'('
'repr'
'('
'string'
'))'
'\n'
'# u"The Dormouse\'s story"'
'\n'
"# u'\\n\\n'"
'\n'
'# u"The Dorm

'(['
'"a"'
','
' '
'"b"'
'])'
'\n'
"# [<b>The Dormouse's story</b>,"
'\n'
'#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,'
'\n'
'#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,'
'\n'
'#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]'
'\n'
'\n'
'\n'
'\n'
'\n'
'True'
'Â¶'
'\n'
'The value '
'True'
' matches everything it can. This code finds '
'all'
'\nthe tags in the document, but none of the text strings:'
'\n'
'for'
' '
'tag'
' '
'in'
' '
'soup'
'.'
'find_all'
'('
'True'
'):'
'\n    '
'print'
'('
'tag'
'.'
'name'
')'
'\n'
'# html'
'\n'
'# head'
'\n'
'# title'
'\n'
'# body'
'\n'
'# p'
'\n'
'# b'
'\n'
'# p'
'\n'
'# a'
'\n'
'# a'
'\n'
'# a'
'\n'
'# p'
'\n'
'\n'
'\n'
'\n'
'\n'
'A function'
'Â¶'
'\n'
'If none of the other matches work for you, define a function that\ntakes an element as its only argument. The function should return\n'
'True'
' if the argument matches, and '
'False'
' otherwise.'
'\n'
'Hereâ\x

' '
'Dormouse'
"'s story"
'\n  '
'</'
'title'
'>'
'\n '
'</'
'head'
'>'
'\n'
'...'
'\n'
'\n'
'\n'
'The <title> tag is beneath the <html> tag, but itâ\x80\x99s not '
'directly'
'\nbeneath the <html> tag: the <head> tag is in the way. Beautiful Soup\nfinds the <title> tag when itâ\x80\x99s allowed to look at all descendants of\nthe <html> tag, but when '
'recursive=False'
' restricts it to the\n<html> tagâ\x80\x99s immediate children, it finds nothing.'
'\n'
'Beautiful Soup offers a lot of tree-searching methods (covered below),\nand they mostly take the same arguments as '
'find_all()'
': '
'name'
',\n'
'attrs'
', '
'string'
', '
'limit'
', and the keyword arguments. But the\n'
'recursive'
' argument is different: '
'find_all()'
' and '
'find()'
' are\nthe only methods that support it. Passing '
'recursive=False'
' into a\nmethod like '
'find_parents()'
' wouldnâ\x80\x99t be very useful.'
'\n'
'\n'
'\n'
'\n'
'Calling a tag is like calling '
'find_all()'
'Â¶'
'\n'
'Because '
'find_all()'

'# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,'
'\n'
'#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,'
'\n'
'#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]'
'\n\n'
'soup'
'.'
'select'
'('
'\'a[href$="tillie"]\''
')'
'\n'
'# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]'
'\n\n'
'soup'
'.'
'select'
'('
'\'a[href*=".com/el"]\''
')'
'\n'
'# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]'
'\n'
'\n'
'\n'
'Thereâ\x80\x99s also a method called '
'select_one()'
', which finds only the\nfirst tag that matches a selector:'
'\n'
'soup'
'.'
'select_one'
'('
'".sister"'
')'
'\n'
'# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>'
'\n'
'\n'
'\n'
'If youâ\x80\x99ve parsed XML that defines namespaces, you can use them in CSS\nselectors.:'
'\n'
'from'
' '
'bs4'
' '
'import'
' '
'BeautifulSoup'
'\n'
'xml'
' '
'='
' '
'"""<tag xmlns:ns1="http:/

'new_tag'
'('
'"div"'
')'
'\n'
'# <div><p><b>I wish I was bold.</b></p></div>'
'\n'
'\n'
'\n'
'This method is new in Beautiful Soup 4.0.5.'
'\n'
'\n'
'\n'
'unwrap()'
'Â¶'
'\n'
'Tag.unwrap()'
' is the opposite of '
'wrap()'
'. It replaces a tag with\nwhateverâ\x80\x99s inside that tag. Itâ\x80\x99s good for stripping out markup:'
'\n'
'markup'
' '
'='
' '
'\'<a href="http://example.com/">I linked to <i>example.com</i></a>\''
'\n'
'soup'
' '
'='
' '
'BeautifulSoup'
'('
'markup'
')'
'\n'
'a_tag'
' '
'='
' '
'soup'
'.'
'a'
'\n\n'
'a_tag'
'.'
'i'
'.'
'unwrap'
'()'
'\n'
'a_tag'
'\n'
'# <a href="http://example.com/">I linked to example.com</a>'
'\n'
'\n'
'\n'
'Like '
'replace_with()'
', '
'unwrap()'
' returns the tag\nthat was replaced.'
'\n'
'\n'
'\n'
'smooth()'
'Â¶'
'\n'
'After calling a bunch of methods that modify the parse tree, you may end up with two or more '
'NavigableString'
' objects next to each other. Beautiful Soup doesnâ\x80\x99t have any problems with this, but since it canâ\x

'\n'
'  </body>'
'\n'
' </html>'
'\n'
"'''"
'\n\n'
'soup'
' '
'='
' '
'BeautifulSoup'
'('
'markup'
')'
'\n'
'print'
'('
'soup'
'.'
'prettify'
'())'
'\n'
'# <html>'
'\n'
'#  <head>'
'\n'
'#   <meta content="text/html; charset=utf-8" http-equiv="Content-type" />'
'\n'
'#  </head>'
'\n'
'#  <body>'
'\n'
'#   <p>'
'\n'
'#    SacrÃ© bleu!'
'\n'
'#   </p>'
'\n'
'#  </body>'
'\n'
'# </html>'
'\n'
'\n'
'\n'
'Note that the <meta> tag has been rewritten to reflect the fact that\nthe document is now in UTF-8.'
'\n'
'If you donâ\x80\x99t want UTF-8, you can pass an encoding into '
'prettify()'
':'
'\n'
'print'
'('
'soup'
'.'
'prettify'
'('
'"latin-1"'
'))'
'\n'
'# <html>'
'\n'
'#  <head>'
'\n'
'#   <meta content="text/html; charset=latin-1" http-equiv="Content-type" />'
'\n'
'# ...'
'\n'
'\n'
'\n'
'You can also call encode() on the '
'BeautifulSoup'
' object, or any\nelement in the soup, just as if it were a Python string:'
'\n'
'soup'
'.'
'p'
'.'
'encode'
'('
'"latin-1"'
')'
'\n'
"# '<p>Sacr\\xe9

'"link2"'
')'
'\n\n'
'def'
' '
'is_short_string'
'('
'string'
'):'
'\n    '
'return'
' '
'len'
'('
'string'
')'
' '
'<'
' '
'10'
'\n\n'
'only_short_strings'
' '
'='
' '
'SoupStrainer'
'('
'string'
'='
'is_short_string'
')'
'\n'
'\n'
'\n'
'Iâ\x80\x99m going to bring back the â\x80\x9cthree sistersâ\x80\x9d document one more time,\nand weâ\x80\x99ll see what the document looks like when itâ\x80\x99s parsed with these\nthree '
'SoupStrainer'
' objects:'
'\n'
'html_doc'
' '
'='
' '
'"""'
'\n'
"<html><head><title>The Dormouse's story</title></head>"
'\n'
'<body>'
'\n'
'<p class="title"><b>The Dormouse\'s story</b></p>'
'\n\n'
'<p class="story">Once upon a time there were three little sisters; and their names were'
'\n'
'<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,'
'\n'
'<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and'
'\n'
'<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;'
'\n'
'and they lived at the bottom of a

'\n'
'\n'
'\n'
'Downloads'
'\n'
'pdf'
'\n'
'html'
'\n'
'epub'
'\n'
'\n'
'\n'
'On Read the Docs'
'\n'
'\n'
'Project Home'
'\n'
'\n'
'\n'
'Builds'
'\n'
'\n'
'\n'
'\n      Free document hosting provided by '
'Read the Docs'
'.\n\n    '
'\n'
'\n'
'\n'
'\n'


In [41]:
# These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead
for string in soup.stripped_strings:
    print(repr(string))

'Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation'
'Beautiful Soup'
'latest'
'Beautiful Soup Documentation'
'Getting help'
'Quick Start'
'Installing Beautiful Soup'
'Problems after installation'
'Installing a parser'
'Making the soup'
'Kinds of objects'
'Tag'
'Name'
'Attributes'
'Multi-valued attributes'
'NavigableString'
'BeautifulSoup'
'Comments and other special strings'
'Navigating the tree'
'Going down'
'Navigating using tag names'
'.contents'
'and'
'.children'
'.descendants'
'.string'
'.strings'
'and'
'stripped_strings'
'Going up'
'.parent'
'.parents'
'Going sideways'
'.next_sibling'
'and'
'.previous_sibling'
'.next_siblings'
'and'
'.previous_siblings'
'Going back and forth'
'.next_element'
'and'
'.previous_element'
'.next_elements'
'and'
'.previous_elements'
'Searching the tree'
'Kinds of filters'
'A string'
'A regular expression'
'A list'
'True'
'A function'
'find_all()'
'The'
'name'
'argument'
'The keyword arguments'
'Searching by CSS class'
'The'
'string'
'ar

'.'
'a'
'link'
'# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>'
'for'
'parent'
'in'
'link'
'.'
'parents'
':'
'if'
'parent'
'is'
'None'
':'
'print'
'('
'parent'
')'
'else'
':'
'print'
'('
'parent'
'.'
'name'
')'
'# p'
'# body'
'# html'
'# [document]'
'# None'
'Going sideways'
'Â¶'
'Consider a simple document like this:'
'sibling_soup'
'='
'BeautifulSoup'
'('
'"<a><b>text1</b><c>text2</c></b></a>"'
')'
'print'
'('
'sibling_soup'
'.'
'prettify'
'())'
'# <html>'
'#  <body>'
'#   <a>'
'#    <b>'
'#     text1'
'#    </b>'
'#    <c>'
'#     text2'
'#    </c>'
'#   </a>'
'#  </body>'
'# </html>'
'The <b> tag and the <c> tag are at the same level: theyâ\x80\x99re both direct\nchildren of the same tag. We call them'
'siblings'
'. When a document is\npretty-printed, siblings show up at the same indentation level. You\ncan also use this relationship in the code you write.'
'.next_sibling'
'and'
'.previous_sibling'
'Â¶'
'You can use'
'.next_sibling'
'and'
'.previous_siblin

'-'
'foo'
'='
'"value"'
')'
"# SyntaxError: keyword can't be an expression"
'You can use these attributes in searches by putting them into a\ndictionary and passing the dictionary into'
'find_all()'
'as the'
'attrs'
'argument:'
'data_soup'
'.'
'find_all'
'('
'attrs'
'='
'{'
'"data-foo"'
':'
'"value"'
'})'
'# [<div data-foo="value">foo!</div>]'
'You canâ\x80\x99t use a keyword argument to search for HTMLâ\x80\x99s â\x80\x98nameâ\x80\x99 element,\nbecause Beautiful Soup uses the'
'name'
'argument to contain the name\nof the tag itself. Instead, you can give a value to â\x80\x98nameâ\x80\x99 in the'
'attrs'
'argument:'
'name_soup'
'='
'BeautifulSoup'
'('
'\'<input name="email"/>\''
')'
'name_soup'
'.'
'find_all'
'('
'name'
'='
'"email"'
')'
'# []'
'name_soup'
'.'
'find_all'
'('
'attrs'
'='
'{'
'"name"'
':'
'"email"'
'})'
'# [<input name="email"/>]'
'Searching by CSS class'
'Â¶'
'Itâ\x80\x99s very useful to search for a tag that has a certain CSS class, but\nthe name of the CSS attribute, 

'soup'
'.'
'select'
'('
'"#link1 ~ .sister"'
')'
'# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,'
'#  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]'
'soup'
'.'
'select'
'('
'"#link1 + .sister"'
')'
'# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]'
'Find tags by CSS class:'
'soup'
'.'
'select'
'('
'".sister"'
')'
'# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,'
'#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,'
'#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]'
'soup'
'.'
'select'
'('
'"[class~=sister]"'
')'
'# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,'
'#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,'
'#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]'
'Find tags by ID:'
'soup'
'.'
'select'
'('
'"#link1"'
')'
'# [<a class="sister" href="ht

'If you pass in'
'formatter="html"'
', Beautiful Soup will convert\nUnicode characters to HTML entities whenever possible:'
'print'
'('
'soup'
'.'
'prettify'
'('
'formatter'
'='
'"html"'
'))'
'# <html>'
'#  <body>'
'#   <p>'
'#    Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;'
'#   </p>'
'#  </body>'
'# </html>'
'If you pass in'
'formatter="html5"'
', itâ\x80\x99s the same as'
'formatter="html5"'
', but Beautiful Soup will\nomit the closing slash in HTML void tags like â\x80\x9cbrâ\x80\x9d:'
'soup'
'='
'BeautifulSoup'
'('
'"<br>"'
')'
'print'
'('
'soup'
'.'
'encode'
'('
'formatter'
'='
'"html"'
'))'
'# <html><body><br/></body></html>'
'print'
'('
'soup'
'.'
'encode'
'('
'formatter'
'='
'"html5"'
'))'
'# <html><body><br></body></html>'
'If you pass in'
'formatter=None'
', Beautiful Soup will not modify\nstrings at all on output. This is the fastest option, but it may lead\nto Beautiful Soup generating invalid HTML/XML, as in these examples:'
'print'
'('
'soup'
'.'
'prettify'
'('
'formatte

'second_b'
'# False'
'Copying Beautiful Soup objects'
'Â¶'
'You can use'
'copy.copy()'
'to create a copy of any'
'Tag'
'or'
'NavigableString'
':'
'import'
'copy'
'p_copy'
'='
'copy'
'.'
'copy'
'('
'soup'
'.'
'p'
')'
'print'
'p_copy'
'# <p>I want <b>pizza</b> and more <b>pizza</b>!</p>'
'The copy is considered equal to the original, since it represents the\nsame markup as the original, but itâ\x80\x99s not the same object:'
'print'
'soup'
'.'
'p'
'=='
'p_copy'
'# True'
'print'
'soup'
'.'
'p'
'is'
'p_copy'
'# False'
'The only real difference is that the copy is completely detached from\nthe original Beautiful Soup object tree, just as if'
'extract()'
'had\nbeen called on it:'
'print'
'p_copy'
'.'
'parent'
'# None'
'This is because two different'
'Tag'
'objects canâ\x80\x99t occupy the same\nspace at the same time.'
'Parsing only part of a document'
'Â¶'
'Letâ\x80\x99s say you want to use Beautiful Soup look at a documentâ\x80\x99s <a>\ntags. Itâ\x80\x99s a waste of time and memory to par

In [46]:
# Tag.unwrap() is the opposite of wrap(). It replaces a tag with whatever’s inside that tag. It’s good for stripping out markup

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
print(markup)
print(a_tag)

<a href="http://example.com/">I linked to <i>example.com</i></a>
<a href="http://example.com/">I linked to example.com</a>


In [48]:
# Tag.sourceline (line number) and Tag.sourcepos (position of the start tag within a line)
markup = "<p\n>Paragraph 1</p>\n    <p>Paragraph 2</p>"
soup = BeautifulSoup(markup, 'html.parser')
for tag in soup.find_all('p'):
    print(tag.sourceline, tag.sourcepos, tag.string)

1 0 Paragraph 1
3 4 Paragraph 2
