# BeautifulSoup

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

BeautifulSoup is a Python library, which is used quickly extract (scrape) valid data from webpages, and this library gives facility to us to use what ever the parser you want like html.parser, lxml and html5lib.



**Installing & Importing prerequisites**

In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
simple_html = """

<html>
    <head>
        <title>Web Scraping Session</title>        
    </head>

    <body>
        <p class="title">
            <b>Simple HTML</b>
        </p>

        <p class="begin start">
        This is one of the sessions on web scraping and we have covered

            <a href="http://abcd.com/request" class="content" id="link1">requests</a>,
            <a href="http://abcd.com/parser" class="content" id="link2">parser</a> and
            <a href="http://abcd.com/extractor" class="content" id="link3">extractor</a>.
        </p>

        <p class="end">Let us move on to the next topic.</p>
    </body>
</html>

"""

## Objects

Beautiful Soup transforms a **complex HTML document into a complex tree of Python objects**. 

The most common kinds of objects we generally deal with are: 

1. BeautifulSoup
2. Tag



### Soup Object

The BeautifulSoup object represents the parsed document as a whole.

In [4]:
soup = BeautifulSoup(simple_html)

In [5]:
print(soup.prettify())  # Indented soup object

<html>
 <head>
  <title>
   Web Scraping Session
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    Simple HTML
   </b>
  </p>
  <p class="begin start">
   This is one of the sessions on web scraping and we have covered
   <a class="content" href="http://abcd.com/request" id="link1">
    requests
   </a>
   ,
   <a class="content" href="http://abcd.com/parser" id="link2">
    parser
   </a>
   and
   <a class="content" href="http://abcd.com/extractor" id="link3">
    extractor
   </a>
   .
  </p>
  <p class="end">
   Let us move on to the next topic.
  </p>
 </body>
</html>



### Tag object

A Tag object corresponds to an XML or HTML tag in the original document:


#### Accessing elements/tags




In [6]:
tag = soup.title  # Takes the first attribute
print(tag)
print(type(tag))
print(tag.name)  # Every tag has a name, and we can access that name using "name" object.

<title>Web Scraping Session</title>
<class 'bs4.element.Tag'>
title


In [7]:
soup.body

<body>
<p class="title">
<b>Simple HTML</b>
</p>
<p class="begin start">
        This is one of the sessions on web scraping and we have covered

            <a class="content" href="http://abcd.com/request" id="link1">requests</a>,
            <a class="content" href="http://abcd.com/parser" id="link2">parser</a> and
            <a class="content" href="http://abcd.com/extractor" id="link3">extractor</a>.
        </p>
<p class="end">Let us move on to the next topic.</p>
</body>

In [8]:
soup.body.name

'body'

In [9]:
soup.body.parent.name

'html'

In [11]:
soup.p

<p class="title">
<b>Simple HTML</b>
</p>

In [12]:
soup.a

<a class="content" href="http://abcd.com/request" id="link1">requests</a>

#### Attributes of the tag

A tag may have any number of attributes. We can access a tag’s attributes by treating the tag like a dictionary:

In [13]:
tag = soup.a
tag

<a class="content" href="http://abcd.com/request" id="link1">requests</a>

In [14]:
tag['href']

'http://abcd.com/request'

In [15]:
tag.attrs  # attrs is the dictionary

{'href': 'http://abcd.com/request', 'class': ['content'], 'id': 'link1'}

To access the attribute values in the form of a list we will use 

In [16]:
tag.get_attribute_list('class')

['content']

## Different Types Of Parsers
BeautifulSoup supports different types of parsers, depends on what type of markup you want to parse. It supports “html”, “xml”, and “html5”

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use


The cool thing when you use Beautiful Soup is that you can have different parsers that work well even with ill-formatted HTML. We shall now create a soup object using this ill-formatted HTML. 

When we don't specify a parser, the default parser that is used is the lxml parser. This parser is very performant, it works extremely fast. It's lenient, it can work with ill-formatted HTML as well. 





In [25]:
html = """
    <h1><a /><b><th <td>
"""

Here is what the parsed HTML looks like with this parser. Observe that it is now well-formatted. 

In [29]:
soup = BeautifulSoup(html)

print(soup.prettify())

<html>
 <body>
  <h1>
   <a>
   </a>
   <b>
   </b>
   <th>
   </th>
  </h1>
 </body>
</html>


### lxml
https://lxml.de/index.html



When you create the soup object, you can explicitly specify what parser you want to use. Here we want to explicitly use the lxml parser, and here is the parsed HTML. 

In [None]:
soup = BeautifulSoup(html, 'lxml')

print(soup.prettify())

<html>
 <body>
  <h1>
   <a>
   </a>
   <b>
   </b>
   <th>
   </th>
  </h1>
 </body>
</html>


### html.parser

You can construct your soup object using Python's default HTML parser as well. 

**This has good speed, it's a fairly lenient parser, but it's not as fast as lxml, and this parser is not as lenient as the html5lib parser.**

Let's take a look at the resulting HTML when we use this html.parser. 


In [27]:
soup = BeautifulSoup(html, 'html.parser')

print(soup.prettify())

<h1>
 <a>
 </a>
 <b>
  <th <td="">
  </th>
 </b>
</h1>


The resulting HTML is **not as clean** as the HTML that was generated when we use the lxml parser. 

### html5lib

If you want your soup objects to have a **valid HTML5 format**, the html5lib parser is the one that you should use. 

It's extremely lenient, it parses pages as a web browser does. The resulting HTML will be valid HTML5. 

However, this is fairly slow and has an external dependency on a Python library. And here is valid HTML5 generated using our html5lib parser. 

In [30]:
soup = BeautifulSoup(html, 'html5lib')

print(soup.prettify())

<html>
 <head>
 </head>
 <body>
  <h1>
   <a>
    <b>
    </b>
   </a>
  </h1>
 </body>
</html>


Beautiful Soup can be used to parse **XML trees** as well. Simply specify XML as your parse format and observe that your resulting HTML page is now rendered as XML. 

In [31]:
soup = BeautifulSoup(html, 'xml')

print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<h1>
 <a/>
 <b>
  <th>
   <td/>
  </th>
 </b>
</h1>


If you want to use the XML parser in the lxml library, this is how you'd specify your soup object.

In [32]:
soup = BeautifulSoup(html, 'lxml-xml')

print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<h1>
 <a/>
 <b>
  <th>
   <td/>
  </th>
 </b>
</h1>
