## Cleanup HTML using BeautifulSoup

Let us understand how to cleanup HTML using BeautifulSoup while copying content from one site to another.
* When we copy content from one site to another site, we might run into issues due to conflicting Java Script and CSS.
* It is better to clean up references to CSS, Java Script and even some of the irrelevant tags.
* We ran into similar issue while copying content from [https://python.itversity.com/04_postgres_database_operations/03_create_database_and_users_table.html](https://python.itversity.com/04_postgres_database_operations/03_create_database_and_users_table.html) to a blog post.
* Here are some of the clean up tasks we will perform to understand BeautifulSoup capabilities to clean up the HTML Content.
  * Remove the `script` tags along with content.
  * Remove anchor tag as it is having permalink referring to itself.
  * Removing the tags along with the content is called as `decompose`.
  * Remove div containers while retaining the inner tags with in the div container. Removing the tags with out touching the content in the tag is called as `unwrap`.

### Decomposing Tags

Let us see how we can remove the tag along with the content. It is called as `decompose`.

In [1]:
import requests

url = 'https://python.itversity.com/04_postgres_database_operations/03_create_database_and_users_table.html'
page = requests.get(url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

In [2]:
mc = soup.find('div', id='main-content')

In [3]:
print(mc.prettify())

<div class="row" id="main-content">
 <div class="col-12 col-md-9 pl-md-3 pr-md-0">
  <div>
   <div class="section" id="create-database-and-users-table">
    <h1>
     Create Database and Users Table
     <a class="headerlink" href="#create-database-and-users-table" title="Permalink to this headline">
      ¶
     </a>
    </h1>
    <p>
     Let us create a simple table by name users for now.
    </p>
    <ul class="simple">
     <li>
      <p>
       We can run database commands using
       <strong>
        %%sql with in Jupyter Notebook
       </strong>
       or
       <strong>
        psql
       </strong>
       or
       <strong>
        SQL Alchemy
       </strong>
       to create the tables in the database. You can use the tool as per your preference.
      </p>
     </li>
     <li>
      <p>
       If you are using our labs, you will get a database and user which will be prefixed with your OS username and the password which is published via our portal.
      </p>
     </li>
 

In [4]:
soup.find('div', id='main-content').find('script')

<script type="text/x-thebe-config">
    {
        requestKernel: true,
        binderOptions: {
            repo: "binder-examples/jupyter-stacks-datascience",
            ref: "master",
        },
        codeMirrorConfig: {
            theme: "abcdef",
            mode: "python"
        },
        kernelOptions: {
            kernelName: "python3",
            path: "./04_postgres_database_operations"
        },
        predefinedOutput: true
    }
    </script>

In [5]:
script = mc.find('script')

In [6]:
script

<script type="text/x-thebe-config">
    {
        requestKernel: true,
        binderOptions: {
            repo: "binder-examples/jupyter-stacks-datascience",
            ref: "master",
        },
        codeMirrorConfig: {
            theme: "abcdef",
            mode: "python"
        },
        kernelOptions: {
            kernelName: "python3",
            path: "./04_postgres_database_operations"
        },
        predefinedOutput: true
    }
    </script>

In [7]:
script.decompose()

In [8]:
soup.find('div', id='main-content').find('script')

<script>kernelName = 'python3'</script>

In [9]:
import requests

url = 'https://python.itversity.com/04_postgres_database_operations/03_create_database_and_users_table.html'
page = requests.get(url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

In [10]:
mc = soup.find('div', id='main-content')

In [11]:
for tag in mc.find_all('script'):
    print(tag)

<script type="text/x-thebe-config">
    {
        requestKernel: true,
        binderOptions: {
            repo: "binder-examples/jupyter-stacks-datascience",
            ref: "master",
        },
        codeMirrorConfig: {
            theme: "abcdef",
            mode: "python"
        },
        kernelOptions: {
            kernelName: "python3",
            path: "./04_postgres_database_operations"
        },
        predefinedOutput: true
    }
    </script>
<script>kernelName = 'python3'</script>


In [12]:
for tag in mc.find_all('script'):
    tag.decompose()

In [13]:
for tag in mc.find_all('script'):
    print(tag)

In [14]:
print(mc.prettify())

<div class="row" id="main-content">
 <div class="col-12 col-md-9 pl-md-3 pr-md-0">
  <div>
   <div class="section" id="create-database-and-users-table">
    <h1>
     Create Database and Users Table
     <a class="headerlink" href="#create-database-and-users-table" title="Permalink to this headline">
      ¶
     </a>
    </h1>
    <p>
     Let us create a simple table by name users for now.
    </p>
    <ul class="simple">
     <li>
      <p>
       We can run database commands using
       <strong>
        %%sql with in Jupyter Notebook
       </strong>
       or
       <strong>
        psql
       </strong>
       or
       <strong>
        SQL Alchemy
       </strong>
       to create the tables in the database. You can use the tool as per your preference.
      </p>
     </li>
     <li>
      <p>
       If you are using our labs, you will get a database and user which will be prefixed with your OS username and the password which is published via our portal.
      </p>
     </li>
 

In [15]:
mc.find('a', class_='headerlink')

<a class="headerlink" href="#create-database-and-users-table" title="Permalink to this headline">¶</a>

In [16]:
headerlink = mc.find('a', class_='headerlink')

In [17]:
headerlink.decompose()

In [18]:
mc.find('a', class_='headerlink')

### Unwrapping Tags

Let us see how we can remove the tags without deleting the content. It is called as `unwrap`.

In [19]:
mc.find('div')

<div class="col-12 col-md-9 pl-md-3 pr-md-0">
<div>
<div class="section" id="create-database-and-users-table">
<h1>Create Database and Users Table</h1>
<p>Let us create a simple table by name users for now.</p>
<ul class="simple">
<li><p>We can run database commands using <strong>%%sql with in Jupyter Notebook</strong> or <strong>psql</strong> or <strong>SQL Alchemy</strong> to create the tables in the database. You can use the tool as per your preference.</p></li>
<li><p>If you are using our labs, you will get a database and user which will be prefixed with your OS username and the password which is published via our portal.</p></li>
</ul>
<p>Here are the commands to create the database using <code class="docutils literal notranslate"><span class="pre">psql</span></code>, in case if you are planning to use your own environment. You can only run these commands if you have access to database as super user.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span>

In [20]:
for tag in mc.find_all('div'):
    tag.unwrap()

In [21]:
mc.find('div')

In [22]:
print(mc.prettify())

<div class="row" id="main-content">
 <h1>
  Create Database and Users Table
 </h1>
 <p>
  Let us create a simple table by name users for now.
 </p>
 <ul class="simple">
  <li>
   <p>
    We can run database commands using
    <strong>
     %%sql with in Jupyter Notebook
    </strong>
    or
    <strong>
     psql
    </strong>
    or
    <strong>
     SQL Alchemy
    </strong>
    to create the tables in the database. You can use the tool as per your preference.
   </p>
  </li>
  <li>
   <p>
    If you are using our labs, you will get a database and user which will be prefixed with your OS username and the password which is published via our portal.
   </p>
  </li>
 </ul>
 <p>
  Here are the commands to create the database using
  <code class="docutils literal notranslate">
   <span class="pre">
    psql
   </span>
  </code>
  , in case if you are planning to use your own environment. You can only run these commands if you have access to database as super user.
 </p>
 <pre><span></span

In [23]:
for tag in mc.find_all('span'):
    tag.unwrap()

In [24]:
print(mc.prettify())

<div class="row" id="main-content">
 <h1>
  Create Database and Users Table
 </h1>
 <p>
  Let us create a simple table by name users for now.
 </p>
 <ul class="simple">
  <li>
   <p>
    We can run database commands using
    <strong>
     %%sql with in Jupyter Notebook
    </strong>
    or
    <strong>
     psql
    </strong>
    or
    <strong>
     SQL Alchemy
    </strong>
    to create the tables in the database. You can use the tool as per your preference.
   </p>
  </li>
  <li>
   <p>
    If you are using our labs, you will get a database and user which will be prefixed with your OS username and the password which is published via our portal.
   </p>
  </li>
 </ul>
 <p>
  Here are the commands to create the database using
  <code class="docutils literal notranslate">
   psql
  </code>
  , in case if you are planning to use your own environment. You can only run these commands if you have access to database as super user.
 </p>
 <pre>psql -U postgres -h localhost -p 5433 -W
docke

* Here is another example. As most of our pages are in similar structure, we can develop a program which will clean up HTMLs for us so that we can publish the content on some target site or save into the database.

In [25]:
import requests

url = 'https://postgresql.itversity.com/03_writing_basic_sql_queries/08_joining_tables_inner.html'
page = requests.get(url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

In [26]:
mc = soup.find('div', id='main-content')

In [27]:
print(mc.prettify())

<div class="row" id="main-content">
 <div class="col-12 col-md-9 pl-md-3 pr-md-0">
  <div>
   <div class="section" id="joining-tables-inner">
    <h1>
     Joining Tables – Inner
     <a class="headerlink" href="#joining-tables-inner" title="Permalink to this headline">
      ¶
     </a>
    </h1>
    <p>
     Let us understand how to join data from multiple tables.
    </p>
    <ul class="simple">
     <li>
      <p>
       We will primarily focus on ANSI style join (
       <strong>
        JOIN with ON
       </strong>
       ).
      </p>
     </li>
     <li>
      <p>
       There are different types of joins.
      </p>
      <ul>
       <li>
        <p>
         INNER JOIN - Get all the records from both the datasets which satisfies JOIN condition.
        </p>
       </li>
       <li>
        <p>
         OUTER JOIN - We will get into the details as part of the next topic
        </p>
       </li>
      </ul>
     </li>
     <li>
      <p>
       Example for INNER JOIN
      </

In [28]:
mc = soup.find('div', id='main-content')

In [29]:
for tag in mc.find_all('script'):
    tag.decompose()

In [30]:
headerlink = mc.find('a', class_='headerlink')

In [31]:
headerlink.decompose()

In [32]:
for tag in mc.find_all('div'):
    tag.unwrap()

In [33]:
for tag in mc.find_all('span'):
    tag.unwrap()

In [34]:
print(mc.prettify())

<div class="row" id="main-content">
 <h1>
  Joining Tables – Inner
 </h1>
 <p>
  Let us understand how to join data from multiple tables.
 </p>
 <ul class="simple">
  <li>
   <p>
    We will primarily focus on ANSI style join (
    <strong>
     JOIN with ON
    </strong>
    ).
   </p>
  </li>
  <li>
   <p>
    There are different types of joins.
   </p>
   <ul>
    <li>
     <p>
      INNER JOIN - Get all the records from both the datasets which satisfies JOIN condition.
     </p>
    </li>
    <li>
     <p>
      OUTER JOIN - We will get into the details as part of the next topic
     </p>
    </li>
   </ul>
  </li>
  <li>
   <p>
    Example for INNER JOIN
   </p>
  </li>
 </ul>
 <pre>SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
LIMIT 10
</pre>
 <ul class="simple">
  <li>
   <p>
    We can join more than 2 tables in one query. Here is how it will look like.
   </p>
  <