## Processing Website Content

We can process the website content and extract HTML Tags as well as data using BeautifulSoup.
* We have to pass the content using `html.parser` and build the BeautifulSoup object.
* Let us prettify and print the content.

In [1]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/cgDtbgFSv_4?rel=0&amp;controls=1&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>

In [2]:
import requests

python_base_url = 'https://python.itversity.com'
python_url = f'{python_base_url}/mastering-python.html'
python_page = requests.get(python_url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(python_page.content, 'html.parser')

In [3]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Mastering Python — Mastering Python
  </title>
  <link href="_static/css/index.73d71520a4ca3b99cfee5594769eaaae.css" rel="stylesheet"/>
  <link href="_static/vendor/fontawesome/5.13.0/css/all.min.css" rel="stylesheet"/>
  <link as="font" crossorigin="" href="_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2" rel="preload" type="font/woff2"/>
  <link as="font" crossorigin="" href="_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2" rel="preload" type="font/woff2"/>
  <link href="_static/vendor/open-sans_all/1.44.1/index.css" rel="stylesheet"/>
  <link href="_static/vendor/lato_latin-ext/1.44.1/index.css" rel="stylesheet"/>
  <link href="_static/pygments.css" rel="stylesheet" type="text/css">
   <link href="_static/sphinx-book-theme.2d2078699c18a0efb88233928e1cf6ed.css" rel="stylesheet" type="text/css">
    <link href="_st

* Let us extract all the `a` tags. We can extract links provided as part of this webpage.
* Here is the code snippet to get the `a` tags from the landing page.

In [4]:
for a in soup.find_all('a'):
    print(a)

<a class="navbar-brand text-wrap" href="index.html">
<h1 class="site-logo" id="site-title">Mastering Python</h1>
</a>
<a class="reference internal" href="#">
   Mastering Python
  </a>
<a class="reference internal" href="01_overview_of_windows_os/01_overview_of_windows_os.html">
   Overview of Windows Operating System
  </a>
<a class="reference internal" href="02_setup_ubuntu_vm_on_gcp/01_setup_ubuntu_vm_on_gcp.html">
   Setup Ubuntu VM on GCP
  </a>
<a class="reference internal" href="03_setup_postgres_database/01_setup_postgres_database.html">
   Setup Postgres Database
  </a>
<a class="reference internal" href="04_postgres_database_operations/01_postgres_database_operations.html">
   Perform Database Operations
  </a>
<a class="reference internal" href="05_getting_started_with_python/01_getting_started_with_python.html">
   Getting Started with Python
  </a>
<a class="reference internal" href="06_basic_programming_constructs/01_basic_programming_constructs.html">
   Basic Programmin

* We can use `field_name.string` to get only the value.

In [5]:
for a in soup.find_all('a'):
    print(a.string)

None

   Mastering Python
  

   Overview of Windows Operating System
  

   Setup Ubuntu VM on GCP
  

   Setup Postgres Database
  

   Perform Database Operations
  

   Getting Started with Python
  

   Basic Programming Constructs
  

   Pre-defined Functions
  

   User Defined Functions
  

   Overview of Collections - list and set
  

   Overview of Collections - dict and tuple
  

   Manipulating Collections using Loops
  

   Development of Map Reduce APIs
  

   Understanding Python Map Reduce Libraries
  

   Overview of Object Oriented Programming
  

   Overview of Pandas Libraries
  

   Web Scraping using Beautiful Soup
  

   Database Programming – CRUD Operations
  

   Database Programming – Batch Operations
  
Newsletter
.ipynb
None
None
None
None
None

   About Python
  

   Course Details
  

   Desired Audience
  

   Prerequisites
  

   Key Objectives
  

   Training Approach
  

   Self Evaluation
  
¶
¶
¶
¶
¶
¶
¶
¶
Overview of Windows Operating System


In [6]:
for a in soup.find_all('a'):
    print(a.get_text())


Mastering Python


   Mastering Python
  

   Overview of Windows Operating System
  

   Setup Ubuntu VM on GCP
  

   Setup Postgres Database
  

   Perform Database Operations
  

   Getting Started with Python
  

   Basic Programming Constructs
  

   Pre-defined Functions
  

   User Defined Functions
  

   Overview of Collections - list and set
  

   Overview of Collections - dict and tuple
  

   Manipulating Collections using Loops
  

   Development of Map Reduce APIs
  

   Understanding Python Map Reduce Libraries
  

   Overview of Object Oriented Programming
  

   Overview of Pandas Libraries
  

   Web Scraping using Beautiful Soup
  

   Database Programming – CRUD Operations
  

   Database Programming – Batch Operations
  
Newsletter
.ipynb
repository
open issue
suggest edit

Binder

   About Python
  

   Course Details
  

   Desired Audience
  

   Prerequisites
  

   Key Objectives
  

   Training Approach
  

   Self Evaluation
  
¶
¶
¶
¶
¶
¶
¶
¶
Overview of

* We can also get the urls used as part of these `a` tags.

In [7]:
for a in soup.find_all('a'):
    print(a['href'])

index.html
#
01_overview_of_windows_os/01_overview_of_windows_os.html
02_setup_ubuntu_vm_on_gcp/01_setup_ubuntu_vm_on_gcp.html
03_setup_postgres_database/01_setup_postgres_database.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programmin

KeyError: 'href'

In [8]:
for a in soup.find_all('a'):
    if a.get('href'):
        print(a['href'])

index.html
#
01_overview_of_windows_os/01_overview_of_windows_os.html
02_setup_ubuntu_vm_on_gcp/01_setup_ubuntu_vm_on_gcp.html
03_setup_postgres_database/01_setup_postgres_database.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programmin

* We can also pass attributes such as `class`, `id` etc to narrow down the filter for specific class or id.

In [9]:
for a in soup.find_all('a'):
    if a.get('class'):
        print(a['class'])

['navbar-brand', 'text-wrap']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['dropdown-buttons']
['repository-button']
['issues-button']
['edit-button']
['full-screen-button']
['binder-button']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['

In [10]:
classes = set()
for a in soup.find_all('a'):
    if a.get('class'):
        classes.add(tuple(a.get('class')))

In [11]:
classes

{('binder-button',),
 ('dropdown-buttons',),
 ('edit-button',),
 ('full-screen-button',),
 ('headerlink',),
 ('issues-button',),
 ('navbar-brand', 'text-wrap'),
 ('reference', 'internal'),
 ('reference', 'internal', 'nav-link'),
 ('repository-button',),
 ('right-next',)}

In [12]:
soup.find('a', {'class': 'reference internal'})

<a class="reference internal" href="#">
   Mastering Python
  </a>

In [13]:
soup.find('a', {'class': 'internal reference'})

In [14]:
soup.find('a', class_='reference internal')

<a class="reference internal" href="#">
   Mastering Python
  </a>

* We can also access attribute values such as `href` of `a` tag.

In [15]:
for a in soup.find_all('a', {'class': 'reference internal'}):
    if a.get('href'):
        print(a['href'])

#
01_overview_of_windows_os/01_overview_of_windows_os.html
02_setup_ubuntu_vm_on_gcp/01_setup_ubuntu_vm_on_gcp.html
03_setup_postgres_database/01_setup_postgres_database.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_o

In [15]:
for a in soup.find_all('a', class_='reference internal'):
    if a.get('href'):
        print(a['href'])

#
01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping

In [16]:
for a in soup.find_all('a'):
    if a.get('id'):
        print(a['id'])

next-link


* Here is an example to narrow down the filter based on `id` on top of `a` tag.

In [17]:
soup.find('a', {'id': 'next-link'})

<a class="right-next" href="01_overview_of_windows_os/01_overview_of_windows_os.html" id="next-link" title="next page">Overview of Windows Operating System</a>

In [18]:
soup.find('a', id='next-link')

<a class="right-next" href="01_overview_of_windows_os/01_overview_of_windows_os.html" id="next-link" title="next page">Overview of Windows Operating System</a>