# Scraping with BeautifulSoup

## Installation

First, ensure you have `BeautifulSoup` installed. You can install it along with the `requests` library (for making `HTTP requests` to fetch web pages)
> To get a quickstart on the requests library, do refer to this <a href = "https://requests.readthedocs.io/en/latest/user/quickstart/"> guide </a>

In [None]:
%pip install beautifulsoup4 requests

In [1]:
# importing the libraries 
from bs4 import BeautifulSoup
import requests

## Basic Setup

To start scraping, you typically:

- Make an `HTTP request` to fetch the webpage’s HTML content.
- Parse the HTML using `BeautifulSoup` to extract the required information.

Here’s a basic setup to scrape a web page:

In [5]:
url = "https://books.toscrape.com/index.html" 

try:
    response = requests.get(url=url) # get the webpage 
    response.raise_for_status() # raise an HTTPError if the HTTP request returned an unsuccessful status code (4xx or 5xx)
    '''
    Parsing is the process of taking unstructured HTML or XML code and converting it into a structured, accessible format. BeautifulSoup allows us to parse HTML documents and provides methods to navigate and search the document tree. This enables easier data extraction, allowing you to work with HTML elements like tags and attributes programmatically.
    '''
    soup = BeautifulSoup(response.text, 'html.parser') # parse the HTML if the request was successful
    print(f'Request Staus Code: {response.status_code}')
except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred : {http_err}") # outputs the HTTP errors 
except requests.exceptions.RequestException as err:
    print(f"Error occurred: {err}") # for other request-related errors

Request Staus Code: 200


### Explanation of Parsing in BeautifulSoup

**Parsing** is the process of taking unstructured HTML or XML code and converting it into a structured, accessible format. `BeautifulSoup` allows us to parse HTML documents and provides methods to navigate and search the document tree. This enables easier data extraction, allowing you to work with HTML elements like tags and attributes programmatically.

### Comparison Between HTML Parsers: `lxml`, `html.parser`, and `html5lib`

BeautifulSoup supports multiple parsers, each with its own strengths:

| Parser       | Speed                  | Compatibility       | Description & Best Use Cases                 |
|--------------|------------------------|---------------------|----------------------------------------------|
| **lxml**     | Fastest                | Very Good          | Uses the `lxml` library; best for XML and HTML parsing with XPath support. It is very fast, especially for large documents, and provides robust error handling. |
| **html.parser** | Moderate (Python-native) | Good              | Built into Python; lightweight and reliable for basic parsing. It doesn’t require external libraries, making it suitable for small-to-moderate HTML documents where high precision isn't critical. |
| **html5lib** | Slowest                | Excellent (HTML5 compliant) | Parses pages exactly as a browser would, handling poorly structured HTML. Ideal for parsing complex or malformed HTML documents since it tries to "fix" broken HTML. |

Each parser has unique advantages based on speed, compatibility, and robustness for specific tasks. Generally:
- **Use `lxml`** for speed and when you need XPath.
- **Use `html.parser`** for lightweight tasks on well-formed HTML.
- **Use `html5lib`** when parsing complex or poorly structured HTML for maximum compatibility.

In [None]:
# let us have a look at the soup object 
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

## Using CSS Selectors to scrape the Webpage

In [11]:
# Notice the title element - this carries the title of the webpage 
# Let us scrape the title of the webpage 

title = soup.find('title') # gets the title element 
print(title) # prints the entire title element 

# say now we want just the text of the title or the content of it 
print(title.string.strip())

<title>
    All products | Books to Scrape - Sandbox
</title>
All products | Books to Scrape - Sandbox


In [19]:
# say i want to get all the a elements 
anchor= soup.findAll('a') 

# anchor is a list of all the anchor elements
for a in anchor:
    print(f'{a}\n')

<a href="index.html">Books to Scrape</a>

<a href="index.html">Home</a>

<a href="catalogue/category/books_1/index.html">
                            
                                Books
                            
                        </a>

<a href="catalogue/category/books/travel_2/index.html">
                            
                                Travel
                            
                        </a>

<a href="catalogue/category/books/mystery_3/index.html">
                            
                                Mystery
                            
                        </a>

<a href="catalogue/category/books/historical-fiction_4/index.html">
                            
                                Historical Fiction
                            
                        </a>

<a href="catalogue/category/books/sequential-art_5/index.html">
                            
                                Sequential Art
                            
        

In [20]:
# selection based on an id - id="promotions_left"
id_select = soup.findAll(id="promotions_left")
id_select

[<div id="promotions_left">
 </div>]

In [21]:
# selection based on a class - the element has two classes namely class="alert alert-warning"
class_select = soup.findAll(class_='alert')
class_select



In [None]:
# I want to see the a element wth href = 'index.html'
a_tags = soup.findAll('a', href='index.html')
print(a_tags) # presents the a tags in a list

for a in a_tags:
    print(a.string) # acces each text of the a tag

[<a href="index.html">Books to Scrape</a>, <a href="index.html">Home</a>]
Books to Scrape
Home


In [27]:
# Suppose I want all the a tags with href="catalogue/category/books/mystery_3/index.html" or href="catalogue/category/books/travel_2/index.html"
a_tags = soup.findAll('a', href = lambda href: href in ["catalogue/category/books/mystery_3/index.html", "catalogue/category/books/travel_2/index.html"])
a_tags

[<a href="catalogue/category/books/travel_2/index.html">
                             
                                 Travel
                             
                         </a>,
 <a href="catalogue/category/books/mystery_3/index.html">
                             
                                 Mystery
                             
                         </a>]

In [28]:
# say I want all the a tags which has the word - category - present in the href attribute 
import re
# Find all <a> tags with 'category' in the href attribute
a_tags = soup.findAll("a", href=re.compile(r"category"))
a_tags

[<a href="catalogue/category/books_1/index.html">
                             
                                 Books
                             
                         </a>,
 <a href="catalogue/category/books/travel_2/index.html">
                             
                                 Travel
                             
                         </a>,
 <a href="catalogue/category/books/mystery_3/index.html">
                             
                                 Mystery
                             
                         </a>,
 <a href="catalogue/category/books/historical-fiction_4/index.html">
                             
                                 Historical Fiction
                             
                         </a>,
 <a href="catalogue/category/books/sequential-art_5/index.html">
                             
                                 Sequential Art
                             
                         </a>,
 <a href="catalogue/catego

In [32]:
# say now I want all the a tags which contain the word 'Fiction' inside the text
a_tags = soup.findAll('a')
filtered_a_tags = [a for a in a_tags if a.string and re.search(r'Fiction', a.string)]
for item in filtered_a_tags:
    print(item.prettify())

<a href="catalogue/category/books/historical-fiction_4/index.html">
 Historical Fiction
</a>

<a href="catalogue/category/books/womens-fiction_9/index.html">
 Womens Fiction
</a>

<a href="catalogue/category/books/fiction_10/index.html">
 Fiction
</a>

<a href="catalogue/category/books/science-fiction_16/index.html">
 Science Fiction
</a>

<a href="catalogue/category/books/adult-fiction_29/index.html">
 Adult Fiction
</a>

<a href="catalogue/category/books/christian-fiction_34/index.html">
 Christian Fiction
</a>



In [None]:
# Filter <a> tags where either href contains "category" and text contains 'Fiction'
# note =, I have used the get() function to get the value of the attribute href
filtered_a_tags = [
    a for a in a_tags
    if re.search(r"category", a.get("href", "")) and (a.text and re.search(r"Fiction", a.text))
]

# Print matching <a> tags
for a_tag in filtered_a_tags:
    print(a_tag.prettify())

<a href="catalogue/category/books/historical-fiction_4/index.html">
 Historical Fiction
</a>

<a href="catalogue/category/books/womens-fiction_9/index.html">
 Womens Fiction
</a>

<a href="catalogue/category/books/fiction_10/index.html">
 Fiction
</a>

<a href="catalogue/category/books/science-fiction_16/index.html">
 Science Fiction
</a>

<a href="catalogue/category/books/adult-fiction_29/index.html">
 Adult Fiction
</a>

<a href="catalogue/category/books/christian-fiction_34/index.html">
 Christian Fiction
</a>



In [None]:
# note on attrs
a_tags = soup.findAll('a')
for a in a_tags:
    print(a.attrs, end=';')

{'href': 'index.html'};{'href': 'index.html'};{'href': 'catalogue/category/books_1/index.html'};{'href': 'catalogue/category/books/travel_2/index.html'};{'href': 'catalogue/category/books/mystery_3/index.html'};{'href': 'catalogue/category/books/historical-fiction_4/index.html'};{'href': 'catalogue/category/books/sequential-art_5/index.html'};{'href': 'catalogue/category/books/classics_6/index.html'};{'href': 'catalogue/category/books/philosophy_7/index.html'};{'href': 'catalogue/category/books/romance_8/index.html'};{'href': 'catalogue/category/books/womens-fiction_9/index.html'};{'href': 'catalogue/category/books/fiction_10/index.html'};{'href': 'catalogue/category/books/childrens_11/index.html'};{'href': 'catalogue/category/books/religion_12/index.html'};{'href': 'catalogue/category/books/nonfiction_13/index.html'};{'href': 'catalogue/category/books/music_14/index.html'};{'href': 'catalogue/category/books/default_15/index.html'};{'href': 'catalogue/category/books/science-fiction_16/

We can use is `attrs` dictionary for each `a` tag for our next task!

In [45]:
# say, now I want to see all the a tags which has a href and a title attribute
filtered_a_tags = soup.findAll('a', href=True, title=True)

for a_tag in filtered_a_tags:
    print(a_tag.prettify())

<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
 A Light in the ...
</a>

<a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">
 Tipping the Velvet
</a>

<a href="catalogue/soumission_998/index.html" title="Soumission">
 Soumission
</a>

<a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">
 Sharp Objects
</a>

<a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">
 Sapiens: A Brief History ...
</a>

<a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">
 The Requiem Red
</a>

<a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">
 The Dirty Little Secrets ...
</a>

<a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the