# Web Scraping using BeautifulSoup

## Objective
The objective of this notebook is to demonstrate structured web scraping using
BeautifulSoup by extracting quotes, authors, and tags from a paginated website.

## Import libraries

In [14]:
import requests                 # For making HTTP requests
from bs4 import BeautifulSoup   # For parsing HTML
import pandas as pd             # For tabular data handling

## Send request to website

In [15]:
# URL of the website to be scraped
url = "https://quotes.toscrape.com/"

# Send an HTTP GET request to the website
response = requests.get(url)

# Check the HTTP response status code
# 200 indicates a successful request
response.status_code

200

Status code 200 confirms the page was successfully fetched.

## Parse HTML

In [16]:
# Parse the HTML content of the response using BeautifulSoup
#'html.parser' is the built-in Python HTML parser
soup = BeautifulSoup(response.text, "html.parser")

This converts raw HTML into a searchable object.

## Extract quote data from a single page

In [17]:
# Initialize an empty list to store extracted quote data
quotes_data = []

# Find all quote containers on the page
quotes = soup.find_all("div", class_="quote")

# Loop through each quote block and extract required fields
for quote in quotes:
    # Extract quote text
    text = quote.find("span", class_="text").text
    
    # Extract author name
    author = quote.find("small", class_="author").text
    
    # Extract all tags associated with the quote
    tags = [tag.text for tag in quote.find_all("a", class_="tag")]

    # Append extracted data as a dictionary
    quotes_data.append({
        "quote": text,
        "author": author,
        "tags": ", ".join(tags)
    })

* We identify repeating HTML blocks.

* Extract required fields.

* Store them in structured form.

## Create DataFrame & save to CSV (single page)

In [18]:
# Convert extracted data into a pandas DataFrame
df = pd.DataFrame(quotes_data)

# Display the DataFrame
df

Unnamed: 0,quote,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"change, deep-thoughts, thinking, world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities, choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational, life, live, miracle, miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy, books, classic, humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself, inspirational"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"adulthood, success, value"
6,“It is better to be hated for what you are tha...,André Gide,"life, love"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"edison, failure, inspirational, paraphrased"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,misattributed-eleanor-roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"humor, obvious, simile"


In [19]:
# Save the DataFrame to a CSV file
df.to_csv("data/quotes.csv", index=False)
print("Data saved to CSV file.")

Data saved to CSV file.


The scraped data is now saved in CSV format, making it usable for analytics or ML pipelines.

## Initialize base URL and storage (for pagination)

In [25]:
def scrape_quotes_with_pagination():
    """
    Scrapes quotes, authors, and tags from all pages of https://quotes.toscrape.com
    using BeautifulSoup and manual pagination.
    """
    base_url = "https://quotes.toscrape.com"
    current_url = base_url
    all_quotes = []

    while current_url:
        response = requests.get(current_url)
        soup = BeautifulSoup(response.text, "html.parser")

        quotes = soup.find_all("div", class_="quote")

        for quote in quotes:
            all_quotes.append({
                "quote": quote.find("span", class_="text").text,
                "author": quote.find("small", class_="author").text,
                "tags": ", ".join(tag.text for tag in quote.find_all("a", class_="tag"))
            })

        next_button = soup.find("li", class_="next")
        current_url = base_url + next_button.find("a")["href"] if next_button else None

    return pd.DataFrame(all_quotes)

## Convert to DataFrame & save

In [26]:
# Convert all collected quotes into a DataFrame
df = scrape_quotes_with_pagination()
df.head()

Unnamed: 0,quote,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"change, deep-thoughts, thinking, world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities, choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational, life, live, miracle, miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy, books, classic, humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself, inspirational"


In [27]:
# Save the complete dataset to a CSV file
df.to_csv("data/quotes_bs4.csv", index=False)

# Print summary information
print(f"Total quotes scraped: {len(df)}")
print("CSV file with scraped quotes created successfully.")

Total quotes scraped: 100
CSV file with scraped quotes created successfully.


## BeautifulSoup Scraping Summary

In this notebook, a complete web scraping pipeline was implemented using
BeautifulSoup to extract structured data from a publicly available website.

### Objective
The objective was to demonstrate how static HTML pages can be parsed and
transformed into a structured dataset using Python, starting from exploratory
analysis and progressing to a reusable scraping function.

### Website Used
**https://quotes.toscrape.com**

This website is publicly available, does not require authentication, and is
specifically intended for web scraping practice and learning.

### Data Extracted
- Quote text
- Author name
- Associated tags

### Scraping Approach
- Sent HTTP requests using the `requests` library.
- Parsed HTML content using `BeautifulSoup`.
- Performed initial single-page exploration to understand HTML structure.
- Encapsulated pagination logic into a reusable function for full-site scraping.
- Identified and followed the "Next" page link to handle pagination.
- Stored extracted data in a Pandas DataFrame.
- Exported the final dataset to a CSV file for downstream usage.

### Key Learnings
- BeautifulSoup is well-suited for exploratory and small-to-medium scale scraping tasks.
- Pagination handling is essential for complete data extraction across multiple pages.
- Refactoring exploratory code into functions improves readability and reusability.