# Introduction

- Web scraping is the automated collection of data from websites.
- It is widely used in data science, research, journalism, and industry.
- Scraping helps extract publicly available data that is not in structured formats.
- This notebook covers requesting web pages, parsing HTML, and extracting data.
- Ethical and responsible scraping practices are essential in real-world use.


# Ethics of Web Scraping

Although web scraping often involves publicly accessible data, it raises important ethical and legal considerations. Responsible scraping requires respecting both website owners and users.

Key ethical principles include:

- **Respect website policies**: Always review a website’s `robots.txt` file and terms of service to understand what is permitted.
- **Avoid excessive requests**: Sending too many requests in a short period can overload servers. Implement rate limiting and delays when scraping.
- **Do not scrape sensitive data**: Personal, private, or confidential information should never be collected without explicit permission.
- **Attribute data sources**: When using scraped data for research or publication, properly credit the original source.
- **Use data responsibly**: Scraped data should be used in ways that do not harm individuals, organizations, or communities.

Ethical web scraping balances technical capability with responsibility, ensuring that data collection practices are fair, transparent, and respectful.

In [1]:
pip install requests beautifulsoup4 lxml

Note: you may need to restart the kernel to use updated packages.


In [1]:
import requests
from bs4 import BeautifulSoup

# The URL of a simple website to scrape
url = "http://books.toscrape.com/";

# 1. Fetch the web page content
headers = {
"User-Agent": "My Web Scraper 1.0 - for educational purposes"
}
# Make the request
response = requests.get(url, headers=headers)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

# 2. Parse the HTML with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

In [3]:
import requests
from bs4 import BeautifulSoup

# The URL of a simple website to scrape
url = "http://books.toscrape.com/";

# 1. Fetch the web page content
try:
    response = requests.get(url)
    response.raise_for_status()

    # 2. Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
   
    # 3. Extract the data
    # We want to get the title of every book on the page.
    # By inspecting the website, we find that book titles are in <h3> tags
    # which are inside <article class="product_pod"> tags.

   
    book_titles = []
    book_prices = []
    # Find all 'article' tags with the class 'product_pod'
    for book in soup.find_all('article', class_='product_pod'):
        # Inside each article, find the 'h3' tag, then the 'a' tag, and get its title attribute
        title = book.h3.a['title']
        href = book.h3.a['href']
        new_title = title + '  ' + href
        book_titles.append(new_title)

       
        price = book.find('p', class_="price_color").text
        book_prices.append(price)
       
       
    print("--- Found Book Titles ---")
    for i, (title, price) in enumerate(zip(book_titles,book_prices),start=1): # Print first 5
        print(f"{i}. {title}, Price:{price}")

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")


--- Found Book Titles ---
1. A Light in the Attic  catalogue/a-light-in-the-attic_1000/index.html, Price:Â£51.77
2. Tipping the Velvet  catalogue/tipping-the-velvet_999/index.html, Price:Â£53.74
3. Soumission  catalogue/soumission_998/index.html, Price:Â£50.10
4. Sharp Objects  catalogue/sharp-objects_997/index.html, Price:Â£47.82
5. Sapiens: A Brief History of Humankind  catalogue/sapiens-a-brief-history-of-humankind_996/index.html, Price:Â£54.23
6. The Requiem Red  catalogue/the-requiem-red_995/index.html, Price:Â£22.65
7. The Dirty Little Secrets of Getting Your Dream Job  catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html, Price:Â£33.34
8. The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull  catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html, Price:Â£17.93
9. The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics  catalogue

In [14]:
import requests
from bs4 import BeautifulSoup

# The URL of a simple website to scrape
url = "http://books.toscrape.com/";

# 1. Fetch the web page content
try:
    response = requests.get(url)
    response.raise_for_status()

    # 2. Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    book_categories =[]
    categories =soup.find('ul',class_='nav')
    new=categories.find('ul')
    for category in new.find_all('li'):
        name=category.a.text.strip()
        print(len(name))
        print(name)
        book_categories.append(name)
        for i, name in enumerate(zip(book_categories),start=1): 
            print(f"{i}. {name}")


except:
    print("error")
    

6
Travel
1. ('Travel',)
7
Mystery
1. ('Travel',)
2. ('Mystery',)
18
Historical Fiction
1. ('Travel',)
2. ('Mystery',)
3. ('Historical Fiction',)
14
Sequential Art
1. ('Travel',)
2. ('Mystery',)
3. ('Historical Fiction',)
4. ('Sequential Art',)
8
Classics
1. ('Travel',)
2. ('Mystery',)
3. ('Historical Fiction',)
4. ('Sequential Art',)
5. ('Classics',)
10
Philosophy
1. ('Travel',)
2. ('Mystery',)
3. ('Historical Fiction',)
4. ('Sequential Art',)
5. ('Classics',)
6. ('Philosophy',)
7
Romance
1. ('Travel',)
2. ('Mystery',)
3. ('Historical Fiction',)
4. ('Sequential Art',)
5. ('Classics',)
6. ('Philosophy',)
7. ('Romance',)
14
Womens Fiction
1. ('Travel',)
2. ('Mystery',)
3. ('Historical Fiction',)
4. ('Sequential Art',)
5. ('Classics',)
6. ('Philosophy',)
7. ('Romance',)
8. ('Womens Fiction',)
7
Fiction
1. ('Travel',)
2. ('Mystery',)
3. ('Historical Fiction',)
4. ('Sequential Art',)
5. ('Classics',)
6. ('Philosophy',)
7. ('Romance',)
8. ('Womens Fiction',)
9. ('Fiction',)
9
Childrens
1. ('

In [16]:
import requests
from bs4 import BeautifulSoup

url = "http://books.toscrape.com/"

try:
    response = requests.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')

    book_categories = []

    categories = soup.find('ul', class_='nav')
    sub_categories = categories.find('ul')

    for category in sub_categories.find_all('li'):
        name = category.a.text.strip()
        book_categories.append(name)

    # Print categories nicely
    print("Book Categories:\n")
    for i, name in enumerate(book_categories, start=1):
        print(f"{i}. {name} ")

except Exception as e:
    print("Error:", e)


Book Categories:

1. Travel 
2. Mystery 
3. Historical Fiction 
4. Sequential Art 
5. Classics 
6. Philosophy 
7. Romance 
8. Womens Fiction 
9. Fiction 
10. Childrens 
11. Religion 
12. Nonfiction 
13. Music 
14. Default 
15. Science Fiction 
16. Sports and Games 
17. Add a comment 
18. Fantasy 
19. New Adult 
20. Young Adult 
21. Science 
22. Poetry 
23. Paranormal 
24. Art 
25. Psychology 
26. Autobiography 
27. Parenting 
28. Adult Fiction 
29. Humor 
30. Horror 
31. History 
32. Food and Drink 
33. Christian Fiction 
34. Business 
35. Biography 
36. Thriller 
37. Contemporary 
38. Spirituality 
39. Academic 
40. Self Help 
41. Historical 
42. Christian 
43. Suspense 
44. Short Stories 
45. Novels 
46. Health 
47. Politics 
48. Cultural 
49. Erotica 
50. Crime 
