# ***Web Scraping Project***

For this project  I am using  http://books.toscrape.com/index.html which is a website specifically designed for web scraping purposes.


####Task: Grab the Category, Name, Rating, Price, and Image URL information for all 1000 products, and store the data in a CSV file for presentation in Power BI.

**Import the the necessary  libraries we need to scrape a website.**

In [None]:
import requests  # Used for making HTTP requests
from bs4 import BeautifulSoup  # to scrape information from web page
from urllib.parse import urljoin  # to constructs a full (absolute) URL by combining a base URL with another URL.
import csv  # for working with CSV files

First Let's get the Header of the website

In [None]:
url = 'https://books.toscrape.com/index.html'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')  # Specify 'html.parser' as the parser
Header = soup.header.text.strip()  # Find the header within the <header> tag
print(Header)

Books to Scrape We love being scraped!


Let's figure out the URL structure to go through  all pages category wise

In [None]:
#the URL to scrape
url = 'https://books.toscrape.com/index.html'

response = requests.get(url) # Make an HTTP request to get the HTML content of the page

soup = BeautifulSoup(response.content, 'html.parser') # Create a BeautifulSoup object to parse the HTML content

books = soup.select('li > ul > li > a') # Select 'a' elements within the nested list structure using CSS selector

# Initialize an empty list to store category URLs
category_links = []

# Loop through each 'a' element and extract the 'href' attribute
for book in books:
    link = book.get('href')  # Get the 'href' attribute from the 'a' tag
    category_url = urljoin(url, link)  # Create absolute URLs by joining base URL with relative URLs
    category_links.append(category_url)  # Append the absolute URL to the list

category_links

['https://books.toscrape.com/catalogue/category/books/travel_2/index.html',
 'https://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
 'https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
 'https://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',
 'https://books.toscrape.com/catalogue/category/books/classics_6/index.html',
 'https://books.toscrape.com/catalogue/category/books/philosophy_7/index.html',
 'https://books.toscrape.com/catalogue/category/books/romance_8/index.html',
 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
 'https://books.toscrape.com/catalogue/category/books/fiction_10/index.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
 'https://books.toscrape.com/catalogue/category/books/religion_12/index.html',
 'https://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html',
 'https://books.toscrape.com/catalogue

Some categorys have additional pages.
Let's checks each category page for the presence of extra pages.

In [None]:
import requests
from bs4 import BeautifulSoup
url = category_links
additonal_pages = []
# Loop through each category link
for category_link in category_links:
  response = requests.get(category_link)
  soup = BeautifulSoup(response.content,'html')
  category = soup.find('h1').text.strip() #Find the category title within an 'h1' tag and extract its text
  Extra_pages = soup.find('form').text.strip() #Find a form element and extract its text
  print(category)
  print(Extra_pages)

Travel
11 results.
Mystery
32 results - showing 1 to 20.
Historical Fiction
26 results - showing 1 to 20.
Sequential Art
75 results - showing 1 to 20.
Classics
19 results.
Philosophy
11 results.
Romance
35 results - showing 1 to 20.
Womens Fiction
17 results.
Fiction
65 results - showing 1 to 20.
Childrens
29 results - showing 1 to 20.
Religion
7 results.
Nonfiction
110 results - showing 1 to 20.
Music
13 results.
Default
152 results - showing 1 to 20.
Science Fiction
16 results.
Sports and Games
5 results.
Add a comment
67 results - showing 1 to 20.
Fantasy
48 results - showing 1 to 20.
New Adult
6 results.
Young Adult
54 results - showing 1 to 20.
Science
14 results.
Poetry
19 results.
Paranormal
1 result.
Art
8 results.
Psychology
7 results.
Autobiography
9 results.
Parenting
1 result.
Adult Fiction
1 result.
Humor
10 results.
Horror
17 results.
History
18 results.
Food and Drink
30 results - showing 1 to 20.
Christian Fiction
6 results.
Business
12 results.
Biography
5 results.
Thr

We can see categories with additional pages showing text 'showing 1 to 20'
So  we will Check each category page information with  presence of the text 'to' and extracts the total number of pages.




Let's extract the total number of pages from the text content

In [5]:
url = category_links
# Loop through each category link
for category_link in category_links:
  response = requests.get(category_link)
  soup = BeautifulSoup(response.content,'html')
  pages = soup.find('form').text
  # To Check if the text 'to' is present in the pages information
  if 'to' in pages:
    nav_bar = soup.find('ul', class_='pager')
    all_pages = nav_bar.text.strip()
    page_range = int(all_pages[10])
    print(page_range)

2
2
4
2
4
2
6
8
4
3
3
2


We  allready have the url of  index page. Now  we need  url starting from 2 page.   
We can see that the URL structure of aditonal page is  following:

https://books.toscrape.com/catalogue/category/books/mystery_3/page-2.html

Index page URL structure is the following:

https://books.toscrape.com/catalogue/category/books/mystery_3/index.html

we can create additional URLs by replacing the **'index.html'** part with **'page-{page_num}.html'** where page_num ranges from 2 to total number of pages.

In [9]:
additional_pages = []
for category_link in category_links:
  # Iterate through page numbers from 2 to page_range
  for page_num in range(2,page_range+1):
      page_url = category_link.replace('index.html', f'page-{page_num}.html') # Generate a new page URL by replacing 'index.html' with 'page-{page_num}.html'
      additional_pages.append(page_url)
additional_pages

['https://books.toscrape.com/catalogue/category/books/travel_2/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/mystery_3/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/historical-fiction_4/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/sequential-art_5/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/classics_6/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/philosophy_7/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/romance_8/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/fiction_10/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/religion_12/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-2.html',
 'https://books.toscrape.c

The additional_pages list  contains URLs for additional pages for each category
Now let's add all the URLs from additional_pages to category_links


In [12]:
# Extend the category_links list with the additional_pages list
category_links.extend(additional_pages)
# Print the modified link_list
category_links

['https://books.toscrape.com/catalogue/category/books/travel_2/index.html',
 'https://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
 'https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
 'https://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',
 'https://books.toscrape.com/catalogue/category/books/classics_6/index.html',
 'https://books.toscrape.com/catalogue/category/books/philosophy_7/index.html',
 'https://books.toscrape.com/catalogue/category/books/romance_8/index.html',
 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
 'https://books.toscrape.com/catalogue/category/books/fiction_10/index.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
 'https://books.toscrape.com/catalogue/category/books/religion_12/index.html',
 'https://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html',
 'https://books.toscrape.com/catalogue

Lets collect information about books from categeory URLs.
The category, title, rating, price, and the absolute URL of the book image.

In [17]:
data = []  # Initialize an empty list to store extracted data
url_list = category_links  # Assign the list of category links to the variable url_list

# Iterate through each URL in the category_links list
for url in url_list:
    # Send an HTTP GET request to the URL
    response = requests.get(url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.content, 'lxml')

    # Use the title tag to get the page title
    title_tags = soup.find('h1')
    category = title_tags.text  # Extract the text content of the title_tags

    # Find all book articles on the page
    books = soup.find_all('article', class_='product_pod')

    # Iterate through each book on the page
    for book in books:
        # Extract book information
        title = book.h3.a['title']  # Extract the 'title' attribute from the 'a' tag inside the 'h3' tag

        # Find the star rating of the book
        rating_tag = book.find_next('p', class_='star-rating')
        rating = rating_tag['class'][1] if rating_tag and 'class' in rating_tag.attrs else 'N/A'

        # Find the price of the book
        price_tag = book.find_next('p', class_='price_color')
        price = price_tag.get_text() if price_tag else 'N/A'

        # Find the absolute URL of the book image
        img_tag = book.find_next('img')
        absolute_image_url = urljoin(url, img_tag.get('src')) if img_tag and 'src' in img_tag.attrs else 'N/A'

        # Append book data to the data list
        data.append([category, title, rating, price, absolute_image_url])


Let's print the extracted data for each book.

In [18]:
for cat_data in data:
    print(cat_data)

['Travel', "It's Only the Himalayas", 'Two', '£45.17', 'https://books.toscrape.com/media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg']
['Travel', 'Full Moon over Noah’s Ark: An Odyssey to Mount Ararat and Beyond', 'Four', '£49.43', 'https://books.toscrape.com/media/cache/57/77/57770cac1628f4407636635f4b85e88c.jpg']
['Travel', 'See America: A Celebration of Our National Parks & Treasured Sites', 'Three', '£48.87', 'https://books.toscrape.com/media/cache/9a/7e/9a7e63f12829df4b43b31d110bf3dc2e.jpg']
['Travel', 'Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel', 'Two', '£36.94', 'https://books.toscrape.com/media/cache/d5/bf/d5bf0090470b0b8ea46d9c166f7895aa.jpg']
['Travel', 'Under the Tuscan Sun', 'Three', '£37.33', 'https://books.toscrape.com/media/cache/98/c2/98c2e95c5fd1a4e7cd5f2b63c52826cb.jpg']
['Travel', 'A Summer In Europe', 'Two', '£44.34', 'https://books.toscrape.com/media/cache/4e/15/4e15150388702ebca2c5a523ac270539.jpg']
['Travel', 'The Great Railway Bazaa

Let's save the information in csv file

In [19]:
# Define the columns and data
columns = ['category','BookTitle','rating','price','image_url']

# Specify the file name
file_name = 'All_product.csv'

# Writing to the CSV file
with open(file_name, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)

    # Write the header (columns) to the CSV file
    writer.writerow(columns)

    # Write the data to the CSV file
    writer.writerows(data)

# Print a success message indicating that the CSV file has been created
print(f'The CSV file "{file_name}" has been created successfully.')

The CSV file "All_product.csv" has been created successfully.
