<a href="https://colab.research.google.com/github/Zeba-Kauser/Web-Scraping_Project/blob/main/final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Web Scraping Project***

For this project  I am using  http://books.toscrape.com/index.html which is a website specifically designed for web scraping purposes.


####Task: Grab the Category, Name, Rating, Price, and Image URL information for all 1000 products, and store the data in a CSV file for presentation in Power BI.

**Import the the necessary  libraries we need to scrape a website.**

In [None]:
import requests  # Used for making HTTP requests
from bs4 import BeautifulSoup  # to scrape information from web page
from urllib.parse import urljoin  # to constructs a full (absolute) URL by combining a base URL with another URL.
import csv  # for working with CSV files

First Let's get the Header of the website

In [None]:
url = 'https://books.toscrape.com/index.html'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')  # Specify 'html.parser' as the parser
Header = soup.header.text.strip()  # Find the header within the <header> tag
print(Header)

Let's figure out the URL structure to go through  all pages category wise

In [None]:
#the URL to scrape
url = 'https://books.toscrape.com/index.html'

response = requests.get(url) # Make an HTTP request to get the HTML content of the page

soup = BeautifulSoup(response.content, 'html.parser') # Create a BeautifulSoup object to parse the HTML content

books = soup.select('li > ul > li > a') # Select 'a' elements within the nested list structure using CSS selector

# Initialize an empty list to store category URLs
category_links = []

# Loop through each 'a' element and extract the 'href' attribute
for book in books:
    link = book.get('href')  # Get the 'href' attribute from the 'a' tag
    category_url = urljoin(url, link)  # Create absolute URLs by joining base URL with relative URLs
    category_links.append(category_url)  # Append the absolute URL to the list

category_links

Some categorys have additional pages.
Let's checks each category page for the presence of extra pages.

In [None]:
import requests
from bs4 import BeautifulSoup
url = category_links
additonal_pages = []
# Loop through each category link
for category_link in category_links:
  response = requests.get(category_link)
  soup = BeautifulSoup(response.content,'html')
  category = soup.find('h1').text.strip() #Find the category title within an 'h1' tag and extract its text
  Extra_pages = soup.find('form').text.strip() #Find a form element and extract its text
  print(category)
  print(Extra_pages)

We can see categories with additional pages showing text 'showing 1 to 20'
So  we will Check each category page information with  presence of the text 'to' and extracts the total number of pages.




Let's extract the total number of pages from the text content

In [None]:
url = category_links
# Loop through each category link
for category_link in category_links:
  response = requests.get(category_link)
  soup = BeautifulSoup(response.content,'html')
  pages = soup.find('form').text
  # To Check if the text 'to' is present in the pages information
  if 'to' in pages:
    nav_bar = soup.find('ul', class_='pager')
    all_pages = nav_bar.text.strip()
    page_range = int(all_pages[10])
    print(page_range)

We  allready have the url of  index page. Now  we need  url starting from 2 page.   
We can see that the URL structure of aditonal page is  following:

https://books.toscrape.com/catalogue/category/books/mystery_3/page-2.html

Index page URL structure is the following formate:

https://books.toscrape.com/catalogue/category/books/mystery_3/index.html

we can create additional URLs by replacing the **'index.html'** part with **'page-{page_num}.html'** where page_num ranges from 2 to total number of pages.

In [None]:
additional_pages = []
for category_link in category_links:
  # Iterate through page numbers from 2 to page_range
  for page_num in range(2,page_range+1):
      page_url = category_link.replace('index.html', f'page-{page_num}.html') # Generate a new page URL by replacing 'index.html' with 'page-{page_num}.html'
      additional_pages.append(page_url)
additional_pages

The additional_pages list  contains URLs for additional pages for each category
Now let's add all the URLs from additional_pages to category_links


In [None]:
# Extend the category_links list with the additional_pages list
category_links.extend(additional_pages)
# Print the modified link_list
category_links

Lets collect information about books from categeory URLs.
The category, title, rating, price, and the absolute URL of the book image.

In [None]:
data = []  # Initialize an empty list to store extracted data
url_list = category_links  # Assign the list of category links to the variable url_list

# Iterate through each URL in the category_links list
for url in url_list:
    # Send an HTTP GET request to the URL
    response = requests.get(url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.content, 'lxml')

    # Use the title tag to get the page title
    title_tags = soup.find('h1')
    category = title_tags.text  # Extract the text content of the title_tags

    # Find all book articles on the page
    books = soup.find_all('article', class_='product_pod')

    # Iterate through each book on the page
    for book in books:
        # Extract book information
        title = book.h3.a['title']  # Extract the 'title' attribute from the 'a' tag inside the 'h3' tag

        # Find the star rating of the book
        rating_tag = book.find_next('p', class_='star-rating')
        rating = rating_tag['class'][1] if rating_tag and 'class' in rating_tag.attrs else 'N/A'

        # Find the price of the book
        price_tag = book.find_next('p', class_='price_color')
        price = price_tag.get_text() if price_tag else 'N/A'

        # Find the absolute URL of the book image
        img_tag = book.find_next('img')
        absolute_image_url = urljoin(url, img_tag.get('src')) if img_tag and 'src' in img_tag.attrs else 'N/A'

        # Append book data to the data list
        data.append([category, title, rating, price, absolute_image_url])


Let's print the extracted data for each book.

In [None]:
for cat_data in data:
    print(cat_data)

Let's save the information in csv file

In [None]:
# Define the columns and data
columns = ['category','BookTitle','rating','price','image_url']

# Specify the file name
file_name = 'All_product.csv'

# Writing to the CSV file
with open(file_name, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)

    # Write the header (columns) to the CSV file
    writer.writerow(columns)

    # Write the data to the CSV file
    writer.writerows(data)

# Print a success message indicating that the CSV file has been created
print(f'The CSV file "{file_name}" has been created successfully.')

We can download the csv file All_product and load it into Power BI to create reports.