# Webscraping of Oxfordbooks store

### Project Outline
- We are going to scrape https://oxfordbookstore.com/
- In this we will get the list of New releases books
- For each new releases books we have Book title,Author name,Old price,New price,Product link
- After scraping the website we will convert it in csv file format

## USE THE REQUEST LIBRARY TO DOWNLOAD WEBPAGES

In [1]:
import requests

In [2]:
topic_url='https://oxfordbookstore.com/list/new-releases?page=1'

In [3]:
!pip install requests --quiet


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
response = requests.get(topic_url)

In [5]:
response.status_code

200

In [6]:
page_contents = response.text

In [7]:
page_contents[:1000]

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<!-- Required meta tags -->\n\t<meta charset="utf-8">\n\t\n\t<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\n\n\t\n\t\n\t\n\t<link rel="shortcut icon" href="https://oxfordbookstore.com/public/uploads/1667569415_logo.jpg" type="image/x-icon">\n    <title>New Releases | Oxford Bookstore</title>\n\t<link href="https://oxfordbookstore.com/public/customResource/apeejay-frontend/font-awesome-4.7.0/css/font-awesome.css" rel="stylesheet"> \n\t\n\t<link href="https://oxfordbookstore.com/public/customResource/apeejay-frontend/css/bootstrap.css" rel="stylesheet">\n\t<link href="https://oxfordbookstore.com/public/customResource/apeejay-frontend/css/style.css" rel="stylesheet">\n\t<link href="https://oxfordbookstore.com/public/customResource/apeejay-frontend/css/responsive.css" rel="stylesheet">\n\t<link href="https://oxfordbookstore.com/public/customResource/apeejay-frontend/fonts/stylesheet.css" rel="stylesheet"

In [8]:
with open('oxfordbooks.html', 'w', encoding='utf-8') as f:
    f.write(page_contents)

## USE BEAUTIFUL SOUP TO PARSE AND EXTRACT INFORMATION

In [9]:
!pip install beautifulsoup4 --upgrade --quiet


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
from bs4 import BeautifulSoup

In [11]:
soup = BeautifulSoup(page_contents,'html.parser')

In [12]:
#creating a list to contain books title
book_title = []
for title in soup.find_all('div', class_='product-describe'):
    book_title.append(title.h3.text)

# Print the list of job titles
print(len(book_title))
print(book_title)

32
['The World after Apu Satyajit...', 'Lost In The Love Maze', 'Mehrunissa', 'Krishna', 'SAME AS EVER', 'COOL', 'Rouge', 'Becoming Gandhi', 'The Day I Became a Runner', 'Kismat Connection', 'Mind Management, Not Time Ma...', 'To Love & Be Loved', 'The Harry Potter Wizarding A...', 'Diary of a Wimpy Kid', 'The Happiness Story : Unlock...', 'Humans', 'How Prime Ministers Decide', "India's Experiment with Demo...", 'MODI: The Challenge of 2024', 'Another Sort of Freedom: A M...', 'Greek Gods, Monsters and Her...', 'Roman Stories', 'Fintech For Billions: Simple...', 'A Bird on My Windowsill', 'Traitors Gate', 'Common Yet Uncommon: 14 Memo...', "Sakina's Kiss", 'A House of Rain and Snow', 'Before We Say Goodbye', "World's Best Girlfriend", 'Idols: Unearthing the Power ...', '1947-1957, India: The Birth ...']


In [13]:
#creating a list to contain Author names
author_names = []

for name in soup.find_all('div', class_='product-describe'):
    author_text = name.h5.text.strip()
    
    # Extract only the author name (remove "Book Author:")
    fullauthor_name = author_text.replace("Book Author:", "").strip()
    
    author_names.append(fullauthor_name)

# Print the cleaned list of author names
print(len(author_names))
print(author_names)


32
['Swapan Mullick', 'MUKUL KUMAR', 'ANAND NAYAK', 'Kevin Missal', 'Morgan Housel', 'Swapan Seth', 'Mona Awad', 'Perry Garfinkel', 'Sohini Chattopadhyay', 'ANANYA DEVRANJAN', 'DAVID KADAVY', 'JIM TOWEY', 'JK ROWLING', 'Jeff Kinney', 'Savi Sharma', 'Serjio Almeciga', 'NEERJA CHOWDHURY', 'SY QURAISHI', 'MINHAZ MERCHANT', 'Gurcharan Das', 'Devdutt Pattanaik', 'Jhumpa Lahiri', 'Anas Ahmed\n                                                                                            Bhagwan Chowdhry', 'Manav Kaul', 'Jeffrey Archer', 'Sudha Murty', 'Vivek Shanbhag\n                                                                                            Srinath Perur', 'Srijato', 'Toshikazu Kawaguchi', 'Durjoy Datta', 'Amish Tripathi\n                                                                                            Bhavna Roy', 'Chandrachur Ghose']


In [14]:
#importing re library to support and manipulate regular expressions
import re

oldbook_prices = []
newbook_prices = []
for price in soup.find_all('div', class_='price'):
    price_text = price.text.strip()
    
    # Use regular expression to capture old and new prices
    match = re.match(r'₹([\d,.]+)\n₹([\d,.]+)', price_text)
    
    if match:
        new_price = match.group(1).strip()
        old_price = match.group(2).strip()
        
        newbook_prices.append(
             new_price)
        oldbook_prices.append(
             old_price)
    else:
        # Handle the case where the format does not match
        print(f"Unable to extract old and new prices from: {price_text}")

# Print the list of old book prices
print(len(oldbook_prices))
print(oldbook_prices)

32
['499.00', '325.00', '599.00', '399.00', '450.00', '999.00', '799.00', '699.00', '599.00', '299.00', '499.00', '599.00', '1599.00', '599.00', '299.00', '999.00', '999.00', '699.00', '899.00', '699.00', '299.00', '499.00', '499.00', '499.00', '499.00', '399.00', '499.00', '399.00', '550.00', '199.00', '399.00', '799.00']


In [15]:
# Print the list of new book prices
print(len(newbook_prices))
print(newbook_prices)

32
['399', '260', '479', '319', '360', '799', '639', '559', '479', '239', '399', '479', '1279', '479', '239', '799', '799', '559', '719', '559', '239', '399', '399', '399', '399', '319', '399', '319', '440', '159', '319', '639']


In [16]:
#Find all 'a' tags with the specified class and get the value of the "href" attribute for each
product_link=[]
for url in soup.find_all('a',class_='book-img'):
    product_url=url.get('href')
    product_link.append(product_url)
print(len(product_link))
# Print the list of links
print(product_link)

32
['https://oxfordbookstore.com/product-details/the-world-after-apu-satyajit-ray-in-retrospect-3574', 'https://oxfordbookstore.com/product-details/lost-in-the-love-maze-0008', 'https://oxfordbookstore.com/product-details/mehrunissa-3108', 'https://oxfordbookstore.com/product-details/krishna-3680', 'https://oxfordbookstore.com/product-details/same-as-ever-5026', 'https://oxfordbookstore.com/product-details/cool-0723', 'https://oxfordbookstore.com/product-details/rouge-5383', 'https://oxfordbookstore.com/product-details/becoming-gandhi-5054', 'https://oxfordbookstore.com/product-details/the-day-i-became-a-runner-7578', 'https://oxfordbookstore.com/product-details/kismat-connection-6721', 'https://oxfordbookstore.com/product-details/mind-management-not-time-management-4708', 'https://oxfordbookstore.com/product-details/to-love--be-loved-8787', 'https://oxfordbookstore.com/product-details/the-harry-potter-wizarding-almanac-7292', 'https://oxfordbookstore.com/product-details/diary-of-a-wim

In [17]:
import pandas as pd

In [18]:
#creating a dictionary
new_releases_dict={
    'books_title':book_title,
    'author_name':author_names,
    'old_price':oldbook_prices,
    'new_price':newbook_prices,
    'product_link':product_link
}

In [19]:
#converting in csv format
new_releases_books=pd.DataFrame(new_releases_dict)
new_releases_books

Unnamed: 0,books_title,author_name,old_price,new_price,product_link
0,The World after Apu Satyajit...,Swapan Mullick,499.0,399,https://oxfordbookstore.com/product-details/th...
1,Lost In The Love Maze,MUKUL KUMAR,325.0,260,https://oxfordbookstore.com/product-details/lo...
2,Mehrunissa,ANAND NAYAK,599.0,479,https://oxfordbookstore.com/product-details/me...
3,Krishna,Kevin Missal,399.0,319,https://oxfordbookstore.com/product-details/kr...
4,SAME AS EVER,Morgan Housel,450.0,360,https://oxfordbookstore.com/product-details/sa...
5,COOL,Swapan Seth,999.0,799,https://oxfordbookstore.com/product-details/co...
6,Rouge,Mona Awad,799.0,639,https://oxfordbookstore.com/product-details/ro...
7,Becoming Gandhi,Perry Garfinkel,699.0,559,https://oxfordbookstore.com/product-details/be...
8,The Day I Became a Runner,Sohini Chattopadhyay,599.0,479,https://oxfordbookstore.com/product-details/th...
9,Kismat Connection,ANANYA DEVRANJAN,299.0,239,https://oxfordbookstore.com/product-details/ki...


In [65]:
new_releases_books.to_csv('new_releases_books.csv')