# Scraping the 'Bestseller in Books' page of Amazon.com


# Web Scraping  Project with Python

### Amazon is one of the largest online marketplaces in the United States, if not the world. Users can buy items from Amazon or a third-party seller on Amazon, or sell their own items. Orders are then delivered using common worldwide courier services. The company also sells its own line of technology items.

The page https://www.amazon.in/gp/bestsellers/books/ provides the list of best seller books on amazon and we are going to scrape these bestseller books details like position, name, writer, price, rating and total reviews. Here we'll use python requests and beautifulsoup4 libraries to scrape this data.

# **Outline of the project:**
![Imgur](https://imgur.com/fTLsTdJ.png)

1. Understanding the structure of Amazon.in/books/bestseller Webpage
2. Installing and Importing required libraries
3. Simulating the page and Extracting the URLs,Name,Author, Price of different books from website using 'BeautifulSoup'
4. Accessing each label and building a list of above mentioned information
5. Parsing the Top 50 books into 6 fields: Book Name,Book Writer,Rating,Reviews and Price.
6. Storing the extracted data into a dictionary.
7. Compiling all the data into a DataFrame using Pandas and saving the data into CSV file.


## What is Web Scraping ?
'Web Scraping' is the process of extracting data from the web automatically.web scraping extracts underlying HTML code and the data stored in a database.

This extracted data can be saved in a structured format in the the form of a CSV or JSON file and also in other formats
![what_is_web_scraping.png](https://i.imgur.com/unjF9ob.png)

## What is web scraping used for?
Web scraping allows you to collect structured data. Structured data is just a way to say that the information is easy for computers to read or add to a database. Instead of relying on humans to read or process web pages, computers can rapidly use that data in lots of unexpected and useful ways.

## Objective: Scraping the starting 4 pages of Bestselling Books Category of Amazon.in using BeautifulSoup as the html parser and storing into csv format file

![imgur](https://imgur.com/x1cS6IK.jpg)



In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
# Execute this to save new versions of the notebook
jovian.commit(project="MyWebScrapingProjectFinal")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ghost-smith9557/mywebscrapingprojectfinal" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ghost-smith9557/mywebscrapingprojectfinal[0m


'https://jovian.ai/ghost-smith9557/mywebscrapingprojectfinal'

# Steps to be Followed

## 1. Download the webpage using 'requests,

In [4]:
import requests

In [5]:
topic_url = 'https://www.amazon.in/gp/bestsellers/books/?ie=UTF8&ref_=sv_ba_3'

In [6]:
response = requests.get(topic_url)

In [7]:
type(response)

requests.models.Response

### Here we checking the status code of the response by response.status_code. If HTTP status code will have a value between 200 to 299 then the request is successful and we got successful response.

In [8]:
response.status_code

200

In [9]:
page_content = response.text

We will find out the number pf character the page has and will view first 1000 characters..

In [10]:
len(page_content)

327907

In [11]:
page_content[:100]

'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start '

### Now we will save these characters which look like a HTML collection into a HTML file

In [12]:
with open('amazon_bestseller_books.html', 'w', encoding = 'utf-8') as f:
    f.write(page_content)

### The above logic is combined in a function which gives the output as webpage

In [13]:
def fetch_web_page(topic_url):
    response=requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Something went wrong while fetching the web page')
    return response.text

### Now we will install the BeautifulSoup library to use html Parser

In [14]:
# Install the library
!pip install beautifulsoup4 --upgrade --quiet

In [15]:
from bs4 import BeautifulSoup

In [16]:
with open('amazon_bestseller_books.html', 'r') as f:
    html_content = f.read()

In [18]:
# doc = BeautifulSoup(html_content, 'html.parser')
doc = BeautifulSoup(page_content)

In [19]:
type(doc)

bs4.BeautifulSoup

## 2. Parse the HTML source code using 'BeautifulSoup'

In [20]:
doc.title.text

'Amazon.in Bestsellers: The most popular items in Books'

In [21]:
def get_doc(page_content):
    doc = BeautifulSoup(page_content)
    return doc

### In the below step we are using the doc method to search the tag by its id('div' tag using 'div id')

In [22]:
divtag_doc = doc.find_all('div' , id ='gridItemRoot')
divtag_doc[30]

<div class="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc" id="gridItemRoot"><div class="a-cardui _cDEzb_grid-cell_1uMOS expandableGrid p13n-grid-content" data-a-card-type="basic" id="p13n-asin-index-30"><div class="a-section zg-bdg-ctr"><div class="a-section zg-bdg-body zg-bdg-clr-body aok-float-left"><span class="zg-bdg-text">#31</span></div><div class="a-section zg-bdg-tri zg-bdg-clr-tri aok-float-left"></div></div><div class="zg-grid-general-faceout"><div class="p13n-sc-uncoverable-faceout" id="9354893899"><a class="a-link-normal" href="/Almanack-Naval-Ravikant-Wealth-Happiness/dp/9354893899/ref=zg_bs_books_sccl_1/000-0000000-0000000?psc=1" role="link" tabindex="-1"><div class="a-section a-spacing-mini _cDEzb_noop_3Xbw5"><img alt="The Almanack Of Naval Ravikant: A Guide to Wealth and Happiness" class="a-dynamic-image p13n-sc-dynamic-image p13n-product-image" data-a-dynamic-image='{"https://images-eu.ssl-images-amazon.com/images/I/61fp0RQR+9L._AC_UL300_SR300,200_.jpg":[30

In [26]:
books_info_doc = divtag_doc[5]
books_info_doc

<div class="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc" id="gridItemRoot"><div class="a-cardui _cDEzb_grid-cell_1uMOS expandableGrid p13n-grid-content" data-a-card-type="basic" id="p13n-asin-index-5"><div class="a-section zg-bdg-ctr"><div class="a-section zg-bdg-body zg-bdg-clr-body aok-float-left"><span class="zg-bdg-text">#6</span></div><div class="a-section zg-bdg-tri zg-bdg-clr-tri aok-float-left"></div></div><div class="zg-grid-general-faceout"><div class="p13n-sc-uncoverable-faceout" id="1612681131"><a class="a-link-normal" href="/Rich-Dad-Poor-Middle-Anniversary/dp/1612681131/ref=zg_bs_books_sccl_6/000-0000000-0000000?psc=1" role="link" tabindex="-1"><div class="a-section a-spacing-mini _cDEzb_noop_3Xbw5"><img alt="Rich Dad Poor Dad : What The Rich Teach Their Kids About Money That The Poor And Middle Class Do Not!: (25th Anniversary Edit" class="a-dynamic-image p13n-sc-dynamic-image p13n-product-image" data-a-dynamic-image='{"https://images-eu.ssl-images-amazon.co

In [27]:
position = books_info_doc.find('span', class_="zg-bdg-text").text.strip('#')
position

'6'

In [28]:
poster = books_info_doc.find('img')['src']
poster

'https://images-eu.ssl-images-amazon.com/images/I/81PuKheA8xL._AC_UL300_SR300,200_.jpg'

In [30]:
bookname_and_writer = books_info_doc.find_all('div', class_= "_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y")

## 3. Extract position, name, writer, price, rating and total reviews from the page

In [31]:
bookname_and_writer

[<div class="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y">Rich Dad Poor Dad : What The Rich Teach Their Kids About Money That The Poor And Middle Class Do Not!: (25th Anniversary Edition)</div>,
 <div class="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y">Robert T. Kiyosaki</div>]

In [32]:
book_name = bookname_and_writer[0].text

In [33]:
book_name

'Rich Dad Poor Dad : What The Rich Teach Their Kids About Money That The Poor And Middle Class Do Not!: (25th Anniversary Edition)'

In [34]:
book_writer = bookname_and_writer[1].text

In [35]:
book_writer

'Robert T. Kiyosaki'

### Putting the above logic into a function to give the writer and name of book

In [36]:
def book_name_and_writer(books_info_doc):
    bookname_and_writer = books_info_doc.find_all('div', class_="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y")
    if len(bookname_and_writer) < 2:
        book_name_doc = books_info_doc.find_all('div', class_="_cDEzb_p13n-sc-css-line-clamp-2_EWgCb")
        book_name = '' if len(book_name_doc)<1 else book_name_doc[0].text
        writer_name_doc = books_info_doc.find_all('div', class_="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y")
        writer = '' if len(writer_name_doc)<1 else writer_name_doc[0].text
    else:
        book_name = bookname_and_writer[0].text
        writer = bookname_and_writer[1].text
    return book_name, writer

In [37]:
book_name_and_writer(books_info_doc)

('Rich Dad Poor Dad : What The Rich Teach Their Kids About Money That The Poor And Middle Class Do Not!: (25th Anniversary Edition)',
 'Robert T. Kiyosaki')

### Finding the ratings and total reviews of books

In [38]:
review_and_rating_doc = books_info_doc.find('div', class_="a-icon-row")

In [39]:
review_and_rating_doc

<div class="a-icon-row"><a class="a-link-normal" href="/product-reviews/1612681131/ref=zg_bs_books_cr_sccl_6/000-0000000-0000000" title="4.5 out of 5 stars"><i class="a-icon a-icon-star-small a-star-small-4-5 aok-align-top"><span class="a-icon-alt">4.5 out of 5 stars</span></i> <span class="a-size-small">14,986</span></a></div>

In [40]:
rating = review_and_rating_doc.find('i').text.split(' ')[0]

In [41]:
rating

'4.5'

In [42]:
reviews = review_and_rating_doc.find('span', class_="a-size-small").text

In [43]:
reviews

'14,986'

In [44]:
all_reviews = ''
for i in reviews.split(','):
    all_reviews += i

In [45]:
all_reviews

'14986'

In [46]:
def reviews_and_ratings(books_info_doc):
    '''
        This function takes book document as input and extract rating and reviews of the books.
    '''
    reviews_and_ratings_doc = books_info_doc.find('div', class_="a-icon-row")
    if reviews_and_ratings_doc == None:
        rating = ''
        all_reviews = ''
    else:
        rating = reviews_and_ratings_doc.find('i').text.split(' ')[0]
        reviews = reviews_and_ratings_doc.find('span', class_="a-size-small").text
        all_reviews = ''
        for i in reviews.split(','):
            all_reviews += i
    return rating, all_reviews

In [47]:
reviews_and_ratings(books_info_doc)

('4.5', '14986')

### Finding the Price of the book

In [48]:
price_doc = books_info_doc.find('span', class_="p13n-sc-price")
price = '' if price_doc == None else price_doc.text.strip('₹')
price

'213.00'

### Extracting book data from the webpage for one element to check the working

In [65]:
def books_data(books_info_doc):
    '''
    This function takes a book item document as input and
    extract all the information from the book and return it as a dictionary.
    '''
    position = books_info_doc.find('span', class_="zg-bdg-text").text.strip('#')
    poster = books_info_doc.find('img')['src']
    name, writer = book_name_and_writer(books_info_doc)
    rating, review = reviews_and_ratings(books_info_doc)
    price_doc = books_info_doc.find('span', class_="p13n-sc-price")
    price = '' if price_doc == None else price_doc.text.strip('₹')
    return {
        'book_rank' : position,
        'book_poster' : poster,
        'book_name' : name,
        'book_writer' : writer,
        'book_rating' : rating,
        'book_review' : review,
        'book_price' : price
    }

In [66]:
book_data(books_info_doc)

{'book_rank': '6',
 'book_poster': 'https://images-eu.ssl-images-amazon.com/images/I/81PuKheA8xL._AC_UL300_SR300,200_.jpg',
 'book_name': 'Rich Dad Poor Dad : What The Rich Teach Their Kids About Money That The Poor And Middle Class Do Not!: (25th Anniversary Edition)',
 'book_writer': 'Robert T. Kiyosaki',
 'book_rating': '4.5',
 'book_review': '14986',
 'book_price': '213.00'}

## 4. Store the extracted information into Python lists and dictionary

In [67]:
def find_all_book_data(books_info_doc):
    '''
    This function takes list of the book document as input and 
    extract all the information from all books and store it into a dictionary of lists and return it.
    '''
    book_dict = {
        'book_rank' : [],
        'book_poster' : [],
        'book_name' : [],
        'book_writer' : [],
        'book_rating' : [],
        'book_review' : [],
        'book_price' : []
    }
    
    for book in books_info_doc:
        book_data = books_data(book)
        book_dict['book_rank'].append(book_data['book_rank'])
        book_dict['book_poster'].append(book_data['book_poster'])
        book_dict['book_name'].append(book_data['book_name'])
        book_dict['book_writer'].append(book_data['book_writer'])
        book_dict['book_rating'].append(book_data['book_rating'])
        book_dict['book_review'].append(book_data['book_review'])
        book_dict['book_price'].append(book_data['book_price'])
        
    return book_dict

In [68]:
find_all_book_data(divtag_doc)

{'book_rank': ['1',
  '2',
  '3',
  '4',
  '5',
  '6',
  '7',
  '8',
  '9',
  '10',
  '11',
  '12',
  '13',
  '14',
  '15',
  '16',
  '17',
  '18',
  '19',
  '20',
  '21',
  '22',
  '23',
  '24',
  '25',
  '26',
  '27',
  '28',
  '29',
  '30',
  '31',
  '32',
  '33',
  '34',
  '35',
  '36',
  '37',
  '38',
  '39',
  '40',
  '41',
  '42',
  '43',
  '44',
  '45',
  '46',
  '47',
  '48',
  '49',
  '50'],
 'book_poster': ['https://images-eu.ssl-images-amazon.com/images/I/71B4h-dSVzL._AC_UL300_SR300,200_.jpg',
  'https://images-eu.ssl-images-amazon.com/images/I/71g2ednj0JL._AC_UL300_SR300,200_.jpg',
  'https://images-eu.ssl-images-amazon.com/images/I/814L+vq01mL._AC_UL300_SR300,200_.jpg',
  'https://images-eu.ssl-images-amazon.com/images/I/81BvqTpXlFL._AC_UL300_SR300,200_.jpg',
  'https://images-eu.ssl-images-amazon.com/images/I/91bYsX41DVL._AC_UL300_SR300,200_.jpg',
  'https://images-eu.ssl-images-amazon.com/images/I/81PuKheA8xL._AC_UL300_SR300,200_.jpg',
  'https://images-eu.ssl-images-am

## 5. Save the extracted information into a CSV file

In [75]:
def scrape_amazon_books(page, path):
    '''
        In this function we pass how many pages we want to extract as 'pages'
        and second path of the csv file in which we want to store the books data.
        This function simply extract all the d
        
    '''
    import pandas as pd
    book_documents = []
    for page in range(1, page+1):
        topic_url = 'https://www.amazon.in/gp/bestsellers/books/?ie=UTF8&ref_=sv_ba_3'
        if page > 1:
            topic_url =f'https://www.amazon.in/gp/bestsellers/books/?ie=UTF8&ref_=sv_ba_3{page}?ie=UTF8&pg={page}'
            
        page_content = fetch_web_page(topic_url)
        doc = get_doc(page_content)
        divtag_doc = doc.find_all('div', id='gridItemRoot')
        book_documents += divtag_doc
    books_data = find_all_book_data(book_documents)
    bestseller_df = pd.DataFrame(books_data)
    bestseller_df.to_csv(path, index=False)

In [79]:
scrape_amazon_books(1,'amazon_bestseller_books.csv')

### Reading the CSV file through .read() method using pandas

In [None]:
import pandas as pd

## 6. At the end of this project we'll have a CSV file in following format: .csv-file

In [81]:
pd.read_csv('amazon_bestseller_books.csv')

Unnamed: 0,book_rank,book_poster,book_name,book_writer,book_rating,book_review,book_price
0,1,https://images-eu.ssl-images-amazon.com/images...,Energize Your Mind: Learn the Art of Mastering...,Gaur Gopal Das,,,190.0
1,2,https://images-eu.ssl-images-amazon.com/images...,The Psychology of Money,Morgan Housel,4.6,45289.0,140.0
2,3,https://images-eu.ssl-images-amazon.com/images...,Ikigai: The Japanese secret to a long and happ...,Héctor García,4.6,40204.0,225.0
3,4,https://images-eu.ssl-images-amazon.com/images...,Doglapan: The Hard Truth about Life and Start-Ups,Ashneer Grover,4.3,394.0,239.5
4,5,https://images-eu.ssl-images-amazon.com/images...,Atomic Habits: The life-changing million copy ...,James Clear,4.6,61942.0,416.0
5,6,https://images-eu.ssl-images-amazon.com/images...,Rich Dad Poor Dad : What The Rich Teach Their ...,Robert T. Kiyosaki,4.5,14986.0,213.0
6,7,https://images-eu.ssl-images-amazon.com/images...,It Starts With Us,Colleen Hoover,4.4,59537.0,190.0
7,8,https://images-eu.ssl-images-amazon.com/images...,"Oswaal CBSE English, Science, Social Science &...",Oswaal Editorial Board,4.4,173.0,876.9
8,9,https://images-eu.ssl-images-amazon.com/images...,KVS PEDAGOGY MASTER BOOK (BILINGUAL) THEORY wi...,Rohit Vaidwan,,,425.0
9,10,https://images-eu.ssl-images-amazon.com/images...,Educart CBSE Class 10 Sample Paper 2023 Bundle...,Educart,4.3,213.0,640.0


In [82]:
scrape_amazon_books(2,'amazon_bestseller_books2.csv')

In [83]:
pd.read_csv('amazon_bestseller_books2.csv')

Unnamed: 0,book_rank,book_poster,book_name,book_writer,book_rating,book_review,book_price
0,1,https://images-eu.ssl-images-amazon.com/images...,Energize Your Mind: Learn the Art of Mastering...,Gaur Gopal Das,,,190.0
1,2,https://images-eu.ssl-images-amazon.com/images...,The Psychology of Money,Morgan Housel,4.6,45289.0,140.0
2,3,https://images-eu.ssl-images-amazon.com/images...,Ikigai: The Japanese secret to a long and happ...,Héctor García,4.6,40204.0,225.0
3,4,https://images-eu.ssl-images-amazon.com/images...,Doglapan: The Hard Truth about Life and Start-Ups,Ashneer Grover,4.3,394.0,239.5
4,5,https://images-eu.ssl-images-amazon.com/images...,Atomic Habits: The life-changing million copy ...,James Clear,4.6,61942.0,416.0
...,...,...,...,...,...,...,...
95,96,https://images-eu.ssl-images-amazon.com/images...,Oswaal CBSE Sample Question Papers Class 12 Ph...,Oswaal Editorial Board,4.3,60.0,230.0
96,97,https://images-eu.ssl-images-amazon.com/images...,Shivdas CBSE Past 7 Years Board Papers + 5 Sam...,Shivdas Editorial,4.9,56.0,113.0
97,98,https://images-eu.ssl-images-amazon.com/images...,"Build, Don't Talk: Things You Wish You Were Ta...",Raj Shamani,4.4,137.0,172.0
98,99,https://images-eu.ssl-images-amazon.com/images...,The Rudest Book Ever: Powerful Perspectives to...,Shwetabh Gangwar,4.4,7670.0,235.0


In [84]:
scrape_amazon_books(4,'amazon_bestseller_books3.csv')

In [85]:
pd.read_csv('amazon_bestseller_books3.csv')

Unnamed: 0,book_rank,book_poster,book_name,book_writer,book_rating,book_review,book_price
0,1,https://images-eu.ssl-images-amazon.com/images...,Energize Your Mind: Learn the Art of Mastering...,Gaur Gopal Das,,,190.00
1,2,https://images-eu.ssl-images-amazon.com/images...,The Psychology of Money,Morgan Housel,4.6,45289.0,140.00
2,3,https://images-eu.ssl-images-amazon.com/images...,Ikigai: The Japanese secret to a long and happ...,Héctor García,4.6,40204.0,225.00
3,4,https://images-eu.ssl-images-amazon.com/images...,Doglapan: The Hard Truth about Life and Start-Ups,Ashneer Grover,4.3,394.0,239.50
4,5,https://images-eu.ssl-images-amazon.com/images...,Atomic Habits: The life-changing million copy ...,James Clear,4.6,61942.0,416.00
...,...,...,...,...,...,...,...
195,196,https://images-eu.ssl-images-amazon.com/images...,The Diary of a Young Girl,Anne Frank,4.6,30149.0,94.99
196,197,https://images-eu.ssl-images-amazon.com/images...,The Answer Writing Manual for UPSC Civil Servi...,Srushti Deshmukh Gowda IAS,4.6,1714.0,335.49
197,198,https://images-eu.ssl-images-amazon.com/images...,Oswaal CBSE Sample Paper Class 10 Mathematics ...,Oswaal Editorial Board,4.8,42.0,200.00
198,199,https://images-eu.ssl-images-amazon.com/images...,Lucent's General Knowledge - 2023/Edn. - Engli...,Dr. Binay Karna,5.0,1.0,203.00


# **Summary**

1. The Scraping was done using Python libraries such as Requests, BEautifulSoup for extracting the data
2. Scraping Top 2000 books and their details from 4 different pages on Amazon.in/books/bestseller website like Book Rank, Book 3. Name, Book Writer, Book Ratings,Total Reviews of the book , and Price of the book.
4. Parsed all the scraped data into a csv file containing 50 rows and 7 columns for each c page and a total of 2000 rows and 7  columns.

# **Future work**

- Extracting more categories of the boooks.
- Code optimization for better visuals and understanding.
- Improving the documentation part of the project to make it readable

# **References**

1. https://www.amazon.in/gp/bestsellers/books/?ie=UTF8&ref_=sv_ba_3

2. https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis

3. Slack Workgroup for General Discussions

In [87]:
jovian.commit(project="myproject", files=['amazon_bestseller_books.csv','amazon_bestseller_books2.csv','amazon_bestseller_books3.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ghost-smith9557/myproject" on https://jovian.ai[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/ghost-smith9557/myproject[0m


'https://jovian.ai/ghost-smith9557/myproject'