# Web Scraping - Bestseller Books on Amazon using Python

![Banner](https://i.imgur.com/dd4tZRW.png)

### What is web scraping?
Web scraping, web harvesting, or web data extraction is  an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. There are many different ways to perform web scraping to obtain data from websites. These include using online services, particular API’s or even creating your code for web scraping from scratch. Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow you to access their data in a structured format. This is the best option, but there are other sites that don’t allow users to access large amounts of data in a structured form or they are simply not that technologically advanced. In that situation, it’s best to use Web Scraping to scrape the website for data.

The pages https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_2?ie=UTF8&pg=1 and https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_2?ie=UTF8&pg=2 provides the top 100 `bestseller books on Amazon`. The list is updated hourly based on the sales of the books. In this project we'll retrieve information from these pages using `web scraping`.

We'll use the `Python` libraries such as [Requets](https://requests.readthedocs.io/en/latest/) and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to scrape data from these pages.

Here is the outline of the steps we'll be follow:

1. Download the web page using `requests`
2. Parse the HTML source code using beautiful soup
3. Extract books title, author name, star rating, total ratings, price and URLs from page
4. Compile the extracted information into Python lists and dictionaries
5. Extract and combine data from both pages
6. Save the extracted information to a CSV file.

By the end of the project we'll create a CSV file in the following format:

```
Book Title, Author, Star Rating, Total Ratings, Price and URL
The Psychology of Money, Morgan Housel, 4.6 out of 5 stars, 43,757, ₹240.00, https://amazon.in/Psychology-Money-Morgan-Housel/dp/9390166268/ref=zg_bs_books_sccl_2/000-0000000-0000000?pd_rd_i=9390166268&psc=1
Word Power Made Easy, Norman Lewis, 4.4 out of 5 stars, 41,780, ₹115.00, https://amazon.in/Word-Power-Made-Norman-Lewis/dp/0143424688/ref=zg_bs_books_sccl_8/000-0000000-0000000?pd_rd_i=0143424688&psc=1
...
```

### How to Run the Code
You can execute the code using 'Run' button and selecting 'Run on Binder' at the top of this page. You can make changes and save your own version of the notebook to [Jovian](https://www.jovian.ai) by executing the following cells:

In [1]:
!pip install jovian --upgrade --quiet

## Download the web page using `requests`

We'll use the `requests` library to download the web page.

The library can be installed using `pip`.

In [2]:
!pip install requests --upgrade --quiet

In [3]:
import requests

The library is now installed and imported.

To download a web page, we'll use the `get` function from requests. 

In [4]:
page1_url = 'https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_2?ie=UTF8&pg=1'
response = requests.get(page1_url)

`requests.get` returns a response object containing the data from the web page and some other information about the response.

The `.status_code` property can be used to check if the requests was successful. A successful response will have an [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) between 200 and 299.

In [5]:
response.status_code

200

The request was successful! We can get the contents of the page using `response.text`.

In [6]:
page_contents = response.text

Let's check the no. of characters on the page.

In [7]:
len(page_contents)

321033

The page contains over 320000 characters.

Here are the first 500 characters of the page:

In [8]:
page_contents[:500]

'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completio'

What we are looking above is the [HTML source code](https://en.wikipedia.org/wiki/HTML) of the web page.

We can also save it to a file and view the page locally within Jupyter using "File > Open".

In [9]:
with open ('webpage.html', 'w') as f:
    f.write(page_contents)

The preview looks similar to the original but none of the links work.

![](https://i.imgur.com/CT2dv0X.png)

We have successfully downloaded the web page using requests.

## Parse the HTML source code using beautiful soup

We'll use the `beautifulsoup4` library to parse the data from HTML source code of the web page. The library can be installed using pip.

In [10]:
!pip  install beautifulsoup4 --upgrade --quiet

The library is installed. 

Now, we have to import `BeautifulSoup` class from the module `bs4`.

In [11]:
from bs4 import BeautifulSoup

We'll use the `BeautifulSoup` to create a doc.

In [12]:
doc = BeautifulSoup(response.text,'html.parser')

In [13]:
type(doc)

bs4.BeautifulSoup

Let's get the title and first image of the doc using `.find` method.

In [14]:
doc.find('title')

<title>Amazon.in Bestsellers: The most popular items in Books</title>

In [15]:
doc.find('img')

<img alt="" src="https://images-eu.ssl-images-amazon.com/images/G/31/social_share/amazon_logo._CB633266945_.png" style="display:none"/>

In [16]:
def select_page(page_no):    
    # Downloads the web page
    response = requests.get('https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_2?ie=UTF8&pg={}'.format(page_no))
   
    # Check if download was successful
    if response.status_code != 200:
        raise Exception('Failed to fetch webpage: https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_2?ie=UTF8&pg={}'.format(page_no))
    
    # Get the page HTML
    page_content = response.text
    
    # Create a bs4 doc
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

Let's parse the first page of the amazon's bestseller books using the `select_page()` function.

In [17]:
doc = select_page(1)

In [18]:
doc.find('title')

<title>Amazon.in Bestsellers: The most popular items in Books</title>

We have the title of the first web page.

We'll use the `.text` method to remove the tags on both ends and get the text only.

In [19]:
doc.find('title').text

'Amazon.in Bestsellers: The most popular items in Books'

We can now use `select_page()` function to download either the first or the second web page of bestseller books and parse it using beautiful soup.

## Extract books title, author name, star rating, total ratings, price and URLs from web page

The `doc` object have several properties and methods for extracting information from the HTML document.
We'll use some of the properties and methods to extract the required information. 

### Book Title

The title of the book can be used to identify the work, to put it in context, to convey a minimal summary of its contents, and to pique the reader's curiosity. Hence, we'll first create a function to get the title of all the books on the web page. 

We'll use `doc` as an argument to parse the book title.

In [20]:
def get_books_title(doc):
    # Get all the 'div' tags from the doc using 'find_all' method and 'class' attribute
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    books_title = []
    # Create a loop to generate a list of 'titles' for all books
    for i in range(len(div_tags)):
        # Using find_all method for 'span' tag to get the title
        books_title.append(div_tags[i].find_all('span')[1].text)
    # Returns the list of titles for all books
    return books_title

Now, we have created the function to get the book titles.

Let's use the function and get the total number of books on the first page. 

In [21]:
books_title = get_books_title(doc)

In [22]:
len(books_title)

50

So, we have 50 books on the first page of the Amazon's bestseller books web page.

Here are the titles of the top 5 bestseller books: 

In [23]:
books_title[:5]

['Atomic Habits: The life-changing million copy bestseller',
 'Ikigai: The Japanese secret to a long and happy life',
 'The Psychology of Money',
 'The Power of Your Subconscious Mind',
 'Word Power Made Easy']

We have created the function `get_books_title` to parse the titles of the all books on the web page.

### Authors Name

We'll create a function to get authors name for each book.

In [24]:
def get_author_names(doc):
    """The attributes of all the books are compiled in div tags which can be parsed using the 'class' attribute from the doc.
    Hence it will remain same to extract all the required information.""" 
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    author = []
    # Create a loop within range of length of 'div tags' to generate a list of 'authors name' for all books
    for i in range(len(div_tags)):
        # Using the 'try' and 'except' block to handle any error's if present
        try:
            # Using find_all method with 'class' attribute to get the authors name
            author.append(div_tags[i].find_all(class_='_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y')[1].text)
        except:
            try:
                author.append(div_tags[i].find_all(class_='_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y')[0].text)
            except:
                author.append('NA')
    # Returns the list of authors name
    return author

We have created the function to get the names of author for all the books.

Let's parse the authors names from the doc.

In [25]:
author_names = get_author_names(doc)

In [26]:
len(author_names)

50

Length of the authors list matches the length of the books list.

Let's get the name list of top 5 authors.

In [27]:
author_names[:5]

['James Clear',
 'Héctor García',
 'Morgan Housel',
 'Joseph Murphy',
 'Norman Lewis']

We have created the function `get_author_names` to parse the author name of all books from the web page.

### Star Ratings

Star rating is a system using five stars to rank a product from good to bad. In the case of books, it gives potential readers an idea of how many people liked the book, and how many didn't based on the content and/or quality of the book.

Let's create a function to parse the star ratings of the books.

In [28]:
def get_star_ratings(doc):
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    star_rating = []
    """Running the loop within range of length of 'div tags' as all books attributes are present in 
    div tags with above 'class' values"""
    for i in range(len(div_tags)):
        try:
            # Using the 'span' tag with the 'class' attribute to get the star ratings
            star_rating.append(div_tags[i].find('span', class_='a-icon-alt').text)
        except:
            star_rating.append('NA')
    # Returns the list of star ratings
    return star_rating

Let's use the function and get the star ratings of the books.

In [29]:
star_ratings = get_star_ratings(doc)

In [30]:
len(star_ratings)

50

In [31]:
star_ratings[:5]

['4.7 out of 5 stars',
 '4.6 out of 5 stars',
 '4.6 out of 5 stars',
 '4.5 out of 5 stars',
 '4.4 out of 5 stars']

So, we have the star ratings for top 5 books.

We have created the function `get_star_ratings` to parse the star ratings of all the books on the web page.

### Total Ratings

Total ratings or global ratings are the total number of times a book is given a star rating. Based on this and other factors, the star rating for a book is calculated.

In [32]:
def get_total_ratings(doc):
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    total_rating = []
    for i in range(len(div_tags)):
        try:
            # Using the 'span' tag with the 'class' attribute to get the total ratings
            total_rating.append(div_tags[i].find_all('span', class_='a-size-small')[-2].text)
        except:
            total_rating.append('NA')
    # Returns the list of total ratings
    return total_rating

Let's use the function and get the total ratings of the books.

In [33]:
total_ratings = get_total_ratings(doc)

In [34]:
len(total_ratings)

50

So, we have the 50 total ratings which matches the total number of books we have parsed. 

Let's see the total ratings of the top 5 books.

In [35]:
total_ratings[:5]

['55,002', '36,253', '44,661', '58,547', '42,134']

We have created the function `get_total_ratings` to get the total ratings of all the books on the web page.

### Book Price

In [36]:
def get_books_price(doc):
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    book_price = []
    for i in range(len(div_tags)):
        try:
            # Using the 'class' attribute to get the books price
            book_price.append(div_tags[i].find(class_='p13n-sc-price').text)
        except:
            book_price.append('NA')
    # Returns the list of books price
    return book_price

Let's use the above function and get the price of all the bestseller books.

In [37]:
books_price = get_books_price(doc)

In [38]:
len(books_price)

50

In [39]:
books_price[:5]

['₹295.00', '₹298.00', '₹245.00', '₹115.00', '₹122.00']

So, we have the prices of the top 5 books of Amazon's bestseller books.

We have created the function `get_books_price` to get the price of all the bestseller books.

### Book URL

Let's create a function to get the URL's for each books. Book URL will moves us to a page which will provide more information about the respective book such as book reviews, buying offers, etc.

In [40]:
def get_books_url(doc):
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    base_url = 'https://amazon.in'
    books_url = []
    for i in range(len(div_tags)):
        # Using find method for 'href' attribute to get the url
        books_url.append(base_url + div_tags[i].find('a')['href'])
    # Returns the list of URLs for all books
    return books_url

Let's use the above function to get URLs for all the books.

In [41]:
books_url = get_books_url(doc)

In [42]:
len(books_url)

50

The length of urls matches the no. of books present on the web page.

Let's see the url of the top 5 books.

In [43]:
books_url[:5]

['https://amazon.in/Atomic-Habits-James-Clear/dp/1847941834/ref=zg_bs_books_sccl_1/000-0000000-0000000?pd_rd_i=1847941834&psc=1',
 'https://amazon.in/Ikigai-H%C3%A9ctor-Garc%C3%ADa/dp/178633089X/ref=zg_bs_books_sccl_2/000-0000000-0000000?pd_rd_i=178633089X&psc=1',
 'https://amazon.in/Psychology-Money-Morgan-Housel/dp/9390166268/ref=zg_bs_books_sccl_3/000-0000000-0000000?pd_rd_i=9390166268&psc=1',
 'https://amazon.in/Power-Your-Subconscious-Mind/dp/8194790832/ref=zg_bs_books_sccl_4/000-0000000-0000000?pd_rd_i=8194790832&psc=1',
 'https://amazon.in/Word-Power-Made-Norman-Lewis/dp/0143424688/ref=zg_bs_books_sccl_5/000-0000000-0000000?pd_rd_i=0143424688&psc=1']

We have created the function `get_books_url` to get the URL of all the bestseller books.

We have created multiple functions such as `get_books_title`, `get_author_names`, `get_star_ratings`, `get_total_ratings`, `get_books_price` and `get_books_url` to extract the required informations such as books title, author name, star rating, total ratings, price and URL from the HTML code respectively, let's move to the next step to compile the information.

## Compile the extracted information into a dictionary

After creating multiple functions to extract the required informations such as books title, author name, star rating, total ratings, price and URL from the HTML code, let's proceed for compilation of the extracted information into a dictionary.

In [44]:
books_data = {
    'Title' : books_title,
    'Author' : author_names,
    'Star Rating' : star_ratings,
    'Total Ratings' : total_ratings,
    'Price' : books_price,
    'URL' : books_url
}

Let's install the `pandas` library to view the values compiled in the dictionary as a pandas dataframe. We can use the `pip` to install the library.

In [45]:
!pip install pandas --upgrade --quiet

Let's import the `pandas`.

In [46]:
import pandas as pd

`Pandas` is imported. Let's use it to create dataframe from our books_data dictionary.

In [47]:
pd.DataFrame(books_data)

Unnamed: 0,Title,Author,Star Rating,Total Ratings,Price,URL
0,Atomic Habits: The life-changing million copy ...,James Clear,4.7 out of 5 stars,55002.0,₹295.00,https://amazon.in/Atomic-Habits-James-Clear/dp...
1,Ikigai: The Japanese secret to a long and happ...,Héctor García,4.6 out of 5 stars,36253.0,₹298.00,https://amazon.in/Ikigai-H%C3%A9ctor-Garc%C3%A...
2,The Psychology of Money,Morgan Housel,4.6 out of 5 stars,44661.0,₹245.00,https://amazon.in/Psychology-Money-Morgan-Hous...
3,The Power of Your Subconscious Mind,Joseph Murphy,4.5 out of 5 stars,58547.0,₹115.00,https://amazon.in/Power-Your-Subconscious-Mind...
4,Word Power Made Easy,Norman Lewis,4.4 out of 5 stars,42134.0,₹122.00,https://amazon.in/Word-Power-Made-Norman-Lewis...
5,My First Library: Boxset of 10 Board Books for...,Wonder House Books,4.5 out of 5 stars,60031.0,₹399.00,https://amazon.in/My-First-Library-Boxset-Boar...
6,Rich Dad Poor Dad : What The Rich Teach Their ...,Robert T. Kiyosaki,4.6 out of 5 stars,70406.0,₹170.00,https://amazon.in/Rich-Dad-Poor-Middle-Anniver...
7,It Ends With Us: A Novel: Volume 1,Colleen Hoover,4.5 out of 5 stars,138104.0,₹292.00,https://amazon.in/Ends-Us-Novel-Colleen-Hoover...
8,28 Years UPSC Civil Services IAS Prelims Topic...,Mrunal Patel,4.1 out of 5 stars,166.0,₹382.00,https://amazon.in/Services-Prelims-Topic-wise-...
9,You Can,George Matthew Adams,4.5 out of 5 stars,1313.0,₹99.00,https://amazon.in/You-Can-George-Matthew-Adams...


We have compiled the books data into a dictionary and with the use of pandas, we have created a dataframe which contains 50 elements.

## Extract and combine data from both pages

Up until now, we are able to extract data from a single page. Now, let's create a function to parse multiple web pages and extract the data from all of them.

In [48]:
def get_pages(page_number):
    doc = select_page(page_number)
    books_title = get_books_title(doc)
    author_names = get_author_names(doc)
    star_ratings = get_star_ratings(doc)
    total_ratings = get_total_ratings(doc)
    books_price = get_books_price(doc)
    books_url = get_books_url(doc)
    return books_title, author_names, star_ratings, total_ratings, books_price, books_url

We have created a function to extract the data from different web pages based on the page number.

Let's create a function to extract data form both pages and compile it into a dictionary. For that we'll need to import `time` library to provide a break time for switching from one page to another page otherwise the web page won't allow us to parse it.

In [49]:
import time

In [50]:
def parse_amazon_bestseller_books_pages(n):
    all_books_title, all_author_names, all_star_ratings, all_total_ratings, all_books_price, all_books_url = [],[],[],[],[],[]
    all_books_data = {
    'Title' : all_books_title,
    'Author' : all_author_names,
    'Star Rating' : all_star_ratings,
    'Total Ratings' : all_total_ratings,
    'Price' : all_books_price,
    'URL' : all_books_url
    }    
    for page_number in range (1,n+1):
        books_title, author_names, star_ratings, total_ratings, books_price, books_url = get_pages(page_number)
        all_books_title += books_title 
        all_author_names += author_names
        all_star_ratings += star_ratings
        all_total_ratings += total_ratings
        all_books_price += books_price
        all_books_url += books_url
        # Give sleep time of 30 seconds before moving to next page 
        time.sleep(30)
    return all_books_data   

We have created the function to extract data from both pages and present them in the form of dictionary.

In [None]:
all_books_data = parse_amazon_bestseller_books_pages(2)

Let's show the books data into a pandas dataframe structure.

In [94]:
dataframe = pd.DataFrame(all_books_data)

In [95]:
dataframe

Unnamed: 0,Title,Author,Star Rating,Total Ratings,Price,URL
0,Atomic Habits: The life-changing million copy ...,James Clear,4.7 out of 5 stars,55002,₹295.00,https://amazon.in/Atomic-Habits-James-Clear/dp...
1,Ikigai: The Japanese secret to a long and happ...,Héctor García,4.6 out of 5 stars,36253,₹298.00,https://amazon.in/Ikigai-H%C3%A9ctor-Garc%C3%A...
2,The Psychology of Money,Morgan Housel,4.6 out of 5 stars,44661,₹245.00,https://amazon.in/Psychology-Money-Morgan-Hous...
3,The Power of Your Subconscious Mind,Joseph Murphy,4.5 out of 5 stars,58547,₹115.00,https://amazon.in/Power-Your-Subconscious-Mind...
4,Word Power Made Easy,Norman Lewis,4.4 out of 5 stars,42134,₹122.00,https://amazon.in/Word-Power-Made-Norman-Lewis...
...,...,...,...,...,...,...
95,International English Olympiad (IEO) Work Book...,ZARRIN ALI KHAN,4.6 out of 5 stars,20,₹64.00,https://amazon.in/International-English-Olympi...
96,"India that is Bharat: Coloniality, Civilisatio...",J Sai Deepak,4.8 out of 5 stars,4383,₹420.00,https://amazon.in/India-that-Bharat-Civilisati...
97,India & World Map ( Both Political & Physical ...,Vidya Chitr Prakashan,4.5 out of 5 stars,1137,₹140.00,https://amazon.in/Political-Physical-Constitut...
98,Bhagavad Gita Original in English - Bhagavad G...,A. C. Bhaktivedanta Swami Prabhupad,4.8 out of 5 stars,3700,₹280.00,https://amazon.in/Bhagavad-Gita-Original-Engli...


All the extracted data from both the web pages are compiled into the dictionary and presented in a dataframe format containing 100 elements.

## Save the extracted information to a CSV file

Let's create a `.csv` file to save the extracted information.

In [54]:
dataframe.to_csv('amazon-bestseller-books.csv', index=None)

We have created a `amazon-bestseller-book.csv` file containing all the extracted data.

Let's view some of the top contents from the csv file using `!head` function.

In [55]:
!head amazon-bestseller-books.csv

Title,Author,Star Rating,Total Ratings,Price,URL
Atomic Habits: The life-changing million copy bestseller,James Clear,4.7 out of 5 stars,"54,899",₹295.00,https://amazon.in/Atomic-Habits-James-Clear/dp/1847941834/ref=zg_bs_books_sccl_1/000-0000000-0000000?pd_rd_i=1847941834&psc=1
Ikigai: The Japanese secret to a long and happy life,Héctor García,4.6 out of 5 stars,"36,199",₹300.00,https://amazon.in/Ikigai-H%C3%A9ctor-Garc%C3%ADa/dp/178633089X/ref=zg_bs_books_sccl_2/000-0000000-0000000?pd_rd_i=178633089X&psc=1
The Psychology of Money,Morgan Housel,4.6 out of 5 stars,"44,543",₹255.00,https://amazon.in/Psychology-Money-Morgan-Housel/dp/9390166268/ref=zg_bs_books_sccl_3/000-0000000-0000000?pd_rd_i=9390166268&psc=1
My First Library: Boxset of 10 Board Books for Kids,Wonder House Books,4.5 out of 5 stars,"59,941",₹399.00,https://amazon.in/My-First-Library-Boxset-Board/dp/9387779262/ref=zg_bs_books_sccl_4/000-0000000-0000000?pd_rd_i=9387779262&psc=1
The Power of Your Subconscious Mind,Jos

We have extracted the required information and saved it in a CSV file.

Let's import `jovian` and save our work using `jovian.commit`.

In [96]:
import jovian

In [97]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ahlenoorkhan/project1-web-scraping-bestseller-books-on-amazon" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ahlenoorkhan/project1-web-scraping-bestseller-books-on-amazon[0m


'https://jovian.ai/ahlenoorkhan/project1-web-scraping-bestseller-books-on-amazon'

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(files=['amazon-bestseller-books.csv'])

<IPython.core.display.Javascript object>

## Summary

Here is what we have covered in our notebook:

1. We downloaded the web page using `requests` library.
2. Parsed the HTML source code using beautiful soup.
3. Extracted books title, author name, star rating, total ratings, price and URLs  of the books from web page.
4. Compiled the extracted information into Python lists and dictionaries.
5. Extracted and combined data from both the pages.
6. Finally saved the extracted information in a CSV file.

The CSV file file we have created has this format:
```
Title,Author,Star Rating,Total Ratings,Price,URL
Atomic Habits: The life-changing million copy bestseller,James Clear,4.7 out of 5 stars,"54,861",₹295.00,https://amazon.in/Atomic-Habits-James-Clear/dp/1847941834/ref=zg_bs_books_sccl_1/000-0000000-0000000?pd_rd_i=1847941834&psc=1
Ikigai: The Japanese secret to a long and happy life,Héctor García,4.6 out of 5 stars,"36,182",₹300.00,https://amazon.in/Ikigai-H%C3%A9ctor-Garc%C3%ADa/dp/178633089X/ref=zg_bs_books_sccl_2/000-0000000-0000000?pd_rd_i=178633089X&psc=1
The Psychology of Money,Morgan Housel,4.6 out of 5 stars,"44,513",₹255.00,https://amazon.in/Psychology-Money-Morgan-Housel/dp/9390166268/ref=zg_bs_books_sccl_3/000-0000000-0000000?pd_rd_i=9390166268&psc=1
```

Here is the complete code for this project:

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def select_page(page_no):    
    # Downloads the web page
    response = requests.get('https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_2?ie=UTF8&pg={}'.format(page_no))
    # Check if download was successful
    if response.status_code != 200:
        raise Exception('Failed to fetch webpage: https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_2?ie=UTF8&pg={}'.format(page_no)) 
    # Get the page HTML
    page_content = response.text
    # Create a bs4 doc
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

def get_books_title(doc):
    # Get all the 'div' tags from the doc using 'find_all' method and 'class' attribute
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    books_title = []
    # Create a loop to generate a list of 'titles' for all books
    for i in range(len(div_tags)):
        # Using find_all method for 'span' tag to get the title
        books_title.append(div_tags[i].find_all('span')[1].text)
    # Returns the list of titles for all books
    return books_title

def get_author_names(doc):
    """The attributes of all the books are compiled in div tags which can be parsed using the 'class' attribute from the doc.
    Hence it will remain same to extract all the required information.""" 
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    author = []
    # Create a loop within range of length of 'div tags' to generate a list of 'authors name' for all books
    for i in range(len(div_tags)):
        # Using the 'try' and 'except' block to handle any error's if present
        try:
            # Using find_all method with 'class' attribute to get the authors name
            author.append(div_tags[i].find_all(class_='_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y')[1].text)
        except:
            try:
                author.append(div_tags[i].find_all(class_='_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y')[0].text)
            except:
                author.append('NA')
    # Returns the list of authors name
    return author

def get_star_ratings(doc):
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    star_rating = []
    """Running the loop within range of length of 'div tags' as all books attributes are present in 
    div tags with above 'class' values"""
    for i in range(len(div_tags)):
        try:
            # Using the 'span' tag with the 'class' attribute to get the star ratings
            star_rating.append(div_tags[i].find('span', class_='a-icon-alt').text)
        except:
            star_rating.append('NA')
    # Returns the list of star ratings
    return star_rating

def get_total_ratings(doc):
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    total_rating = []
    for i in range(len(div_tags)):
        try:
            # Using the 'span' tag with the 'class' attribute to get the total ratings
            total_rating.append(div_tags[i].find_all('span', class_='a-size-small')[-2].text)
        except:
            total_rating.append('NA')
    # Returns the list of total ratings
    return total_rating

def get_books_url(doc):
    div_tags = doc.find_all('div', class_='a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc')
    base_url = 'https://amazon.in'
    books_url = []
    for i in range(len(div_tags)):
        # Using find method for 'href' attribute to get the url
        books_url.append(base_url + div_tags[i].find('a')['href'])
    # Returns the list of URLs for all books
    return books_url

def get_pages(page_number):
    doc = select_page(page_number)
    books_title = get_books_title(doc)
    author_names = get_author_names(doc)
    star_ratings = get_star_ratings(doc)
    total_ratings = get_total_ratings(doc)
    books_price = get_books_price(doc)
    books_url = get_books_url(doc)
    return books_title, author_names, star_ratings, total_ratings, books_price, books_url

def parse_amazon_bestseller_books_pages(n):
    all_books_title, all_author_names, all_star_ratings, all_total_ratings, all_books_price, all_books_url = [],[],[],[],[],[]
    all_books_data = {
    'Title' : all_books_title,
    'Author' : all_author_names,
    'Star Rating' : all_star_ratings,
    'Total Ratings' : all_total_ratings,
    'Price' : all_books_price,
    'URL' : all_books_url
    }    
    for page_number in range (1,n+1):
        books_title, author_names, star_ratings, total_ratings, books_price, books_url = get_pages(page_number)
        all_books_title += books_title 
        all_author_names += author_names
        all_star_ratings += star_ratings
        all_total_ratings += total_ratings
        all_books_price += books_price
        all_books_url += books_url
        # Give sleep time of 30 seconds before moving to next page 
        time.sleep(30)
    return all_books_data   

## Future Works

* We can scrape each book to fetch more information about the book such as description of the book, details of star rating, critical or most helpful reviews.
* As the web page is updated on an hourly basis, we can collect data at various times to analyze the changes in book ranking over time. 
* We can analyze this data to find the relationship between total ratings and star rating, the impact of ratings on the books ranking, etc.

# References

* Web scraping tutorial at Jovian: https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis
* Documentation tutorial at Jovian: https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/documentation-and-storytelling?notebook=aakashns/documentation-and-storytelling
* Requests documentation: https://requests.readthedocs.io/en/latest/
* Beautiful Soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Pandas documentation: https://pandas.pydata.org/docs/