# Web Scrapping Amazon Books with Python using BeautifulSoup and requests

![](https://i.imgur.com/G8alOBY.jpg)

Amazon is a tech business based in the United States that specialises in cloud computing, digital streaming, artificial intelligence, and e-commerce. Through its Amazon Prime Video and Audible divisions, Amazon also offers a variety of digital and streaming media. In particular, Kindle e-readers, Echo gadgets, Fire tablets, and Fire TVs are among the consumer electronics that it also makes.

An individual can choose from thousands of books on Amazon that cover a wide range of subjects. This project aids the user in searching for and choosing a book of his or her preference as well as navigating through the selection of best-selling books.                                               

You can have a look at the amazon bestselling books by visiting [Amazon Books](https://www.amazon.in/gp/bestsellers/books/)

We will try to web scrape information like name, author, url and rating of the books from books of various genres from few pages using few libraries like Requests and BeautifulSoup and write the parsed information into a `.csv` file

### Introduction to Python

Python is a powerful, interactive, object-oriented, and interpreted scripting language. Python has been created to be very readable.
It has fewer syntactical structures than other languages and typically employs English keywords rather than punctuation.

### Introduction to Web Scrapping

The process of creating an agent that can automatically collect, parse, download, and arrange meaningful information from the
web is known as web scraping, also known as web data mining or web harvesting. In other words, the web scraping programme will
automatically load and extract data from many websites in accordance with our requirements, as opposed to us manually saving 
the data from websites. 

Here, in this project, we will try to scrap the amazon best selling books website with python using the library BeautifulSoup and finally converting the extracted data into a csv file.

![](https://i.imgur.com/NCsz2xG.png)

### Steps to be followed

1) Downloading web pages through the requests library 

2) Examining a web page's HTML source code    

3) Using Beautiful Soup to parse various website sections

4) Extract the name, author, url and rating of the books

5) Creating CSV files from parsed data


   The CSV created contains the information like name, author, url and rating of the books 

### Dictionaries used for Web Scrapping

Requests : Python's requests module enables you to send HTTP requests. A Response Object containing all the response data is returned by the HTTP request.
    
BeautifulSoup : Python software called Beautiful Soup is used to parse HTML and XML texts. It produces parse trees, which are useful for quickly extracting the data.
    
Pandas : Pandas is a library for analysing and manipulating data. The data is extracted and stored using it in the desired format. 

![](https://i.imgur.com/S93dsz3.jpg)

## Downloading web pages through the `requests` library

When you use a web browser to access a URL, such as https://www.amazon.in/gp/bestsellers/books/4149807031/ref=zg_bs_pg_2?ie=UTF8&pg=1, the output of the web page the URL corresponds to is downloaded and shown on the screen. We must first use Python to download the web page in order to extract the information from it.

For the purpose of downloading web pages from the internet, we'll use the requests library. Installing and importing the library should come first.

`pip` can be used to install the library.

In [1]:
# Install the library
!pip install requests --upgrade --quiet

In [2]:
# Import the library
import requests

We can download a web page using the `requests.get` function

In [3]:
topic_url ='https://www.amazon.in/gp/bestsellers/books/4149807031/ref=zg_bs_pg_2?ie=UTF8&pg=1'

In [4]:
response = requests.get(topic_url)

In [5]:
type(response)

requests.models.Response

Using a status code, requests.get produces a response object that contains the contents of the page as well as some information about the request's success. Visit this page for further information about [HTTP status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)

If the request was successful, `response.status code` is set to a value between 200 and 299

In [6]:
response.status_code

200

The `response.text` attribute can be used to obtain the web page's content.

In [7]:
page_contents = response.text

The length of the `page_contents` can be accessed using len()

In [8]:
len(page_contents)

303801

There are more than 300,000 characters on the page! Let's look at the webpage's first 1000 characters.

In [9]:
page_contents[:1000]

'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completion.amazon.com">\n<!-- sp:end-feature:cs-optimization -->\n\n<!-- sp:feature:aui-assets -->\n<link rel="stylesheet" href="https://images-eu.ssl-images-amazon.com/images/I/11EIQ5IGqaL._RC|01ZTHTZObnL.css,41C-I1lXVwL.css,31ufSReDtSL.css,013z33uKh2L.css,017DsKjNQJL.css,0131vqwP5UL.css,41EWOOlBJ9L.css,11TIuySqr6L.css,01ElnPiDxWL.css,11Qjwq-j69L.css,01Dm5eKVxwL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,21P6CS3L9LL.css,01oDR3IULNL.css,41CYNGpGlrL.css,01XPHJk60-L.css,01smHc51S9L.css,21aPhFy+riL.cs

The web page's source code is seen above. It was created using the HTML format. It outlines the web page's structure and content.

Let's write the information to a file with the.html extension.

In [10]:
with open('school-books.html', 'w') as file:
    file.write(page_contents)

Using Jupyter Notebook's "File > Open" menu option and selecting on school_books.html in the list of files displayed, you can now examine the file. When you open the file, you'll see this:

![](https://imgur.com/IbyPH1e.jpg)

## Using `Beautiful Soup` to parse various website sections

Using the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package, we can programmatically extract data from a webpage's HTML source code. Now that the library has been installed, let's import the BeautifulSoup class from the bs4 module.

In [11]:
# Install the library
!pip install beautifulsoup4 --upgrade --quiet

In [12]:
# Import the library
from bs4 import BeautifulSoup

Let's now open the file school-books.html and examine its contents before creating a BeautifulSoup object to parse it.

In [13]:
with open('school-books.html', 'r') as f:
    html_source = f.read()

Creating a BeautifulSoup object to parse the content.

In [14]:
soup = BeautifulSoup(html_source, 'html.parser')

In [15]:
type(soup)

bs4.BeautifulSoup

### Accessing a tag

There are numerous methods for information extraction from the [HTML](https://www.w3schools.com/html/html_intro.asp) document in the soup object. Take a look at the following examples.

Accessing the title of the page

In [16]:
title_tag = soup.title

In [17]:
title_tag

<title>Amazon.in Bestsellers: The most popular items in School Books</title>

We can access a tag's name using the `.name` property

In [18]:
title_tag.name

'title'

The text within a tag can be accessed using `.text`.

In [19]:
title_tag.text

'Amazon.in Bestsellers: The most popular items in School Books'

## Examining a web page's `HTML` source code

![](https://i.imgur.com/i75aStf.png)

DIV Tag : Using the class or id attribute, it is simple to style and specifies a division or part in an HTML document.           
CLASS Attribute : An HTML element's class can be specified using the class attribute.

 ## Extracting  the name, author, url and rating of the books

The `find_all` method returns  all tags that match the supplied tag name or id.

In [20]:
books = soup.find_all('div', class_= 'zg-grid-general-faceout')

In [21]:
# Length of books can be accessed by len() 
len(books)

50

Let's examine one of the 50 objects in books using the indexing notation 0. The first element is located in the 0th place.

In [22]:
first_book = books[0]

In [23]:
first_book

<div class="zg-grid-general-faceout"><div id="9391050840"><a class="a-link-normal" href="/Oxford-Student-Atlas-India-Fourth/dp/9391050840/ref=zg_bs_4149807031_1/000-0000000-0000000?pd_rd_i=9391050840&amp;psc=1" role="link" tabindex="-1"><div class="a-section a-spacing-mini _cDEzb_noop_3Xbw5"><img alt="Oxford Student Atlas for India, Fourth Edition - Useful for Competitive Exams" class="a-dynamic-image p13n-sc-dynamic-image p13n-product-image" data-a-dynamic-image='{"https://images-eu.ssl-images-amazon.com/images/I/71owQ+uazdL._AC_UL302_SR302,200_.jpg":[302,200],"https://images-eu.ssl-images-amazon.com/images/I/71owQ+uazdL._AC_UL604_SR604,400_.jpg":[604,400],"https://images-eu.ssl-images-amazon.com/images/I/71owQ+uazdL._AC_UL906_SR906,600_.jpg":[906,600]}' height="200px" src="https://images-eu.ssl-images-amazon.com/images/I/71owQ+uazdL._AC_UL302_SR302,200_.jpg" style="max-width:302px;max-height:200px"/></div></a><a class="a-link-normal" href="/Oxford-Student-Atlas-India-Fourth/dp/939105

The book name and author name are in the tag 'div' and class '_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y'.                           
so, let's try to access them

In [24]:
book_info = first_book.find_all('div', class_='_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y')

In [25]:
book_info

[<div class="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y">Oxford Student Atlas for India, Fourth Edition - Useful for Competitive Exams</div>,
 <div class="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y">Oxford University Press</div>]

We can observe that "book_info" is a list of 2 tags of which, the first tag has the information about Name of the Book and the second tag is about Author of the book.

The text within a tag can be accessed using `.text`.

In [26]:
book_name = book_info[0].text
book_author = book_info[1].text

In [27]:
print("Book Name :", book_name)

Book Name : Oxford Student Atlas for India, Fourth Edition - Useful for Competitive Exams


In [28]:
print("Book Author :", book_author)

Book Author : Oxford University Press


Similarly, we try to access the url of the book using `href` attribute

In [29]:
book_url = 'https://www.amazon.in' + first_book.find('a', class_ = 'a-link-normal')["href"]

In [30]:
book_url

'https://www.amazon.in/Oxford-Student-Atlas-India-Fourth/dp/9391050840/ref=zg_bs_4149807031_1/000-0000000-0000000?pd_rd_i=9391050840&psc=1'

Accessing the rating of the book using .find

In [31]:
rating = first_book.find_all(class_ = 'a-link-normal')[2]["title"]

In [32]:
rating

'4.4 out of 5 stars'

Now, Let's look at the information that we extracted using the PRINT function.

In [33]:
print(f'''
book name : {book_name}
book author : {book_author}
book url : https://www.amazon.in{book_url}
book rating : {rating}
''')


book name : Oxford Student Atlas for India, Fourth Edition - Useful for Competitive Exams
book author : Oxford University Press
book url : https://www.amazon.inhttps://www.amazon.in/Oxford-Student-Atlas-India-Fourth/dp/9391050840/ref=zg_bs_4149807031_1/000-0000000-0000000?pd_rd_i=9391050840&psc=1
book rating : 4.4 out of 5 stars



As the total number of books in the url https://www.amazon.in/gp/bestsellers/books/4149807031/ref=zg_bs_pg_2?ie=UTF8&pg=1 are 50, let's return name, author, url and rating by writing a function parse_books.

In [34]:
# Information like book name or author name is missing in few books.
# Let us only print the books with complete information.
def parse_books(book):
    book_info = book.find_all('div', class_='_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y')
    # To make sure that book_info has book name and author name    
    if len(book_info) == 2:
        # Name of the book
        book_name = book_info[0].text
        # Author of the book
        book_author = book_info[1].text
        # book url
        book_url = 'https://www.amazon.in' + book.find('a', class_ = 'a-link-normal')["href"]
        # rating of the book
        rating = book.find_all(class_ = 'a-link-normal')[2]["title"]
            
        return { 'book name' : book_name, 'book author' : book_author, 'book url' : book_url, 'book rating' : rating }


We can now parse details of any book using the function `parse_books(book)`

In [35]:
parse_books(books[0])

{'book name': 'Oxford Student Atlas for India, Fourth Edition - Useful for Competitive Exams',
 'book author': 'Oxford University Press',
 'book url': 'https://www.amazon.in/Oxford-Student-Atlas-India-Fourth/dp/9391050840/ref=zg_bs_4149807031_1/000-0000000-0000000?pd_rd_i=9391050840&psc=1',
 'book rating': '4.4 out of 5 stars'}

In [36]:
parse_books(books[5])

{'book name': "T.S. Grewal's Double Entry Book Keeping: Financial Accounting Textbook for CBSE Class 11 (as per 2022-23 syllabus)",
 'book author': 'T.S. Grewal',
 'book url': 'https://www.amazon.in/T-S-Grewals-Double-Entry-Keeping/dp/9390851777/ref=zg_bs_4149807031_6/000-0000000-0000000?pd_rd_i=9390851777&psc=1',
 'book rating': '4.2 out of 5 stars'}

We can use a list comprehension to parse all the books at a time.

In [37]:
top_books = [ parse_books(book) for book in books ]

In [38]:
len(top_books)

50

In [39]:
top_books[:3]

[{'book name': 'Oxford Student Atlas for India, Fourth Edition - Useful for Competitive Exams',
  'book author': 'Oxford University Press',
  'book url': 'https://www.amazon.in/Oxford-Student-Atlas-India-Fourth/dp/9391050840/ref=zg_bs_4149807031_1/000-0000000-0000000?pd_rd_i=9391050840&psc=1',
  'book rating': '4.4 out of 5 stars'},
 {'book name': 'Concept of Physics by H.C Verma Part - I - Session 2022-23',
  'book author': 'H.C. Verma',
  'book url': 'https://www.amazon.in/Concept-Physics-Part-1-2018-2019-Session/dp/8177091875/ref=zg_bs_4149807031_2/000-0000000-0000000?pd_rd_i=8177091875&psc=1',
  'book rating': '4.6 out of 5 stars'},
 {'book name': 'CBSE All In One Social Science Class 10 2022-23 Edition (As per latest CBSE Syllabus issued on 21 April 2022)',
  'book author': 'Farah Sultan Madhumita Pattrea',
  'book url': 'https://www.amazon.in/Social-Science-2022-23-latest-Syllabus/dp/9326196879/ref=zg_bs_4149807031_3/000-0000000-0000000?pd_rd_i=9326196879&psc=1',
  'book rating':

Few books doesn't have the complete information that we are looking for. So, we are eliminating information of such books.
The above function is written in such a way that only valid books get printed.

Providing the genres of books of our choice that we are interested to parse the details into a dictionary.

In [40]:
all_book_links = { 
                    'School Books' : 'https://www.amazon.in/gp/bestsellers/books/4149807031/ref=zg_bs_pg_2?ie=UTF8&pg=',
                    'Action Books' : 'https://www.amazon.in/gp/bestsellers/books/1318084031/ref=zg_bs_pg_2?ie=UTF8&pg=',
                    'Engineering Books' : 'https://www.amazon.in/gp/bestsellers/books/22960344031/ref=zg_bs_pg_2?ie=UTF8&pg=',
                    'Mining Books' :'https://www.amazon.in/gp/bestsellers/books/12365345031/ref=zg_bs_pg_2?ie=UTF8&pg='
                }

Below is the function that downloads the amazon bestselling books oage for a given genre and returns a beautifulsoup docuement representing the page.

In [41]:
def get_topic_page(topic):
    # downloads a web page and returns docs
    docs = []
    for i in range(1, 7):
        # Constructing the URL
        topic_url = topic + str(i)
        
        # Get the HTML page content using requests
        response = requests.get(topic_url)
    
        # Ensure that the reponse is valid
        if response.status_code != 200:
            print('Status code:', response.status_code)
            raise Exception('Failed to fetch web page ' + topic_url)

        # Construct a beautiful soup document
        doc = BeautifulSoup(response.text, 'html.parser')
        
        docs.append(doc)
    
    return docs

Below is the function `parse_pages` that can parse pages indexing from page numbers 1 to 6

In [42]:
def parse_pages(soup):
    books = soup.find_all('div', class_= 'zg-grid-general-faceout')

    topic_books = [ parse_books(book) for book in books ]

    return topic_books

We can now use the functions we've defined to get the top books for any genre.

In [50]:
docs = get_topic_page(all_book_links['School Books'])
topic_books = []
for doc in docs:
    topic_books = topic_books + parse_pages(doc)

print(topic_books[:3])

[{'book name': 'Oxford Student Atlas for India, Fourth Edition - Useful for Competitive Exams', 'book author': 'Oxford University Press', 'book url': 'https://www.amazon.in/Oxford-Student-Atlas-India-Fourth/dp/9391050840/ref=zg_bs_4149807031_1/000-0000000-0000000?pd_rd_i=9391050840&psc=1', 'book rating': '4.4 out of 5 stars'}, {'book name': 'Concept of Physics by H.C Verma Part - I - Session 2022-23', 'book author': 'H.C. Verma', 'book url': 'https://www.amazon.in/Concept-Physics-Part-1-2018-2019-Session/dp/8177091875/ref=zg_bs_4149807031_2/000-0000000-0000000?pd_rd_i=8177091875&psc=1', 'book rating': '4.6 out of 5 stars'}, {'book name': 'CBSE All In One Social Science Class 10 2022-23 Edition (As per latest CBSE Syllabus issued on 21 April 2022)', 'book author': 'Farah Sultan Madhumita Pattrea', 'book url': 'https://www.amazon.in/Social-Science-2022-23-latest-Syllabus/dp/9326196879/ref=zg_bs_4149807031_3/000-0000000-0000000?pd_rd_i=9326196879&psc=1', 'book rating': '4.2 out of 5 stars

In [51]:
len(topic_books)

300

We can also access a particular book by indexing with a number

In [52]:
topic_books[2]

{'book name': 'CBSE All In One Social Science Class 10 2022-23 Edition (As per latest CBSE Syllabus issued on 21 April 2022)',
 'book author': 'Farah Sultan Madhumita Pattrea',
 'book url': 'https://www.amazon.in/Social-Science-2022-23-latest-Syllabus/dp/9326196879/ref=zg_bs_4149807031_3/000-0000000-0000000?pd_rd_i=9326196879&psc=1',
 'book rating': '4.2 out of 5 stars'}

In [53]:
topic_books[20]

{'book name': 'International Mathematics Olympiad (IMO) Work Book for Class 4 - MCQs, Previous Years Solved Paper and Achievers Section - Olympiad Books For 2022-2023 Exam',
 'book author': 'MAHABIR SINGH',
 'book url': 'https://www.amazon.in/International-Mathematics-Olympiad-Work-Class/dp/9355552130/ref=zg_bs_4149807031_21/000-0000000-0000000?pd_rd_i=9355552130&psc=1',
 'book rating': '4.5 out of 5 stars'}

Now, with the above wriiten function, we can parse the information for any page in amazon books from other pages as well.

Here, we are accesing  the "Action Books"

In [55]:
docs = get_topic_page(all_book_links['Action Books'])
topic_books2 = []
for doc in docs:
    topic_books2 = topic_books2 + parse_pages(doc)

print(topic_books2[:3])

[{'book name': "Harry Potter and the Philosopher's Stone", 'book author': 'J.K. Rowling', 'book url': 'https://www.amazon.in/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318084031_1/000-0000000-0000000?pd_rd_i=B019PIOJYU&psc=1', 'book rating': '4.7 out of 5 stars'}, {'book name': 'The Accidental Minecraft Family: Book 27', 'book author': 'Pixel Ate', 'book url': 'https://www.amazon.in/Accidental-Minecraft-Family-Book-27-ebook/dp/B0B3QVB4K3/ref=zg_bs_1318084031_2/000-0000000-0000000?pd_rd_i=B0B3QVB4K3&psc=1', 'book rating': '5.0 out of 5 stars'}, {'book name': 'Harry Potter Box Set: The Complete Collection (Set of 7 Volumes)', 'book author': 'J.K. Rowling', 'book url': 'https://www.amazon.in/Harry-Potter-ChildrenS-Paperback-Boxed/dp/1408856778/ref=zg_bs_1318084031_3/000-0000000-0000000?pd_rd_i=1408856778&psc=1', 'book rating': '4.7 out of 5 stars'}]


We can access a particular book by indexing with a number

In [56]:
topic_books2[2]

{'book name': 'Harry Potter Box Set: The Complete Collection (Set of 7 Volumes)',
 'book author': 'J.K. Rowling',
 'book url': 'https://www.amazon.in/Harry-Potter-ChildrenS-Paperback-Boxed/dp/1408856778/ref=zg_bs_1318084031_3/000-0000000-0000000?pd_rd_i=1408856778&psc=1',
 'book rating': '4.7 out of 5 stars'}

In [57]:
topic_books2[20]

{'book name': "Good Girl, Bad Blood: TikTok made me buy it! The Sunday Times Bestseller and sequel to A Good Girl's Guide to Murder: Book 2",
 'book author': 'Holly Jackson',
 'book url': 'https://www.amazon.in/Good-Girl-Blood-Holly-Jackson/dp/1405297751/ref=zg_bs_1318084031_21/000-0000000-0000000?pd_rd_i=1405297751&psc=1',
 'book rating': '4.6 out of 5 stars'}

In [58]:
docs = get_topic_page(all_book_links['Engineering Books'])
topic_books3 = []
for doc in docs:
    topic_books3 = topic_books3 + parse_pages(doc)

In [63]:
docs = get_topic_page(all_book_links['Mining Books'])
topic_books4 = []
for doc in docs:
    topic_books4 = topic_books4 + parse_pages(doc)

## Writing information to CSV files

The function will produce a file with the desired name by changing the corresponding information from the dictionary to a .CSV file

In [64]:
def write_csv(items, path):
    """Write a list of dictionaries to a CSV file"""
    with open(path, 'w') as f:
        if len(items) == 0:
            return
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        for item in items:
            if item==None:
                pass
            else:
                values = []
                for header in headers:
                    values.append(item.get(header, ""))
                f.write(','.join(values) + "\n")

In [65]:
write_csv(top_books, 'School Books.csv')

The file can now be read to view its contents. Using Jupyter's "File > Open" menu option, you can also examine the file's content.

In [66]:
with open('School Books.csv', 'r') as f:
    print(f.read())

book name,book author,book url,book rating
Oxford Student Atlas for India, Fourth Edition - Useful for Competitive Exams,Oxford University Press,https://www.amazon.in/Oxford-Student-Atlas-India-Fourth/dp/9391050840/ref=zg_bs_4149807031_1/000-0000000-0000000?pd_rd_i=9391050840&psc=1,4.4 out of 5 stars
Concept of Physics by H.C Verma Part - I - Session 2022-23,H.C. Verma,https://www.amazon.in/Concept-Physics-Part-1-2018-2019-Session/dp/8177091875/ref=zg_bs_4149807031_2/000-0000000-0000000?pd_rd_i=8177091875&psc=1,4.6 out of 5 stars
CBSE All In One Social Science Class 10 2022-23 Edition (As per latest CBSE Syllabus issued on 21 April 2022),Farah Sultan Madhumita Pattrea,https://www.amazon.in/Social-Science-2022-23-latest-Syllabus/dp/9326196879/ref=zg_bs_4149807031_3/000-0000000-0000000?pd_rd_i=9326196879&psc=1,4.2 out of 5 stars
Mathematics for Class 10 - CBSE - by R.D. Sharma Examination 2022-23,R.D. Sharma,https://www.amazon.in/Mathematics-Class-10-Examination-2022-23/dp/B09SJ365JL/ref

Perfect! We've created a CSV containing the information about the top GitHub repositories for the topic machine-learning. We can now put together everything we've done so far to solve the original problem.

Let us write a function that creates a CSV file (comma-separated values) containing details about the 100+ top books for any given topic provide with topic link withot index number at the end of url.

In [67]:
import requests
from bs4 import BeautifulSoup

all_book_links = { 
                    'School Books' : 'https://www.amazon.in/gp/bestsellers/books/4149807031/ref=zg_bs_pg_2?ie=UTF8&pg=',
                    'Action Books' : 'https://www.amazon.in/gp/bestsellers/books/1318084031/ref=zg_bs_pg_2?ie=UTF8&pg=',
                    'Engineering Books' : 'https://www.amazon.in/gp/bestsellers/books/22960344031/ref=zg_bs_pg_2?ie=UTF8&pg=',
                    'Mining Books' :'https://www.amazon.in/gp/bestsellers/books/12365345031/ref=zg_bs_pg_2?ie=UTF8&pg='
                }


def parse_books(book):
    book_info = book.find_all('div', class_='_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y')
    # To make sure that book_info has book name and author name    
    if len(book_info) == 2:
        # Name of the book
        book_name = book_info[0].text
        # Author of the book
        book_author = book_info[1].text
        # book url
        book_url = 'https://www.amazon.in' + book.find('a', class_ = 'a-link-normal')["href"]
        # book rating
        rating = book.find_all(class_ = 'a-link-normal')[2]["title"]

        book_url = 'https://www.amazon.in' + book_url
        
        return { 'book name' : book_name, 'book author' : book_author, 'book url' : book_url, 'book rating' : rating }

    

def get_topic_page(topic):
    
    docs = []
    for i in range(1, 7):

        topic_url = topic + str(i)
        
        # Get the HTML page content using requests
        response = requests.get(topic_url)
    
        # Ensure that the reponse is valid
        if response.status_code != 200:
            print('Status code:', response.status_code)
            raise Exception('Failed to fetch web page ' + topic_url)

        # Construct a beautiful soup document
        doc = BeautifulSoup(response.text, 'html.parser')
        
        docs.append(doc)
    
    return docs


# Function to parse Multiple pages
def parse_pages(soup):
        books = soup.find_all('div', class_= 'zg-grid-general-faceout')
        
        topic_books = [ parse_books(book) for book in books ]
       
        return topic_books

def write_csv(items, path):
    """Write a list of dictionaries to a CSV file"""
    with open(path, 'w') as f:
        if len(items) == 0:
            return
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        for item in items:
            if item==None:
                pass
            else:
                values = []
                for header in headers:
                    values.append(item.get(header, ""))
                f.write(','.join(values) + "\n")
    

def scrape_topic_books(topic, path=None):
    # topic is nothing but url of pages
    """Get the top repositories for a topic and write them to a CSV file"""
    
    if path is None:
        path = topic + '.csv'
        
    topic_url = all_book_links[topic]
    
    topic_page_docs = get_topic_page(topic_url)
    topic_books = []
    for doc in topic_page_docs:
        topic_books = topic_books + parse_pages(doc)

   
    write_csv(topic_books, path)
    print('Top repositories for topic "{}" written to file "{}"'.format(topic, path))
    return path

In [71]:
scrape_topic_books('School Books')

Top repositories for topic "School Books" written to file "School Books.csv"


'School Books.csv'

We can use the pandas library to see the contents of our `CSV` file now that we have one.

In [72]:
# Importing the library Pandas
import pandas as pd

Using Pandas Library to view the content

In [73]:
k=[x for x in topic_books if x is not None]
pd.DataFrame.from_dict(k)

Unnamed: 0,book name,book author,book url,book rating
0,"Oxford Student Atlas for India, Fourth Edition...",Oxford University Press,https://www.amazon.in/Oxford-Student-Atlas-Ind...,4.4 out of 5 stars
1,Concept of Physics by H.C Verma Part - I - Ses...,H.C. Verma,https://www.amazon.in/Concept-Physics-Part-1-2...,4.6 out of 5 stars
2,CBSE All In One Social Science Class 10 2022-2...,Farah Sultan Madhumita Pattrea,https://www.amazon.in/Social-Science-2022-23-l...,4.2 out of 5 stars
3,Mathematics for Class 10 - CBSE - by R.D. Shar...,R.D. Sharma,https://www.amazon.in/Mathematics-Class-10-Exa...,4.5 out of 5 stars
4,CBSE All In One Science Class 10 2022-23 Editi...,"Rashmi Gupta Sonal Singh ,Ruchi Kapoor,Imran A...",https://www.amazon.in/Science-2022-23-latest-S...,4.4 out of 5 stars
...,...,...,...,...
283,International English Olympiad (IEO) Work Book...,ZARRIN ALI KHAN,https://www.amazon.in/International-English-Ol...,4.2 out of 5 stars
284,Physics Galaxy 2020-21: Electrostatics & Curre...,Ashish Arora,https://www.amazon.in/Physics-Galaxy-2020-21-E...,4.5 out of 5 stars
285,CBSE All In One Information Technology (Code 4...,Shweta Agarwal Neetu Gaikwad,https://www.amazon.in/Information-Technology-2...,4.7 out of 5 stars
286,WORKBOOK MATH CBSE- CLASS 5TH,Arihant Experts,https://www.amazon.in/CBSE-WORKBOOK-MATH-CLASS...,4.3 out of 5 stars


In [74]:
jovian.commit(files=['School Books.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "yogi-varmavatsa/yogendra-web-scrapping-amazon-books" on https://jovian.ai[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/yogi-varmavatsa/yogendra-web-scrapping-amazon-books[0m


'https://jovian.ai/yogi-varmavatsa/yogendra-web-scrapping-amazon-books'

Now, let us try to scrape the "Action Books"

In [80]:
scrape_topic_books('Action Books')

Top repositories for topic "Action Books" written to file "Action Books.csv"


'Action Books.csv'

Using Pandas library to view the content

In [81]:
k=[x for x in topic_books2 if x is not None]
pd.DataFrame.from_dict(k)

Unnamed: 0,book name,book author,book url,book rating
0,Harry Potter and the Philosopher's Stone,J.K. Rowling,https://www.amazon.in/Harry-Potter-Philosopher...,4.7 out of 5 stars
1,The Accidental Minecraft Family: Book 27,Pixel Ate,https://www.amazon.in/Accidental-Minecraft-Fam...,5.0 out of 5 stars
2,Harry Potter Box Set: The Complete Collection ...,J.K. Rowling,https://www.amazon.in/Harry-Potter-ChildrenS-P...,4.7 out of 5 stars
3,The Jungle Book (Campbell First Stories) (Camp...,Miriam Bos,https://www.amazon.in/Jungle-Book-First-Storie...,4.5 out of 5 stars
4,Harry Potter and the Chamber of Secrets,J.K. Rowling,https://www.amazon.in/Harry-Potter-Chamber-Sec...,4.7 out of 5 stars
...,...,...,...,...
291,The Hardy Boys 03: The Secret of the Old Mill,Franklin W. Dixon,https://www.amazon.in/Hardy-Boys-03-Secret-Mil...,4.6 out of 5 stars
292,Marvel's Avengers: Age of Ultron: The Reusable...,Charles Cho,https://www.amazon.in/Marvels-Avengers-Ultron-...,4.0 out of 5 stars
293,"The Ballad of Winston the Wandering Trader, Bo...",Dr. Block,https://www.amazon.in/Ballad-Winston-Wandering...,4.8 out of 5 stars
294,The Lord of the Rings: The Return of the King:...,J.R.R. Tolkien,https://www.amazon.in/Lord-Rings-Return-King/d...,4.8 out of 5 stars


In [83]:
jovian.commit(files=['Action Books.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "yogi-varmavatsa/yogendra-web-scrapping-amazon-books" on https://jovian.ai[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/yogi-varmavatsa/yogendra-web-scrapping-amazon-books[0m


'https://jovian.ai/yogi-varmavatsa/yogendra-web-scrapping-amazon-books'

Using the above written function, we can now scrape any genre of books.

### Summary and Further Reading

We've understood the following in web-scraping:

1. Downloading web pages using the requests library
2. Inspecting the HTML source code of a web page
3. Parsing parts of a website using Beautiful Soup
4. Writing parsed information into CSV les

### Future Work

1) In order to extract useful information from this data, we may now move forward and investigate it more thoroughly.   
2) With all the above information, we can come up with a decision of picking a book that suits our interest.

### References

1) Amazon books website https://www.amazon.in/gp/bestsellers/books/

2) Aakash N S, Introduction to Web Scraping https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis

3) Pandas library documentation. https://pandas.pydata.org/docs/

4) Requests library. https://pypi.org/project/requests/

5) Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

6) Python ocal documentation. https://docs.python.org/3/

7) Few Google Images