# Scraping Book-website using Python and BeautifulSoup

## Choosing the [Book website to Scrap](https://books.toscrape.com/index.html) and describing object to scrap ![](https://i.imgur.com/NvTiIil.jpeg)

### Introduction:
 Book to scrape has been chosen as it contains a wide range of books from all genres containing sufficient information, which can attract people of all ages.  As every website may not allow scraping due to commercial purposes, the Book to scrape website provides the ease to scrape, so without adopting an unethical way, required information can be extracted.   

- Browse through different pages of the Book website and pick on to scrape.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea in a paragraph using a Markdown cell and outline your strategy.

## Project Outline
- We are going to scrap https://books.toscrape.com/index.html
- We are going to scrap different webpages associated to this website, containing differnt book information 
- For the each webpage associated with  [Main url](https://books.toscrape.com/index.html) scrap all the books presents on corresponding pages of [Main url]

- For the each webpage associated with [Main url](https://books.toscrape.com/index.html) extract the information like Book_Title, Book_cost, Book_Rating, Book_availability, Book_uniqueCode(UPSC), Particular Book_link

- The final output of the project should look like 


```      
        "Book_title":Title,
         
         "Price":Book_Price,
        
        "Availability": No_of_Copies,
        
        "Book_url":Book_url,
        
        "Book Rating":Rating+" "+ "star",
        
        "Book_code":Book_code,
        
        "Geners":Geners
        
        ```


### Sucessful completion of the project shoul give desired output in the Dataframe : ![](https://i.imgur.com/fXwhFRk.png)

## User Interface of scrape website looks like: ![ ](https://i.imgur.com/bI9JRYm.png)

- It can be refered that [Base URL](http://books.toscrape.com/catalogue/page-) and total number of Pages is 50 which need to scrape 



## Creating the Weblink corresponding to all 50 webpages

In [6]:
# Demonstration of creating all weblink 
base_url="http://books.toscrape.com/catalogue/page-"
url_L=base_url+str(1)+".html"  # So 1 can be changed to another number based on weblink

In [7]:
base_url="http://books.toscrape.com/catalogue/page-"
url_list=[]  # Empty list to stored the all weblink 

for i in range(1,51):
    add_url=base_url + str(i) +".html"
    url_list.append(add_url)


## Importing Required Python library:
[Request ](https://pypi.org/project/requests/), [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/), [Pandas](https://pandas.pydata.org/), [CSV](https://docs.python.org/3/library/csv.html)  library 

In [8]:
import requests
import jovian
from bs4 import BeautifulSoup
import pandas as pd
import csv

## Function for authentication of weblink by getting it's [HTTP's response](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) 

In [9]:

def url_fetch_status(url):
    
    response=requests.get(url)
    
    if response.status_code!=200:
        
        print("Status Code",response.status_code)
        
        raise Exception('Failed to fetch web page'+url)
    
    doc=BeautifulSoup(response.text,'html.parser')
    
    return doc

## HTML SOURCE CODE OVERVIEW 
- HTML documents start with a document type declaration:

- The HTML document itself begins with <html> and ends with </html>.

- The visible part of the HTML document is between <body> and </body>
![](https://i.imgur.com/V0sJkus.png)


```HMTL source code contain several of tags which is like a container which contain different category of items inside it, if we want to grab the information of particular category we have to acess the particular tag, and this principle will be applying for getting the required information ```


 **Storing the Beautifousoup version of the response sent for the each of weblink created and stored in HTML file**

In [10]:
doc_collection=[]  # BeautifoulpSoup of weblink HTTP response 
for url in url_list:
    doc_collection.append(url_fetch_status(url))

**Checking for each weblink (i.e 50) there will be BeautifoulSoup version of HTML file got on getting the response**

In [11]:
len(doc_collection)

50

 **Finding books present on the weblink using** tag li and class_=col-xs-6 col-sm-4 col-md-3 col-lg-3 
 
 ![](https://i.imgur.com/0ginh95.png)

## Creating helper functions

### Finding list of tag containing all required information about book

In [12]:
#Function created to find all the availble books on the weblink

def article_tags(doc):
    all_article_tags=doc.find_all('li',class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")
    return all_article_tags

### Finding  `a` tag for getting href of particular book ![](https://i.imgur.com/5vmLBgR.png)

In [13]:
all_tag_collection=[article_tags(doc) for doc in doc_collection]

# Returning the list of collection of all tag  found for response.text found for each weblink

### Similar to finding the href for each book , creating function which can give all the required information and storing information in form of Dictonary ![](https://i.imgur.com/JMMlOxk.png)

### Getting book path

In [14]:
def get_Book_path(tag):
    Book_url=tag.find_all('a')[0]['href']
    return Book_path

### Finding Book_url 

In [15]:
def get_Book_url(Book_path):
    Base_url="https://books.toscrape.com/catalogue/"
    Book_url=Base_url+Book_path
    return Book_url

###  Parsing url response of individual books as BeautifulSoup object

In [16]:
def get_Book_response(Book_url):
    Book_response=requests.get(Book_url)
    Book_doc=BeautifulSoup(Book_response.text,'html.parser')
    return Book_doc

### Finding geners of book 

In [17]:
def get_Book_geners(Book_doc):
    Generes=Book_doc.find('ul',class_='breadcrumb').find_all('li')[2].text.strip()
    return Generes  

### Getting book title

In [18]:
def get_Book_title(Book_doc):
    Book_individual_tag=Book_doc.find_all('h1')
    Book_title=Book_individual_tag[0].text.strip()
    return Book_title

### Finding number of book_copies available

In [19]:
def get_Book_copies(Book_doc):
    Book_copies=Book_doc.find_all('p',class_="instock availability")
    No_of_Copies=Book_copies[0].text.strip()
    return No_of_Copies

### Finding the book_unicode

In [20]:
def get_Book_code(Book_doc):
    code=Book_doc.find_all('td')
    Book_code=code[0].text
    return Book_code

### Getting price of book 

In [21]:
def get_Book_price(tag):
    Book_Price=tag.find('p',class_="price_color").text.strip()
    return Book_Price

###  Rating of book

In [22]:
def get_Book_rating(tag):
    star_rating=tag.find_all('p')[0]
    Rating=star_rating["class"][1]
    return Rating

## Combining all helper functions for extraction of required information

In [23]:
def main_function(tag):
    Book_path=tag.find_all('a')[0]['href']
    Base_url="https://books.toscrape.com/catalogue/" 
    Book_url=Base_url+Book_path
    Book_doc=get_Book_response(Book_url)
    return {
        "Title":get_Book_title(Book_doc),
        
        "Price":get_Book_price(tag),
        
        "Availability":get_Book_copies(Book_doc),
        
        "Book_url":Book_url,
        
        "Book Rating":get_Book_rating(tag)+" "+ "star",
        
        "Book_code":get_Book_code(Book_doc),
        
        "Book_Geners":get_Book_geners(Book_doc)

            }

**Creating the Hyperlink created required to fetch the Book details**


In [24]:
Book_path=[tag[0].find_all('a')[0]['href'] for tag in all_tag_collection]
Book_path[:10]

['a-light-in-the-attic_1000/index.html',
 'in-her-wake_980/index.html',
 'slow-states-of-collapse-poems_960/index.html',
 'the-nameless-city-the-nameless-city-1_940/index.html',
 'princess-jellyfish-2-in-1-omnibus-vol-01-princess-jellyfish-2-in-1-omnibus-1_920/index.html',
 'immunity-how-elie-metchnikoff-changed-the-course-of-modern-medicine_900/index.html',
 'algorithms-to-live-by-the-computer-science-of-human-decisions_880/index.html',
 'the-shadow-hero-the-shadow-hero_860/index.html',
 'the-bridge-to-consciousness-im-writing-the-bridge-between-science-and-our-old-and-new-beliefs_840/index.html',
 'modern-romance_820/index.html']

In [25]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "jhagautamkumar362/webscraping-fair" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/jhagautamkumar362/webscraping-fair[0m


'https://jovian.ai/jhagautamkumar362/webscraping-fair'

In [26]:
def parse_page(index):
    base_url="http://books.toscrape.com/catalogue/page-"
    url=base_url+str(index)+".html"
    doc=url_fetch_status(url)
    tags=article_tags(doc)
    all_info_Book=[main_function(tag) for tag in tags]
    
    return all_info_Book

    

In [27]:
parse_page(2)

[{'Title': 'In Her Wake',
  'Price': 'Â£12.84',
  'Availability': 'In stock (19 available)',
  'Book_url': 'https://books.toscrape.com/catalogue/in-her-wake_980/index.html',
  'Book Rating': 'One star',
  'Book_code': '23356462d1320d61',
  'Book_Geners': 'Thriller'},
 {'Title': 'How Music Works',
  'Price': 'Â£37.32',
  'Availability': 'In stock (19 available)',
  'Book_url': 'https://books.toscrape.com/catalogue/how-music-works_979/index.html',
  'Book Rating': 'Two star',
  'Book_code': '327f68a59745c102',
  'Book_Geners': 'Music'},
 {'Title': 'Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More',
  'Price': 'Â£30.52',
  'Availability': 'In stock (19 available)',
  'Book_url': 'https://books.toscrape.com/catalogue/foolproof-preserving-a-guide-to-small-batch-jams-jellies-pickles-condiments-and-more-a-foolproof-guide-to-making-small-batch-jams-jellies-pickles-cond

## Parsing indexes denoting number pages to be scrapped, storing data into Book_info

In [28]:
Book_info=[]
n=int(input("Total number of pages to be scraped:"))
if(n>0 and n<=51): # Check number of pages should not exceed total number of pages 
    for i in range(1,n):
        Book_info.append(parse_page(i))
else:
    raise Exception('Number of pages to be scraped between 1 and 51')

Total number of pages to be scraped:51


**Giving input value closer to 51, increase computing time because information are scrapped from all 50 pages**

In [30]:
len(Book_info[1])

20

In [31]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "jhagautamkumar362/webscraping-fair" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/jhagautamkumar362/webscraping-fair[0m


'https://jovian.ai/jhagautamkumar362/webscraping-fair'

## Using Pandas  to save the information into dataframe

In [32]:
df=pd.DataFrame([r for d in Book_info for r in d])


In [33]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "jhagautamkumar362/webscraping-fair" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/jhagautamkumar362/webscraping-fair[0m


'https://jovian.ai/jhagautamkumar362/webscraping-fair'

In [34]:
df.head() # Checking first 5 rows of output 

Unnamed: 0,Title,Price,Availability,Book_url,Book Rating,Book_code,Book_Geners
0,A Light in the Attic,Â£51.77,In stock (22 available),https://books.toscrape.com/catalogue/a-light-i...,Three star,a897fe39b1053632,Poetry
1,Tipping the Velvet,Â£53.74,In stock (20 available),https://books.toscrape.com/catalogue/tipping-t...,One star,90fa61229261140a,Historical Fiction
2,Soumission,Â£50.10,In stock (20 available),https://books.toscrape.com/catalogue/soumissio...,One star,6957f44c3847a760,Fiction
3,Sharp Objects,Â£47.82,In stock (20 available),https://books.toscrape.com/catalogue/sharp-obj...,Four star,e00eb4fd7b871a48,Mystery
4,Sapiens: A Brief History of Humankind,Â£54.23,In stock (20 available),https://books.toscrape.com/catalogue/sapiens-a...,Five star,4165285e1663650f,History


**Coverting all dataframe into csv and saving file containing all header for columns and different character**

In [35]:
df.to_csv('Book_Scrapped_Info.csv', index = False, encoding='utf-8', header=True)

# Summary
- The scraping of the website has been done using Python libraries requests, BeautifulSoup
- Total number of pages scraped is 50, each page contains 20 books, so total 1000 Books has been scraped 
- For each book information extracted are Book_title, Book_Price, Book_Rating, Book_Availability,Book_url,Book_code,Book_Geners
- All information scrapped from different webpages contains total 7 -coloumns and 1000 rows 
- Using the python Pandas dataframe and CSV all information scrapped has been converted into .csv format file 

# Future Work
-  As a continuation of the above work some of the other information related to book like author of Book, content of Books, Product Description, Number of reviews can be extracted 
- Using better Datastructure to take optimal time for extracting more information 
- Extrated information into csv file can be used later for performing EDA and other operations
- Restructuring of flow of content 

#  References
- [Books to Scrape](https://books.toscrape.com/index.html)
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
- [Requests](https://pypi.org/project/requests/)
- [Image storage](https://imgur.com/])
- [CSV writing and reading](https://docs.python.org/3/library/csv.html)
- [Pandas](https://pandas.pydata.org/)
- [Jupyter Notebook](https://jupyter.org/)

In [37]:
import jovian
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "jhagautamkumar362/webscraping-fair" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/jhagautamkumar362/webscraping-fair[0m


'https://jovian.ai/jhagautamkumar362/webscraping-fair'