# Scraping books.toscrape.com

1. Web scraping is the process of using tools to extract content and data from a website

2. The tools used in the project are Python programming language and the various libraries available in python namely:
    - requests (to get webapge in http format) https://docs.python-requests.org/en/latest/
    - Beautiful Soup 4 (to convert the html data into parsable form) https://beautiful-soup-4.readthedocs.io/en/latest/
    - Pandas (to create a database of the scraped information and make a csv file) https://pandas.pydata.org/
    - Regular Expressions 're' (makes dealing with expression and text easier) https://docs.python.org/3/library/re.html
    - os (to use operating system based functions in our programs) https://docs.python.org/3/library/os.html

## Task Statement

  - To scrape a 'https://books.toscrape.com' for the most common quote_tags and to get top 10 quotes from each tag
  
  - 'books.toscrape.com' is a free web scraping sandbox to help people practise web-scraping
  
  - We will output a CSV file containg the first 20 books and the links to each book
  
  - Then for each book we will output a CSV file containg UPC, Name, Genre, Stars, Price, Availability and Description which we will store in a separate folder

### The format for CSV file :

   - for the Quotes.csv:

        Name, URL
        

   - for the CSV files for each tag:
   
        UPC, Name, Genre, Stars, Price, Availability, Description
   

#### Scraping the homepage

In [1]:
def scrape_books(page_url):
    
    #get the page
    response=requests.get(page_url)
    
    #check the status of response
    if response.status_code != 200:
        raise Exception('Failed to load home page {}'.format(page_url))
    
    #convert into BS4 for parsing
    homepage_doc=BeautifulSoup(response.text,'html.parser')
    
    return homepage_doc

returns the homepage in parsable form

#### Getting the name of the top 20 books and their URL

In [2]:
def get_bookurl(homepage_doc):
    
    #initializing lists
    book_names=[]
    book_webpage=[]
    
    #get tags of the books
    ttags=homepage_doc.find_all('h3')
    #get all the books' name and url
    for i in range(0,20):
        book_names.append(ttags[i].a['title'])
        book_webpage.append('https://books.toscrape.com/'+ttags[i].a['href'])
        
    return [book_names,book_webpage]

returns a list of book names and the urls

#### Getting the url for each book

In [3]:
def get_page_details(book_url):
    
    #get the page
    res=requests.get(book_url)
    
    #check the status of response
    if res.status_code != 200:
        raise Exception('Failed to load page {}.format(book_url)')
    
    #convert into BS4 for parsing
    book_page_doc=BeautifulSoup(res.text,'html.parser')
    
    return book_page_doc

returns the book page in parsible form

#### Getting the book details

In [4]:
def get_book_details(page_doc):
    
    # get the name, stars, price and availability
    tag=page_doc.find_all('div',{'class':"col-sm-6 product_main"})
    Name=tag[0].h1.text.strip()
    Stars=tag[0].find_all('p')[2]['class'][1]
    Price=tag[0].find('p',{'class':'price_color'}).text[1:]
    Availability=tag[0].find_all('p',{'class':'instock'})[0].text.strip()
    
    #get genre
    gen=page_doc.find_all('ul',{'class':'breadcrumb'})
    Genre=gen[0].find_all('a')[2].text.strip()
    
    #get description
    p_tags=page_doc.find_all('p')
    Description=p_tags[3].text.strip()
    
    #get UPC
    details=page_doc.find_all('table',{'class':'table table-striped'})
    UPC=details[0].find_all('tr')[0].text.strip()[3:]
    
    return [UPC,Name,Genre,Stars,Price,Availability,Description]

returns the book details

#### Writing all the details into CSV file

In [5]:
def write_to_csv(url):

    #get the parser page
    doc_parse=get_page_details(url)
    
    #get all the details in UPC, Name, Genre, Stars, Price, Availability, Description order
    detail_list=get_book_details(doc_parse)
    
    #to store the name of the book by
    #removing the ':' in the name as it causes error in writingaa to csv file
    N=' '.join(re.findall(r"[^: ]+",detail_list[1]))
    
    #creating directory to store data
    os.makedirs('Data',exist_ok=True)
    
    #checking if file already exists in case of failures
    fname='Data'+'/'+N+'.csv'
    if os.path.exists(fname):
        print("The file {} already exists....Skipping".format(fname))
        return
    
    #writing to CSV
    fto=open(fname,mode='w',newline='',encoding='UTF-8')
    csv_writer=csv.writer(fto,delimiter=',')
    csv_writer.writerows([['UPC', 'Name', 'Genre', 'Stars', 'Price', 'Availability', 'Description'],[detail_list[0],detail_list[1],detail_list[2],detail_list[3],detail_list[4],detail_list[5],detail_list[6]]])
    #closing the CSV
    fto.close()

writes all the informatiom about each book to csv file

#### The driver function to run the program

In [6]:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import re
import os
def driver():
    
    url='https://books.toscrape.com/'
    i=scrape_books(url)
    the_lists=get_bookurl(i)
    
    #create dataframe using pandas
    dict1={'Name':the_lists[0],"URL":the_lists[1]}
    df=pd.DataFrame(dict1,index=list(range(1,21)))
    
    #create csv
    if os.path.exists('Books.csv'):
        print("The file {} already exists.....Skipping".format('Books.csv'))
    else:
        print("Creating Books.csv .....")
        df.to_csv('Books.csv',index=None)
    
    #Scraping the book webpages
    for i in range(0,20):
        print("Scraping {}".format(the_lists[0][i]))
        write_to_csv(the_lists[1][i])
        
    print("Done!!!")

Drives the code by creating Book.csv and other csv files

#### Run the driver to begin scraping

In [7]:
driver()

Creating Books.csv .....
Scraping A Light in the Attic
Scraping Tipping the Velvet
Scraping Soumission
Scraping Sharp Objects
Scraping Sapiens: A Brief History of Humankind
Scraping The Requiem Red
Scraping The Dirty Little Secrets of Getting Your Dream Job
Scraping The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
Scraping The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
Scraping The Black Maria
Scraping Starving Hearts (Triangular Trade Trilogy, #1)
Scraping Shakespeare's Sonnets
Scraping Set Me Free
Scraping Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Scraping Rip it Up and Start Again
Scraping Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Scraping Olio
Scraping Mesaerion: The Best Science Fiction Stories 1800-1849
Scraping Libertarianism for Beginners
Scraping It's Only the Himalayas
Done!!!
