# Scraping quotes.toscrape.com
  
 1. Web scraping is the process of using tools to extract content and data from a website
 
 3. The tools used in the project are Python programming language and the various libraries available in python namely:
     - requests (to get webapge in http format) https://docs.python-requests.org/en/latest/
     - Beautiful Soup 4 (to convert the html data into parsable form) https://beautiful-soup-4.readthedocs.io/en/latest/
     - Pandas (to create a database of the scraped information and make a csv file) https://pandas.pydata.org/
     - Regular Expressions 're' (makes dealing with expression and text easier) https://docs.python.org/3/library/re.html
     - os (to use operating system based functions in our programs) https://docs.python.org/3/library/os.html

## Task Statement

  - To scrape a 'https://quotes.toscrape.com' for the most common quote_tags and to get top 10 quotes from each tag
  
  - 'quotes.toscrape.com' is a free web scraping sandbox to help people practise web-scraping
  
  - We will output a CSV file containg all the quote tags and the links to top quotes in each tag
  
  - Then for each quote tag we will output a CSV file containg top 10 quotes along with the author name, about the author and other tags on the quote which we will store in a separate folder

### The format for CSV file :

   - for the Quotes.csv:

        Quote Tag, URL
        

   - for the CSV files for each tag:
   
        Quote, Author, About the Author, Other tags
   

##### Scraping the homepage

In [1]:
def scrape_quotes(page_url):
    
    #get the page
    response=requests.get(page_url)
    
    #check the status of response
    if response.status_code != 200:
        raise Exception('Failed to load home page {}'.format(page_url))
    
    #convert into BS4 for parsing
    homepage_doc=BeautifulSoup(response.text,'html.parser')
    
    return homepage_doc

returns the homepage in parsable form

##### Getting the name of the top quote tags and their URL

In [2]:
def get_quoteurl(homepage_doc):
    
    #initializing lists
    quote_names=[]
    quote_webpage=[]
    
    #get tags of the quotes
    ttags=homepage_doc.find_all('span',{'class':'tag-item'})
    #get all the quote tags and url
    for i in range(0,len(ttags)):
        quote_names.append(ttags[i].text.strip())
        quote_webpage.append('https://quotes.toscrape.com/'+ttags[i].a['href'])
        
    return [quote_names,quote_webpage]

returns a list of quote tags and the urls 

#### Getting the url for each quote tag page

In [3]:
def get_page_details(quote_url):
    
    #get the page
    res=requests.get(quote_url)
    
    #check the status of response
    if res.status_code != 200:
        raise Exception('Failed to load page {}'.format(quote_url))
    
    #convert into BS4 for parsing
    quote_page_doc=BeautifulSoup(res.text,'html.parser')
    
    return quote_page_doc

returns the quote tag page in parsible form

#### Getting the top quotes and other details

In [4]:
def get_quote_details(quote_doc):
    
    #get all the quotes
    quo_tags=quote_doc.find_all('div',{'class':'quote'})
    
    #initializing lists
    quotes=[]
    auth_name=[]
    about_auth=[]
    tags=[]
    
    #getting details
    for i in range(0,len(quo_tags)):
        quotes.append(quo_tags[i].find('span',{'class':'text'}).text.strip())
        auth_name.append(quo_tags[i].find_all('span')[1].small.text.strip())
        about_auth.append('https://quotes.toscrape.com/'+quo_tags[i].find_all('span')[1].a['href'])
        #replacing ',' with ':' in between tags
        t=':'.join(re.findall(r'[^, ]+',quo_tags[i].find_all('div',{'class':'tags'})[0].meta['content']))
        tags.append(t)
        
    return [quotes,auth_name,about_auth,tags]

returns the top quotes and their respective details

#### Writing all details to csv file using pandas

In [5]:
def write_to_csv(url,quote_name):

    #get the parser page
    doc_parse=get_page_details(url)
    
    #get all the details in UPC, Name, Genre, Stars, Price, Availability, Description order
    detail_list=get_quote_details(doc_parse)
    
    #creating directory to store data
    os.makedirs('Data',exist_ok=True)
    
    #checking if file already exists in case of failures
    fname='Data'+'/'+quote_name+'.csv'
    if os.path.exists(fname):
        print("The file {} already exists....Skipping".format(fname))
        return
    
    #create dataframe using pandas
    dict2={'Quote':detail_list[0],"Author":detail_list[1],'About the Author':detail_list[2],'Other Tags':detail_list[3]}
    df=pd.DataFrame(dict2)
    
    #creating CSV file
    print("Creating {} .....".format(fname))
    df.to_csv(fname,index=None)

writes all the informatiom about each quote tag to csv file

### The driver function to run the program

In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import os

def driver():

    url='https://quotes.toscrape.com/'
    i=scrape_quotes(url)
    the_lists=get_quoteurl(i)
    
    #create dataframe using pandas
    dict1={'Quote Tag':the_lists[0],"URL":the_lists[1]}
    df=pd.DataFrame(dict1,index=list(range(1,11)))

    #create csv
    if os.path.exists('Quotes.csv'):
        print("The file {} already exists.....Skipping".format('Quotes.csv'))
    else:
        print("Creating Quotes.csv .....")
        df.to_csv('Quotes.csv',index=None)
    
    #Scraping the quote webpages
    for i in range(0,10):
        print("Scraping {}".format(the_lists[0][i]))
        write_to_csv(the_lists[1][i],the_lists[0][i])
        
    print("Done!!!")

Drives the code by creating Quotes.csv and calling other functions

#### Run the driver to begin scraping

In [7]:
driver()

Creating Quotes.csv .....
Scraping love
Creating Data/love.csv .....
Scraping inspirational
Creating Data/inspirational.csv .....
Scraping life
Creating Data/life.csv .....
Scraping humor
Creating Data/humor.csv .....
Scraping books
Creating Data/books.csv .....
Scraping reading
Creating Data/reading.csv .....
Scraping friendship
Creating Data/friendship.csv .....
Scraping friends
Creating Data/friends.csv .....
Scraping truth
Creating Data/truth.csv .....
Scraping simile
Creating Data/simile.csv .....
Done!!!
