# Webscraping Project  1


Try to scrap multiple pages and multiple content from a website


Website: https://toscrape.com/ 

- Made for practicing web scraping skills

## Objectives


### Pick the website to scrape

* Book Store : https://books.toscrape.com/ 

### What content we'll scrape

* All the elements to make a dataset

Like:

In [1]:
import pandas as pd

file = pd.read_csv("sample-books.csv")
file.head()

Unnamed: 0,Genre,Name,Price,Availability,Rating,UPC,Product Type,Price (excl. Tax),Price (incl. Tax),Tax,Number of Reviews
0,Poetry,A Light in the Attic,£51.77,In stock (22 available),3,a897fe39b1053632,Books,£51.77,£51.77,£0.0,0
1,Historical Fiction,Tipping the Velvet,£53.74,In stock (20 available),1,90fa61229261140a,Books,£53.74,£53.74,£0.0,0


## Workflow Experiment

### Imports

In [2]:
import bs4
import requests
import lxml

### Testing with the main page

In [6]:
# Try to open the main page

# General URL
url = "https://books.toscrape.com/catalogue/page-{}.html"

# Request for the URL
page = requests.get(url.format(1),"lxml")

# Make it a soup
soup = bs4.BeautifulSoup(page.text)

In [66]:
# soup

### Getting Useful Information

In [13]:
# Try to get all the list of genres

genres = soup.select(".side_categories a")

In [21]:
genre_list =  [genre.text.rstrip().lstrip() for genre in genres]

In [22]:
genre_list

['Books',
 'Travel',
 'Mystery',
 'Historical Fiction',
 'Sequential Art',
 'Classics',
 'Philosophy',
 'Romance',
 'Womens Fiction',
 'Fiction',
 'Childrens',
 'Religion',
 'Nonfiction',
 'Music',
 'Default',
 'Science Fiction',
 'Sports and Games',
 'Add a comment',
 'Fantasy',
 'New Adult',
 'Young Adult',
 'Science',
 'Poetry',
 'Paranormal',
 'Art',
 'Psychology',
 'Autobiography',
 'Parenting',
 'Adult Fiction',
 'Humor',
 'Horror',
 'History',
 'Food and Drink',
 'Christian Fiction',
 'Business',
 'Biography',
 'Thriller',
 'Contemporary',
 'Spirituality',
 'Academic',
 'Self Help',
 'Historical',
 'Christian',
 'Suspense',
 'Short Stories',
 'Novels',
 'Health',
 'Politics',
 'Cultural',
 'Erotica',
 'Crime']

In [29]:
# Try to get all the links of books

book_links = []

for section in soup.select(".product_pod"):
    link = "https://books.toscrape.com/catalogue/" + section.select('a')[0]['href']
    book_links.append(link)

In [30]:
book_links

['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'https://books.toscrape.com/catalogue/soumission_998/index.html',
 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'https://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'https://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-tr

In [31]:
# Try to get the information from the first link


# Request for the URL
page = requests.get(book_links[0],"lxml")

# Make it a soup
soup = bs4.BeautifulSoup(page.text)

In [32]:
soup

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    A Light in the Attic | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="
    It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now

### Test Case: For the First Book

In [63]:
# Get the genre of the book

genre = soup.select(".breadcrumb")[0].select("a")[2].getText()

# Get the name of the book

title = soup.select(".breadcrumb")[0].select(".active")[0].getText()

# Get the rating of the book

rating = soup.select(".col-sm-6")[1].select('p')[2]['class'][1]

# Get the rest of the information

contents = soup.select(".table.table-striped")[0].select('td')
upc = contents[0].text
prod_type = contents[1].text
price_exc = contents[2].text
price_inc = contents[3].text
tax = contents[4].text
availability = contents[5].text
review = contents[6].text

### Make Categories to store data

In [65]:
scrape_dict = {
    "Genre" : [],
    "Title" : [],
    "Rating" : [],
    "UPC Code" : [],
    "Product Type" : [],
    "Price (excl. tax)" : [],
    "Price (incl. tax)" : [],
    "Tax" : [],
    "Availability" : [],
    "Number of reviews" : [],
    
}

In [68]:
scrape_dict["Genre"].append(genre)

In [69]:
scrape_dict["Genre"]

['Poetry']

### Helper Functions

In [67]:
# Function 1
## For the links of books

def get_book_links(soup):
    book_links = []
    for section in soup.select(".product_pod"):
        link = "https://books.toscrape.com/catalogue/" + section.select('a')[0]['href']
        book_links.append(link)
    return book_links

## Final Setup For Scraping

In [73]:
### Final Setup For Scraping

# Make an empty dictionary

scrape_dict = {
    "Genre" : [],
    "Title" : [],
    "Rating" : [],
    "UPC Code" : [],
    "Product Type" : [],
    "Price (excl. tax)" : [],
    "Price (incl. tax)" : [],
    "Tax" : [],
    "Availability" : [],
    "Number of reviews" : []}

# Loop though all the pages
for n in range(1,51):
    
    # General URL
    url = "https://books.toscrape.com/catalogue/page-{}.html"
    # Request for the URL
    page = requests.get(url.format(n),"lxml")
    # Make it a soup
    soup = bs4.BeautifulSoup(page.text)
    
    
    # Get the links for the books
    links = get_book_links(soup)
    
    for link in links:
        # Request for the URL
        page = requests.get(link,"lxml")
        # Make it a soup
        soup = bs4.BeautifulSoup(page.text)
        
        
        # Get and all the infromation
        
        genre = soup.select(".breadcrumb")[0].select("a")[2].getText() # genre of the book
        title = soup.select(".breadcrumb")[0].select(".active")[0].getText() # name of the book
        rating = soup.select(".col-sm-6")[1].select('p')[2]['class'][1] # rating of the book
        # Get the rest of the information
        contents = soup.select(".table.table-striped")[0].select('td')
        upc = contents[0].text
        prod_type = contents[1].text
        price_exc = contents[2].text
        price_inc = contents[3].text
        tax = contents[4].text
        availability = contents[5].text
        review = contents[6].text
        
        
        # Add information to the dictionary
        scrape_dict["Genre"].append(genre)
        scrape_dict["Title"].append(title)
        scrape_dict["Rating"].append(rating)
        scrape_dict["UPC Code"].append(upc)
        scrape_dict["Product Type"].append(prod_type)
        scrape_dict["Price (excl. tax)"].append(price_exc)
        scrape_dict["Price (incl. tax)"].append(price_inc)
        scrape_dict["Tax"].append(tax)
        scrape_dict["Availability"].append(availability)
        scrape_dict["Number of reviews"].append(review)
        
df = pd.DataFrame(scrape_dict)

### Save the data

In [74]:
df.to_csv("scraped-data.csv")

In [76]:
df.tail(10)

Unnamed: 0,Genre,Title,Rating,UPC Code,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews
990,Womens Fiction,Bridget Jones's Diary (Bridget Jones #1),One,a2b4b685dfa94733,Books,Â£29.82,Â£29.82,Â£0.00,In stock (1 available),0
991,Romance,Bounty (Colorado Mountain #7),Four,abc0b15f2c907ff0,Books,Â£37.26,Â£37.26,Â£0.00,In stock (1 available),0
992,Mystery,Blood Defense (Samantha Brinkman #1),Three,95cdfd514098c38b,Books,Â£20.30,Â£20.30,Â£0.00,In stock (1 available),0
993,Sequential Art,"Bleach, Vol. 1: Strawberry and the Soul Reaper...",Five,099fae4a0705d63b,Books,Â£34.65,Â£34.65,Â£0.00,In stock (1 available),0
994,Philosophy,Beyond Good and Evil,One,08672cd59171d5e4,Books,Â£43.38,Â£43.38,Â£0.00,In stock (1 available),0
995,Classics,Alice in Wonderland (Alice's Adventures in Won...,One,cd2a2a70dd5d176d,Books,Â£55.53,Â£55.53,Â£0.00,In stock (1 available),0
996,Sequential Art,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Four,bfd5e1701c862ac3,Books,Â£57.06,Â£57.06,Â£0.00,In stock (1 available),0
997,Historical Fiction,A Spy's Devotion (The Regency Spies of London #1),Five,19fec36a1dfb4c16,Books,Â£16.97,Â£16.97,Â£0.00,In stock (1 available),0
998,Mystery,1st to Die (Women's Murder Club #1),One,f684a82adc49f011,Books,Â£53.98,Â£53.98,Â£0.00,In stock (1 available),0
999,Travel,"1,000 Places to See Before You Die",Five,228ba5e7577e1d49,Books,Â£26.08,Â£26.08,Â£0.00,In stock (1 available),0
