# SIMPLE WEB SCRAPING PROJECT

This simple project was to practice web scraping using a practice scraping website.

Link for the  website: http://books.toscrape.com/


### What This Notebook Shows:
1. Simple review and use of BeautifulSoup
2. Simple use of Pandas


In [1]:
# Import libraries, etc...
import urllib3
from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re

In [2]:
# Define the webpage to scrape
prefix = 'http://books.toscrape.com/'

In [3]:
# Define a function to request and parse a HTML web page
def getAndParseURL(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text, 'html.parser')
    return(soup)

soup = getAndParseURL(prefix) 

In [4]:
# Find all <a> tags (hyperlinks) within the main page
book_links = soup.find_all('a')

# Go through all <a> tags and get the links associated with on the main page
links = []
for a in book_links:
    links.append(prefix + a["href"])

# Convert all links on the main page into dataframe, drop duplicates, reset index
links_df = pd.DataFrame(links, columns = ['Links'])
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
links_df = links_df.drop_duplicates(subset = 'Links', keep = 'first')
links_df = links_df.reset_index(drop=True)
links_df

Unnamed: 0,Links
0,http://books.toscrape.com/index.html
1,http://books.toscrape.com/catalogue/category/books_1/index.html
2,http://books.toscrape.com/catalogue/category/books/travel_2/index.html
3,http://books.toscrape.com/catalogue/category/books/mystery_3/index.html
4,http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html
5,http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html
6,http://books.toscrape.com/catalogue/category/books/classics_6/index.html
7,http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html
8,http://books.toscrape.com/catalogue/category/books/romance_8/index.html
9,http://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html


In [5]:
# find distinct URL in href attribute
soup.find("article", class_ = "product_pod").div.a.get('href')

'catalogue/a-light-in-the-attic_1000/index.html'

In [6]:
# Web scraping all the books on the main page:
main_page_products_urls = [x.div.a.get('href') for x in soup.findAll("article", class_ = "product_pod")]

print(str(len(main_page_products_urls)) + " products URLs scraped on the main page:")
main_page_products_urls

20 products URLs scraped on the main page:


['catalogue/a-light-in-the-attic_1000/index.html',
 'catalogue/tipping-the-velvet_999/index.html',
 'catalogue/soumission_998/index.html',
 'catalogue/sharp-objects_997/index.html',
 'catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'catalogue/the-requiem-red_995/index.html',
 'catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'catalogue/the-black-maria_991/index.html',
 'catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html',
 'catalogue/shakespeares-sonnets_989/index.html',
 'catalogue/set-me-free_988/index.html',
 'catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html',
 'catalogue/rip-it-up-and-start-again_986/index.html',
 'catalogue/our-band-could-be-your-life-scene

In [7]:
# Web scraping all the categories on the main page:
categories_urls = [prefix + x.get('href') for x in soup.find_all("a", href= re.compile("catalogue/category/books"))]
categories_urls = categories_urls[1:] # Remove first as it corresponds to all books

print(str(len(categories_urls)) + " categories URLs scraped:")
categories_urls

50 categories URLs scraped:


['http://books.toscrape.com/catalogue/category/books/travel_2/index.html',
 'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
 'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
 'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',
 'http://books.toscrape.com/catalogue/category/books/classics_6/index.html',
 'http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html',
 'http://books.toscrape.com/catalogue/category/books/romance_8/index.html',
 'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
 'http://books.toscrape.com/catalogue/category/books/fiction_10/index.html',
 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
 'http://books.toscrape.com/catalogue/category/books/religion_12/index.html',
 'http://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html',
 'http://books.toscrape.com/catalogue/category/boo

In [8]:
# Web scraping all the books from the website:

# Store all the results into a list
pages_urls = [prefix]

soup = getAndParseURL(pages_urls[0])

# When we get 2 matches, it means that the webpage contains a 'previous' and a 'next' button
# When there is only one button, it means that we are on the first or last pages
# The while loop stops on the last page

while len(soup.findAll("a", href=re.compile("page"))) == 2 or len(pages_urls) == 1:
    
    # Get the new url by adding the fetched URL to the base URL and removing the .html part of the base URL)
    new_url = "/".join(pages_urls[-1].split("/")[:-1]) + "/" + soup.findAll("a", href = re.compile("page"))[-1].get("href")
    
    # add the URL to the list
    pages_urls.append(new_url)
    
    # parse the next page
    soup = getAndParseURL(new_url)
    

print(str(len(pages_urls)) + " scraped URLs:")
pages_urls

50 scraped URLs:


['http://books.toscrape.com/',
 'http://books.toscrape.com/catalogue/page-2.html',
 'http://books.toscrape.com/catalogue/page-3.html',
 'http://books.toscrape.com/catalogue/page-4.html',
 'http://books.toscrape.com/catalogue/page-5.html',
 'http://books.toscrape.com/catalogue/page-6.html',
 'http://books.toscrape.com/catalogue/page-7.html',
 'http://books.toscrape.com/catalogue/page-8.html',
 'http://books.toscrape.com/catalogue/page-9.html',
 'http://books.toscrape.com/catalogue/page-10.html',
 'http://books.toscrape.com/catalogue/page-11.html',
 'http://books.toscrape.com/catalogue/page-12.html',
 'http://books.toscrape.com/catalogue/page-13.html',
 'http://books.toscrape.com/catalogue/page-14.html',
 'http://books.toscrape.com/catalogue/page-15.html',
 'http://books.toscrape.com/catalogue/page-16.html',
 'http://books.toscrape.com/catalogue/page-17.html',
 'http://books.toscrape.com/catalogue/page-18.html',
 'http://books.toscrape.com/catalogue/page-19.html',
 'http://books.toscrape

In [9]:
# Print the status of the last page:
result = requests.get("http://books.toscrape.com/catalogue/page-50.html")
print("status code for page 50: " + str(result.status_code))

status code for page 50: 200


In [10]:
# Print the status of a page that doesn't exist:
result = requests.get("http://books.toscrape.com/catalogue/page-51.html")
print("status code for page 51: " + str(result.status_code))

status code for page 51: 404


In [11]:
# Define another function to request and parse all books HTML web page
def getBooksURLs(url):
    soup = getAndParseURL(url)
    # Remove the index.html part of the base url before returning the results
    return(["/".join(url.split("/")[:-1]) + "/" + x.div.a.get('href') for x in soup.findAll("article", class_ = "product_pod")])

In [12]:
# Get all 1000 books URLs:
booksURLs = []
for page in pages_urls:
    booksURLs.extend(getBooksURLs(page))
    
print(str(len(booksURLs)) + " scraped URLs. We show 10 below:")
booksURLs[:10]

1000 scraped URLs. We show 10 below:


['http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'http://books.toscrape.com/catalogue/the-black-maria_991/index.html']

In [13]:
# Scrape data from each book URL (Takes time as there are 1000 books):

names = []
prices = []
nb_in_stock = []
img_urls = []
categories = []
ratings = []

for url in booksURLs:
    soup = getAndParseURL(url)
    # Product name
    names.append(soup.find("div", class_ = re.compile("product_main")).h1.text)
    # Product price
    prices.append(soup.find("p", class_ = "price_color").text[2:]) # get rid of the pound sign
    # Number of available products
    nb_in_stock.append(re.sub("[^0-9]", "", soup.find("p", class_ = "instock availability").text)) # get rid of non numerical characters
    # Image url
    img_urls.append(url.replace("index.html", "") + soup.find("img").get("src"))
    # Product category
    categories.append(soup.find("a", href = re.compile("../category/books/")).get("href").split("/")[3])
    # Ratings
    ratings.append(soup.find("p", class_ = re.compile("star-rating")).get("class")[1])
    
# Convert into pandas dataframe
all_books_df = pd.DataFrame({'Book Name': names, 'Selling Price': prices, 'Number in Stock': nb_in_stock, "URL of Image": img_urls, "Genre": categories, "Rating (/5)": ratings})
all_books_df.head()

Unnamed: 0,Book Name,Selling Price,Number in Stock,URL of Image,Genre,Rating (/5)
0,A Light in the Attic,51.77,22,http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/../../media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg,poetry_23,Three
1,Tipping the Velvet,53.74,20,http://books.toscrape.com/catalogue/tipping-the-velvet_999/../../media/cache/08/e9/08e94f3731d7d6b760dfbfbc02ca5c62.jpg,historical-fiction_4,One
2,Soumission,50.1,20,http://books.toscrape.com/catalogue/soumission_998/../../media/cache/ee/cf/eecfe998905e455df12064dba399c075.jpg,fiction_10,One
3,Sharp Objects,47.82,20,http://books.toscrape.com/catalogue/sharp-objects_997/../../media/cache/c0/59/c05972805aa7201171b8fc71a5b00292.jpg,mystery_3,Four
4,Sapiens: A Brief History of Humankind,54.23,20,http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/../../media/cache/ce/5f/ce5f052c65cc963cf4422be096e915c9.jpg,history_32,Five


In [14]:
# Edit Genre column:
genre_split = all_books_df['genre_split'] = all_books_df.Genre.str.split("_")
all_books_df['Genre'] = all_books_df.genre_split.str.get(0)
all_books_df = all_books_df.drop(columns = ['genre_split'])
all_books_df = all_books_df.reset_index(drop=True)
all_books_df.head(20)

Unnamed: 0,Book Name,Selling Price,Number in Stock,URL of Image,Genre,Rating (/5)
0,A Light in the Attic,51.77,22,http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/../../media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg,poetry,Three
1,Tipping the Velvet,53.74,20,http://books.toscrape.com/catalogue/tipping-the-velvet_999/../../media/cache/08/e9/08e94f3731d7d6b760dfbfbc02ca5c62.jpg,historical-fiction,One
2,Soumission,50.1,20,http://books.toscrape.com/catalogue/soumission_998/../../media/cache/ee/cf/eecfe998905e455df12064dba399c075.jpg,fiction,One
3,Sharp Objects,47.82,20,http://books.toscrape.com/catalogue/sharp-objects_997/../../media/cache/c0/59/c05972805aa7201171b8fc71a5b00292.jpg,mystery,Four
4,Sapiens: A Brief History of Humankind,54.23,20,http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/../../media/cache/ce/5f/ce5f052c65cc963cf4422be096e915c9.jpg,history,Five
5,The Requiem Red,22.65,19,http://books.toscrape.com/catalogue/the-requiem-red_995/../../media/cache/6b/07/6b07b77236b7c80f42bd90bf325e69f6.jpg,young-adult,One
6,The Dirty Little Secrets of Getting Your Dream Job,33.34,19,http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/../../media/cache/e1/1b/e11bea016d0ae1d7e2dd46fb3cb870b7.jpg,business,Four
7,"The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull",17.93,19,http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/../../media/cache/97/36/9736132a43b8e6e3989932218ef309ed.jpg,default,Three
8,The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics,22.6,19,http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/../../media/cache/d1/2d/d12d26739b5369a6b5b3024e4d08f907.jpg,default,Four
9,The Black Maria,52.15,19,http://books.toscrape.com/catalogue/the-black-maria_991/../../media/cache/d1/7a/d17a3e313e52e1be5651719e4fba1d16.jpg,poetry,One


### Conclusion:

Web scraping is useful to gather data for different projects, in this case for books. It was a simple and useful practice for my other project involving NBA Analytics. An important aspect of web scraping efficiently is to understand the structure of the website being web scraped __(and check if they allow you to web scrape in the first place)__. Another thing to note is that the code has to be maintained in order to stay useful as websites change their structures. From my knowledge, using Selenium is quite useful to cope with these problems and automate the process.