# WEEK 5 ASSIGNMENT
## WEB SCRAPING

The aim of this assignment is to scrape a website of book seller called  [Books to Scrape](http://books.toscrape.com/).
From this website, we will be creating a dataframe with the following columns:
- **title**
- **rating**
- **price**
- **link**

We will be using libraries such as `BeautifulSoup` and `requests` for web scrpaing and `pandas` to build the dataframe.

## Importing Libraries
We will be importing the libraries to scape the data from the website.

`BeautifulSoup` will be used for parsing and extracting the html elements, `requests` will be used for handling the protocol requests and `pandas` will be used for data manipulation. We will also be importing `csv` library for the export of the finalized .csv output.

In [27]:
# importing libraries
import requests
from bs4 import BeautifulSoup as bs
import csv
import pandas as pd

**We will use the requests library to send request to the website.**
**If request is accepted, then we will print the status of it.**

In [4]:
url = "http://books.toscrape.com/"
response = requests.get(url)

if response.status_code == 200:
    print("Request Successful")
else:
    print("Rwquest Unsuccessful")

Request Successful


## Parsing the HTML content
After successfully getting the data, we will first view the format of the HTML text till 1000 characters.

In [8]:
# Printing the first 1000 characters of the response text
print(response.text[:1000])

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="" />
        <meta name="viewport" content="width=device-width" />
        <meta name="robots" content="NOARCHIVE,NOCACHE" />

        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
        <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->

        
            <link rel="shortcut icon" href="static/oscar/favicon.

In [9]:
# parsing the HTML file using BeautifulSoup library
soup = bs(response.text, "html.parser")
print(type(soup))

<class 'bs4.BeautifulSoup'>


## Extracting details for one book
Steps to be performed:
- Scrap the data of 1 Book
- Scrap the data of all the books in 1 page
- Scrap the data of all the books of all 50 pages

We will need to find all the `<article>` tag from the website and print and view that.

In [11]:
books = soup.find_all('article', class_ = 'product_pod')
single_book = books[0]
single_book

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [23]:
# extracting the title attribute from the 1st book element of <anchor> tag
title = single_book.find('a', title=True)['title']
print("Title: ",title)

Title:  A Light in the Attic


In [22]:
# extracting the star-rating class value from the first book element of <paragraph> tag
rating = single_book.find('p', class_='star-rating')['class'][1]
print("Rating: ",rating)

Rating:  Three


In [21]:
# extracting and cleaning the "price_color" class value from the first book element of <paragraph> tag
price = single_book.find('p',class_='price_color').text.strip().strip('Â')
print("Price: ",price)

Price:  £51.77


In [25]:
# extracting the "href" attribute value from the first book element of <anchor> tag.
# we will be concatenating the initial url to book_url
book_url = single_book.find('a')['href']
link = url + book_url
print("Link: ", link)

Link:  http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html


## Extracting book details for page 1
- Finding all HTML elements for individual books
- Initializing an empty list to store book details
- For each book, we will be:
  - Extracting the book details
  - Extracting the book ratings
  - Cleaning and extracting the book price
  - Extracting the book URL and creating complete link
  - Appending all the details to a list

In [44]:
books = soup.find_all('article', class_='product_pod')
books_data = []

for book in books:
    title = book.find('a', title=True)['title']
    rating = book.find('p', class_='star-rating')['class'][1]
    price = book.find('p',class_='price_color').text.strip().strip('Â')
    book_url = book.find('a')['href']
    link = url + book_url

    books_data.append([title, rating, price, link])

In [45]:
# creating a dataframe using pandas
df = pd.DataFrame(books_data, columns=["Title","Rating","Price","Link"])
df

Unnamed: 0,Title,Rating,Price,Link
0,A Light in the Attic,Three,£51.77,http://books.toscrape.com/catalogue/a-light-in...
1,Tipping the Velvet,One,£53.74,http://books.toscrape.com/catalogue/tipping-th...
2,Soumission,One,£50.10,http://books.toscrape.com/catalogue/soumission...
3,Sharp Objects,Four,£47.82,http://books.toscrape.com/catalogue/sharp-obje...
4,Sapiens: A Brief History of Humankind,Five,£54.23,http://books.toscrape.com/catalogue/sapiens-a-...
5,The Requiem Red,One,£22.65,http://books.toscrape.com/catalogue/the-requie...
6,The Dirty Little Secrets of Getting Your Dream...,Four,£33.34,http://books.toscrape.com/catalogue/the-dirty-...
7,The Coming Woman: A Novel Based on the Life of...,Three,£17.93,http://books.toscrape.com/catalogue/the-coming...
8,The Boys in the Boat: Nine Americans and Their...,Four,£22.60,http://books.toscrape.com/catalogue/the-boys-i...
9,The Black Maria,One,£52.15,http://books.toscrape.com/catalogue/the-black-...


## Extracting book details for all 50 pages 
## Page 1 to 50

In [46]:
for num in range(1, 51):
    page_url = f'http://books.toscrape.com/catalogue/page-{num}.html'
    print(page_url)

http://books.toscrape.com/catalogue/page-1.html
http://books.toscrape.com/catalogue/page-2.html
http://books.toscrape.com/catalogue/page-3.html
http://books.toscrape.com/catalogue/page-4.html
http://books.toscrape.com/catalogue/page-5.html
http://books.toscrape.com/catalogue/page-6.html
http://books.toscrape.com/catalogue/page-7.html
http://books.toscrape.com/catalogue/page-8.html
http://books.toscrape.com/catalogue/page-9.html
http://books.toscrape.com/catalogue/page-10.html
http://books.toscrape.com/catalogue/page-11.html
http://books.toscrape.com/catalogue/page-12.html
http://books.toscrape.com/catalogue/page-13.html
http://books.toscrape.com/catalogue/page-14.html
http://books.toscrape.com/catalogue/page-15.html
http://books.toscrape.com/catalogue/page-16.html
http://books.toscrape.com/catalogue/page-17.html
http://books.toscrape.com/catalogue/page-18.html
http://books.toscrape.com/catalogue/page-19.html
http://books.toscrape.com/catalogue/page-20.html
http://books.toscrape.com/cat

We will be using `primary_url` to build complete book URls and `page_url` to specify the directory of multiple webpages.

In [48]:
primary_url = "http://books.toscrape.com/" 
books_50_data = []

for num in range(1, 51):
    page_url = f'http://books.toscrape.com/catalogue/page-{num}.html'
    response = requests.get(page_url)
    soup_page = bs(response.text, "html.parser")
    books = soup_page.find_all('article', class_='product_pod')

    for book in books:
        title = book.find('a', title=True)['title']
        rating = book.find('p', class_='star-rating')['class'][1]
        price = book.find('p', class_='price_color').text.strip().strip('Â')
        book_url = book.find('a')['href']
        link = primary_url + book_url

        books_50_data.append([title, rating, price, link])
        

In [49]:
# creating a dataframe
page_50 = pd.DataFrame(books_50_data, columns=["Title", "Rating", "Price", "Link"])
page_50

Unnamed: 0,Title,Rating,Price,Link
0,A Light in the Attic,Three,£51.77,http://books.toscrape.com/a-light-in-the-attic...
1,Tipping the Velvet,One,£53.74,http://books.toscrape.com/tipping-the-velvet_9...
2,Soumission,One,£50.10,http://books.toscrape.com/soumission_998/index...
3,Sharp Objects,Four,£47.82,http://books.toscrape.com/sharp-objects_997/in...
4,Sapiens: A Brief History of Humankind,Five,£54.23,http://books.toscrape.com/sapiens-a-brief-hist...
...,...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Won...,One,£55.53,http://books.toscrape.com/alice-in-wonderland-...
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Four,£57.06,http://books.toscrape.com/ajin-demi-human-volu...
997,A Spy's Devotion (The Regency Spies of London #1),Five,£16.97,http://books.toscrape.com/a-spys-devotion-the-...
998,1st to Die (Women's Murder Club #1),One,£53.98,http://books.toscrape.com/1st-to-die-womens-mu...


## Exporting the data
Saving the final data in a .csv format as **books_scraped.csv** 

In [51]:
page_50.to_csv("books_scraped.csv", index=False)