## WebScraping with BeautifulSoup

In this post, we'll create a dataset of popular books in different genres by scraping the site Books To Scrape: http://books.toscrape.com using Requests library and BeautifulSoup.

One of the main aim of web scraping is to automate the process of extracting information/data from the web. The field is still in active development and it has application in various areas such as monitoring e-commerce prices, gathering web data automatically, applying machine learning techniques, investment opportunoites and generating sale leads.

The process of extracting data from the web  is now easy thanks to a number of libraries that are available. Most popular libraries used for web scraping  include Beautiful Soup, Scrapy, Selenium, Requests, e.t.c. Interestingly, The choice of any of these libraries can be made by careful consideration of the pros and cons  of each of the libraries. For instance, if we want to retrieve information or data from a site, it is paramount to first understand the nature of the site, whether its a static or dynamic page. With static sites,which are web page made with fixed codes. This means that they don't have comment section, log-in or any form of interactivity, it is easier to create and form the most basic type of all websites. For such sites, it is common to use libraries such as requests and beautiful soup, while dynamic site which are Javascript driven and involves a number of interactivity and ability to create user profiles  should be scraped with libraries such as selenium or scrapy. When we scrape a website, we do this by fetching specific information from the web page by locating certain elements of the page. This is done by using the requests library, once the contents have been downloaded, it can then be parsed and searched by another library such as beautifulsoup and the desired information retrieved can be load into a spreadsheet or database for later use.

In this blog post I will be using requests library and beautiful soup to extract data from a static page, called Book-To_Scrape. It all about 1000 books belonging to different genre.

Before we begin ensure you already have libraries installed  into your environment using pip or conda.Another option is to use a notebook such as google colab which most of the required libraries pre installed.

#### Import Required Libraries
A. Request Library
* Requests is an elegant and simple HTTP library for Python, built for human beings.
* It officiall support Python 3.7+
  The first step to using the request library is to ensure that it has been installed and updated.

In [1]:

import requests
from bs4 import BeautifulSoup
import pandas as pd

In [77]:
url = 'http://books.toscrape.com/catalogue/category/books_1/page-1.html'
r = requests.get(url)

* r is a response object which we get after calling request.get on a static website, and it contains all needed information on the site.


In [3]:
# We can check the status code of the url page 
r.status_code

200

The Status-Code element in a server response, is a 3-digit integer.The first digit of the code defines the class of response and the last two digits do not have any categorization role. 

If the first digit is 2 : means the action was successfully received, understood, and accepted. Therefore, we have permission to scrape the site.

In [9]:
# we can print out the content of the page using 
r.text



* As clearly shown above, the content of the site is not easily readable in the form above. And this is we another library called BeautifulSoup(BS) comes into action. BS is a python librray that is used for pulling out data from junks of HTML and XML files. With the help of parser it can navigate and search such file in different manners to retrieve desired information.
* We import BeautifulSoup from bs4

In [10]:
# To make sense of the HTML file above, lets create a beautifulsoup object.
# Pass in the url content and a parser as its argument.
soup= BeautifulSoup(r.text, 'html.parser')

In [13]:
# Let's check for the type of soup
type(soup)

bs4.BeautifulSoup

 Good!, now we have a soup object. One important thing to keep in mind is that beautiful soup provides a number of method to search/navigate the parse tree. Interestingly, they all take similar argument, but the most commonly used method are find and find_all. Find() and Find_all() are the two most common beautifulsoup methods, they act as filter and help to nagivate through the HTML/XML file by going upward or downward. They accept a number of argument. You will find easy to follow examples by visiting the documentation of beautifulsoup. Using find() method return only the first instance of the search item, will the later will return all the instance of the specified element as a list.
 
Using find() will only the return the first instance of the tag 'ol',which is what we want, as that tag contain all the different books on each pages. Oh, before we go on. You need to know how to use the developer tool or inspect element on your browser. To display the code behind a web page, firstly, you need to load the page , then right click on the any section of the page and select the inspect element. This will reveal the HMTL make-up of that site either on the same page or new window. At the top left of the page is a tool, which you can click, Once you click the tool, it will reveal the code of any part of the site you hover on.

In [14]:
allbooks= soup.find('ol')



Now, lets go inside the 'ol' tag and find all instances of the 'article' tag.We are going to save this in a variable called artcilces, which will contain all the article present in a given Url. We are going to do that using the find-all() method of beautifulSoup.For this case, we want to find all the tag 'article' that belong to the class 'product_pod'

In [18]:
articles= allbooks.find_all('article', class_='product_pod')

In [16]:
len(articles)

20

The length of the articles reveal that there are 20 articles( a section containing details of each books). Let's inspect the first book in our variable articles. I am going to display it in a pretty fashion using beautifulsoup prettify() method. This will enable the HTML code responsible for that section to be displayed in a each to read format. This will also making searchinhg for a particular element seemless. This beacuse we can easily vied tag which are siblings or even determing the next element of another tag. Searching and Navigating the parse tree have been simplified in the beautifulsoup documentation.

In [28]:
a1= articles[0]

In [29]:
print(a1.prettify())

<article class="product_pod">
 <div class="image_container">
  <a href="catalogue/a-light-in-the-attic_1000/index.html">
   <img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
  </a>
 </div>
 <p class="star-rating Three">
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
 </p>
 <h3>
  <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
   A Light in the ...
  </a>
 </h3>
 <div class="product_price">
  <p class="price_color">
   Â£51.77
  </p>
  <p class="instock availability">
   <i class="icon-ok">
   </i>
   In stock
  </p>
  <form>
   <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
    Add to basket
   </button>
  </form>
 </div>
</article>



Each tag has one or more attributes, which makes locating the tag element very easy. For instance the first tag in the prettified format above is 'article' and it has just a single attribute class called "product_pod" while 'button' several attributes such as class, data-loading-text and type.

In [85]:
Title=[]
Price=[]
Rating=[]
Link=[]

for book in articles:
    title = book.a.next_element['alt']
    price= float(book.find('p', class_='price_color').text[2:])
    rating= book.find('p', class_="star-rating").attrs['class'][1]
    url= 'http://books.toscrape.com/'
    link= url + book.select_one(".image_container a" )['href']
    Title.append(title)
    Price.append(price)
    Rating.append(rating)
    Link.append(link)
    
df = pd.DataFrame({
    'Title':Title, 'Price':Price, 'Rating':Rating, 'Link':Link
})
    

#### From the cell Above:
* Each article in our variable articles contains a number of information describing the book.
* Here, we will only be selecting just four features of each books; title of the book, price, its rating and link to the book. So, I created an empty list of each of the required variables, then loop through the entire article to get each datas.
* Navigating the tree involve encoutering several strings and tag, you decide to find an element upward,downward or even sideways. 
------------------------------------------------------------------------------------------------------------ 
* To find the title; we find the first 'a' and obtain the value of the 'alt' attribute.
* To get the price of each book, we select 'p' tag that belong only to the class price-color, this will produce the tag and string. Therefore, we need to apply .text to get only the string out, slicing from index 2 will help us remove the pound sign in the price. Lastly, the price was passed into a float method to enable us carry out numerical computation on the values later. Thsi can also be done during data cleaning.
* Other methods as conatined in the documentation were employed to aslo get the rating and link to each book. Finally, the extracted data were stored in a dataframe

In [86]:
df.shape

(20, 4)

In [88]:
df.sort_values(by='Price')

Unnamed: 0,Title,Price,Rating,Link
17,A Spy's Devotion (The Regency Spies of London #1),16.97,Five,http://books.toscrape.com/../../a-spys-devotio...
12,Blood Defense (Samantha Brinkman #1),20.3,Three,http://books.toscrape.com/../../blood-defense-...
7,Charlie and the Chocolate Factory (Charlie Buc...,22.85,Three,http://books.toscrape.com/../../charlie-and-th...
19,"1,000 Places to See Before You Die",26.08,Five,http://books.toscrape.com/../../1000-places-to...
6,Choosing Our Religion: The Spiritual Lives of ...,28.42,Four,http://books.toscrape.com/../../choosing-our-r...
1,Forever Rockers (The Rocker #12),28.8,Three,http://books.toscrape.com/../../forever-rocker...
10,Bridget Jones's Diary (Bridget Jones #1),29.82,One,http://books.toscrape.com/../../bridget-joness...
3,Emma,32.93,Two,http://books.toscrape.com/../../emma_17/index....
13,"Bleach, Vol. 1: Strawberry and the Soul Reaper...",34.65,Five,http://books.toscrape.com/../../bleach-vol-1-s...
11,Bounty (Colorado Mountain #7),37.26,Four,http://books.toscrape.com/../../bounty-colorad...


### Now let's scrape more than just the first page

Here, I will put all the codes above in a single cell, and we are going to scrape data from all the pages in the Book-To-scarpe site. we are going to use a for loop. Note the url is now in a string format, and the for loop will run through the 50 pages on the site, scraping 20 items form each pages.

In [89]:

Title=[]
Price=[]
Rating=[]
Link=[]

for i in range(1,51):

    url = f'http://books.toscrape.com/catalogue/category/books_1/page-{i}.html'
    r = requests.get(url)
    soup= BeautifulSoup(r.text, 'html.parser')
    allbooks= soup.find('ol')
    articles= allbooks.find_all('article', class_='product_pod')



    for book in articles:
        title = book.a.next_element['alt']
        price= float(book.find('p', class_='price_color').text[2:])
        rating= book.find('p', class_="star-rating").attrs['class'][1]
        url= 'http://books.toscrape.com/'
        link= url + book.select_one(".image_container a" )['href']
        Title.append(title)
        Price.append(price)
        Rating.append(rating)
        Link.append(link)


df_all = pd.DataFrame({
        'Title':Title, 'Price':Price, 'Rating':Rating, 'Link':Link
    })


In [83]:
df_all.shape

(1000, 4)

In [90]:
df_all.sort_values(by='Price')

Unnamed: 0,Title,Price,Rating,Link
638,An Abundance of Katherines,10.00,Five,http://books.toscrape.com/../../an-abundance-o...
501,The Origin of Species,10.01,Four,http://books.toscrape.com/../../the-origin-of-...
716,The Tipping Point: How Little Things Can Make ...,10.02,Two,http://books.toscrape.com/../../the-tipping-po...
84,Patience,10.16,Three,http://books.toscrape.com/../../patience_916/i...
302,Greek Mythic History,10.23,Five,http://books.toscrape.com/../../greek-mythic-h...
...,...,...,...,...
366,The Diary of a Young Girl,59.90,Three,http://books.toscrape.com/../../the-diary-of-a...
560,The Barefoot Contessa Cookbook,59.92,Five,http://books.toscrape.com/../../the-barefoot-c...
860,Civilization and Its Discontents,59.95,Two,http://books.toscrape.com/../../civilization-a...
617,Last One Home (New Beginnings #1),59.98,Three,http://books.toscrape.com/../../last-one-home-...


Now you can export to csv file

## Scraping by Book Genre
Finally, I will like to do something interesting. A look at the home page of the site we have been scraping reveals that there are genres of each books. Although we have a total of 1000 books. They belong to different genre. So, we are going to write a function called scrape_book_by_genre() which will scrape books belong to a specific genre. This function will tell us the number of books that belong to this genre and also return a dataframe of the book.

In [96]:
url2='https://books.toscrape.com/index.html'
r2= requests.get(url2)
r2.status_code

200

In [97]:
doc= BeautifulSoup(r2.text, 'html.parser')

In [118]:
genre_section=doc.find('ul', class_='nav nav-list')
genre_list= genre_section.find_all('a')

In [126]:
#Let create a dataframe containing the name and link to the book

Genre =[]
Link=[]

for item in genre_list[1:]:
    genre = item.text.strip()
    link= 'https://books.toscrape.com/'+item['href']
    Genre.append(genre)
    Link.append(link)
    
df3 = pd.DataFrame({
    'Genre':Genre, 'Link':Link
})

In [127]:
df3.head()

Unnamed: 0,Genre,Link
0,Travel,https://books.toscrape.com/catalogue/category/...
1,Mystery,https://books.toscrape.com/catalogue/category/...
2,Historical Fiction,https://books.toscrape.com/catalogue/category/...
3,Sequential Art,https://books.toscrape.com/catalogue/category/...
4,Classics,https://books.toscrape.com/catalogue/category/...


In [128]:
url4 =df3.iloc[4][1]

In [137]:
url4name= df3.iloc[4][0]
url4name

'Classics'

* Url4 contains books belonging to the Classics genre

In [129]:
url4

'https://books.toscrape.com/catalogue/category/books/classics_6/index.html'

In [139]:

def scrape_book_by_genre( urls):
    
    """ This function was created to scrape all books belong to each genre found in the site https://books.toscrape.com/catalogue/category/books_1/index.html.
        All you need to do is pass the url of a genre and it returns the dataframe with title, price, rating and link.
        The returned dataframe is sorted by the price variable. Thus you can see the cheapest and most expensive book per genre.
        The function also return the number of books per genre
        
    """
    Title=[]
    Price=[]
    Rating=[]
    Link=[]

    url = urls
    r = requests.get(url)
    soup= BeautifulSoup(r.text, 'html.parser')
    allbooks= soup.find('ol')
    articles= allbooks.find_all('article', class_='product_pod')
    for book in articles:
        title = book.a.next_element['alt']
        price= float(book.find('p', class_='price_color').text[2:])
        rating= book.find('p', class_="star-rating").attrs['class'][1]
        url= 'http://books.toscrape.com/'
        link= url + book.select_one(".image_container a" )['href']
        Title.append(title)
        Price.append(price)
        Rating.append(rating)
        Link.append(link)
        
    dfgenre = pd.DataFrame({
        'Title':Title, 'Price':Price, 'Rating':Rating, 'Link':Link
    })
    print(f"This genre has {dfgenre.shape[0]} books")
    return  dfgenre.sort_values(by='Price')


In [144]:
scrape_book_by_genre(url4)

This genre has 19 books


Unnamed: 0,Title,Price,Rating,Link
3,The Hound of the Baskervilles (Sherlock Holmes...,14.82,Two,http://books.toscrape.com/../../../the-hound-o...
0,The Secret Garden,15.08,Four,http://books.toscrape.com/../../../the-secret-...
8,Wuthering Heights,17.73,Three,http://books.toscrape.com/../../../wuthering-h...
10,The Complete Stories and Poems (The Works of E...,26.78,Four,http://books.toscrape.com/../../../the-complet...
4,Little Women (Little Women #1),28.07,Four,http://books.toscrape.com/../../../little-wome...
1,The Metamorphosis,28.58,One,http://books.toscrape.com/../../../the-metamor...
9,The Picture of Dorian Gray,29.7,Two,http://books.toscrape.com/../../../the-picture...
5,Gone with the Wind,32.49,Three,http://books.toscrape.com/../../../gone-with-t...
17,Emma,32.93,Two,http://books.toscrape.com/../../../emma_17/ind...
12,And Then There Were None,35.01,Two,http://books.toscrape.com/../../../and-then-th...


# The END