In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Another alternative to data collection besides using APIs is through web scraping. In this notebook we will learn how to scrape a web page using `beautifulsoup` and `requests` library. We will be scaping http://books.toscrape.com/. A website that was specifically created in order for developers to learn how to scrape.  

The script below retrieves the html page of a book entitled "A Light in the Attic". The html page was retrieved using the `requests` library. After retrieving, the html is passed on to the `beautifulsoup` parser. By using `beautifulsoup`, we are able to get specific information (ex. title, price, rating) of the book.

Take a look at the html document printed by the script. What can you observe? Do you see any specific patterns that are repeating? 

In [2]:
#get the html from one of the books in the website
page = requests.get('http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')

#feed it into beautiful soup for parsing
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   A Light in the Attic | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="
    It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the 

Browsing though the html document can be overwhelming. Don't fret, as you gain more experience in scraping websites this will become more intuitive to you. Lets familiarize ourselves first with the typical structure of an HTML document. 

```
<html>
    <head>
        <!--this is how comments are written in html-->
        <!--we usually place css files under the head tag --> 
        <link href="../../static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
        <link href="../../static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css" rel="stylesheet"/>
        <link href="../../static/oscar/css/datetimepicker.css" rel="stylesheet" type="text/css"/>
    </head>
    <body>
        <!-- this where the content of the web page is located -->
        <!-- which means the information that you want to scrape will be located here -->
    </body>
</html>
```

The html document is made up of tags each with an opening and closing tag. Typically it is made up of 3 main tags `html`, `head`, `body`. The html tag is a standard being followed to indicate that the document being created is html. The head tag would usually contain libraries or files that need to be imported into the document for example `CSS`(Cascading Style Sheets) files. `CSS` can be imagined as the libraries/files responsible for making your website pretty so things like color, shading, font settings can be found here. Lastly, the body tag is where the content of the page can be found. This is usually where we scrape the information from. 

Inside the body tag you can observe several types of tags. Some common tags can be found below: 

| HTML Tag | Description |
| --- | --- |
| div | The div tag is used to group together html elements that make up a component | 
| h1,h2,h3,h4,h5 | These are header tags, the smalelr the number the bigger the text that is shown on screen | 
| p | P stands for paragraph, this is where text content is usually placed | 
| a | a stands for anchor, this is where hyperlinks are placed | 
| img | imag stands for Image, this is where images are placed | 
| ul, ol, li | These are list tags, ul stands for unordered list, ol for ordered list ,li for list | 


Another useful tool to aid you in web scraping is the inspector tool that is available in browsers. This can be accessed by pressing `f12`. After pressing a toolbar should pop up on your browser. With the inspector tool you can hover your mosue around the web page and it will automatically show you which part of the html document you are looking at. This makes scraping easier since you don't have to read through the entire html document.

Take for example the image below, I hovered my mouse on the container of the book title and on the right side you can see the different tags that make up the book title component. So if we were interested in getting the title, and price we know that the title is placed inside a `h1` tag while the price is placed inside a `p` tag. You can observe also that these tags have attributes called `class`. The class attribute is connected to the css file that was imported in the head tag which allows the browser to know how to render the tag. The class attribute is also useful for us when scraping since it allows us to narrow down the tag that we want to get.  

<img src="chrome_dev_tools.png"/>

Now that we have a general undesrtanding of the html document let us use beautifulsoup to get information from the website. The most common function that we will use from this library is `find`. This function will return the first tag that matches the criteria given to it. Let's use it to get the title of the book and the price.

In [3]:
# #the find function returns the tag of the element if we want to remove the tags we call the .text attribute 
print(soup.find('h1'))
print(soup.find('h1').text)

<h1>A Light in the Attic</h1>
A Light in the Attic


The find function also accepts attributes to look for inside the tag. For the script below we indicate that we are looking for the paragraph tag who has a class called price_color. This allows our search to be more targeted.

In [4]:
print(soup.find('p', attrs={'class':'price_color'}))
print(soup.find('p', attrs={'class':'price_color'}).text)

<p class="price_color">Â£51.77</p>
Â£51.77


Another function that is available for us is the find_all function. This returns a list of all elements that match the tag placed inside the find_all function.

In [5]:
p_results = soup.find_all('p')
p_results

[<p class="price_color">Â£51.77</p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock (22 available)
     
 </p>,
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <!-- <small><a href="/catalogue/a-light-in-the-attic_1000/reviews/">
         
                 
                     0 customer reviews
                 
         </a></small>
          --> 
 
 
 <!-- 
     <a id="write_review" href="/catalogue/a-light-in-the-attic_1000/reviews/add/#addreview" class="btn btn-success btn-sm">
         Write a review
     </a>
 
  --></p>,
 <p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids si

Given that it is a list we can use an index to retrieve the information that we need. Take for example if we are only interested in getting the product description we can just get the last element from the result.

In [6]:
p_results[-1].text

"It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounde

We can also get information relative to a tag using the next_sibling function. For example we want to get how many are in stock. 

In [7]:
print(soup.find('p', attrs={'class':'price_color'}).next_sibling.next_sibling)

<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock (22 available)
    
</p>


Usually, tables are a common structure found in websites which contain the information that we need. In order to retreive the data we have to iterate over the different rows within the table. 

In [8]:
table_res = soup.find('table', class_="table-striped")
print(table_res.prettify())

<table class="table table-striped">
 <tr>
  <th>
   UPC
  </th>
  <td>
   a897fe39b1053632
  </td>
 </tr>
 <tr>
  <th>
   Product Type
  </th>
  <td>
   Books
  </td>
 </tr>
 <tr>
  <th>
   Price (excl. tax)
  </th>
  <td>
   Â£51.77
  </td>
 </tr>
 <tr>
  <th>
   Price (incl. tax)
  </th>
  <td>
   Â£51.77
  </td>
 </tr>
 <tr>
  <th>
   Tax
  </th>
  <td>
   Â£0.00
  </td>
 </tr>
 <tr>
  <th>
   Availability
  </th>
  <td>
   In stock (22 available)
  </td>
 </tr>
 <tr>
  <th>
   Number of reviews
  </th>
  <td>
   0
  </td>
 </tr>
</table>



In [9]:
for row in table_res.find_all('tr'):
    header = row.find('th').text
    data = row.find('td').text
    print(f"{header}={data}")

UPC=a897fe39b1053632
Product Type=Books
Price (excl. tax)=Â£51.77
Price (incl. tax)=Â£51.77
Tax=Â£0.00
Availability=In stock (22 available)
Number of reviews=0
