In [2]:
from bs4 import BeautifulSoup
import urllib.request

# Getting the Page's html

## Method 1 using 'requests.urlopen( )'

In [3]:
with urllib.request.urlopen('http://books.toscrape.com/') as url:
    content = url.read()

In [4]:
print(type(content))

<class 'bytes'>


## Method 2 using 'requests.get( )'
https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3

In [5]:
import requests

In [6]:
#Assingning the url to the variable link
link = 'http://books.toscrape.com/'

Next, we assign the result of a request of that page to the variable page with request.get() method which we pass our link argument 

In [7]:
page = requests.get(link)

In [8]:
#The variable page is assigned a Response object
print(page)

<Response [200]>


The Response object above tells us the status_code property in square brackets (in this case 200, would be 404 if page not found for example) 
- using ? to see documentation gives: The :class:`Response <Response>` object, which contains a
server's response to an HTTP request.
- find list of status codes here: https://www.restapitutorial.com/httpstatuscodes.html
- status code can be accessed directly: 

In [9]:
page.status_code

200

In order to work with the web data, we're going to want to access the text-based content of webfiles
- we can read the content of the server's response with page.text
- or page.content if we'd like to access the response in bytes

In [11]:
page.text[0:400]

'<!DOCTYPE html>\n<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->\n    <head>\n        <title>\n    All products | Books to Scr'

In [12]:
type(page.text)

str

In [13]:
page.content[0:400]

b'<!DOCTYPE html>\n<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->\n    <head>\n        <title>\n    All products | Books to Scr'

In [14]:
type(page.content)

bytes

In [15]:
type(page)

requests.models.Response

# Stepping trough the Page with Beautiful Soup
- The Beautiful Soup library creates a parse tree from parsed HTML and XML documents 
- This will make th web page text more readable than what we saw coming from the Requests module
- We'll run the page.text document through the module to give us a BeautifulSoup object
- that is, a parse tree from this parsed page that we'll get from running Pythons's built in **html.parser** over the html
- the constructed object represents the html document as a nested data structure
- this is assigned to the variable soup

## Basics

In [17]:
soup = BeautifulSoup(page.text, 'html.parser')

In [18]:
type(soup)

bs4.BeautifulSoup

In [20]:
print(soup.prettify()[0:100])

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![end


## Finding instances of a tag 
- we can extract a single tag from a page by using the BeautifulSoup's **find all** method
- this will return all instances of a given tag within a document

In [69]:
#Finding all <p> tags:
soup.find_all('p')

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>, <p class="price_color">Â£51.77</p>, <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>, <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>, <p class="price_color">Â£53.74</p>, <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>, <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>, <p class="price_color">Â£50.10</p>, <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>, <p class="star-rating Four">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="

## Our example: get prices from website books.toscrape.com

In [81]:
#Finding Tags by class and ID / prices are of class_ price_color
soup.find_all(class_='price_color')

[<p class="price_color">Â£51.77</p>,
 <p class="price_color">Â£53.74</p>,
 <p class="price_color">Â£50.10</p>,
 <p class="price_color">Â£47.82</p>,
 <p class="price_color">Â£54.23</p>,
 <p class="price_color">Â£22.65</p>,
 <p class="price_color">Â£33.34</p>,
 <p class="price_color">Â£17.93</p>,
 <p class="price_color">Â£22.60</p>,
 <p class="price_color">Â£52.15</p>,
 <p class="price_color">Â£13.99</p>,
 <p class="price_color">Â£20.66</p>,
 <p class="price_color">Â£17.46</p>,
 <p class="price_color">Â£52.29</p>,
 <p class="price_color">Â£35.02</p>,
 <p class="price_color">Â£57.25</p>,
 <p class="price_color">Â£23.88</p>,
 <p class="price_color">Â£37.59</p>,
 <p class="price_color">Â£51.33</p>,
 <p class="price_color">Â£45.17</p>]

In [82]:
prices

[<p class="price_color">Â£51.77</p>,
 <p class="price_color">Â£53.74</p>,
 <p class="price_color">Â£50.10</p>,
 <p class="price_color">Â£47.82</p>,
 <p class="price_color">Â£54.23</p>,
 <p class="price_color">Â£22.65</p>,
 <p class="price_color">Â£33.34</p>,
 <p class="price_color">Â£17.93</p>,
 <p class="price_color">Â£22.60</p>,
 <p class="price_color">Â£52.15</p>,
 <p class="price_color">Â£13.99</p>,
 <p class="price_color">Â£20.66</p>,
 <p class="price_color">Â£17.46</p>,
 <p class="price_color">Â£52.29</p>,
 <p class="price_color">Â£35.02</p>,
 <p class="price_color">Â£57.25</p>,
 <p class="price_color">Â£23.88</p>,
 <p class="price_color">Â£37.59</p>,
 <p class="price_color">Â£51.33</p>,
 <p class="price_color">Â£45.17</p>]

In [83]:
#As we have found all the price tags itself, we just strip the plain text from the tags
for tag in prices:
    print (tag.text.strip())

Â£51.77
Â£53.74
Â£50.10
Â£47.82
Â£54.23
Â£22.65
Â£33.34
Â£17.93
Â£22.60
Â£52.15
Â£13.99
Â£20.66
Â£17.46
Â£52.29
Â£35.02
Â£57.25
Â£23.88
Â£37.59
Â£51.33
Â£45.17


In [84]:
#Empty list to be filled with price tags 
price_li = []
for i in prices: 
    price_li.append(i.text.strip())

## Same procedure for names

In [85]:
#Same procedure for the names of the books, which are stored inside the <h3> tag 
names = soup.find_all('h3')

In [86]:
names

[<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>,
 <h3><a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>,
 <h3><a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a></h3>,
 <h3><a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>,
 <h3><a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>,
 <h3><a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a></h3>,
 <h3><a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a></h3>,
 <h3><a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/ind

In [87]:
#Stripping the plain text outside every tag using .text.strip()
for tag in names:
    print(tag.text.strip())

A Light in the ...
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History ...
The Requiem Red
The Dirty Little Secrets ...
The Coming Woman: A ...
The Boys in the ...
The Black Maria
Starving Hearts (Triangular Trade ...
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little ...
Rip it Up and ...
Our Band Could Be ...
Olio
Mesaerion: The Best Science ...
Libertarianism for Beginners
It's Only the Himalayas


In [88]:
#empty list for names 
names_li =[]
#Filling names with stripped text
for i in names:
    names_li.append(i.text.strip())

In [74]:
#Checking if both of same length before merging both lists in a dict
print(len(price_li))
print(len(names_li))

20
20


In [77]:
#As both of same length, we can use zip function to merge them into a dict so we have name:price
inventory = dict(zip(names_li,price_li))

In [78]:
inventory

{'A Light in the ...': 'Â£51.77',
 "It's Only the Himalayas": 'Â£45.17',
 'Libertarianism for Beginners': 'Â£51.33',
 'Mesaerion: The Best Science ...': 'Â£37.59',
 'Olio': 'Â£23.88',
 'Our Band Could Be ...': 'Â£57.25',
 'Rip it Up and ...': 'Â£35.02',
 'Sapiens: A Brief History ...': 'Â£54.23',
 "Scott Pilgrim's Precious Little ...": 'Â£52.29',
 'Set Me Free': 'Â£17.46',
 "Shakespeare's Sonnets": 'Â£20.66',
 'Sharp Objects': 'Â£47.82',
 'Soumission': 'Â£50.10',
 'Starving Hearts (Triangular Trade ...': 'Â£13.99',
 'The Black Maria': 'Â£52.15',
 'The Boys in the ...': 'Â£22.60',
 'The Coming Woman: A ...': 'Â£17.93',
 'The Dirty Little Secrets ...': 'Â£33.34',
 'The Requiem Red': 'Â£22.65',
 'Tipping the Velvet': 'Â£53.74'}

# Some further analyis: How many books per category?
- when looking at the website you see there are several categories with varying numbers of different books
- see second file for this
