<div class="alert alert-block alert-info"><br>
    <h2>Week 8_Part3. Web Scraping by HTML with BeautifulSoup</h2> <br>
    <p>Here we'll learn how to parse HTML documents and scrape off their content by selecting either html elements such as <kbd class="alertalert-block alert-danger"><b>p</b></kbd>, <kbd class="alert-alert-block alert-danger"><b>h1</b></kbd>, <kbd class="alert-alert-block alert-danger"><b>div</b></kbd> and the like, or html attributes such as <kbd class="alert-alert-block alert-success"><b>href</b></kbd> as in <kbd>a href=""</kbd> or <kbd class="alert-alert-block alert-success"><b>class</b></kbd>as in<kbd><b>p class="first"</b></kbd>.</p><br>
    <p>We'll use Python libraries <kbd class="alert-alert-block alert-warning"><b>requests</b></kbd> and most importantly <kbd class="alert-alert-block alert-warning"><b>BeautifulSoup</b></kbd> to get web content out of HTML, XML and other files.</p>
</div>

In [1]:
# the requests module, along with other libraries such as urllib and urllib2, deal with HTTP requests.
# We'll stick to requests.

import requests as re

# First, install BeautifulSoup, a Python web scraping library, on your machines either via JN or Terminal.
#!pip install beautifulsoup4 # if via JN
# alternatively, you may use conda instead of pip. Yet first install conda files on your computer.

# Once the lib is installed, import its version 4.

from bs4 import BeautifulSoup as bs

In [2]:
# The next step is to establish the connection with the webpage by using the .get() method of the requests module.
# GET is the most often used HTTP method. 
# We can use GET request to retrieve data from the specified resource. 


url = re.get('https://www.bl.uk/people/william-blake') # send a GET request to the webpage

In [3]:
# If you get the output <Response [200]>, all is running well.

url

# <Response [429]> would mean that you sent too many requests in a given amount of time ("rate limiting").

<Response [200]>

<div class="alert alert-block alert-danger">
    <p>Response 429</p>
    <img src="Response429.JPG">
</div>

In [4]:
# Store the content of the html document in the htmlDoc variable by using the .content property.

htmlDoc = url.content

htmlDoc

b'\r\n<!DOCTYPE HTML>\r\n<html class="no-js green" lang="en" >\r\n<head id="head" itemscope="" itemtype="https://schema.org/Organization">\r\n<meta http-equiv="X-UA-Compatible" content="IE=edge" />\r\n\r\n<link rel="stylesheet" href="/britishlibrary/resources/global/css2/desktop-styles?v=l-CmMwJtVOS98dkFyu7CWsIwoCA5WhdH2qpU6aq8ZAE1" media="screen" />\r\n\r\n<link href="/britishlibrary/resources/global/css2/all-print.css" rel="stylesheet" type="text/css" media="print" />\r\n<!--[if IE 7]><link rel="stylesheet" type="text/css" href="/britishlibrary/resources/global/css2/all-ie-7.css" /><![endif]-->\r\n<!--[if IE 8]><link rel="stylesheet" type="text/css" href="/britishlibrary/resources/global/css2/all-ie-8.css" /><![endif]-->\r\n\r\n\r\n<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>\r\n<script>if (!window.jQuery) { document.write(\'<script src="/britishlibrary/resources/global/scripts2/jquery-1.9.1.js"><\\/script>\');}\r\n</script>\r\n\r\n\r\n\r\n

In [5]:
# The output above is messy. 
# To get a tidier view of html document, parse the content stored in the htmlDoc variable.

# Pass the argument that specifies a parser. 

# There are many parsers: html.parser, lxml, xml, html5lib. 
# lxml is very fast, but the most generous one is html5lib even when HTML document is malformed.

Blake = bs(htmlDoc, 'html5lib') # here we create a BeautifulSoup object as bs with the html from the page

Blake

<!DOCTYPE html>
<html class="no-js green" lang="en"><head id="head" itemscope="" itemtype="https://schema.org/Organization">
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>

<link href="/britishlibrary/resources/global/css2/desktop-styles?v=l-CmMwJtVOS98dkFyu7CWsIwoCA5WhdH2qpU6aq8ZAE1" media="screen" rel="stylesheet"/>

<link href="/britishlibrary/resources/global/css2/all-print.css" media="print" rel="stylesheet" type="text/css"/>
<!--[if IE 7]><link rel="stylesheet" type="text/css" href="/britishlibrary/resources/global/css2/all-ie-7.css" /><![endif]-->
<!--[if IE 8]><link rel="stylesheet" type="text/css" href="/britishlibrary/resources/global/css2/all-ie-8.css" /><![endif]-->


<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>if (!window.jQuery) { document.write('<script src="/britishlibrary/resources/global/scripts2/jquery-1.9.1.js"><\/script>');}
</script>



        <style media="screen" type="text/css">
            .desktop-

In [6]:
# Check what type of object the Blake variable is.

type(Blake)

bs4.BeautifulSoup

In [7]:
# The BS method to get content by html tags is find_all() for BS4 and findAll() for BS3.

# To scrape the content held by specific html tags, pass the name of the html tag as an argument.

# Let's get the paragraphs first. They are normaly marked up with <p></p> in html.

para = Blake.find_all('p')

para

[<p>William Blake is famous today as an imaginative and original poet, painter, engraver and mystic. But his work, especially his poetry, was largely ignored during his own lifetime, and took many years to gain widespread appreciation.</p>,
 <p>The third of six children of a Soho hosier, William Blake lived and worked in London all his life. As a boy, he claimed to have seen ‘bright angelic wings bespangling every bough like stars’ in a tree on Peckham Rye, one of the earliest of many visions. In 1772, he was apprenticed to the distinguished printmaker James Basire, who extended his intellectual and artistic education. Three years of drawing murals and monuments in Westminster Abbey fed a fascination with history and medieval art.</p>,
 <p>In 1782, he married Catherine Boucher, the steadfast companion and manager of his affairs for the whole of his chequered, childless life. Much in demand as an engraver, he experimented with combining poetry and image in a printing process he invented

In [8]:
# Count how many paragraphs are there on this page.

len(para)

20

In [9]:
# The difference between .find() and .find_all():
# Blake.find('p') will get you the first paragraph only, while Blake.find_all('p') will get them all.

# You can slice off a particular <p> by its index position. Remember find_all() method outputs lists!

# Slice off the third paragraph and strip the tags off with the .text property of the Blake object.


para2 = Blake.find_all('p')[2].text

para2

'In 1782, he married Catherine Boucher, the steadfast companion and manager of his affairs for the whole of his chequered, childless life. Much in demand as an engraver, he experimented with combining poetry and image in a printing process he invented himself in 1789. Among the spectacular works of art this produced were ‘The Marriage of Heaven and Hell’, ‘Visions of the Daughters of Albion’, ‘Jerusalem’, and ‘Songs of Innocence and Experience’.'

In [10]:
# To strip away the html tags and extract content alone, use a for-loop to iterate through each pair of <p> tags.
# Use the .text property to get content only.

for i in Blake.find_all('p'):
    print(i.text)


William Blake is famous today as an imaginative and original poet, painter, engraver and mystic. But his work, especially his poetry, was largely ignored during his own lifetime, and took many years to gain widespread appreciation.
The third of six children of a Soho hosier, William Blake lived and worked in London all his life. As a boy, he claimed to have seen ‘bright angelic wings bespangling every bough like stars’ in a tree on Peckham Rye, one of the earliest of many visions. In 1772, he was apprenticed to the distinguished printmaker James Basire, who extended his intellectual and artistic education. Three years of drawing murals and monuments in Westminster Abbey fed a fascination with history and medieval art.
In 1782, he married Catherine Boucher, the steadfast companion and manager of his affairs for the whole of his chequered, childless life. Much in demand as an engraver, he experimented with combining poetry and image in a printing process he invented himself in 1789. Amon

In [11]:
# Let's collect all paragraphs into a list and save it in an external file.

# Define a variable AboutBlake and assign an empty list to it. 

AboutBlake = [ ]

# We'll collect all <p>s and append them to the emtpy list above with the for-loop.

for i in Blake.find_all('p'):
    paraBlake = i.text
    AboutBlake.append(paraBlake)

# Check how many items are in AboutBlake list. This is just to make you aware that it's a list and not one string.

print(len(AboutBlake))
print(AboutBlake)
        

20
['William Blake is famous today as an imaginative and original poet, painter, engraver and mystic. But his work, especially his poetry, was largely ignored during his own lifetime, and took many years to gain widespread appreciation.', 'The third of six children of a Soho hosier, William Blake lived and worked in London all his life. As a boy, he claimed to have seen ‘bright angelic wings bespangling every bough like stars’ in a tree on Peckham Rye, one of the earliest of many visions. In 1772, he was apprenticed to the distinguished printmaker James Basire, who extended his intellectual and artistic education. Three years of drawing murals and monuments in Westminster Abbey fed a fascination with history and medieval art.', 'In 1782, he married Catherine Boucher, the steadfast companion and manager of his affairs for the whole of his chequered, childless life. Much in demand as an engraver, he experimented with combining poetry and image in a printing process he invented himself in

In [12]:
# Save the list into an external file by opening it as 'AboutBlake'.

# Don't forget to pass 'w' argument so that the machine knows that you want to write a file.

# Also when writing the scraped content into the .txt file, join all 20 list items into one string.

file = open('AboutBlake.txt', 'w', encoding="utf-8")
file.write(''.join(AboutBlake)) # use the empty string and .join method to create one string
file.close()    

In [13]:
# The headings of different sizes are marked up with <h1> and <h2> on the front webpage on Blake.

# The size ranges from h1 to h6 where h1 is the largest.

# To scrape content by  more than one html element, pass the element names as dictionary keys with {}.

headings = Blake.find_all({'h1', 'h2'})

headings

[<h1 class="masthead-title">
                             <span class="sp-category">
                                 
                             </span>
                             <span class="sp-name">
                                 
                             </span>
                             <span class="sp-subtitle">
                                 
                             </span>
 
                             <span id="ctl00_mastheadtitle">People</span>                            
                         </h1>,
 <h1 class="page-title">William Blake</h1>,
 <h2 class="pnl-row-title p-b-2col hidden">Explore further</h2>,
 <h2 class="pnl-row-title">Related articles</h2>,
 <h2 class="pnl-row-title">Related collection items</h2>,
 <h2 class="pnl-row-title">Related people</h2>,
 <h2 class="pnl-row-title">Related teachers' notes</h2>,
 <h2 class="pnl-row-title">Related works</h2>,
 <h2 class="block-title">Share this page</h2>,
 <h2 class="block-title">British Library n

In [14]:
# Loop through the headings variable to get text only and compare both the outputs below.

for i in headings:
    print(i.text)


                            
                                
                            
                            
                                
                            
                            
                                
                            

                            People                            
                        
William Blake
Explore further
Related articles
Related collection items
Related people
Related teachers' notes
Related works
Share this page
British Library newsletter


In [15]:
# Let's scrape off parts that are marked up with <a> tags used for creating hyperlinks from one html document to another.

Links = Blake.find_all('a')

Links

# As you can see from the output, the code fetches the html tags with the links. 

[<a href="#main">Skip to main content</a>, <a href="https://www.bl.uk">
                         <img alt="THE BRITISH LIBRARY" height="100" src="/britishlibrary/resources/global/images/bl_logo_100.gif" width="52"/></a>, <a href="#">
                     <h3>Catalogues &amp; Collections</h3>
                 </a>, <a href="https://www.bl.uk/catalogues-and-collections">Catalogues &amp; Collections</a>, <a href="http://explore.bl.uk">Search the Main Catalogue</a>, <a href="http://searcharchives.bl.uk">Archives and Manuscripts</a>, <a href="http://cadensa.bl.uk">Sound and Moving Image</a>, <a href="http://bnb.bl.uk/">British National Bibliography</a>, <a href="http://www.bl.uk/reshelp/findhelprestype/catblhold/all/allcat.html">All Catalogues &gt;</a>, <a href="https://www.bl.uk/business-and-management">Business and Management</a>, <a href="https://www.bl.uk/manuscripts/ ">Digitised Manuscripts</a>, <a href="http://socialwelfare.bl.uk/">Social Welfare</a>, <a href="http://sounds.bl.uk">Sou

In [16]:
# To get only the urls addresses, we need to select them by the attribute href inside the opening tag of <a>.

# Write a loop to iterate through all <a> tags and fetch links if those tags contain the attribute href

for link in Blake.find_all('a'):
    if link.has_attr('href'):
        print(link.attrs['href'])

# View the output: you might want to clean the data and select only the results starting with https://
# How will you do it?

#main
https://www.bl.uk
#
https://www.bl.uk/catalogues-and-collections
http://explore.bl.uk
http://searcharchives.bl.uk
http://cadensa.bl.uk
http://bnb.bl.uk/
http://www.bl.uk/reshelp/findhelprestype/catblhold/all/allcat.html
https://www.bl.uk/business-and-management
https://www.bl.uk/manuscripts/ 
http://socialwelfare.bl.uk/
http://sounds.bl.uk
http://ethos.bl.uk
https://www.bl.uk/collection-guides/uk-web-archive
https://www.bl.uk/catalogues-and-collections/digital-collections
https://www.bl.uk/help/how-to-get-a-reader-pass
http://www.bl.uk/reshelp/inrrooms/stp/refteam/refteam/refcontacts.html 
https://www.bl.uk/reshelp/inrrooms/readingrooms.html
http://www.bl.uk/catalogues/search/help.html
https://www.bl.uk/subjects
http://www.bl.uk/on-demand
https://www.bl.uk/digitisation-services
#
https://www.bl.uk/discover-and-learn
http://www.bl.uk/learning/schools-and-teachers
https://www.bl.uk/learning/adults
https://www.bl.uk/learning/families-and-community-groups
http://www.bl.uk/learning/on

In [17]:
# For a change, go to the Wikipedia page on William Blake. 

# Fetch the url, read in its content and parse it. 

url2 = re.get('https://en.wikipedia.org/wiki/William_Blake')
html2 = url2.content
wiki = bs(html2, 'html.parser')
wiki

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>William Blake - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"a36af521-8cdd-48d0-a29f-f86204aedbf3","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"William_Blake","wgTitle":"William Blake","wgCurRevisionId":990723855,"wgRevisionId":990723855,"wgArticleId":33175,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing explicitly cited Early Modern English-language text","CS1: long volume value","Wikipedia pending changes protected pages","Articles with short descrip

In [18]:
# On the right side of the webpage, there's a summary about Blake. 
# It's encoded as <table> in the html document. 


# <table> is a wrapper tag that contains <tbody>, which contains <tr> for rows.
# Rows contain table header defined with a <th> tag. 
# Each table data cell is defined with a <td> tag.

# The content we want to scrape sits directly in between <th></th> and <td></td> tags.

tBody = wiki.find('tbody') # in the html document, find the <tbody> 

tRows = tBody.find_all('tr') # from <tbody> take all <tr> elements and their content
print(len(tRows))

# for a change, instead of initializing an empty list, use a list comprehension construct 
# inside the scope of the loop to fetch and store a list of data. 

for row in tRows:
    cols = row.find_all({'th', 'td'})
    cols2 = [ x.text for x in cols] 
    print(cols2)

11
['William Blake']
['Blake in a portraitby Thomas Phillips (1807)']
['Born', '(1757-11-28)28 November 1757Soho, London, England']
['Died', '12 August 1827(1827-08-12) (aged\xa069)Charing Cross, London, England[1]']
['Occupation', 'Poet, painter, printmaker']
['Genre', 'Visionary, poetry']
['Literary movement', 'Romanticism']
['Notable works', 'Songs of Innocence and of Experience, The Marriage of Heaven and Hell, The Four Zoas, Jerusalem, Milton, "And did those feet in ancient time"']
['Spouse', 'Catherine Boucher \u200b(m.\xa01782)\u200b']
['']
['Signature', '']


<div style="background-color:#ccccff"><br>
<h2>USEFUL LINKS</h2>
    <ul>
    <li>https://www.dataquest.io/blog/web-scraping-tutorial-python/</li>
    <li>https://programminghistorian.org/en/lessons/intro-to-beautiful-soup</li>
    <li>https://www.twilio.com/blog/web-scraping-and-parsing-html-in-python-with-beautiful-soup</li>
    </ul>
</div>