In [1]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

#  Beautiful Soup, so rich and green, waiting in a hot tureen!

(*The Lobster Quadrille*, Alice in Wonderland)

We are now ready to start scraping web pages. In order to do so we are going to use [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/), a powerful python package to parse web pages you already scraped. Normally you would use `requests` (to GET the page) and then `BeautifulSoup` to analyse the web page.

We will use the wikipedia page for a player from Germany's national football team as an example: https://en.wikipedia.org/wiki/Erik_Durm that has already been downloaded into the `Data/` folder. We are starting with a pre-downloaded HTML page so that there aren't a hundred requests from the same place for the same page at the same server at the same time from (which will frequently result in you getting blocked from accessing that website!)

In [2]:
import bs4

We start by opening up the page and convert it to a `soup` object. Then, we're going to use the `find` method to find the page's `<title>` tag and print it.

In [5]:
with open("../Data/erik_durm_wiki.html", "r", encoding="utf-8") as wiki_file:
        soup = bs4.BeautifulSoup(wiki_file.read(), 'lxml')

#The soup is the entire page
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Erik Durm - Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>window.RLQ = window.RLQ || []; window.RLQ.push( function () {
mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Erik_Durm","wgTitle":"Erik Durm","wgCurRevisionId":667540954,"wgRevisionId":667540954,"wgArticleId":36798241,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 German-language sources (de)","Use dmy dates from August 2013","Articles using Template:Medal with Winner","Articles with German-language external links","1992 births","Living people","People from Pirmasens","German footballers","1. FSV Mainz 05 II players","Borussia Dortmund II players","Borussia Do

In [6]:
#There are a number of different functions of a soup
dir(soup)

['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'HTML_FORMATTERS',
 'ROOT_TAG_NAME',
 'XML_FORMATTERS',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_attr_value_as_string',
 '_attribute_checker',
 '_feed',
 '_find_all',
 '_find_one',
 '_formatter_for_name',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_most_recent_element',
 '_popToTag',
 '_select_debug',
 '_selector_combinators',
 '_should_pretty_print',
 '_tag_name_matches_and',
 'append',
 'attribselect_re',
 'attrs',
 'builder',

In [7]:
#We're going to start with the `find` function. It will find the first tag of the given type.
title = soup.find('title')
print(title)

<title>Erik Durm - Wikipedia, the free encyclopedia</title>


Note that the title is the entire html tag. If we want only the text within it, then we need to ask for the text.

In [8]:
title.text

'Erik Durm - Wikipedia, the free encyclopedia'

The reason for this is that Beautiful Soup converts HTML tags into its own `Tag` objects.`Tag` objects have many useful attributes.

In [9]:
print(type(title))
print(title.text) # The text gives you the visible part of the tag
print(title.name) # The type of tag

<class 'bs4.element.Tag'>
Erik Durm - Wikipedia, the free encyclopedia
title


If a tag has any html attributes, they can be accessed in a very "pythonic" way. That is, they are organized as a dictionary!

In [10]:
h1 = soup.find("h1")

print(h1.attrs)
print(h1["class"])
print(h1["id"])

{'lang': 'en', 'class': ['firstHeading'], 'id': 'firstHeading'}
['firstHeading']
firstHeading


Instead of searching for `Tags` one by one, we can also retrieve them all at once.  As an example, let's find all level 2 headers. To this end, we use the `find_all` method.

In [11]:
headers = soup.find_all('h2')

print(headers)

[<h2>Contents</h2>, <h2><span class="mw-headline" id="Club_career">Club career</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=1" title="Edit section: Club career">edit</a><span class="mw-editsection-bracket">]</span></span></h2>, <h2><span class="mw-headline" id="International_career">International career</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=4" title="Edit section: International career">edit</a><span class="mw-editsection-bracket">]</span></span></h2>, <h2><span class="mw-headline" id="Career_statistics">Career statistics</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=7" title="Edit section: Career statistics">edit</a><span class="mw-editsection-bracket">]</span></span></h2>, <h2><spa

Too much information!  In order to get the only the information that we need, we must restrict to the desired attribute.

In [12]:
for header in headers:
    print(header.text)

Contents
Club career[edit]
International career[edit]
Career statistics[edit]
Honours[edit]
References[edit]
External links[edit]
Navigation menu


Another `Tag` that that is useful and that demonstrate some of the other useful attributes is the one for webpages that our page points to:

In [13]:
links = soup.find_all('a')

for link in links[:10]:  # Showing just the first 10 links for brevity
    # href represents the target of the link
    # Where the link actually goes to!
    print('-----', link.text)
    print(link.get('href'))
    

----- 
None
----- navigation
#mw-head
----- search
#p-search
----- 
/wiki/File:Erik_Durm_IMG_1748.jpg
----- BVB
/wiki/Borussia_Dortmund
----- [1]
#cite_note-1
----- Pirmasens
/wiki/Pirmasens
----- Left back
/wiki/Defender_(association_football)#Full-back
----- Right back
/wiki/Defender_(association_football)#Full-back
----- Borussia Dortmund
/wiki/Borussia_Dortmund


### Searching using attribute information

Some `Tag` elements have attributes associated with them. These includes `id`, `class_`, `href`.  Our search can restrict results to attributes with a specific value or to results where the attribute type is included.

Note that we must use `class_` instead of `class` to avoid conflicts with Python's built-in keyword. 

In [14]:
# Retrieve the element with the attribute "id" equal to "Early_career"
tag = soup.find(id="Early_career")
print(tag)
print(tag.text)

<span class="mw-headline" id="Early_career">Early career</span>
Early career


In [15]:
# Retrieve all elements with an href attribute
all_links = soup.find_all(href=True)
print(len(all_links))

373


In [16]:
# Retrieve inline citations -- they are <sup> elements with the class "reference"
soup.find_all("sup", class_="reference")[5:15]

[<sup class="reference" id="cite_ref-6"><a href="#cite_note-6"><span>[</span>6<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-7"><a href="#cite_note-7"><span>[</span>7<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-8"><a href="#cite_note-8"><span>[</span>8<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-9"><a href="#cite_note-9"><span>[</span>9<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-10"><a href="#cite_note-10"><span>[</span>10<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-11"><a href="#cite_note-11"><span>[</span>11<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-12"><a href="#cite_note-12"><span>[</span>12<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-2014_German_Super_Cup_13-0"><a href="#cite_note-2014_German_Super_Cup-13"><span>[</span>13<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-14"><a href="#cite_note-14"><span>[</span>14<span>]</span></a></sup>,
 <s

In [17]:
# Retrieve all tags with class=mw-headline and an id attribute (regardless of value)
soup.find_all(attrs={"class": "mw-headline", "id": True})

[<span class="mw-headline" id="Club_career">Club career</span>,
 <span class="mw-headline" id="Early_career">Early career</span>,
 <span class="mw-headline" id="Borussia_Dortmund">Borussia Dortmund</span>,
 <span class="mw-headline" id="International_career">International career</span>,
 <span class="mw-headline" id="Youth">Youth</span>,
 <span class="mw-headline" id="Senior">Senior</span>,
 <span class="mw-headline" id="Career_statistics">Career statistics</span>,
 <span class="mw-headline" id="Club">Club</span>,
 <span class="mw-headline" id="International">International</span>,
 <span class="mw-headline" id="Honours">Honours</span>,
 <span class="mw-headline" id="Club_2">Club</span>,
 <span class="mw-headline" id="International_2">International</span>,
 <span class="mw-headline" id="References">References</span>,
 <span class="mw-headline" id="External_links">External links</span>]

### Navigating the HTML tree with BeautifulSoup


Besides being able to search elements anywhere on the whole html tree, beautiful soup also allows you to navigate the tree in any direction.

Let's try to get at the first paragraph (`<p>`) in the `Club career` section starting from the section's title tag.

Here's the relevant HTML snippet:

```html
    <h2>
      <span class="mw-headline" id="Club_career">Club career</span>
      <span class="mw-editsection">
        <span class="mw-editsection-bracket">[</span>
        <a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=1" title="Edit section: Club career">edit</a>
        <span class="mw-editsection-bracket">]</span>
      </span>
    </h2>
    <h3>
      <span class="mw-headline" id="Early_career">Early career</span>
      <span class="mw-editsection">
        <span class="mw-editsection-bracket">[</span>
        <a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=2" title="Edit section: Early career">edit</a>
        <span class="mw-editsection-bracket">]</span>
      </span>
    </h3>
    <p>Durm began his club career in 1998 at the academy of SG Rieschweiler....</p>
```

We can see that that section of text is *under* the "Club career" title: 

In [18]:
section_headline = soup.find(id="Club_career")
print(section_headline)
print(section_headline.text)
section_headline.contents

<span class="mw-headline" id="Club_career">Club career</span>
Club career


['Club career']

The `contents` attribute lets us access everything that is inside a given tag. In this case we find only the visible text of the tag.

Looking at the webpage snippet, we see that the tag `<p>` is at the same level as the tags `<h2>` and `<h3>`.  Hence, we need to navigate up one level (to the `<h2>` tag), then navigate to its second sibling (first `<h3>` then `<p>`).

In [19]:
parent_h2 = section_headline.parent  # Up one level
print( parent_h2.name == "h2" )      # Is it the <h2> tag?
print()
print(parent_h2.contents) 

True

[<span class="mw-headline" id="Club_career">Club career</span>, <span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=1" title="Edit section: Club career">edit</a><span class="mw-editsection-bracket">]</span></span>]


In [20]:
one_step = parent_h2.next_sibling
print(one_step.name)

None


In [21]:
two_steps = one_step.next_sibling
print(two_steps.name)

h3


We are only at the `<h3>` tag even though we moved past two siblings.  The reason is that some of the siblings in the soup are not actual HTML elements. Some could simply be empty lines.

In [22]:
three_steps = two_steps.next_sibling
print(three_steps.name)

None


In [23]:
four_steps = three_steps.next_sibling
print(four_steps.name)

p


In [24]:
print(four_steps.contents)

['Durm began his club career in 1998 at the academy of SG Rieschweiler, before joining the academy of ', <a href="/wiki/1._FC_Saarbr%C3%BCcken" title="1. FC Saarbrücken">1. FC Saarbrücken</a>, ' in 2008 where he became youth league top scorer of the 2009–2010 season with 13 goals.', <sup class="reference" id="cite_ref-pfaelzischer-merkur.de_2-0"><a href="#cite_note-pfaelzischer-merkur.de-2"><span>[</span>2<span>]</span></a></sup>, ' In July 2010, Durm was enrolled at the academy of ', <a href="/wiki/1._FSV_Mainz_05" title="1. FSV Mainz 05">1. FSV Mainz 05</a>, ' and won the 2010–11 Youth Federation Cup in Germany and Durm debuted and played his only game of the 2010–11 season for the second team of 1. FSV Mainz 05 on 4 December 2010 against ', <a href="/wiki/SV_Elversberg" title="SV Elversberg">SV Elversberg</a>, ' in the German ', <a href="/wiki/Regionalliga" title="Regionalliga">Regionalliga</a>, '.', <sup class="reference" id="cite_ref-3"><a href="#cite_note-3"><span>[</span>3<span>

Ok. Now we are where we wanted to be. We have the text corresponding to the `<p>` tag.  This is something we must always be mindful about. Web scraping can, and very frequently will be, messy and will involve trial-and-error...

We can the contents of our desired element is a list.  Let's obtain the number of elements and check what they contain.

In [25]:
print(len(four_steps.contents))
print(four_steps.contents[1])
print(four_steps.contents[5])

12
<a href="/wiki/1._FC_Saarbr%C3%BCcken" title="1. FC Saarbrücken">1. FC Saarbrücken</a>
<a href="/wiki/1._FSV_Mainz_05" title="1. FSV Mainz 05">1. FSV Mainz 05</a>


Much nicer!

Besides the `find_next_sibling` method, there are also `find_previous_sibling`, `find_next_children`, `find_previous_children`, and many others.

The [Beautiful Soup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) has a comprehensive list of all these methods. There is no need to memorize all of them. It's more important to realize that, as with any programming language, there is more than one way to get any element of the html tree. The trick is to *pick a good starting point* from where to start the scraping.

## Scraping images from a webpage

You can also use Beautiful Soup to get the source of an image from a webpage. It works just the same as for text.

In [26]:
# Some modules that will allows us to display images and other media in the notebook itself
from IPython.display import display, Image

In [27]:
for image in soup.find_all('img'):
    print(image)

<img src="www/images/Erik_Durm_IMG_1748.jpg"/>
<img src="www/images/Erik_Durm20140714_0009.jpg"/>
<img alt="Germany" class="thumbborder" data-file-height="600" data-file-width="1000" height="30" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/50px-Flag_of_Germany.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/75px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/100px-Flag_of_Germany.svg.png 2x" width="50"/>
<img alt="" height="1" src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" style="border: none; position: absolute;" title="" width="1"/>
<img alt="Wikimedia Foundation" height="31" src="/static/images/wikimedia-button.png" srcset="/static/images/wikimedia-button-1.5x.png 1.5x, /static/images/wikimedia-button-2x.png 2x" width="88"/>
<img alt="Powered by MediaWiki" height="31" src="https://en.wikipedia.org/static/1.26wmf19/resources/assets/poweredby_mediawik

We can pinpoint a specific image and get its attributes

In [31]:
images = soup.find_all('img')
img0 = images[0]
print(img0.attrs)

{'src': 'www/images/Erik_Durm_IMG_1748.jpg'}


Then we can display the image using its `src` attribute

In [32]:
display(Image(url='../data/' + img0['src']))

display(Image(url='../data/' + images[1]['src']))



## Exercise: Scraping results from your Personality profile

For this exercise you will use your results from the personality quiz at [HEXACO](http://hexaco.org/hexaco-online). You did take the quiz right? :)

In [33]:
with open("../Data/my_hexaco.html", "r", encoding="utf-8") as hexaco_file:
        soup = bs4.BeautifulSoup(hexaco_file.read(), 'lxml')

1 - Find the `<table>` element, that contains your results.

2 -  Find all the scale names using the `table` variable from above

3 - Now get both the scale names and your own scores associated with each scale

4 - Now replot your scores as a bar chart