In [None]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

#  Beautiful Soup, so rich and green, waiting in a hot tureen!

(*The Lobster Quadrille*, Alice in Wonderland)

We are now ready to start scraping web pages. In order to do so we are going to use [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/), a powerful python package to parse web pages you already scraped. Normally you would use `requests` (to GET the page) and then `BeautifulSoup` to analyse the web page.

We will use the wikipedia page for a player from Germany's national football team as an example: https://en.wikipedia.org/wiki/Erik_Durm that has already been downloaded into the `Data/` folder. We are starting with a pre-downloaded HTML page so that there aren't a hundred requests from the same place for the same page at the same server at the same time from (which will frequently result in you getting blocked from accessing that website!)

In [None]:
import bs4

We start by opening up the page and convert it to a `soup` object. Then, we're going to use the `find` method to find the page's `<title>` tag and print it.

In [None]:
with open("../Data/erik_durm_wiki.html", "r", encoding="utf-8") as wiki_file:
        soup = bs4.BeautifulSoup(wiki_file.read(), 'lxml')

#The soup is the entire page
soup

In [None]:
#There are a number of different functions of a soup
dir(soup)

In [None]:
#We're going to start with the `find` function. It will find the first tag of the given type.
title = soup.find('title')
print(title)

Note that the title is the entire html tag. If we want only the text within it, then we need to ask for the text.

In [None]:
title.text

The reason for this is that Beautiful Soup converts HTML tags into its own `Tag` objects.`Tag` objects have many useful attributes.

In [None]:
print(type(title))
print(title.text) # The text gives you the visible part of the tag
print(title.name) # The type of tag

If a tag has any html attributes, they can be accessed in a very "pythonic" way. That is, they are organized as a dictionary!

In [None]:
h1 = soup.find("h1")

print(h1.attrs)
print(h1["class"])
print(h1["id"])

Instead of searching for `Tags` one by one, we can also retrieve them all at once.  As an example, let's find all level 2 headers. To this end, we use the `find_all` method.

In [None]:
headers = soup.find_all('h2')

print(headers)

Too much information!  In order to get the only the information that we need, we must restrict to the desired attribute.

In [None]:
for header in headers:
    print(header.text)

Another `Tag` that that is useful and that demonstrate some of the other useful attributes is the one for webpages that our page points to:

In [None]:
links = soup.find_all('a')

for link in links[:10]:  # Showing just the first 10 links for brevity
    # href represents the target of the link
    # Where the link actually goes to!
    print('-----', link.text)
    print(link.get('href'))
    

### Searching using attribute information

Some `Tag` elements have attributes associated with them. These includes `id`, `class_`, `href`.  Our search can restrict results to attributes with a specific value or to results where the attribute type is included.

Note that we must use `class_` instead of `class` to avoid conflicts with Python's built-in keyword. 

In [None]:
# Retrieve the element with the attribute "id" equal to "Early_career"
tag = soup.find(id="Early_career")
print(tag)
print(tag.text)

In [None]:
# Retrieve all elements with an href attribute
all_links = soup.find_all(href=True)
print(len(all_links))

In [None]:
# Retrieve inline citations -- they are <sup> elements with the class "reference"
soup.find_all("sup", class_="reference")[5:15]

In [None]:
# Retrieve all tags with class=mw-headline and an id attribute (regardless of value)
soup.find_all(attrs={"class": "mw-headline", "id": True})

# Navigation

### Navigating the HTML tree 


Besides being able to search elements anywhere on the whole html tree, beautiful soup also allows you to navigate the tree in any direction.

Let's try to get at the first paragraph (`<p>`) in the `Club career` section starting from the section's title tag.

Here's the relevant HTML snippet:

```html
    <h2>
      <span class="mw-headline" id="Club_career">Club career</span>
      <span class="mw-editsection">
        <span class="mw-editsection-bracket">[</span>
        <a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=1" title="Edit section: Club career">edit</a>
        <span class="mw-editsection-bracket">]</span>
      </span>
    </h2>
    <h3>
      <span class="mw-headline" id="Early_career">Early career</span>
      <span class="mw-editsection">
        <span class="mw-editsection-bracket">[</span>
        <a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=2" title="Edit section: Early career">edit</a>
        <span class="mw-editsection-bracket">]</span>
      </span>
    </h3>
    <p>Durm began his club career in 1998 at the academy of SG Rieschweiler....</p>
```

We can see that that section of text is *under* the "Club career" title: 

In [None]:
section_headline = soup.find(id="Club_career")
print(section_headline)
print(section_headline.text)
section_headline.contents

The `contents` attribute lets us access everything that is inside a given tag. In this case we find only the visible text of the tag.

Looking at the webpage snippet, we see that the tag `<p>` is at the same level as the tags `<h2>` and `<h3>`.  Hence, we need to navigate up one level (to the `<h2>` tag), then navigate to its second sibling (first `<h3>` then `<p>`).

In [None]:
parent_h2 = section_headline.parent  # Up one level
print( parent_h2.name == "h2" )      # Is it the <h2> tag?
print()
print(parent_h2.contents) 

In [None]:
one_step = parent_h2.next_sibling
print(one_step.name)

In [None]:
two_steps = one_step.next_sibling
print(two_steps.name)

We are only at the `<h3>` tag even though we moved past two siblings.  The reason is that some of the siblings in the soup are not actual HTML elements. Some could simply be empty lines.

In [None]:
three_steps = two_steps.next_sibling
print(three_steps.name)

In [None]:
four_steps = three_steps.next_sibling
print(four_steps.name)

In [None]:
print(four_steps.contents)

Ok. Now we are where we wanted to be. We have the text corresponding to the `<p>` tag.  This is something we must always be mindful about. Web scraping can, and very frequently will be, messy and will involve trial-and-error...

We can the contents of our desired element is a list.  Let's obtain the number of elements and check what they contain.

In [None]:
print(len(four_steps.contents))
print(four_steps.contents[1])
print(four_steps.contents[5])

Much nicer!

Besides the `find_next_sibling` method, there are also `find_previous_sibling`, `find_next_children`, `find_previous_children`, and many others.

The [Beautiful Soup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) has a comprehensive list of all these methods. There is no need to memorize all of them. It's more important to realize that, as with any programming language, there is more than one way to get any element of the html tree. The trick is to *pick a good starting point* from where to start the scraping.

# Scraping images

You can also use Beautiful Soup to get the source of an image from a webpage. It works just the same as for text.

In [None]:
# Some modules that will allows us to display images and other media in the notebook itself
from IPython.display import display, Image

In [None]:
for image in soup.find_all('img'):
    print(image)

We can pinpoint a specific image and get its attributes

In [None]:
images = soup.find_all('img')
img0 = images[0]
print(img0.attrs)

Then we can display the image using its `src` attribute

In [None]:
display(Image(url='../data/' + img0['src']))

display(Image(url='../data/' + images[1]['src']))



## Exercise: Scraping results from your Personality profile

For this exercise you will use your results from the personality quiz at [HEXACO](http://hexaco.org/hexaco-online). You did take the quiz right? :)

In [None]:
with open("../Data/my_hexaco.html", "r", encoding="utf-8") as hexaco_file:
        soup = bs4.BeautifulSoup(hexaco_file.read(), 'lxml')

1 - Find the `<table>` element, that contains your results.

2 -  Find all the scale names using the `table` variable from above

3 - Now get both the scale names and your own scores associated with each scale

4 - Now replot your scores as a bar chart