# An Introduction to Web Scraping

How you can use python to extract or scrape simple text elements such as images, text, and tables from a web page.

## Making a request

In the context of the web, a `request` is the act of connecting to some web address and performing an action. There are many possible types of requests. The most common, and the one we will be using in this lecture, is the `GET` request which you are likely quite familiar with already. A `GET` request is simply the act of downloading the content of the web address you are connected to: it's what you do everytime you browse the web!

In python, we can use the [`requests`](http://docs.python-requests.org/en/latest/user/quickstart/) package to crawl (load) webpages and download (scrape) their contents.

In [None]:
import requests

In [None]:
response = requests.get("http://httpbin.org/html")

Methods from the `requests` package return `Response` objects

In [None]:
print(response)

One of the most important properties of the response is its status code, which is printed by default but which we can also get explicitly:

In [None]:
print(response.status_code)

The status code, indicates what happened to our request.
In this case we got a **200**, which means that all went well, and we successfully connected to the web address we wanted and downloaded its contents.

In [None]:
print(response.url)

To get the web page's actual content, we access the Response.text variable, which contains the raw HTML source code of the page (more on this later) as one giant string.

In [None]:
print(response.text)

The folks at http.org have quite the sense of humour...

Unfortunately thing don't always work out this nicely

In [None]:
requests.get("http://httpbin.org/totaly-fake-webpage")

Oops.
We got a **404** code which indicates that, **although the domain exists (the part before .org)**, the particular webpage you are trying to access does not.

Here are some of the most common status codes you might encounter:
* 200, **OK**. Request was successful
* 303, **See Other**. Page redirected to another URL. Your web browser automatically fetches the new URL but web crawlers do not usually do this unless you specify it.
* 401 **Unauthorized**. The URL requires authentication (e.g. password) which was not provided or was incorrect.
* 404, **Not Found**. The URL does not exist
* 500 **Internal Server Error**. The server is having _unexpected_ problems and the web page is down.
* 503 **Service Unavailable**. The web page is down, likely for server maintenance.

More codes: http://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Note that **if the domain itself does not exist** then the `GET` request will not even connect and we get a very different error: a `ConnectionError`. This error comes with a veeery long traceback so to keep the demonstration simple I will just wrap the request in  a `try...except` statement.

In [None]:
try:
    response = requests.get("http://www.totaly-fake-domain-not-even-close.com/")
except requests.exceptions.ConnectionError:
    print("This is not the domain you are looking for")

Now that we know how to scrape webpages, we can move on to the next step: analyzing the page's HTML source code.

...you know what HTML is, right?

---
## Detour: A (very brief) intro to HTML

So, maybe you don't know what HTML is. Or maybe you've accidentally viewed the source code of a webpage and wondered if you just caught a computer virus. Well, you didn't. Let me explain.

HTML is a markup language for describing web documents. It stands for **H**yper **T**ext **M**arkup **L**anguage. HTML, together with CSS (**C**ascading **S**tyle **S**heets for _styling_ web documents) and Javascript (for _animating_ web documents), is the language that is used to construct web pages.

HTML documents are built using a series of HTML _tags_. Each tag describes a different type of content. Web pages are built by putting together different tags.

This is the general HTML tag structure:

```html
<tagname tag_attribute1="attribute1value1 attribute1value2" tag_attribute2="attribute2value1">tag contents</tagname>
```
* Tags (usually) have both a start (or opening) tag, <tagname> and an end (or closing) tag, </tagname>
* Tags can also have attributes which are declared _inside_ the opening tag.
* The actual tag _content_ goes inbetween the opening and closing tags.

Tags can be contained (nested) inside other tags, which defines relationships between them:

```html
<parent>
  <brother></brother>
  <sister>
    <grandson></grandson>
  </sister>
</parent>
```

* `<parent>` is the _parent_ tag of `<brother>` and `<sister>`
* `<brother>` and `<sister>` are the _children_ or _direct descendant_ tags of `<parent>`
* `<brother>`, `<sister>`, and `<grandson>` are the _descendant_ tags of `<parent>`
* `<brother>` and `<sister>` are _sibling_ tags

Here's a very simple web document:

```html
<!DOCTYPE html>
<html>
  <head>
    <title>Page Title</title> 
  </head>

  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
  </body>
</html> 
```

Here, `<h1>` and `<p>` are sibling tags, `<body>` is their parent tag, and all three are descendent tags of `<html>`

When you access any URL, your browser (Chrome, Firefox, Safari, IE, etc.) is actually reading a document such as this one and using the tags within the document to decide how to render the page for you.

Jupyter is able to render a (python) string of HTML code as real HTML in the notebook itself!

In [None]:
from IPython.display import HTML

first_html = """
<!DOCTYPE html>
<html>
  <head>
    <title>Page Title</title>
  </head>
  
  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
  </body>

</html> 
"""

HTML(first_html)

#### Let's look at what the different tags mean:

```html
<!-- This is how you write a comment in HTML. Comments will not show up in the browser -->

<!-- This line simply identifies the document type to be HTML-->
<!DOCTYPE html>
<!-- Content between <html> and </html> tags define everything about the document-->
<html>
  <!-- Tags inside the <head> are not rendered but provide general information about the document -->
  <head>
    <!-- Like the <title> tag which provides a title that appears in the browser's title and tab bars -->
    <title>Page Title</title>
  </head>
  
  <!-- Anything inside the <body> tags describes visible page content -->
  <body>
    <!-- The <h1> defines a header. The number defines the size of the header. -->
    <!-- There are 6 levels of headers: <h1> to <h6> -->
    <!-- The higher the number, the lower the font used to display it. -->
    <h1>My First Heading</h1>
    <!-- The <p> represents a paragraph.-->
    <p>My first paragraph.</p>
  </body>
</html>
```

**Different levels of headers**

```html
<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
<h5>This is heading 5</h5>
<h6>This is heading 6</h6> 
```

**Links**
```html
<a href="http://www.website.com">Click to go to website.com</a>
```

**Images**
```html
<!-- Notice that the image tag has no closing tag and no content outside the opening tag -->
<img src="smiley.gif">
```

**Lists**
```html
<!-- Unordered (bulleted) list -->
<ul>
  <li>One Element</li>
  <li>Another Element</li>
</ul>

<!-- Ordered (numbered) list -->
<ol>
  <li>First Ordered Element</li>
  <li>Second Ordered Element</li>
</ol>
```

**Tables**
```html
<table>
  <!-- An HTML table is defined as a series of rows (<tr>) -->
  <!-- The individual cell (<td>) contents are nested inside rows -->
  
  <!-- The <tr> tag is optional and is the parent of column headers (<th>) -->
  <tr>
    <th>First Header</th>
    <th>Second Header</th>
  </tr>
  <tr>
    <td>Row 2, Col 1</td>
    <td>Row 2, Col 2</td>
  </tr>
  <tr>
    <td>Row 3, Col 1</td>
    <td>Row 3, Col 2</td>
  </tr>
</table>
```

In [None]:
more_tags = """
<html>
<head>
  <title>More HTML Tags</title>
</head>
<body>
  <h1>This is heading 1</h1>
  <h2>This is heading 2</h2>
  <h3>This is heading 3</h3>
  <h4>This is heading 4</h4>
  <h5>This is heading 5</h5>
  <h6>This is heading 6</h6>

  <br>
  
  <a href="http://www.website.com">Click to go to website.com</a>

  <p><img src="../images/smiley.png" alt="smiley face"></p>

  <ul>
    <li>One Element</li>
    <li>Another Element</li>
  </ul>

  <ol>
    <li>First Ordered Element</li>
    <li>Second Ordered Element</li>
  </ol>

  <table>
    <!-- An HTML table is defined as a series of rows (<tr>) -->
    <!-- The individual cell (<td>) contents are nested inside rows -->
    <tr>
      <!-- The <tr> tag defines a column headers -->
      <th>First Header</th>
      <th>Second Header</th>
    </tr>
    <tr>
      <td>Row 2, Col 1</td>
      <td>Row 2, Col 2</td>
    </tr>
    <tr>
    <td>Row 3, Col 1</td>
    <td>Row 3, Col 2</td>
  </tr>
  </table>
</body>
</html>
"""

HTML(more_tags)

If you want to know more about HTML, I recommend the excellent w3schools website: http://www.w3schools.com/html/html_intro.asp

---
## Ok, back to web scraping

Now we are all HTML experts. Great! We're almost ready to start parsing and analyzing a scraped web page. There's just one last item of business we need to discuss before we get started.

### Viewing a page's source code

In order to extract elements of interest from a webpage we need to know where they sit in the webpage's HTML tree.
This means that you need to look at a webpage's HTML source code before you can even start scraping it. Not only that but, during your web scraping you will be switching back and forth between the actual scraping (we'll get there really soon, I promise!) and the webpage's source code.

How do we view a page's source code then?

* To view the **full page** source code:
  1. Right-click anywhere on the webpage **that is not a link**
  2. Click "View Page Source" (<kbd>CTRL</kbd>+<kbd>U</kbd>) in Firefox or Chrome, or "Show page source" (<kbd>&#8997;</kbd>+<kbd>&#8984;</kbd>+<kbd>U</kbd>) in Safari.
    * In order to view the source code in Safari the Develop menu must be enabled first: Preferences > Advanced > Show Develop menu in menu bar
    
* To view the source code zoomed-in on **a single element** (and with better formatting!):
  1. Right-click any element in the webpage.
  2. Click "Inspect Element"

###  Beautiful Soup, so rich and green, Waiting in a hot tureen!

We made it! We are now ready to start scraping web pages. In order to do so we are going to use [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/), a powerful python package to parse web pages you already scraped. Normally you would use `requests` (to GET the page) and then `BeautifulSoup` to analyse the web page. But to make life easier, and to avoid having upwards of 200 people scraping a webpage at once, we will use previously scraped webpages for the remainder of the lecture... You did take that personality quiz over the weekend like we suggested right?

We will use the wikipedia page for a player from Germany's national football team as an example: https://en.wikipedia.org/wiki/Erik_Durm

In [None]:
# Beautiful Soup version 4.x
import bs4

We start by opening up the page and convert it to a "soup"

In [None]:
# We specify the encoding of the file here because Windows
# has problems reading some characters in it.

with open("../Data/erik_durm_wiki.html", "r", encoding="utf-8") as wiki_file:
        soup = bs4.BeautifulSoup(wiki_file.read())

We're going to use the `find` method to find the page's `<title>` tag and print it

In [None]:
title = soup.find('title')   #finds the FIRST <title> tag 
print(title.text)

Beautiful Soup converts HTML tags into its own `Tag` objects

In [None]:
print(type(title))

`Tag` objects have several useful attributes

In [None]:
print(title.text) # The text gives you the visible part of the tag
print(title.name) # The type of tag

If a tag has any html attributes, they can be accessed in a very "pythonic" way

In [None]:
h1 = soup.find("h1")

print(h1.attrs)
print(h1["class"])
print(h1["id"])

Yep, just like a dictionary!

Now, let's try to find ALL level 2 headers. To do that we use the `find_all` method instead.

In [None]:
headers = soup.find_all('h2')

print(headers)

Ugh, that's a mess, let's try printing each header individually

In [None]:
for header in headers:
    print(header.text)

Much better! We can also find all the other pages that this webpage links to

In [None]:
links = soup.find_all('a')

for link in links[:20]:  # Showing just the first 20 links for brevity
    # href represents the target of the link
    # Where the link actually goes to!
    print(link.get('href'))

We can also search for elements with specific attributes

In [None]:
# Gets a specific element, in this case a section header with the attribute "id"
soup.find(id="Early_career")

In [None]:
# Here's another way to search for all valid links:
# Just search for all elements with an href attribute!
all_links = soup.find_all(href=True)
print(len(all_links))

In [None]:
# Gets all inline citations! They are <sup> elements with the class "reference"
soup.find_all("sup", class_="reference")[10:]

Note that we must use "class_" instead of "class" to avoid conflicts with python's built-in keyword. Remember the [Data Types](../Python-Basics/Data-Types.ipynb) lecture?

More generally, you can pass a dictionary of attributes to search for:

In [None]:
# Find all tags with class=mw-headline and an id attribute (regardless of value)
soup.find_all(attrs={"class": "mw-headline", "id": True})

### Extra! Some more HTML

`class` and `id` are special HTML attributes that allow for a rich connection between HTML and CSS and Javascript. Feel free to google the subject. We won't go into the details here. Just know that:

* The `id` attribute is used to uniquely identify a tag. This means that all `id` attributes should have different values in a webpage.

* The `class` attribute is used to identify tags which share certain properties. A tag can have more than one `class` value:
```html
   <!-- Separate extra classes by a space -->
   <tag class="first_class second_class">...</tag>
```

In the above example, notice that all reference elements (`<sup>` tags) have the same `class` value but different `id` values.

### Navigating the HTML tree with BeautifulSoup

Besides being able to search elements anywhere on the whole html tree, beautiful soup also allows you to navigate the tree in any direction.

Let's try to get at the first paragraph (`<p>`) in the `Club career` section starting from the section's title tag.

Here's the relevant HTML snippet:

```html
    <h2>
      <span class="mw-headline" id="Club_career">Club career</span>
      <span class="mw-editsection">
        <span class="mw-editsection-bracket">[</span>
        <a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=1" title="Edit section: Club career">edit</a>
        <span class="mw-editsection-bracket">]</span>
      </span>
    </h2>
    <h3>
      <span class="mw-headline" id="Early_career">Early career</span>
      <span class="mw-editsection">
        <span class="mw-editsection-bracket">[</span>
        <a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=2" title="Edit section: Early career">edit</a>
        <span class="mw-editsection-bracket">]</span>
      </span>
    </h3>
    <p>Durm began his club career in 1998 at the academy of SG Rieschweiler....</p>
```

In [None]:
section_headline = soup.find(id="Club_career")

From the snippet above we don't expect to have anything inside this tag.

In [None]:
list(section_headline.contents)

Correct! The `contents` attribute lets us access everything that is inside a given tag. In this case we find only the visible text of the tag, as expected.

So we need to go up one level (to the `<h2>` tag), then go to its second sibling (first `<h3>` then `<p>`).

In [None]:
parent_h2 = section_headline.parent

parent_h2.name == "h2"

We got the parent! What's inside it?

In [None]:
parent_h2.contents

That's the code that lets you edit a section!

As a convenience, you can access the first children of a given parent tag using its name:

In [None]:
parent_h2.contents[1]

In [None]:
parent_h2.contents[1].span

Cool, huh?

Now, from the earlier snippet we can see that our target, the first `<p>` in the section, is the next sibling:

In [None]:
parent_h2.next_sibling.next_sibling

Hmmm, not there yet. This is because some of the siblings in the soup are not actual HTML elements but simply empty lines due to potential parsing issues:

In [None]:
parent_h2.next_sibling

This is something we must always be mindful about. Web scraping can, and very frequently will be messy and will involve trial-and-error.... But we are not ones to be defeated by a puny new line are we?

In [None]:
parent_h2.next_sibling.next_sibling.next_sibling.next_sibling

Sucess!!

That was the brute force way. It works most of the time. But in this case there is a better option.

In [None]:
parent_h2.find_next_sibling("p")

Much simpler!

Similarly you have find_previous_sibling, find_next_children, find_previous_children, and some others.

The [Beautiful Soup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) has a comprehensive list of all these methods. There is no need to memorize all of them. It's more important to realize that, as with any programming language, there is more than one way to get any element of the html tree. The trick is to *pick a good starting point* from where to start the scraping.

### Scraping images from a webpage

You can also use Beautiful Soup to get the source of an image from a webpage. It works just the same as for text.

In [None]:
# Some modules that will allows us to display images and other media in the notebook itself
from IPython.display import display, Image

In [None]:
for image in soup.find_all('img'):
    print(image)

We can pinpoint a specific image and get its attributes

In [None]:
images = soup.find_all('img')
img0 = images[0]
print(img0.attrs)

Then we can display the image using its `src` attribute

In [None]:
display(Image(url=img0['src']))

display(Image(url=images[1]['src']))

## Exercise: scraping results from your Personality table

For this exercise you will use your results from the personality quiz at [HEXACO](http://hexaco.org/hexaco-online). You did take the quiz right? :)

Save the page with the quiz results to: `<path to the bootcamp directory>/Data/my_hexaco.html`

In [None]:
with open("../Data/my_hexaco.html", "r", encoding="utf-8") as hexaco_file:
        soup = bs4.BeautifulSoup(hexaco_file.read())

1 - Find the `<table>` element, that contains your results.

In [None]:
table = soup.find() # your search terms inside the `find` method

2 -  Find all the scale names using the `table` variable from above

In [None]:
# Find all table rows, skipping the first two which don't matter
for tag in table.find_all("tr")[2:]:
    cells = tag.find_all("td")
    
    # Your code here

3: Now get both the scale names and your own scores associated with each scale

In [None]:
# Find all table rows, skipping the first two which don't matter
for tag in table.find_all("tr")[2:]:
    cells = tag.find_all("td")

    # Your code here

In [None]:
from IPython.core.display import HTML


def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()