# Synopsis

In this unit we will cover:

* The structure of Web pages
* What is HTML/CSS
* How to extract information from HTML pages
* Techniques for navigating and scraping web pages


# Read libraries and functions

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from pathlib import Path
from sys import path

path.append('../My_libraries')
path

In [None]:
import datetime
import json
import sys
import random
import requests
import scipy

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup
from IPython.display import HTML, display, Image


# Detour: A (very brief) intro to HTML

In the previous units, we learned how to retrieve data from Web sources using APIs. But what if the organization hosting the data does not have the forethought or resources to create an API (or if they do not want to share their data)?  Then, we have to **crawl** their website and **scrape** their data.

To do this, we will be using our dependable `requests` library.  However, we will need to call upon a few other resources.  In particular, we will need to understand the code in which webpages are written.

HTML is a markup language for describing web documents. It stands for **H**yper **T**ext **M**arkup **L**anguage. 

HTML, together with CSS (**C**ascading **S**tyle **S**heets for _styling_ web documents) and Javascript (for _animating_ web documents), is the language that is used to run web pages.

HTML documents are built using a series of HTML _tags_. Each tag describes a different type of content. 

This is the general HTML tag structure:

> ```html
> <tagname tag_attribute1="attribute1value1 attribute1value2" 
>          tag_attribute2="attribute2value1">tag contents</tagname>
>```


* Tags (usually) have both a start (or opening) tag, <tagname> and an end (or closing) tag, </tagname>
* Tags can also have attributes which are declared _inside_ the opening tag.
* The actual tag _content_ goes in between the opening and closing tags.

Tags can be contained (nested) inside other tags, which defines relationships between them:

> ```html
> <parent>
>    <sibling1></sibling1>
>    <sibling2>
>        <grandchild1></grandchild1>
>    </sibling2>
> </parent>
> ```

* `<parent>` is the _parent_ tag of `<sibling>`
* `<sibling1>` and `<sibling2>` are the _children_ or _direct descendant_ tags of `<parent>`
* `<sibling1>`, `<sibling2>`, and `<grandchild1>` are the _descendant_ tags of `<parent>`
* `<sibling1>` and `<sibling2>` are _sibling_ tags

## A very simple web document


> ```html
> <!-- This line will not be displayed because it is a comment-->
> <!DOCTYPE html>
> <html>
>    <head>
>       <title>Page Title</title> 
>    </head>
>
>    <body>
>       <h1>My First Heading</h1>
>       <p>My first paragraph.</p>
>    </body>
> </html> 
> ```

Using the nomenclature introduced above, we see that `<h1>` and `<p>` are sibling tags, `<body>` is their parent tag, and all three are descendent tags of `<html>`

When you access any URL, your browser (Chrome, Firefox, Safari, IE, etc.) is actually reading a document such as this one and using the tags within the document to decide how to render the page for you.


In [None]:
first_html = """
    <!-- This line will not be displayed because it is a comment-->
    <!DOCTYPE html>
    <html>
          <head>
                <title>Page Title</title>
          </head>
  
          <body>
                <h1>My First Heading</h1>
                <p>My first paragraph.</p>
                <p>--&nbspHello World!</p>
          </body>

    </html> 
"""


**`Jupyter` is able to render a (python) string of HTML code as real HTML in the notebook itself!**

In [None]:
HTML(first_html)

## The  anatomy of this simple HTML document

This is how you write a comment in HTML. Comments will not show up in the browser

> This line will not be displayed because it is a comment

The next line identified the contents of the document as being HTML code

> DOCTYPE html

The tags `html` and `/html` define everything that goes into the document

> html

A document typically has a header and a body. They are identified by the `head` and `body` tags

> head

Statements within the `head` tag are not rendered but provide general information about the document.

The `title` tag provides a title that appear in the browsers title and tab bar. 

> title

> <title>Page Title</title>

Statements within the `body` describe visible page content.

> body

The `h1` tag defined a section title.  There are 6 depth levels to sections 1 corresponding to the highest level and 6 to the lowest.  The size of the font used to display the title decreases with the depth of the level. 

> h1


The `p` tag indicates new paragraphs.

> p

> <p>My first paragraph.</p>




**Different levels of headers**

```html
<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
<h5>This is heading 5</h5>
<h6>This is heading 6</h6> 
```


## Important `html` objects 

**Links**
```html
<a href="http://www.website.com">Click to go to website.com</a>
```

**Images**
```html
<!-- Notice that the image tag has no closing tag and no content outside the opening tag.
 The alt tag attribute provides accessibility to visually impaired user.
 The width tag attribute specifies the size of the image.
-->
<img src="Images/smiley.png" alt="smiley face" 
          width = '40'>
```

**Lists**
```html
<!-- Unordered (bulleted) list -->
<ul>
  <li>One Element</li>
  <li>Another Element</li>
</ul>

<!-- Ordered (numbered) list -->
<ol>
  <li>First Ordered Element</li>
  <li>Second Ordered Element</li>
</ol>
```

**Tables**
```html
<table>
  <!-- An HTML table is defined as a series of rows (<tr>) -->
  <!-- The individual cell (<td>) contents are nested inside rows -->
  
  <!-- The <tr> tag is optional and is the parent of column headers (<th>) -->
  <tr>
    <th>First Header</th>
    <th>Second Header</th>
  </tr>
  <tr>
    <td>Row 2, Col 1</td>
    <td>Row 2, Col 2</td>
  </tr>
  <tr>
    <td>Row 3, Col 1</td>
    <td>Row 3, Col 2</td>
  </tr>
</table>
```

In [None]:
more_tags = """
<html>
<head>
  <title>More HTML Tags</title>
</head>
<body>
  <h1>This is heading 1</h1>
  <h2>This is heading 2</h2>
  
  <h3>This is heading 3</h3>
  <h4>This is heading 4</h4>
  <h5>This is heading 5</h5>
  <h6>This is heading 6</h6>

  <br>
  
  <a href="http://www.website.com">Click to go to website.com</a>

  <p><img src="Images/smiley.png" alt="smiley face" 
          width = '50'></p>

  <ul>
    <li>One Element</li>
    <li>Another Element</li>
  </ul>

  <ol>
    <li>First Ordered Element</li>
    <li>Second Ordered Element</li>
  </ol>

  <table>
    <!-- An HTML table is defined as a series of rows (<tr>) -->
    <!-- The individual cell (<td>) contents are nested inside rows -->
    <tr>
      <!-- The <tr> tag defines a column headers -->
      <th>First Header</th>
      <th>Second Header</th>
    </tr>
    <tr>
      <td>Row 2, Col 1</td>
      <td>Row 2, Col 2</td>
    </tr>
    <tr>
    <td>Row 3, Col 1</td>
    <td>Row 3, Col 2</td>
  </tr>
  </table>
</body>
</html>
"""

In [None]:
HTML(more_tags)

If you want to learn more about HTML, I recommend the excellent [w3schools website](http://www.w3schools.com/html/html_intro.asp).

# Viewing a page's source code

Wow!!! You are now an HTML expert. Congratulations! 

You are now almost ready to start parsing and analyzing a scraped web page. There's just one last item of business we need to discuss before we get started.

In order to extract elements of interest from a webpage, you need to know where they sit in the webpage's `HTML` structure .

Sadly, really really sadly, **this means that you need to look at the HTML source code before you can start scraping it.**

Not only that but, during your web scraping you will be switching back and forth between the actual scraping (we'll get there really soon, I promise!) and the source code.

So, how do you view a page's source code then?

> To view the **full page** source code:
>  1. Right-click anywhere on the webpage **that is not a link**
>  2. Click "View Page Source" (<kbd>CTRL</kbd>+<kbd>U</kbd>) in Firefox or Chrome, or "Show page source" (<kbd>&#8997;</kbd>+<kbd>&#8984;</kbd>+<kbd>U</kbd>) in Safari.
>
>   In order to view the source code in Safari the Develop menu must be enabled first:
>          Preferences > Advanced > Show Develop menu in menu bar
    
> To view the source code zoomed-in on **a single element** (much better formatting!):
>  1. Right-click any element in the page.
>  2. Click "Inspect Element"

# Scrapping, finally...

##  Beautiful Soup, so rich and green, waiting in a hot tureen!

(*The Lobster Quadrille*, Alice in Wonderland)

You are now ready to start scraping web pages. 

It all start with a request made using the `requests` package.  This returns a string that we then need to parse.

Parsing is a detail oriented -- likely to become frustrating -- job. It would be a complete nightmare without [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). You will notice that we already imported it at the top. Yes, this is thinking ahead!


For completely random reasons, we will use the Wikipedia page for a mediocre German soccer player as an exercise. His page is located [here](https://en.wikipedia.org/wiki/Erik_Durm). To simplify our lives\*, we have already downloaded the page and placed it in the `Data/` folder. 

\* **That, and the fact that we do not want to have hundreds of requests for the same page at the same server at the about the same time all from the same IP address (we are all on the same network!). Such behavior will frequently result in you getting blocked from accessing a website!**



.



.


We start by reading the page into a string and converting it to a `soup` object. 

In [None]:
# We specify the encoding of the file here because the page uses non-standard
# characters which would result in potential crashes of our code.
#
data_folder = Path.cwd() / 'Data' 
filename = data_folder / 'web_scraping_erik_durm_wiki.html'

with open(filename, "r", encoding = "utf-8") as wiki_file:
    string_content = wiki_file.read()
    soup = BeautifulSoup(string_content, 'lxml')

print(type(soup))
print()
print(f"-- {soup.text[:60].strip()} --")

# Searching by `tag` type


Then, we're going to use the `find` method to find the page's `<title>` tag and print it.

In [None]:
title = soup.find('title')  
print(f"The method .find() returns a {type(title)} object\n")

print(f"The text attribute of the tag object is:\n\t{title.text}\n")

print(f"which is a {type(title.text)} object\n")

print(f"The contents attribute of the tag object is:\n\t{title.contents}\n")

print(f"which is a {type(title.contents)} object\n")

.


.


Beautiful Soup converts HTML tags into its own `tag` objects that, as you can see, have many useful attributes.

In [None]:
print(title.name) # The type of tag

help(title)

If a tag has any html attributes, they can be accessed in a very "pythonic" way. That is, they are organized as a dictionary!



In [None]:
h_tag = soup.find('h1')

for x in h_tag.attrs:
    print( f"Key: {x:5} -- Value: {h_tag.attrs[x]}")


One could go on searching for instance of a `tag` one at a time.  


## Finding multiple matches

To this end, we must use the method `.find_next()`.   



In [None]:
# Finds first instance
#
header1 = soup.find('h2')
print(header1.text)

# searching of previously found object gives us the next instance
#
header2 = header1.find_next('h2')
print(header2.text)

# And again
#
header3 = header2.find_next('h2')
print(header3.text)

# If we just try to find instead of find_next, we find nothing
#
header3 = header2.find('h2')
print(header3)


.


.


We can also we can also retrieve all instances at once.  Now, we should expect to get a `list` of `tag` objects in return...

In [None]:
headers = soup.find_all('h2')

print(f"The variable headers is a {type(headers)} object\n")

print(f"Because it is list-like, it has a length: {len(headers)}\n")

print(f"Its first item is {headers[0]} and its last is {headers[-1]}\n")



In [None]:
conda update beautifulsoup4

In [None]:
for header in headers[:2]:
    print(f"++++\t{header.name} -- {header.contents}")
    
    for item in header.contents:
        print(f"\t\t--{item}")
        print(f"\t\t--{type(item)}")
        print(f"\t\t--{item.text}\n")


.



.



Another `tag` that that is frequently useful and that I will use to demonstrate some other useful attributes is the one for links.

In [None]:
links = soup.find_all('a')
print(len(links))

for link in links[:5]:  
    # href represents the target of the link
    # Where the link actually goes to!
    print(f"\n-- {link}")
    print(f"\t Attributes of the tag object: {link.attrs}")
    print(f"\t\t Value of the link: {link.get('href')}")
    

# Searching using attribute information

Some `Tag` elements have attributes associated with them. These includes `id`, `class_`, `href`.  Our search can restrict results to attributes with a specific value or to results where the attribute type is included.

Note that we must use `class_` instead of `class` to avoid conflicts with Python's built-in keyword. 



In [None]:
# Retrieve the element with the attribute "id" equal to "Early_career"
tag = soup.find(id="Early_career")
print(tag)
print(tag.text)

In [None]:
# Retrieve all elements with an href attribute
all_links = soup.find_all(href=True)
print(len(all_links))

In [None]:
# Retrieve inline citations -- they are <sup> elements with the class "reference"
soup.find_all('sup', class_ = 'reference')[5:15]

In [None]:
# Retrieve all tags with class=mw-headline and an id attribute (regardless of value)
soup.find_all(attrs={"class": "mw-headline", "id": True})

## `Tag` attributes

`class` and `id` are special HTML attributes that allow for a rich connection between HTML and CSS and Javascript. Feel free to google the subject. We won't go into the details here. Just know that:

* The `id` attribute is used to uniquely identify a tag. This means that all `id` attributes should have different values in a webpage.

* The `class` attribute is used to identify tags which share certain properties. A tag can have more than one `class` value:
```html
   <!-- Separate extra classes by a space -->
   <tag class="first_class second_class">...</tag>
```

In the above example, notice that all reference elements (`<sup>` tags) have the same `class` value but different `id` values.

**Note that not all webpage follow this simple rule.  Some will repeat `id` values.**

# Navigating the HTML tree with BeautifulSoup


Besides being able to search elements anywhere on the whole html tree, beautiful soup also allows you to navigate the tree in any direction.

Let's try to get at the first paragraph (`<p>`) in the `Club career` section starting from the section's title tag.

Here's the relevant HTML snippet:

```html
    <h2>
      <span class="mw-headline" id="Club_career">Club career</span>
      <span class="mw-editsection">
        <span class="mw-editsection-bracket">[</span>
        <a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=1" title="Edit section: Club career">edit</a>
        <span class="mw-editsection-bracket">]</span>
      </span>
    </h2>
    <h3>
      <span class="mw-headline" id="Early_career">Early career</span>
      <span class="mw-editsection">
        <span class="mw-editsection-bracket">[</span>
        <a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=2" title="Edit section: Early career">edit</a>
        <span class="mw-editsection-bracket">]</span>
      </span>
    </h3>
    <p>Durm began his club career in 1998 at the academy of SG Rieschweiler....</p>
```

The actual text appear in the `HTML` code after that `p tag`, which appears both in the page and in the code after *Club career* title: 

In [None]:
section_headline = soup.find( id = 'Club_career')
print(section_headline)
print()

print(section_headline.text)
print()

print(section_headline.contents)

The `contents` attribute gives us access to everything contained within the relevant `tag` as a `list`. 

In this case we find only the visible text of the tag.

Looking at the webpage snippet, we see that the tag `<p>` is at the **same level** as the `tags` `<h2>` and `<h3>`.  

One way to navigate there -- think filesystem tree -- is to ascend one level in the tree, in this case, to the `h2 tag`.


The `h2 tag` has two siblings (that we can see in the snippet):  and `h3 tag` and a `p tag`.

In [None]:
parent_of_title = section_headline.parent  # Up one level

print(f"The section_headline is of type:\t{section_headline.name}")
print()

print(f"The parent of section_headline is of type:\t{parent_of_title.name}")
print()

print(parent_of_title.contents)            

In [None]:
one_step = parent_of_title.nextSibling
print(f"---- one_step is a \t{type(one_step)}\n")

print(f"--It has the value:\n--\n--{one_step}--\n")

two_steps = parent_of_title.nextSibling.nextSibling
print(f"---- two_steps is of type:\t{type(two_steps)}\n")

print(f"It has the value:\n--\n--{two_steps}--\n")


We are only at the `<h3>` tag even though we moved past two siblings.  

The reason is that some of the `siblings` in the soup are not actual `HTML` elements. An empty line could be processed as an element in the soup.

In [None]:
three_steps = two_steps.nextSibling
print(f"---- three_steps is a \t{type(three_steps)}\n")

print(f"--It has the value:\n--\n--{three_steps}--\n")

four_steps = three_steps.nextSibling
print(f"---- four_steps is of type:\t{type(four_steps)}\n")

print(f"It has the value:\n--\n--{four_steps}--\n")


.


That is what we want: the fourth sibling.

In [None]:
print(four_steps.name)

In [None]:
print(four_steps.contents)

Yes, it is very sad.  

**Web-scrapping involves a lot of trial and error.**

There is no getting around of it. There is just to much that is left to the developer's choice, so even two webpages that *look the same* may, under the hood, **be coded quite differently**.

.


.


We can the contents of our desired element is a list.  Let's obtain the number of elements and check what they contain.

In [None]:
print(len(four_steps.contents))

for i, item in enumerate(four_steps.contents):
    print(f"{i:2} : {str(item)[:80]}")
    
# print(two_steps.contents[1])
# print(two_steps.contents[5])

.


.


To review: in order to find the desired tag, we choose a easily identifiable starting point -- `id` is great because its value *should* be unique -- and then navigate the HTML tree to the correct parent and transversed siblings until we got to the right one. 

Clearly, this is not a very elegant solution. 

If there were hundreds of siblings that would have been very cumbersome. 

So, naturally, `Python` developers thought of a better way to do it!

In [None]:
p_sibling = parent_of_title.find_next_sibling('p')
print(f"---- four_steps is of type:\t{type(p_sibling)}\n")

print(f"It has the value:\n--\n--{p_sibling}--\n")

Much nicer!

Besides the `.find_next_sibling` method, there are also `.find_previous_sibling`, `.find_next_children`, `.find_previous_children`, and many others.

The [Beautiful Soup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) has a comprehensive list of all these methods. 

**There is no need to memorize all of them.**

It's more important to realize that, as with any programming language, there is more than one way to get any element of the `HTML` tree. 

The trick, most of the time, is to *pick a good starting point* from where to start the scraping.

Wisdom, is to realize that, most of the time, there is a method that does what you want.


## Scraping images from a webpage

You can also use Beautiful Soup to get the source of an image from a webpage. It works just the same as for text.

In [None]:
for i, image in enumerate( soup.find_all('img') ):
    print(f"{i:>2} -- {str(image)[:80]}")

We can pinpoint a specific image and get its attributes

In [None]:
images = soup.find_all('img')

print(images[0].attrs)

.


Then, we can display the image using its `src` attribute.

Below, you can see how it is done if it is a file you have downloaded.  

In [None]:
print(data_folder / images[0]['src'])

for j in range(2):
    display(Image(filename =  data_folder / images[j]['src']))
    print()



If you have the `url`, then this is how you do it.

In [None]:
real_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Erik_Durm_2014_%28cropped%29.jpg/440px-Erik_Durm_2014_%28cropped%29.jpg' 
display(Image(url = real_url, width= 100))

# Exercises

What about getting all the `urls` in the page?