##### [A very crash course in Python](contents.ipynb)

# Interlude: What actually is a web page?

There are two answers to this question, depending on your perspective!

* To a human user, a web page is a collection of text and images and links and forms and so on, arranged in two dimensions, viewed through a web browser.
* To a web browser, a web page is a document in a language called HTML, which it knows how to render into the format that you, the human, sees.

(Today we're only interested in static web pages.  A web page may be dynamic, which means it can change in response to a user's actions or other events.  Dynamic pages use a language called JavaScript to manipulate the HTML on a page.)

A web page is sent to the browser by a server in response to a request by the browser.  A request is triggered either by the user entering a URL into the URL bar, or by the user clicking on a link on another page.

Since this is a very crash course, we'll just look at some HTML and try to spot some patterns.

Here's an example of some simple HTML:

```
<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <title>Weather forecast</title>
  </head>

  <body>
    <header>
      <h1>Weather forecast</h1>
    </header>

    <main>
      <article>
        <h2>3rd July 2019</h2>
        <section>
          <h3>London</h3>
          <p>Cloudy</p>
        </section>
        <section>
          <h3>Oxford</h3>
          <p>Overcast</p>
        </section>
        <section>
          <h3>Stroud</h3>
          <p>Nimbo cumulus</p>
        </section>
      </article>

      <article>
        <h2>4th July 2019</h2>
        <section>
          <h3>London</h3>
          <p>Hot</p>
        </section>
        <section>
          <h3>Oxford</h3>
          <p>Sunny</p>
        </section>
        <section>
          <h3>Stroud</h3>
          <p>Scorchio</p>
        </section>
      </article>
    </main>
  </body>
</html>
```

You can view the (not very impressive) web page at http://inglesp.pythonanywhere.com/datalab/2019-07/forecast.html.

> Tasks:
> * Match up what you can see on the page with what's in the HTML.
> * Copy the HTML above to a file on your computer called `forecast.html` and open it in a web browser.
> * Make changes to the file and see what effect they have.
>
> Questions:
> * What do you think `h1`, `h2`, `h3` indicate?
> * `<tag>` is called an "opening tag" and `</tag>` is called a closing tag.  What happens if you remove an opening tag, eg `<h1>`?  What about if you remove a closing tag?

You can see the original source HTML of a web page in your browser.  If you right click somewhere on the web page, you should see something (depending on your browser) like "View Page Source".

You can also right click on a part of the page and choose something like "Inspect Element" to interact with the HTML on the page.

Whereas the page source will show you the HTML that was originally sent to the browser by the server, "Inspect Element" will show you the state of the HTML as it is currently rendered, taking into account any changes that have been caused by JavaScript running on the page.

> Task:
> * Try both of these now.

An HTML document is structured as a "tree" of elements.

```
html
├─head
│ └─title
│   └─text: Weather forecast
└─body
  ├─header
  │ └─h1
  │   └─text: Weather forecast
  └─main
    ├─article
    │ ├─h2
    │ │ └─text: 3rd July 2019
    │ ├─section
    │ │ ├─h3
    │ │ │ └─text: London
    │ │ └─p
    │ │   └─text: Cloudy
    │ ├─section
    │ │ ├─h3
    │ │ │ └─text: Oxford
    │ │ └─p
    │ │   └─text: Overcast
    │ └─section
    │   ├─h3
    │   │ └─text: Stroud
    │   └─p
    │     └─text: Nimbo cumulus
    └─article
      ├─h2
      │ └─text: 4th July 2019
      ├─section
      │ ├─h3
      │ │ └─text: London
      │ └─p
      │   └─text: Hot
      ├─section
      │ ├─h3
      │ │ └─text: Oxford
      │ └─p
      │   └─text: Sunny
      └─section
        ├─h3
        │ └─text: Stroud
        └─p
          └─text: Scorchio
```

We can find elements by "walking" along the branches of the tree.

For instance, the location of the text "Overcast" can be described as follows:

* Start at the `html` root element
* Find the `body` element within this element
* Find the `main` element within this element
* Find the first `article` element within this element
* Find the second `section` element within this element
* Find the `p` element within this element

> Task:
> * Follow these directions and check that you arrive at the the text "Overcast".

To find an element, you don't have to start at the root element.  Instead, you can start at any other element and walk in either direction.  So to find the forecast for Stroud on 4th July, you could:

* Find the element with text "4th July 2019" (an `h2`)
* Find this element's parent (an `article`)
* Find the element within this element with text "Stroud" (an `h3`)
* Find this element's parent (a `section`)
* Find the `p` element within this element

> Task:
> * Follow these directions and check that you arrive at the the expected text.

You don't need a browser to request a web page -- we can do it with Python.

We need to use a couple of third-party libraries: [`requests`](https://2.python-requests.org/en/master/) (for making web requests) and [`BeautifulSoup4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing HTML.

In [1]:
import requests
from bs4 import BeautifulSoup

We use `requests` to request the web page from the server.

In [2]:
rsp = requests.get('http://inglesp.pythonanywhere.com/datalab/2019-07/forecast.html')

`requests.get` returns a `Response` object.  200 is a status code that means that the request succeeded.  See [this excellent website](https://http.cat/) for details of other status codes.

In [3]:
rsp

<Response [200]>

A `Response` object has the content of a response available as an attribute, and in this case it is the HTML of the web page.

In [4]:
rsp.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <title>Weather forecast</title>\n  </head>\n\n  <body>\n    <header>\n      <h1>Weather forecast</h1>\n    </header>\n\n    <main>\n      <article>\n        <h2>3rd July 2019</h2>\n        <section>\n          <h3>London</h3>\n          <p>Cloudy</p>\n        </section>\n        <section>\n          <h3>Oxford</h3>\n          <p>Overcast</p>\n        </section>\n        <section>\n          <h3>Stroud</h3>\n          <p>Nimbo cumulus</p>\n        </section>\n      </article>\n\n      <article>\n        <h2>4th July 2019</h2>\n        <section>\n          <h3>London</h3>\n          <p>Hot</p>\n        </section>\n        <section>\n          <h3>Oxford</h3>\n          <p>Sunny</p>\n        </section>\n        <section>\n          <h3>Stroud</h3>\n          <p>Scorchio</p>\n        </section>\n      </article>\n    </main>\n  </body>\n</html>'

We can use `BeautifulSoup` to create an object that makes it easy for us to navigate this HTML.

In [5]:
doc = BeautifulSoup(rsp.text, 'html.parser')

Here, we're finding the text of the `p` element within the second `section` element within the first `article` element.

In [6]:
doc.find_all('article')[0].find_all('section')[1].find('p').text

'Overcast'

And here, we're finding the weather in Stroud in 4th July.

In [7]:
doc.find(text='4th July 2019').parent.parent.find(text='Stroud').parent.parent.find('p').text

'Scorchio'