In [None]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

# A page on the web

HTML is a markup language for describing web documents. It stands for **H**yper **T**ext **M**arkup **L**anguage. HTML, together with CSS (**C**ascading **S**tyle **S**heets for _styling_ web documents) and Javascript (for _animating_ web documents), these three elements are used to construct modern web pages.

# HTML

HTML documents are built using a series of HTML _tags_. Each tag describes a different type of content. Web pages are built by putting together different tags.

This is the general HTML tag structure:

```html
<tagname tag_attribute1="attribute1value1 attribute1value2" tag_attribute2="attribute2value1">tag contents</tagname>
```
* Tags (usually) have both a start (or opening) tag, <tagname> and an end (or closing) tag, </tagname>
* Tags can also have attributes which are declared _inside_ the opening tag.
* The actual tag _content_ goes in between the opening and closing tags.

While HTML can appear monstrous based on what you see on other websites, it's actually very simple in its most basic form and can be written by hand.

As an example, navigate to this [example page](../data/hello_world.html)

<img src='../images/hello_example.png'>

As we dig into the code for that page, we can begin to understand it.

<img src='../images/webpage_structure.png'>

The first line only declares that the document is HTML.

The second line essentially starts the HTML content - notice that the page ends when it is closed on line 9. Once the html tag is closed that ends the document.

There are two main tags within the html tag - *head* and *body*.

**head** is a container for metadata about the webpage (i.e. data about the data in the web page). The *head* will/can typically contain:
* *title* - the title of the document
* *base* - the default address
* *style* - how the page should be styled (CSS)
* *link* - an external resource that should be loaded with the page
* *script* - javascript that should be loaded with the page (internal or external)
* *meta* - explicit metadata about the page (largely for search engines and web browsers)

**body** contains all the contents of an HTML page. We will almost always only care about the body of a web page when we are extracting data.

An important point that I want to drive home is that **the appearance of a web page has little to do with its content**

Below you will find the same web page body and find that it looks completely different. All I have done is change the style (without impacting or changing the text at all).

<img src='../images/styled_webpage.png'>

## Structure within the body

The point of a markup language is to provide a basic structure so that it is interpretable by another machine program (i.e. your web browser). As such, most parts of a web page should be in a well defined tag and it is these tags that help the browser determine how to render the HTML text.

The tags that are generally used within the body of a page are:
* div - this is a container that divides the body of the page into a block. It will typically contain a large number of other tags and text.

  `<div>
   multiple paragraphs and tags
   </div>`
* img - displays images
  
  `<img src='image_url'></img>`
  
* a - links to another page/content
  
  `<a href='url'>link text displayed to human</a>`
  
* ul or ol - lists (unordered or ordered)

   `<ul>`
      <li> item 1
      <li> item 2
    `<ull>`

* table - a table within the webpage

  `<table>`
      <tr>
          <td>Name</td>
          <td>Age</td>
      </tr>
      <tr>
          <td>Abigail Adams</td>
          <td>274</td>
      </tr>
  `</table>`
  
* p - a paragraph
 
   `<p> Some paragraph... </p>`
   
* h1 to h6 - headers that follow automatic sizing and bolding (1 is largest, 6 is smallest)

   `<h1> Header for a paragraph </h1>`
   
* br - inserts a line break between two elements

   `<p> paragraph 1</p>
   <br/>
   <p> paragraph 2 after a line break</p>`

## A real web page in action

First let's go to https://en.wikipedia.org/wiki/Seinfeld

Now let's go to http://www.cnn.com

What is going on?

### Static vs. Dynamic content

Wikipedia largely has static content - text and elements that it renders (i.e. the Seinfeld page is always the Seinfeld page). It is a "free" website, that doesn't serve ads so there really isn't anything that it is trying to hide from users.

The CNN homepage, on the other hand, is constantly updating which stories are being shown. It has lists/blocks of articles, but the articles that are inserted in those lists need to be constantly updated. As such, not much of the page is displayed as plain text that we can easily read. Some web frameworks that companies use are better/worse about rendering the displayed text into plain text that we can then read/scrape.

It also, by my count with my ad blocker off, is showing me 8 ads through an advertising network. These ads are unique and served directly from the ad network provider (so it changes/isn't written directly but served as a block of data- much like our API usage-that is obfuscated from the user and generally from the page itself).

The downside is that dynamic content is much harder to retrieve from a page (if it can be retrieved at all in a usable format).

# Examining web page contents

For this, we will be using Chrome so everyone has a homogenous experience. To start dissecting a web page we will need to use the developer view and can start from the view source code function.

<img src='../images/chrome_dev.png'></img>

How do we view a page's source code then?

* To view the **full page** source code:
  1. Right-click anywhere on the webpage **that is not a link**
  2. Click "View Page Source" (<kbd>CTRL</kbd>+<kbd>U</kbd>) in Firefox or Chrome, or "Show page source" (<kbd>&#8997;</kbd>+<kbd>&#8984;</kbd>+<kbd>U</kbd>) in Safari.
    * In order to view the source code in Safari the Develop menu must be enabled first: Preferences > Advanced > Show Develop menu in menu bar
    
* To view the source code zoomed-in on **a single element** (and with better formatting!):
  1. Right-click any element in the webpage.
  2. Click "Inspect Element"


# Understanding callouts

Importantly there are additional pieces of information that can be included in a tag. The three classes that we will ultimately want to pay attention to are:

* class - this is used to apply a css style. Typically applied to a type of tag (such as all h1 tags)
* id - this is typically a unique name applied to one tag
* style - these are additional style elements that should be added to a tag
* attributes - contains additional data about the tag. Can be very helpful, especially if it contains a displayed data point in plain text.