# Week 7 Notebook 1 Introducing HTML

This week, we are going to learn how to perform web scraping using the Beautiful Soup library.

But first, in this lesson we will learn about:
1. Different HTML elements and the tags that are used to define them
2. How to render HTML in the Jupyter Notebook

## HTML 

HTML, or HyperText Markup Language, is the language for rendering or "marking up" web pages. HTML uses tags to define how various elements such as text, images and hyperlinks in the web pages should be displayed by a web browser. Even this Jupyter Notebook is being rendered using HTML!

Different types of HTML elements are defined by their tags. The content that should be marked up will be placed between a start tag and an end tag. For example, the `<html> .... </html>` tags tell the browser that the content between the tags are to be marked up using HTML. As we can see, the end tag is denoted by the slash `/`.

A basic web page can be formed by nesting the following tags:

```html
<html>
    <head>
    
    </head>
    <body>
    
    </body>
</html>
```

Information about the HTML page, or *metadata*, will be contained between the `<head>` tags. The page content will be contained between the `<body>` tags, and we will focus on this part, as the data that we want to extract is here.

Two tags that we will introduce are:
- `<h1> ... </h1>` which is to format a heading, and
- `<p> ... </p>` which contains a paragraph of text.

We can try to use these tags in the next example.

## Displaying HTML in Jupyter
We can display HTML within this notebook. First, we have to import the `HTML()` function from the `IPython.display` library. The `HTML` function reads a string of HTML-formatted text to render it when we execute the cell.

The string variable `test_html` below contains HTML tags and content between the tags. The `<h1>` tag is used to format the header and the `<p>` tag is used to format the content. Always remember to use the slash, `/` for the end tags (`</h1>` and `</p>`).

Execute the cell below to view the rendered HTML.

In [1]:
from IPython.display import HTML

# create a string that contains HTML tags
test_html = '<html><body><h1>This is header 1</h1><p>A paragraph with some content</p></body></html>'

# use the HTML function to display the HTML
HTML(test_html)

**Another Example**

We can create a string that spans several lines using triple quotes `"""`. This will help us to view our HTML tags clearly, for example when we create the string `sample_doc` below.

In [2]:
sample_doc = """
<!DOCTYPE html>
<html>
    <head>
    </head>
    <body>
        <h1>Hello</h1>
        <p>Welcome to HTML in Jupyter</p>
    </body>
</html>
"""

# now use the HTML function
HTML(sample_doc)

**Attributes**

Tags may also contain attributes to describe the element. For example, we can add an attribute `title = "greeting"` within the `<h1>` tag below. This adds a tooltip at the header.

In [3]:
sample_doc = """
<!DOCTYPE html>
<html>
    <head>
    </head>
    <body>
        <h1 title="greeting">Hello</h1>
        <p>Welcome to HTML in Jupyter</p>
    </body>
</html>
"""

HTML(sample_doc)

Now, when you hover your mouse above the "Hello" header that has been rendered above, you should see a tooltip "greeting".


# Other Tags

Execute the cell below to view other HTML tags.

In [4]:
other_tags = """
<html>
    <head>
      <title>Page Title</title>
    </head>
    <body>
        <!-- This is a comment -->
        <!-- Headers are from <h1> to <h6> -->
        <h1>This is the biggest header</h1>
        <h2>This is the next size header</h2>
        <h6>This is smallest header</h6>

        <h2>Embedded Image</h2>
        <!-- The <img> tag is used to provide a link to an image to be displayed-->
        <!-- The <a> tag is used to provide a hyperlink to other files-->
        <img src="https://www.publicdomainpictures.net/pictures/210000/velka/physical-world-map-robinson.jpg" alt="World Map" width = "212" height = "107">
        <p>Image available from <a href="http://www.publicdomainpictures.net">Public Domain Pictures</a></p>

        <h2>Unordered List</h2>
        <!-- The <ul> tag is for unordered lists, <li> for each list item  -->
        <ul>
            <li>list item</li>
            <li>another item</li>
            <li>another item</li>
        </ul>

        <h2>Ordered list has numbers</h2>
        <!-- The <ol> tag is for ordered lists, <li> also for each list item  -->
        <ol>
            <li>list item</li>
            <li>list item</li>
        </ol>

        <h2>HTML Table</h2>
       
        <!-- An table is defined using the <table> tag  -->
        <!-- Then the table rows are specified using <tr>  -->
        <!-- The <th> tag is for the table header -->
        <!-- The <td> tag is for the table data -->
        <table>
            <tr>
                <th>Library</th>
                <th>Purpose</th>
            </tr>
            <tr>
                <td>Pandas</td>
                <td>Data Manipulation</td>
            </tr>
            <tr>
                <td>Matplotlib</td>
                <td>Plotting</td>
            </tr>
            <tr>
                <td>Beautiful Soup</td>
                <td>Web Scraping</td>
            </tr>
        </table>
    </body>
</html>
"""

HTML(other_tags)

Library,Purpose
Pandas,Data Manipulation
Matplotlib,Plotting
Beautiful Soup,Web Scraping


# Nested Tags

As you can see from the HTML code, some tags are nested within other tags. These nested tags are called *child* tags, while the outer tags are called *parent* tags. Tags which are nested on the same level are called *sibling* tags. 

```html
<html>
    <body>
        <h1>This is the biggest header</h1>
        <p>Image available from <a href="http://www.publicdomainpictures.net">Public Domain Pictures</a></p>

        <ul>
            <li>list item</li>
            <li>another item</li>
            <li>another item</li>
        </ul>
    </body>
 </html>
```

In the example above, `<body>` is the parent tag of `<h1>`, `<p1>` and `<ul>`. This means `<h1>`, `<p1>` and `<ul>` are sibling tags. `<li>` is the child tag of `<ul>` and also considered a *descendant* of `<body>`.

Having the relationships between the tags allows us to navigate the HTML tree.

# Viewing HTML Source

We can actually view the HTML source code of any web page by bringing up the context menu by right-clicking on the webpage then select `View Page Source`.

![image-2.png](attachment:image-2.png)

To view a specific element, we must hover our cursor to the element, right-click on the element and then select `Inspect Element`. This will help us select specific tags that we want for web scraping, which we will do on the next notebook!