# `bs4` Basics

Here is an example of a very simple HTML document:

```html
<html> 
  <head> 
    <title>My page</title> 
  </head> 
  <body> 
    <h2>Welcome to my <a href="https://example.com/index.html">page</a></h2> 
    <p id="basic" class="stylish small">This is the first paragraph.</p> 
    <p id="para2" class="unstylish">This is the second paragraph.</p> 
    <p>This is the third paragraph, which contains a <a href="https://example.com/about.html">link</a> to another page.</p>
    <!-- this is the end --> 
  </body> 
</html>
```

Eventually, we will learn to request arbitrary HTML documents from remote servers, store them to disk, and load them for analysis. But let's keep things simple and assign this document to a variable called `simple_doc`.

In [182]:
simple_doc = """<html> 
  <head> 
    <title>My page</title> 
  </head> 
  <body> 
    <h2>Welcome to my <a href="https://example.com/index.html">page</a></h2> 
    <p id="basic" class="stylish small">This is the first paragraph.</p> 
    <p id="para2" class="unstylish">This is the second paragraph.</p> 
    <p>This is the third paragraph, which contains a <a href="https://example.com/about.html">link</a> to another page.</p>
    <!-- this is the end --> 
  </body> 
</html>"""

Let's we want to retrieve the title of this HTML document, which is stored in text, wrapped in the `<title>` tags. If you already know a bit about Python, you might know how to manipulate strings. Here, the document has been nicely formatted so that we could do something like this, using string methods:

In [183]:
title_tag = simple_doc.split('\n')[2].strip()
title_tag

'<title>My page</title>'

In [184]:
title_contents = title_tag.replace('<title>', '').replace('</title>', '')
title_contents

'My page'

One problem with this technique is that we are making too many assumptions about how the string that stores the HTML document is structured, for example, by assuming that the relevant tag is on the third line (`.split('\n')[2]`) of the multi-line string. In fact, many HTML documents are structurally equivalent to each other, even if we remove all the line spacing. So there are no guarantees that the precis structure that we depend on above to identify the information we care about will be preserved in the form that the server provides us with the remote document. 

A better approach is to use Python to do the work of converting the string that stores the HTML document into an abstract representation of the tree structure that is the page's skeleton.

In general, this step can be done precisely once for each HTML document, so that once it has been completed, we can start to use the features of the Python programming language to navigate the structure of the document - either manually, or, ultimately, programmatically.

To do this, we use a third-party Python module called `BeautifulSoup`, which can be made available using the following `import` statement:

In [185]:
import bs4

We ask `bs4` to process the string that stores the HTML document and return a new Python object, that we will call `soup` throughout:

In [186]:
soup = bs4.BeautifulSoup(simple_doc.replace('\n', ''))
type(soup)

bs4.BeautifulSoup

Objects of type `bs4.BeautifulSoup` support a wide range of methods that allow us to access elements of the HTML document. For example, the `.find()` method can be used to find the first instance of a tag with the name given in its first argument:

In [187]:
title = soup.find('title')

Take care here. Even though the displayed representation of the object called `title` is the same as the HTML markup that generates it, it is not stored as a string.

In [188]:
title, type(title)

(<title>My page</title>, bs4.element.Tag)

It is, rather, a `bs4.element.Tag`, which supports methods useful for further manipulation, such as extracting the text attached to the `<title></title>` tag:

In [189]:
title_text = title.getText()

We can also use methods on the `bs4.element.Tag` in order to retrieve elements that have a specific relationship to it within the HTML document, such as this method, which finds the parent of the element given: the `<head>` element. This method returns the whole subdocument enclosed by the relevant tags.

In [190]:
title.findParent()

<head> <title>My page</title> </head>

The `find_next_sibling()` method can be used to find sibling elements, which are those at the same level in the document tree. They will be returned in the order that they appear in the document at that level. In this example, it is in the paragraph ordered.

In [191]:
h2_next_sibling = soup.body.h2.find_next_sibling()
h2_next_sibling, h2_next_sibling.getText()

(<p class="stylish small" id="basic">This is the first paragraph.</p>,
 'This is the first paragraph.')

The `find_next_siblings()` method is a shortcut to return all siblings, which will be wrapped in a `ResultSet`, which behaves a little like a regular Python `list`.

In [200]:
h2_next_siblings = soup.body.h2.find_next_siblings()
h2_second_sibling = h2_next_siblings[1]
h2_second_sibling

<p class="unstylish" id="para2">This is the second paragraph.</p>

This has consequences for how you might choose to manipulate the result. However, `bs4` provides a helpful message here to help you out of the issue.

In [204]:
h2_next_siblings.text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [206]:
h2_next_siblings_texts = [x.text for x in h2_next_siblings]
h2_next_siblings_texts

['This is the first paragraph.',
 'This is the second paragraph.',
 'This is the third paragraph, which contains a link to another page.']

**Exercise**: 

Retrieve the text *content* of the three paragraph tags (`<p></p>`), place them in a Python list, and assign them to the variable `paragraph_texts`. You will need to read about the `.find_all()` method.

**Solution**:

In [202]:
paragraph_texts = [p.getText() for p in soup.find_all('p')]
paragraph_texts

['This is the first paragraph.',
 'This is the second paragraph.',
 'This is the third paragraph, which contains a link to another page.']

**Exercise**: 

Retrieve the link *destination* of all the hyperlinks in the document `<body`> and place them in a Python list, and assign them to the variable `document_links`. You will need to address the `href` attribute of each of the `<a>` elements you find.

**Solution**:

In [203]:
document_links = [a.attrs['href'] for a in soup.body.find_all('a')]
document_links

['https://example.com/index.html', 'https://example.com/about.html']