# Navigating The Parse Tree

The most straightforward way of navigating the parse tree created by BeautifulSoup is by accessing the HTML or XML tags. We can access the tags as if they were attributes of the BeautifulSoup object. Let's see how this works.

In the code below, we will access the `<head>` tag in our `page_content` object by using the statement:

```python
page_content.head
```

Whenever we access a tag in this manner, we get a **Tag** object. Consequently, the above statement returns a Tag object that we will save in the `page_head` variable. We then print the Tag object to see what it looks like.

In [1]:
# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Access the head tag
page_head = page_content.head

# Print the Tag Object
print(page_head.prettify())

<head>
 <title>
  AI For Trading
 </title>
 <meta charset="utf-8"/>
 <link href="./teststyle.css" rel="stylesheet"/>
 <style>
  .h2style {background-color: tomato;color: white;padding: 10px;}
 </style>
</head>



As we can see, the `page_head` object has the entire contents of the `<head>` tag only, including all the opening and closing tags within it. These sub tags are known as children of the `<head>` tag. For example, the `<title>` tag is a child of the `<head>` tag. 

We can access these child tags within the `<head>` tag as if they were attributes of the `page_head` object. For example, if we wanted to access the `<title>` tag within the `<head>` tag, we can use:

In [2]:
# Access the title tag within the head tag from the page_head object
page_head.title

<title>AI For Trading</title>

Note that the above statement, `page_head.title`, is equivalent to the statement:

In [3]:
# Access the title tag within the head tag from the page_content object
page_content.head.title

<title>AI For Trading</title>

We can see that in both cases we get the same result. We will talk more about child tags in a later lesson.

# TODO: Access The Meta Tag

In the cell below, access the `<meta>` tag contained within the `<head>` tag. Start by importing BeautifulSoup, then open the `sample.html` file and pass the open filehandle to the BeautifulSoup constructor and use the `lxml` parser. Save the BeautifulSoup object returned by the constructor in a variable called `page_content`. Then access the `<meta>` tag contained within the `<head>` tag from the `page_content` object. Save the Tag object in a variable called `page_meta` and then print it. 

In [None]:
# Import BeautifulSoup


# Open the HTML file and create a BeautifulSoup Object

page_content = 

# Access the head tag
page_meta = 

# Print the Tag Object


# TODO: Access The `<h1>` Tag

In the cell below, access the `<h1>` tag contained within the `<body>` tag. Start by importing BeautifulSoup, then open the `sample.html` file and pass the open filehandle to the BeautifulSoup constructor and use the `lxml` parser. Save the BeautifulSoup object returned by the constructor in a variable called `page_content`. Then access the `<h1>` tag contained within the `<body>` tag from the `page_content` object. Save the Tag object in a variable called `page_h1` and then print it.

In [None]:
# Import BeautifulSoup


# Open the HTML file and create a BeautifulSoup Object

page_content = 

# Access the h1 tag
page_h1 = 

# Print the Tag Object


# Getting Text 

Let's get the `<title>` tag within the `<head>` tag, again:

In [6]:
# Print the title tag within the head tag
print(page_head.title)

<title>AI For Trading</title>


As we can see from this example, and the ones above it, Tag objects contain the beginning and end tags as well as the text within them. In most cases however we do not want the tags, but rather, only the text contained within the tags. For example, let's suppose we only wanted to get the text `AI for Trading` contained within the `title` tags. In these cases we can use the `.get_text()` method. The `.get_text()` method only gets the text part of a document or tag. Let's see how it works:

In [7]:
# Print only the text in the title tag within the head tag
print(page_head.title.get_text())

AI For Trading


We can see that now, we only get the text in the `title` tag.

# TODO: Remove HTML Tags

In the cell below, use the `.get_text()` method to remove all the HTML tags from the `sample.html` document. In other words, just print the entire text in the document, with no HTML tags.

**HINT:** Use the `.get_text()` method on the `page_content` object.

In [None]:
# Import BeautifulSoup


# Open the HTML file and create a BeautifulSoup Object

page_content = 

# Print only the text in the whole document


# Getting Attributes

An HTML or XML tag can have many attributes. For example, the tag:

```python
<h1 id='intro'>
```
has the attribute `id` whose value is `'intro'`. BeautifulSoup allows us to get the value of a tag’s attribute by treating the tag like a dictionary. For example, in the code below we get the value of the `id` attribute of the `<h1>` tag by using:

```python
page_h1['id']
```

where `page_h1` is the Tag object that holds the contents of the `<h1>` tag. Let's see how this works in the code below:

In [9]:
# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Access the h1 tag
page_h1 = page_content.body.h1

# Get the value of the id attribute from the h1 tag
h1_id_attr = page_h1['id']

# Print the value of the id attribute
print(h1_id_attr)

intro


# TODO: Get The Website Address

In this exercise, you will get the website address in a hyperlink tag. Hyperlinks are defined by the `<a>` tag. In our `sample.html` document we only have one hyperlink:

```python
<a href="https://knowledge.udacity.com/">Knowledge</a>
```
Hyperlinks are used to link webpages together. The `href` attribute in the `<a>` tag, indicates the link's destination, *i.e.* a website address.

In the cell below, open the `sample.html` file and pass the open filehandle to the BeautifulSoup constructor and use the `lxml` parser. Save the BeautifulSoup object returned by the constructor in a variable called `page_content`. Then access the `<a>` tag from the `page_content` object. Save the Tag object in a variable called `page_hyperlink`. Then get the `href` attribute from the `page_hyperlink` object and save it into a variable called `href_attr`. Finally, print the `href_attr` variable.

In [None]:
# Import BeautifulSoup


# Open the HTML file and create a BeautifulSoup Object

page_content = 

# Access the a tag
page_hyperlink = 

# Get the href attribute from the a tag
href_attr = 

# Print the href attribute


# Finding All Tags

Let's take a look at our `sample.html` file:

In [11]:
# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')
    
print(page_content.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <title>
   AI For Trading
  </title>
  <meta charset="utf-8"/>
  <link href="./teststyle.css" rel="stylesheet"/>
  <style>
   .h2style {background-color: tomato;color: white;padding: 10px;}
  </style>
 </head>
 <body>
  <h1 id="intro">
   Get Help From Peers and Mentors
  </h1>
  <div class="section">
   <h2 class="h2style" id="hub">
    Student Hub
   </h2>
   <p>
    Student Hub is our real time collaboration platform where you can work with peers and mentors. You will find Community rooms with other students and alumni.
   </p>
  </div>
  <hr/>
  <div class="section">
   <h2 class="h2style" id="know">
    Knowledge
   </h2>
   <p>
    Search or ask questions in
    <a href="https://knowledge.udacity.com/">
     Knowledge
    </a>
   </p>
  </div>
  <div class="outro">
   <h3 id="know">
    Good Luck
   </h3>
   <p>
    Good luck and we hope you enjoy the course
   </p>
  </div>
 </body>
</html>


We can see that our `sample.html` file has two `<h2>` tags:

```python
<h2 class="h2style" id="hub">Student Hub</h2>
```

and

```python
<h2 class="h2style" id="know">Knowledge</h2>
```

So let's try to access these `<h2>` tags as we did before: 

In [12]:
# Get h2 tag
page_content.body.h2

<h2 class="h2style" id="hub">Student Hub</h2>

As we can see, we only get the first `<h2>` tag and not both. This is because, when we access a tag as an attribute, we only get the first occurrence of that tag in the document. 

In order to get all the `<h2>` tags we need to use the `.find_all()` search method, which is the topic of our next lesson.

# Solution

[Solution notebook](navigating_the_parse_tree_solution.ipynb)