# Parsing an HTML File

To parse an HTML or XML document, we need to pass the document into the BeautifulSoup constructor. The BeautifulSoup constructor, `BeautifulSoup(file, 'parser')`, parses the given `file` using the given `parser` and returns a BeautifulSoup object. We can pass our `file` to the constructor either as a string or as an open filehandle, while the `parser` is a string that indicates the parser we want to use.

The BeautifulSoup constructor will transform the HTML or XML file into a complex tree of Python objects. One of this objects is the BeautifulSoup object returned by the constructor. The BeautifulSoup object itself represents the document as a whole and can be searched using various methods, as we will see in the following lessons. 

In the code below, we start by importing the BeautifulSoup library by using the statement:

```python
from bs4 import BeautifulSoup
```

We then open the `sample.html` file and pass the open filehandle `f` to the `BeautifulSoup` constructor. We also set the `parser` in the constructor to `lxml` to indicate that we want to use lxml’s HTML parser. The `BeautifulSoup` constructor will return a BeautifulSoup object that we will save in the `page_content` variable. We then print the BeautifulSoup object to see what it looks like.

In [1]:
# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Print the BeautifulSoup Object
print(page_content)

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>AI For Trading</title>
<meta charset="utf-8"/>
<link href="./teststyle.css" rel="stylesheet"/>
<style>.h2style {background-color: tomato;color: white;padding: 10px;}</style>
</head>
<body>
<h1 id="intro">Get Help From Peers and Mentors</h1>
<div class="section">
<h2 class="h2style" id="hub">Student Hub</h2>
<p>Student Hub is our real time collaboration platform where you can work with peers and mentors. You will find Community rooms with other students and alumni.</p>
</div>
<hr/>
<div class="section">
<h2 class="h2style" id="know">Knowledge</h2>
<p>Search or ask questions in <a href="https://knowledge.udacity.com/">Knowledge</a></p>
</div>
<div class="outro">
<h3 id="know">Good Luck</h3>
<p>Good luck and we hope you enjoy the course</p>
</div>
</body>
</html>


As we can see, `page_content`, holds the entire contents of our `sample.html` file. Notice that when we print the BeautifulSoup object, it is not printed in a nice format and it is very hard to read.

Luckily, the BeautifulSoup object has the `.prettify()` method that allows our BeautifulSoup object to be printed with all the tags nicely indented. Let's see how this works:

In [2]:
# Print the BeautifulSoup Object with prettify
print(page_content.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <title>
   AI For Trading
  </title>
  <meta charset="utf-8"/>
  <link href="./teststyle.css" rel="stylesheet"/>
  <style>
   .h2style {background-color: tomato;color: white;padding: 10px;}
  </style>
 </head>
 <body>
  <h1 id="intro">
   Get Help From Peers and Mentors
  </h1>
  <div class="section">
   <h2 class="h2style" id="hub">
    Student Hub
   </h2>
   <p>
    Student Hub is our real time collaboration platform where you can work with peers and mentors. You will find Community rooms with other students and alumni.
   </p>
  </div>
  <hr/>
  <div class="section">
   <h2 class="h2style" id="know">
    Knowledge
   </h2>
   <p>
    Search or ask questions in
    <a href="https://knowledge.udacity.com/">
     Knowledge
    </a>
   </p>
  </div>
  <div class="outro">
   <h3 id="know">
    Good Luck
   </h3>
   <p>
    Good luck and we hope you enjoy the course
   </p>
  </div>
 </body>
</html>


As we can see, the `.prettify()` method makes the object easier to read and also it also allows us to identify tags more readily. In the next lesson, we will see how to access the information contained in each tag.