# Parsing an HTML File

In order to pull data from an HTML or XML file, we first have to pass our file to the BeautifulSoup constructor to get a BeautifulSoup object. The BeautifulSoup constructor, `BeautifulSoup(file, 'parser')`, parses the given `file` using the given `parser` and returns a BeautifulSoup object. We can pass our `file` to the constructor either as a string or as an open filehandle. The `parser` is a string that indicates the parser we want to use. 

For illustration purposes, we will use a simple HTML file, `sample.html`, that contains information taken from the *Wikipedia* page on Natural Language Processing. We will also use lxml’s HTML parser and we indicate this in the constructor by setting our `parser` to be`lxml`. 

In the code below, we start by openning our `sample.html` file and then passing the open filehandle `f` as the first parameter to the `BeautifulSoup()` constructor. We also set the second parameter of the constructor to `lxml` to indicate that we want to use the lxml’s HTML parser.

In [3]:
from bs4 import BeautifulSoup

with open('./sample.html', encoding='ISO-8859-1') as f:
    page_content = BeautifulSoup(f, 'lxml')

`page_content` is a BeautifulSoup object that contains our parsed HTML file as a nested data structure. Let's print our BeautifulSoup object to see what it looks like:

In [4]:
print(page_content)

<!DOCTYPE html>
<html class="client-nojs" lang="en">
<head>
<title>Natural language processing - Wikipedia</title>
<meta charset="utf-8"/>
</head>
<body>
<h1 class="firstHeading" id="firstHeading">Natural language processing</h1>
<hr/>
<div class="noprint" id="siteSub">From Wikipedia, the free encyclopedia</div>
<div class="hatnote navigation-not-searchable" role="note">This article is about language processing by computers. For the processing of language by the human brain, see <a href="/wiki/Language_processing_in_the_brain" title="Language processing in the brain">Language processing in the brain</a>.</div>
<p><b>Natural language processing</b> (<b>NLP</b>) is a subfield of <a href="/wiki/Computer_science" title="Computer science">computer science</a>, <a href="/wiki/Information_engineering_(field)" title="Information engineering (field)">information engineering</a>, and <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a> concerned wit

As we can see, `page_content`, holds the entire contents of our `sample` HTML file. Notice that when we print the BeautifulSoup object, it is not printed in a nice format and it is very hard to read.

Luckily, the BeautifulSoup object has the `.prettify()` method that allows our parsed HTML file to be printed with all the tags nicely indented, making it easy to read and identify tags:

In [3]:
print(page_content.prettify())

<!DOCTYPE html>
<html class="client-nojs" lang="en">
 <head>
  <title>
   Natural language processing - Wikipedia
  </title>
  <meta charset="utf-8"/>
 </head>
 <body>
  <h1 class="firstHeading" id="firstHeading">
   Natural language processing
  </h1>
  <hr/>
  <div class="noprint" id="siteSub">
   From Wikipedia, the free encyclopedia
  </div>
  <div class="hatnote navigation-not-searchable" role="note">
   This article is about language processing by computers. For the processing of language by the human brain, see
   <a href="/wiki/Language_processing_in_the_brain" title="Language processing in the brain">
    Language processing in the brain
   </a>
   .
  </div>
  <p>
   <b>
    Natural language processing
   </b>
   (
   <b>
    NLP
   </b>
   ) is a subfield of
   <a href="/wiki/Computer_science" title="Computer science">
    computer science
   </a>
   ,
   <a href="/wiki/Information_engineering_(field)" title="Information engineering (field)">
    information engineering
   <