# Chapter 1 Your First Web Scraper

## 1 Connecting

A web browser is a useful application for creating packets of information, telling operating system to send them off, and interpreting the data as pretty pictures, sounds, videos, and text. A web browser can tell the processor to send data to the application that handles wireless interface, but we can do the same thing in Python with just three lines of code.

In [1]:
### get request
from urllib.request import urlopen

html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


`urllib` is a standard Python library and contains functions for requesting data across the web, handling cookies, and even changing metadata such as headers and your user agent.

`urlopen` is used to open a remote object across a network and read it.

## 2 An Introduction to BeautifulSoup

The `BeautifulSoup` library tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.

### 2.1 Running BeautifulSoup

The most commonly used object in the `BeautifulSoup` library is the `BeautifulSoup` object. Look at the example above:

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://pythonscraping.com/pages/page1.html")
bs = BeautifulSoup(html, "html.parser")
print(bs.h1)

<h1>An Interesting Title</h1>


When we create a `BeautifulSoup` object, two arguments are passed in. The first one is the HTML text the object is based on, and the second specifies the parser that you want `BeautifulSoup` to use in order to create that object. In the majority of cases, it makes no difference which parser you choose. `html.parser` is a parser that is included with Python and requires no extra installation in order to use. Except where required, we will use this parser by default.

Another popular parser is `lxml`. `lxml` has some advantages over `html.parser` in that it is generally better at parsing messy or malformed HTML code. It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags. It is also somewhat faster.

Another popular parser is `html5lib`. Like `lxml`, `html5lib` is an extremely forgiving parser that takes even more initiative correcting broken HTML. But it is slower than both `lxml` and `html.parser`.

The last line of code only returns the first instance of h1 tag found on the web page. By convention, only one h1 tag should be used on a single page, but conventions are often broken on the web, so you should be aware that this will retrieve the first instance of the tag only, and not necessarily the one that you are looking for.

`BeautifulSoup` can use the file object retrieved by `urlopen` without needing to call `.read()` first. Another useful method of `BeautifulSoup` object is `.prettify()` which restructure the HTML code and make it easier to read.

In [3]:
print(bs.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <div>
   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  </div>
 </body>
</html>



We can find that the h1 tag we extract is nested two layers deep into the `BeautifulSoup` object structure (html $\rightarrow$ body $\rightarrow$ h1). However, when you actually fetch it from the object, you can call h1 tag directly.

In [4]:
bs.h1

<h1>An Interesting Title</h1>

In fact, any of the following function calls would produce the same output.

In [5]:
bs.html.body.h1

<h1>An Interesting Title</h1>

In [6]:
bs.body.h1

<h1>An Interesting Title</h1>

In [7]:
bs.html.h1

<h1>An Interesting Title</h1>

Virtually any information can be extracted from any HTML file, as long as it has an identifying tag surrounding it or near it.

## 3 Connecting Reliably and Handling Exceptions

Two main things can go wrong in retrieving HTML file using `urlopen`: the page is not found on the server, or the server is not found.

In the first situation, an `HTTPError` will be returned. This `HTTPError` may be "404 Page Not Found", "500 Internal Server Error", and so forth. In all of these cases, the `urlopen` function will throw the generic exception `HTTPError`. We can handle this exception in the following way:

In [8]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
else:
    print("The HTML file is opened.")

The HTML file is opened.


If an `HTTPError` is returned, the program now prints the error, and does not execute the rest of the program under the `else` statement.

If the server is not found at all, `urlopen` will throw an `URLError`. This indicates that no server could be reached at all, and because the remote server is responsible for returning HTTP status code, an `HTTPError` cannot be thrown, and the more serious `URLError` must be caught. We can add a check:

In [9]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
except URLError as e:
    print("The server could not be found!")
else:
    print("The HTML file is opened.")

The HTML file is opened.


If page is retrieved successfully from the server, there is still the issue of the content on the page not quite being what you expected. Every time you access a tag in a BeautifulSoup object, it's smart to add a check to make sure the tag actually exists. If you attempt to access a tag that does not exits, `BeautifulSoup` will return a `None` object. The problem is, attempting to access a tag on a `None` object itself will result in an `AttributeError` being thrown.

In [10]:
### access a non existing object
print(bs.h5)

None


In [11]:
### access a tag on a None object
print(bs.h5.text)

AttributeError: 'NoneType' object has no attribute 'text'

The easiest way to guard against these two situations is to explicitly check for both situations:

In [12]:
try:
    badContent = bs.h5.text
except AttributeError as e:
    print(e)
else:
    if badContent == None:
        print("Tag was not found.")
    else:
        print(badContent)

'NoneType' object has no attribute 'text'


We can reorganize the test code to make it less difficult to write:

In [13]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs = BeautifulSoup(html, "html.parser")
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found.")
else:
    print(title)

<h1>An Interesting Title</h1>
