# Retrieving web pages with **`urllib`**

While we can manually send and receive data over HTTP using the socket library, there is a much simpler way to perform this common task in Python by using the **urllib** library.

Using **urllib**, you can treat a web page much like a file. You simply indicate which web page you would like to retrieve and **urllib** handles all of the HTTP protocol and header details.

The equivalent code to read the https://docs.python.org/3/library/ file from the web using urllib is as follows:

```python
import urllib.request

fhand = urllib.request.urlopen('https://docs.python.org/3/library/')
for line in fhand:
    print(line.decode().strip())
```    

Once the web page has been opened with **urllib.urlopen**, we can treat it like a file and read through it using a **for** loop.

When the program runs, we only see the output of the contents of the file. The headers are still sent, but the **urllib** code consumes the headers and only returns the data to us.


As an example, we can write a program to retrieve the data and compute the frequency of each word in the file as follows:

```python
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('https://docs.python.org/3/library/')

counts = dict()
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
print(counts)
```

In [1]:
import urllib.request

fhand = urllib.request.urlopen('https://docs.python.org/3/library/')
for line in fhand:
    print(line.decode().strip())


<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title>The Python Standard Library &#8212; Python 3.7.4 documentation</title>
<link rel="stylesheet" href="../_static/pydoctheme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />

<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/language_data.js"></script>

<script type="text/javascript" src="../_static/sidebar.js"></script>

<link rel="search" type="application/opensearchdescription+xml"
title="Search within Python 3.7.4 documentation"
href="../_static/opensearch.xml"/>
<link rel="author" title="About these documen

<li class="toctree-l2"><a class="reference internal" href="xml.sax.utils.html"><code class="xref py py-mod docutils literal notranslate"><span class="pre">xml.sax.saxutils</span></code> — SAX Utilities</a></li>
<li class="toctree-l2"><a class="reference internal" href="xml.sax.reader.html"><code class="xref py py-mod docutils literal notranslate"><span class="pre">xml.sax.xmlreader</span></code> — Interface for XML parsers</a></li>
<li class="toctree-l2"><a class="reference internal" href="pyexpat.html"><code class="xref py py-mod docutils literal notranslate"><span class="pre">xml.parsers.expat</span></code> — Fast XML parsing using Expat</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="internet.html">Internet Protocols and Support</a><ul>
<li class="toctree-l2"><a class="reference internal" href="webbrowser.html"><code class="xref py py-mod docutils literal notranslate"><span class="pre">webbrowser</span></code> — Convenient Web-browser controller</a></

# Parsing HTML and scraping the web

One of the common uses of the **urllib** capability in Python is to scrape the web. Web scraping is when we write a program that pretends to be a web browser and retrieves pages, then examines the data in those pages looking for patterns.

As an example, a search engine such as Google will look at the source of one web page and extract the links to other pages and retrieve those pages, extracting links, and so on. Using this technique, Google spiders its way through nearly all of the pages on the web.

Google also uses the frequency of links from pages it finds to a particular page as one measure of how "important" a page is and how high the page should appear in its search results.



--------------
# Parsing HTML using regular expressions

One simple way to parse HTML is to use regular expressions to repeatedly search for and extract substrings that match a particular pattern.

Here is a simple web page:

```python
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
```

We can construct a well-formed regular expression to match and extract the link values from the above text as follows:

```python
href="http[s]?://.+?"
```

Our regular expression looks for strings that start with "href="http://" or "href="https://", followed by one or more characters **`(.+?)`**, followed by another double quote. The question mark behind the **`[s]?`** indicates to search for the string "http" followed by zero or one "s".

The question mark added to the **`.+?`** indicates that the match is to be done in a "non-greedy" fashion instead of a "greedy" fashion. A non-greedy match tries to find the smallest possible matching string and a greedy match tries to find the largest possible matching string.

We add parentheses to our regular expression to indicate which part of our matched string we would like to extract, and produce the following program:

```python
# Search for link values within URL input
import urllib.request, urllib.parse, urllib.error
import re
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#url = input('Enter - ')
url = "https://docs.python.org"
html = urllib.request.urlopen(url).read()
links = re.findall(b'href="(http[s]?://.*?)"', html)
for link in links:
    print(link.decode())
```    

The **ssl** library allows this program to access web sites that strictly enforce HTTPS. The **read** method returns HTML source code as a bytes object instead of returning an HTTPResponse object. The **findall** regular expression method will give us a list of all of the strings that match our regular expression, returning only the link text between the double quotes.

When we run the program and input a URL, we get the following output:

```
Enter - https://docs.python.org
https://docs.python.org/3/index.html
https://www.python.org/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
https://www.python.org/
https://www.python.org/psf/donations/
http://sphinx.pocoo.org/
```

Regular expressions work very nicely when your HTML is well formatted and predictable. But since there are a lot of "broken" HTML pages out there, a solution only using regular expressions might either miss some valid links or end up with bad data.

This can be solved by using a robust HTML parsing library.



----------
# Parsing HTML using BeautifulSoup


Even though HTML looks like XML1 and some pages are carefully constructed to be XML, most HTML is generally broken in ways that cause an XML parser to reject the entire page of HTML as improperly formed.

There are a number of Python libraries which can help you parse HTML and extract data from the pages. Each of the libraries has its strengths and weaknesses and you can pick one based on your needs.

As an example, we will simply parse some HTML input and extract links using the BeautifulSoup library. BeautifulSoup tolerates highly flawed HTML and still lets you easily extract the data you need. You can download and install the BeautifulSoup code from:

[https://pypi.python.org/pypi/beautifulsoup4](https://pypi.python.org/pypi/beautifulsoup4)

Information on installing BeautifulSoup with the Python Package Index tool **pip** is available at:

[https://packaging.python.org/tutorials/installing-packages/](https://packaging.python.org/tutorials/installing-packages/)

We will use **urllib** to read the page and then use **BeautifulSoup** to extract the **href** attributes from the anchor (**a**) tags.

```python
# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))
```    

--------------------
# Sample program

```python
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = "https://docs.python.org"
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    # Look at the parts of a tag
    print('TAG:', tag)
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents[0])
    print('Attrs:', tag.attrs)
    break
```    

---------
# Assignment

### Scraping Numbers from HTML using BeautifulSoup

- Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553)
- Actual data: http://py4e-data.dr-chuck.net/comments_283373.html (Sum ends with 92)

**Data Format**

The file is a table of names and comment counts. You can ignore most of the data in the file except for lines like the following:

```HTML
<tr><td>Modu</td><td><span class="comments">90</span></td></tr>
<tr><td>Kenzie</td><td><span class="comments">88</span></td></tr>
<tr><td>Hubert</td><td><span class="comments">87</span></td></tr>
```

You are to find all the <span> tags in the file and pull out the numbers from the tag and sum the numbers.
    
    
**Sample Code**

```python
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = "http://py4e-data.dr-chuck.net/comments_42.html"
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    # Look at the parts of a tag
    print('TAG:', tag)
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents[0])
    print('Attrs:', tag.attrs)
    break  
``` 

You need to adjust this code to look for span tags and pull out the text content of the span tag, convert them to integers and add them up to complete the assignment.