# Web Scraping

Since the course's last project is building a web scraping program, I'de like to research and study in depth on web scraping programming. Basically, web scraper would do following tasks:

- Retrieving HTML data from a domain name
- Parsing that data for target information
- Storing the target information
- Moving to another web page to repeat the process 

## urlopen 

In order to perform web scraping effectively, the programmer may want to use several modules such as requests, selenium, and beautiful soup. However, I'd like to start with some basic modules. *urlopen* function is used to open a website and read it.

In [6]:
from urllib.request import urlopen
web_link = 'https://openweathermap.org/'
html = urlopen(web_link)
print(html.read(50)) # Print only first 50 words

b"<!DOCTYPE html>\n<html lang='en'>\n    <head>\n      "


## Beautiful Soup

The useful library is BeautifulSoup. Since it's not a public library, the programmer will need to install it using pip.

$pip install beautifulsoup4

When importing the module, use following statement.

from bs4 import Beautifulsoup4

The beautifulsoup4 library provide useful tools to parse data from html documents. For example, we can parse a tag from a beautifulsoup object as shown below.

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

web_link = 'https://openweathermap.org/'
html = urlopen(web_link)
bsObj = BeautifulSoup(html.read())
print(bsObj.div)

<div class="mini-navbar mini-navbar-dark">
<div class="container">
<div class="row">
<div class="col-lg-9 col-md-9 col-sm-9 hidden-xs">
<a class="first-child" href="//openweathermap.force.com/" onclick="ga('send','event','link','click','supp');" target="_blank">
<i class="fa fa-envelope"> </i> <span class="hidden-xs">Support Center</span></a>
<a class="pull-right" href="/home/sign_up" onclick="_gaq.push(['_trackEvent', 'Navbar', 'Main', 'register']);"><i class="fa fa-arrow-circle-down"></i> Sign Up</a>
<a class="pull-right" href="/home/sign_in" onclick="_gaq.push(['_trackEvent', 'Navbar', 'Main', 'signin']);"><i class="fa fa-sign-in"></i> Sign In</a>
<a class="pull-right" href="#" id="nav-search"><i class="fa fa-search" onclick="_gaq.push(['_trackEvent', 'Navbar', 'Main', 'search']);"></i> Weather in your city</a>
<a class="pull-right hidden" href="#" id="nav-search-close"><i class="fa fa-times"></i></a>
<!-- Search Form -->
<form action="/find" class="pull-right hidden" id="nav-search

## Handling Connection Error

There can be cases that websites are down or unreachable. Typically, there are two main error when opening a web page:

- HTTPError: The page is not found
- URLError: The server is not found

Using try and except statement, we should wrap urlopen statement and catch any possible error.

In [5]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup

web_link = 'https://unknownwebsiteorsomething.com'
try:
    html = urlopen(web_link)
    bsObj = BeautifulSoup(html.read())
    print(bsObj.div)
except HTTPError as e:
    print("The page could not be found!")
except URLError as e:
    print("The server could not be found!")
    

The server could not be found!


## Get Title of Website

Below code would return either the title of the page or None if there is any problem. This code could be useful when documenting the title of the paper.

In [6]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup

web_link = 'https://unknownwebsiteorsomething.com'

def get_tile(url):
    try:
        html = urlopen(url)
    except HTTPError:
        return None
    except URLError:
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError:
        return None
    return title

title = get_tile(web_link)
if title is None :
    print("The title of the page could not be found")
else:
    print(title)
    

The title of the page could not be found


**References**  
Beazley, D. & Jones, B. K. (2013). Python Cookbook. Sebastopol, CA: O’Reilly Media, Inc.  
Mitchell, Ryan (2015). Web Scraping with Python. Sebastopol, CA: O’Reilly Media, Inc.  
Severance. C. R. (2009). Python for Everybody. http://do1.dr-chuck.com/pythonlearn/EN_us/pythonlearn.pdf  
https://www.w3schools.com/python/default.asp  

