## What is web scraping?

Automated gathering of data from the Internet

Bots are Web-scraping programs.

## Chapter1 Your first web scraper

### What happens when send a get request to a server?

1. generate a stream of 1 and 0 bit: header(an immediate destination of local router's MAC address, final destination IP address) & body(request fro server application)

2. local router interprets them as a packet and router stamps its own IP address as "From" on the packet then sends it off

3. intermediary servers

4. destination server receivs the packet

5. server reads the packet port destination and pass it to appropriate application

6. web server application receive a stream of data from the server processor

7. bundles needed info up into a new packet to send

### How this is done in python?

In [1]:
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


### Create a virtual environment

`virtualenv scrapingEnv`

`cd scrapingEnv`

`source bin/activate`

`deactivate`

Keeping all libraries separated by project. Easy to zip up thr entire environment folder. 

### Use beautifulsoup library

In [6]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read(), "html.parser")
print(bsObj.h1)

<h1>An Interesting Title</h1>


### Exception handling

`html = urlopen("http://pythonscraping.com/pages/page1.html")`

Two main things that can go wrong in this line:

- The page is not found on the server(or there was some error in retriving it)

    - This may cause "404 page not found" or "500 internal server error", etc

- The server is not found

    - urlopen returns a none object
    
- If a tag does not exist

    - BeautifulSoup will reutn a None object, AttributeError being thrown
    
**Refactor 1**

In [7]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
try:
    html = urlopen("http://pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
else:
    if html is None:
        print("URL is not found")
    else:
        bsObj = BeautifulSoup(html.read(), "html.parser")
        try:
            title = bsObj.html.h1
        except AttributeError as e:
            print("Tag was not found")
        else:
            if title == None:
                print("Tag was not found")
            else:
                print(bsObj.html.h1)

<h1>An Interesting Title</h1>


**Refactor 2**

In [9]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    else:
        if html is None:
            print("URL is not found")
        try:
            bsObj = BeautifulSoup(html.read(), "html.parser")
            title = bsObj.h1
        except AttributeError as e:
            return None
        return title
title = getTitle("http://pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not found")
else:
    print(title)

<h1>An Interesting Title</h1>
