# Web scraping basics

Before we can scrape anything, we need to get access to a web page. Our internet pages uses the HTTP protocol (HyperText Transfer Protocol) for this, which means that to get a document from a web server, you will need to send a GET request to the relevant server on port 80. You also have to establish a socket (like a channel where information can be sent back and forth), connect it, receive the information in bytes (which is HTTP standard) and decode it into unicode, which is what Python uses. However, Python has made our life easier by creating a library that does this for us: **urllib**. 

The request module establishes a socket and retrieves a web page, while the .decode() function decodes the received data.

In [2]:
#First import the library and request module
import urllib.request
#Establish a 'filehandle' to the web site
fhand = urllib.request.urlopen('https://en.wikipedia.org/wiki/Hubba_Bubba') 
#Go through the web site line by line, decode it, remove extra white space and print it
for line in fhand:
    print(line.decode().strip())


<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Hubba Bubba - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XpVS9ApAMMQAA@bgZmIAAACI","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Hubba_Bubba","wgTitle":"Hubba Bubba","wgCurRevisionId":946145509,"wgRevisionId":946145509,"wgArticleId":792557,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: archived copy as title","Products introduced in 1979","Chewing gum","Wrigley Company brands"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":

In [8]:
#This is a demonstration of how you can download a picture from the web (your harddrive will be the lucky owner of a picture of the cover of the book Python 4 Everybody.)
#It will only work if you run the notebook from your local computer and it will save the file in the same directory as your notebook. 
import urllib.request, urllib.parse, urllib.error
img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()
fhand = open('Pythonbook.jpg', 'wb')
fhand.write(img)
fhand.close()

Now you can get access to an HTML page. Great. Next step is to parse it. If you are looking for something very specific, you could use regular expressions to search for it. In this example, the program will look for all links in the web page:

In [10]:
# Importing the urllib library with three modules, the regular expressions library and the ssl library to access secure http websites. 
import urllib.request, urllib.parse, urllib.error
import re
import ssl

In [13]:
# Code to tell your program to ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#Open and read the website
url = "https://en.wikipedia.org/wiki/Hubba_Bubba"
html = urllib.request.urlopen(url, context=ctx).read() 
#Find all links and print them
links = re.findall(b'href="(http[s]?://.*?)"', html) 
for link in links:
    print(link.decode())

https://en.wikipedia.org/wiki/Hubba_Bubba
http://www.wrigley.com/global/brands/hubba-bubba.aspx
https://web.archive.org/web/20120321172750/http://www.oldtimecandy.com/hubba-bubba-gum.htm
http://www.oldtimecandy.com/hubba-bubba-gum.htm
https://www.youtube.com/watch?v=r270ZGet0ck
http://www.wrigley.com/uk/brands/hubba-bubba.aspx
https://en.wikipedia.org/w/index.php?title=Template:Mars,_Incorporated&amp;action=edit
https://en.wikipedia.org/w/index.php?title=Hubba_Bubba&amp;oldid=946145509
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&amp;utm_medium=sidebar&amp;utm_campaign=C13_en.wikipedia.org&amp;uselang=en
https://www.wikidata.org/wiki/Special:EntityPage/Q1632941
https://da.wikipedia.org/wiki/Hubba_Bubba
https://de.wikipedia.org/wiki/Hubba_Bubba
https://et.wikipedia.org/wiki/Hubba_Bubba
https://es.wikipedia.org/wiki/Hubba_Bubba
https://fr.wikipedia.org/wiki/Hubba_Bubba
https://no.wikipedia.org/wiki/Hubba_Bubba
https://nn.wikipedia.org/wiki/Hubba_Bubba


## Beautiful Soup
If we don't just need a link or a specific word or number, you will probably need to parse your html with a parser. One of these is BeautifulSoup, which you can download here: https://pypi.python.org/pypi/beautifulsoup4

In [15]:
#If needed, install it with pip:
pip install beautifulsoup4

In [None]:
#Import beautifulsoup
from bs4 import BeautifulSoup

In [26]:
#Now, we continue with our wiki-page saved in the variable html above and parse it with beautifulsoup:
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a') 
# Just give me the first three, but then print the info on tag, url, content and attributes for each:
firstfew = tags[0:3]
for tag in firstfew:
    print('TAG:', tag)
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents)
    print('Attrs:', tag.attrs)

TAG: <a id="top"></a>
URL: None
Contents: []
Attrs: {'id': 'top'}
TAG: <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>
URL: #mw-head
Contents: ['Jump to navigation']
Attrs: {'class': ['mw-jump-link'], 'href': '#mw-head'}
TAG: <a class="mw-jump-link" href="#p-search">Jump to search</a>
URL: #p-search
Contents: ['Jump to search']
Attrs: {'class': ['mw-jump-link'], 'href': '#p-search'}


# Creating a spider
Let's say you want to look through more than one web page. Below is a code for a simple spider, which will find the links on a web page and add these to its 'to-do-list' of web sites to go through. I have borrowed this example from the materials in [Python for Everybody by Charles Severance](https://www.py4e.com/). 

In [None]:
# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors for https
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

todo = list()
visited = list()
url = input('Enter - ')
todo.append(url)
count = int(input('How many to retrieve - '))

while len(todo) > 0 and count > 0 :
    print("====== To Retrieve:",count, "Queue Length:", len(todo))
    url = todo.pop()
    count = count - 1

    if (not url.startswith('http')):
        print("Skipping", url)
        continue

    if (url.find('facebook') > 0):
        continue

    if (url.find('linkedin') > 0):
        continue

    if (url in visited):
        print("Visited", url)
        continue

    print("===== Retrieving ", url)

    try:
        html = urllib.request.urlopen(url, context=ctx).read()
    except:
        print("*** Error in retrieval")
        continue

    soup = BeautifulSoup(html, 'html.parser')
    visited.append(url)

    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        newurl = tag.get('href', None)
        if (newurl is not None):
            todo.append(newurl)
