# fetching data from online sources
it is often useful to pull data straight of a web url. 

for instance the uk government makes data available on every uk company above a certain size.
`http://download.companieshouse.gov.uk/BasicCompanyData-2018-11-01-part<X>_6.zip` where `<X>` stands for the numbers 1-6. each file is about 70mb in size.

given its uniform resource locator (url) we can of course collect such data with a browser, download the data file, and read the file into python for processing. that gets pretty boring quickly. if we need to fetch data from several urls, or if frequently or periodically need to collect updates from a given url, it is better to automate the process. python is quite powerful when it comes to accessing the internet. today we will see how to use the (classic) `urllib` (or `urllib2`) module and the simpler `requests` module for fetching online data.

`urllib` has a simple interface, and we will apply the `urlopen()` function whih is a web base sibling to the file opening function `open()`. `urlopen()` can fetch urls using a variety of protocols. 

once we have read the data into memory, the next task is to do something with it. this often means parsing it from whatever format it is given in, into a tidy format. how this is done depends of course on the original format. in the case of html, there is a beautiful module that we can use...

# this week's exercise:
read the argos. or, since that is very boring, write a function in python to read the argus for you. the function should be called `fetch_argus_headlines(date)`. this function should:
- accept as its sole input argument a date string (you know how to format it now!).
- verify that the input date is in the past (but not too far in the past, the argus archive is limited).
- if so, fetch all the argus headlines from that date.
- return a list of strings

## bonus:
write another function `fetch_argus_article_links(date)` which, instead of the headlines, fetches and return a list of links to that day's articles.

In [1]:
argus_archive_url = 'https://www.theargus.co.uk/archive/2018/11/01/'
# the urllib module has a submodule called request, and that submodule defines the urlopen function:
import urllib
argus_archive_handle = urllib.request.urlopen(argus_archive_url)

In [2]:
# in the case of the urllib2 module the function is in the main module 
#argus_archive_handle = urllib2.urlopen(argus_archive_url)

In [3]:
# the urlopen function returns a handle that is equivalent to the file handles we saw in week two.
# just like with the file handle, we can use the read() function to read the entire document into memory
archive_page = argus_archive_handle.read()
print(type(archive_page)) # the output is a bytes object

<class 'bytes'>


In [4]:
# we can use the decode() function to translate bytes to string
archive_page = argus_archive_handle.read().decode('UTF-8')
print(type(archive_page)) # the output has been converted to string

<class 'str'>


In [5]:
# requests package makes this simpler (especially if we had to authenticate to see the page (not covered here):
import requests
# the requests function get() fetches all relevant information in a single line:
page = requests.get('https://www.theargus.co.uk/archive/2018/11/01/')
# the output is a special response object:
print(type(page))

<class 'requests.models.Response'>


In [6]:
# this response object contains everything we need. 
# we can get the headers:
print("fetched at", page.headers['Date'])
# information about the encoding:
print('the encoding is:', page.encoding)
# the content type
print('type of content', page.headers['Content-Type'])

fetched at Mon, 12 Nov 2018 01:53:22 GMT
the encoding is: UTF-8
type of content text/html; charset=UTF-8


In [7]:
print('type of page.content', type(page.content))

type of page.content <class 'bytes'>


In [8]:
# we can view the page source as a continuous string:
page.text

'<!DOCTYPE html>\n<html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#">\n<head>\n    <title>Archive news from the The Argus</title>\n\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n<!-- standard AdvertisingInit --><script>\nwindow.startExec = performance.now();\n</script>\n\n<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.1/jquery.min.js" integrity="sha256-VAvG3sHdS5LqTT+5A/aeq/bZGa/Uj04xKxY8KM/w9EE=" crossorigin="anonymous"></script>\n\n<script type="text/javascript" src="/resources/shared/responsive-sync/?r=HDGkyexc"></script>\n\n\n<!-- standard PianoInit -->\n<script>\nwindow.usePiano = true;\n</script>\n<script>\n\ndocument.cookie = "__adblocker=; expires=Thu, 01 Jan 1970 00:00:00 GMT; path=/";\nvar setNptTechAdblockerCookie = function(adblocker) {\n  

# now what?
so, if we have a url, we can fetch the linked data. now what can we do with the response? what happens next depends on the details of the file format. 

the response may be a pile of html code, json, xml, csv, ... 

in the case of the argus, the response was html. let's look into parsing it. 

In [9]:
#the parser we will use is called beautiful soup. 
from bs4 import BeautifulSoup
parsed_tree = BeautifulSoup(page.text, 'html.parser')

In [10]:
# the output of the BeautifulSoup() function is of special type 'BeautifulSoup'
print(type(parsed_tree))

<class 'bs4.BeautifulSoup'>


## finding the articles:
the `parsed_tree` object has the necessary properties, that we can search for specific html tags in it. 

![source of argus page](argus_page.png)

let us explore the argus page (go through the developer menu to `view source`). 

digging around, you will find that the articles are all contained in an unordered list tag (`<ul>`) of type `class='archive-list'` 

using BeautifulSoup, we can use a `find()` function and specify the class property:
    

In [11]:
article_list = parsed_tree.find(class_='archive-list')
# the result is a tag in the html tree:
print(type(article_list))

<class 'bs4.element.Tag'>


within this tag, we can see that each article's headline is contained withing `<h3>` tags inside `<div class="col-md-9">` since we want to get all the tags that match this, we now use the `findall()` function:

In [12]:
headlines = article_list.find_all('h3') # there are no other h3 tags in the article_list... 
# this time the the returned object is of special type `ResultSet`.
print(type(headlines))

<class 'bs4.element.ResultSet'>


In [13]:
headlines

[<h3>Beating seventh-best Everton would be a big result for Hughton</h3>,
 <h3>Girl, 14, missing after not turning up to school yesterday</h3>,
 <h3>Pier boss Luke Johnson hits back at critics over Patisserie Valerie rescue plan</h3>,
 <h3>Man killed and child seriously injured after being hit by a car</h3>,
 <h3>Katie Price admits 'things hit rock bottom' as she confirms TV show</h3>,
 <h3>Albion UK recruitment chief joins West Brom</h3>,
 <h3>Life sentence for man who murdered trans woman after sex and drugs binge</h3>,
 <h3>Son who stabbed his mother to death jailed</h3>,
 <h3>More resilience required to keep run going, says Hughton</h3>,
 <h3>Hairdresser who deliberately infected men with HIV loses appeal against jail sentence</h3>,
 <h3>Peter Brackley's final Argus column - and the last stage show goal which remained unrealised</h3>,
 <h3>Young hotshot on Hughton's radar</h3>,
 <h3>Police swoop in after man was seen with gun</h3>,
 <h3>Propper primed for late November comeback for

In [14]:
[headline.contents[0] for headline in headlines]

['Beating seventh-best Everton would be a big result for Hughton',
 'Girl, 14, missing after not turning up to school yesterday',
 'Pier boss Luke Johnson hits back at critics over Patisserie Valerie rescue plan',
 'Man killed and child seriously injured after being hit by a car',
 "Katie Price admits 'things hit rock bottom' as she confirms TV show",
 'Albion UK recruitment chief joins West Brom',
 'Life sentence for man who murdered trans woman after sex and drugs binge',
 'Son who stabbed his mother to death jailed',
 'More resilience required to keep run going, says Hughton',
 'Hairdresser who deliberately infected men with HIV loses appeal against jail sentence',
 "Peter Brackley's final Argus column - and the last stage show goal which remained unrealised",
 "Young hotshot on Hughton's radar",
 'Police swoop in after man was seen with gun',
 'Propper primed for late November comeback for Albion',
 "Russell Bishop trial: Jury hears from top police and scientists about girls' murde

In [20]:
from datetime import date, timedelta
def fetch_argus_headlines(date=date.today()):
    # reads the argus archive and gets the list of headlines from the given date
    root_url = 'https://www.theargus.co.uk/archive/'
    date_string = 

2018-11-11
