# 06 - fetching data from online sources using urllib

# this week's exercise:
read the argos. or, since that is very boring, write a function in python to read the argus for you. the function should be called `fetch_argus_headlines(date)`. this function should:
- accept as its sole input argument a date string (you know how to format it now!).
- verify that the input date is in the past (but not too far in the past, the argus archive is limited).
- if so, fetch all the argus headlines from that date.
- return a list of strings

## bonus:
write another function `fetch_argus_article_links(date)` which, instead of the headlines, fetches and return a list of links to that day's articles.

In [13]:
import requests 
import datetime
from bs4 import BeautifulSoup

In [116]:
def fetch_argus_headlines(date):
    
    # validate that the date is in the required format (YYYY-MM-DD)
    try:
        d = datetime.datetime.strptime(date, '%Y-%m-%d')
    except ValueError:
        print("Incorrect date format, should be YYYY-MM-DD")
    
    # check that the date is within the past 5 years (chose 5 years as an arbitrary number and) extract headlines
    todays_date = datetime.datetime.today()
    five_years_ago = todays_date - datetime.timedelta(days=(5*365))
    if five_years_ago <= d <= todays_date:
        page = requests.get(f'https://www.theargus.co.uk/archive/{d.year}/{d.month}/{d.day}/')
        parsed_tree = BeautifulSoup(page.text, 'html.parser')
        article_list = parsed_tree.find(class_='archive-list')
        headlines = article_list.find_all('h3')
        headlines_list = [headline.contents[0] for headline in headlines]
    else:
        headlines_list = "Please choose a date within the last five years"
        
    return headlines_list

In [118]:
fetch_argus_headlines("2018-01-01")

['Rob Cross wins World Championship final as he pulls plug on The Power',
 'Albion v Bournemouth Analysis: Defending from corners a mounting concern',
 'Murray praises Albion performance - but rues set-piece goals',
 'Albion v Bournemouth: Two points dropped says Hughton',
 'Seagulls twice pegged back. See how Albion drew 2-2 with Bournemouth at the Amex today',
 'Duffy backs promoted trio to stay up',
 'Armed police called to man with knife in Seaford',
 "Albion's central defensive duo Lewis Dunk and Shane Duffy urged to add goals to their game",
 "Distressed woman rescued from water's edge in Shoreham",
 'Business is booming as Albion cash flows into the city',
 'Chris Hughton says keeping Albion in Premier would be as big an achievement as getting them there',
 'Meet the man who could be youngest council member',
 'New search for kinder treatments for cancer',
 'Five generations celebrate former mayor’s 100th',
 'Marshall-Tufflex’s £54k charity total',
 '‘Treatment gave me my life b

In [135]:
def fetch_argus_article_links(date):
    
    # validate that the date is in the required format (YYYY-MM-DD)
    try:
        d = datetime.datetime.strptime(date, '%Y-%m-%d')
    except ValueError:
        print("Incorrect date format, should be YYYY-MM-DD")
    
    # check that the date is within the past 5 years (chose 5 years as an arbitrary number) and extract links
    todays_date = datetime.datetime.today()
    five_years_ago = todays_date - datetime.timedelta(days=(5*365))
    if five_years_ago <= d <= todays_date:
        page = requests.get(f'https://www.theargus.co.uk/archive/{d.year}/{d.month}/{d.day}/')
        parsed_tree = BeautifulSoup(page.text, 'html.parser')
        article_list = parsed_tree.find(class_='archive-list')
        links = article_list.find_all('a')
        links_list = [link['href'] for link in links]
        links_list = list(set(links_list)) # removes duplicates 
        links_list = ['https://www.theargus.co.uk' + link for link in links_list] # create the full link
    else:
        links_list = "Please choose a date within the last five years"
        
    return links_list

In [136]:
fetch_argus_article_links("2018-01-01")

['https://www.theargus.co.uk/news/15801209.Duffy_backs_promoted_trio_to_stay_up/?ref=arc',
 'https://www.theargus.co.uk/news/15801146.Armed_police_called_to_man_with_knife_in_Seaford/?ref=arc',
 'https://www.theargus.co.uk/news/15800983.Eileen_Inkpen/?ref=arc',
 'https://www.theargus.co.uk/news/15800895.Please_help_me_force_Government___s_hand_over_road_changes/?ref=arc',
 'https://www.theargus.co.uk/news/15801217.Seagulls_twice_pegged_back__See_how_Albion_drew_2-2_with_Bournemouth_at_the_Amex_today/?ref=arc',
 'https://www.theargus.co.uk/news/15800709.New_search_for_kinder_treatments_for_cancer/?ref=arc',
 'https://www.theargus.co.uk/news/15800646.Shooting_team_ready_to_compete_on_world_stage/?ref=arc',
 'https://www.theargus.co.uk/news/15800887.Gatwick___s_new_link_with_the_Far_East/?ref=arc',
 'https://www.theargus.co.uk/news/15800688.App_proves_a_top_hit_with_mums_and_babies/?ref=arc',
 'https://www.theargus.co.uk/news/15801099.Business_is_booming_as_Albion_cash_flows_into_the_city