# Full data extraction

This is the script where the full data extraction should be done without the need to run various scripts. All the functions will be taken from the *football_functions* package and nothing else should need to be imported.

There are several key features to these functions in the way that they save and pull HTML - almost all of them need the current date and proxy to be provided. This is done so that a proxy can easily be dealt with and any date of choice can be used (to go back and fix old data).

Note that they will all also ask for paths - this is done so that switching between OS should not be difficult. To that end the system of *HOME_PATH* is also used, such that all paths are dependent on the *HOME_PATH* and only it should need to be modified, as long as the file structure is followed.

The important structure is as follows

/Data/

    00_Organisation/
    01_HTML/
    02_Stories/
    
Within each one you will find the domain name as well as the date it was extracted on - and subfolders indicating different processes (such as *base link* or *sublink*).

The process is as follows:

    1) Pull the HTML from the baselinks (depends on generic.pull_html)
    2) Find the sublinks within certain pages of the baselinks, as some are, for example, team pages (depends on source_specific)
    3) Scrape all the sublinks found to get our "main pages" for the scraping of headlines (depends on generic.pull_html)
    4) Find the headlines within the scraped sublinks, saving information about the article and any easy extras (like image titles or summaries) in order to better tag later (depends on source_specific)
    5) Scrape the headline URL found in the previous step in order to get the story's HTML for story extraction (depends on generic.pull_html)
    6) Get the actual story from the HTML and save it along with the other information *NOT COMPLETE*
    
Steps depending on *generic.pull_html* will save .HTML files in *Data/01_HTML*. Step 2 saves links in a list under *Data/00_Organisation*, and step 4 saves .pickle files with dictionaries for each story under */Data/02_Stories*.

## Block 1: Initial constants

Some initial constants just for the initial overhead, only this should need to be modified

In [1]:
# Base functions for running the script
import sys, datetime

# Set the path and proxy in accordance with our OS
if sys.platform == 'linux':
    HOME_PATH = '/home/andreas/Desktop/Projects/Football/'
    proxy_settings = None
else:
    HOME_PATH = 'c:/Users/amathewl/Desktop/03_Personal_projects/football/'
    proxy_settings = None
    
# Relative paths
html_loc = HOME_PATH + 'Data/01_HTML/'
organ_loc = HOME_PATH + 'Data/00_Organisation/'
story_loc = HOME_PATH + 'Data/02_Stories/'

# Files to find
baseurl_loc = organ_loc + 'news_sources.txt'

# Get today's date for various functions
date_today = datetime.datetime.today().strftime('%Y_%m_%d')

## Block 2: Importing from football_functions

The personal functions we will use from football_functions for the initial processes. Have separated this because could cause some issues when changes are made.

In [2]:
# Base URLs
from football_functions.processes.html_extraction import baseurls

# Suburls
from football_functions.processes.html_extraction import suburls

# Headlines
from football_functions.processes.information_extraction import headlines

# Stories
from football_functions.processes.information_extraction import stories

## Block 3: Baseline URL extraction

Getting the HTML from the baseline URLs and saving to file

In [3]:
baseline_errors = baseurls.scrape_urls(baseurl_loc, html_loc, date_today, proxy_settings)

Loading URLs from /home/andreas/Desktop/Projects/Football/Data/00_Organisation/news_sources.txt
Successfully pulled HTML from http://www.bbc.com/sport/football/teams
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/base_urls/sport_football_teams.html
Making directory /home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/base_urls/
Successfully saved URL

Successfully pulled HTML from http://www.bbc.com/sport/football/transfers
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/base_urls/sport_football_transfers.html
Successfully saved URL

Successfully pulled HTML from http://www.dailymail.co.uk/sport/premierleague/index.html
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/dailymail/2018_01_27/base_urls/sport_premierleague_indexhtml.html
Making directory /home/andreas/Desktop/Projects/Football/Data/01_HTML/dailymail/2018_01_27/base_urls/
Successfully saved U

## Block 4: Suburl extraction

Getting the suburls from the HTML in block 3 and then scraping them

In [4]:
suburls.extract_urls(html_loc, organ_loc, date_today)

Now looking at sources from mirror
Processing the HTML found in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/mirror/2018_01_27/base_urls/sport_football_.html
Have found 20 links from mirror
Making directory /home/andreas/Desktop/Projects/Football/Data/00_Organisation/mirror/2018_01_27/
Writing links file
Finished writing to file

Now looking at sources from bbc
Processing the HTML found in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/base_urls/sport_football_teams.html
Have found 158 links from bbc
Making directory /home/andreas/Desktop/Projects/Football/Data/00_Organisation/bbc/2018_01_27/
Writing links file
Finished writing to file

Now looking at sources from dailymail
Processing the HTML found in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/dailymail/2018_01_27/base_urls/sport_football_indexhtml.html
Have found 24 links from dailymail
Making directory /home/andreas/Desktop/Projects/Football/Data/00_Organisation/dailymail/2018_01_27/
Writing

In [5]:
suburl_errors = suburls.scrape_urls(organ_loc, html_loc, date_today, proxy_settings)

Successfully pulled HTML from http://www.mirror.co.uk/all-about/arsenal-fc
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/mirror/2018_01_27/sublinks/allabout_arsenalfc.html
Making directory /home/andreas/Desktop/Projects/Football/Data/01_HTML/mirror/2018_01_27/sublinks/
Successfully saved URL

Successfully pulled HTML from http://www.mirror.co.uk/all-about/afc-bournemouth
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/mirror/2018_01_27/sublinks/allabout_afcbournemouth.html
Successfully saved URL

Successfully pulled HTML from http://www.mirror.co.uk/all-about/brighton-and-hove-albion-fc
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/mirror/2018_01_27/sublinks/allabout_brightonandhovealbionfc.html
Successfully saved URL

Successfully pulled HTML from http://www.mirror.co.uk/all-about/burnley-fc
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/mirror/2018_01_27/sublin

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/southampton
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_southampton.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/stoke-city
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_stokecity.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/swansea-city
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_swanseacity.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/tottenham-hotspur
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_tottenhamhotspur.html
Successfully saved URL


Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/millwall
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_millwall.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/norwich-city
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_norwichcity.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/nottingham-forest
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_nottinghamforest.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/preston-north-end
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_prestonnorthend.html
Successfully sav

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/wigan-athletic
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_wiganathletic.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/accrington-stanley
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_accringtonstanley.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/barnet
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_barnet.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/cambridge-united
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_cambridgeunited.html
Successfully sa

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/eastleigh
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_eastleigh.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/ebbsfleet-united
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_ebbsfleetunited.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/halifax
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_halifax.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/gateshead
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_gateshead.html
Successfully saved URL

Successfully pu

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/queens-park
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_queenspark.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/raith-rovers
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_raithrovers.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/stranraer
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_stranraer.html
Successfully saved URL

Successfully pulled HTML from http://www.bbc.co.uk/sport/football/teams/annan-athletic
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/sublinks/sport_football_teams_annanathletic.html
Successfully saved URL

Success

Successfully pulled HTML from http://www.dailymail.co.uk/sport/teampages/west-ham-united.html
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/dailymail/2018_01_27/sublinks/sport_teampages_westhamunitedhtml.html
Successfully saved URL

Successfully pulled HTML from http://www.dailymail.co.uk/sport/teampages/celtic.html
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/dailymail/2018_01_27/sublinks/sport_teampages_celtichtml.html
Successfully saved URL

Successfully pulled HTML from http://www.dailymail.co.uk/sport/teampages/rangers.html
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/dailymail/2018_01_27/sublinks/sport_teampages_rangershtml.html
Successfully saved URL

Successfully pulled HTML from http://www.dailymail.co.uk/sport/teampages/barcelona.html
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/dailymail/2018_01_27/sublinks/sport_teampages_barcelonahtml.html
Succ

Successfully pulled HTML from https://www.theguardian.com/football/fulham
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_fulham.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/hullcity
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_hullcity.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/ipswichtown
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_ipswichtown.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/leedsunited
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_leedsunited.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/fo

Successfully pulled HTML from https://www.theguardian.com/football/getafe
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_getafe.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/girona
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_girona.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/laspalmas
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_laspalmas.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/legan-s
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_legans.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/levante
Sa

Successfully pulled HTML from https://www.theguardian.com/football/freiburg
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_freiburg.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/hamburg
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_hamburg.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/hannover
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_hannover.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/herthaberlin
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_herthaberlin.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/foot

Successfully pulled HTML from https://www.theguardian.com/football/blackburn
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_blackburn.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/blackpool
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_blackpool.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/bradford
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_bradford.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/bristolrovers
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_bristolrovers.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.

Successfully pulled HTML from https://www.theguardian.com/football/grimsby
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_grimsby.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/lincoln
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_lincoln.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/lutontown
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_lutontown.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/mansfield
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_mansfield.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/m

Successfully pulled HTML from https://www.theguardian.com/football/cowdenbeath
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_cowdenbeath.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/edinburgh-city
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_edinburghcity.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/elgin
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_elgin.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com/football/montrose
Saving HTML from URL in 
/home/andreas/Desktop/Projects/Football/Data/01_HTML/theguardian/2018_01_27/sublinks/football_montrose.html
Successfully saved URL

Successfully pulled HTML from https://www.theguardian.com

## Block 5: Headline extraction

Getting the headlines from all the HTML pages that we have looked at

In [6]:
headlines.process_html(html_loc, story_loc, date_today)

Making directory /home/andreas/Desktop/Projects/Football/Data/02_Stories/mirror/2018_01_27/
Getting content from /home/andreas/Desktop/Projects/Football/Data/01_HTML/mirror/2018_01_27/base_urls/sport_football_.html


AttributeError: module 'football_functions.source_specific.mirror' has no attribute 'extract_mirror'

## Block 6: Story extraction

Getting the HTML from the headline links that we found in block 5 - then getting the text *COMING SOON*

In [None]:
story_errors = stories.process_articles(story_loc, html_loc, date_today, proxy_settings)