# Full data extraction

This is the script where the full data extraction should be done without the need to run various scripts. All the functions will be taken from the *football_functions* package and nothing else should need to be imported.

There are several key features to these functions in the way that they save and pull HTML - almost all of them need the current date and proxy to be provided. This is done so that a proxy can easily be dealt with and any date of choice can be used (to go back and fix old data).

Note that they will all also ask for paths - this is done so that switching between OS should not be difficult. To that end the system of *HOME_PATH* is also used, such that all paths are dependent on the *HOME_PATH* and only it should need to be modified, as long as the file structure is followed.

The important structure is as follows

/Data/

    00_Organisation/
    01_HTML/
    02_Stories/
    
Within each one you will find the domain name as well as the date it was extracted on - and subfolders indicating different processes (such as *base link* or *sublink*).

The process is as follows:

    1) Pull the HTML from the baselinks (depends on generic.pull_html)
    2) Find the sublinks within certain pages of the baselinks, as some are, for example, team pages (depends on source_specific)
    3) Scrape all the sublinks found to get our "main pages" for the scraping of headlines (depends on generic.pull_html)
    4) Find the headlines within the scraped sublinks, saving information about the article and any easy extras (like image titles or summaries) in order to better tag later (depends on source_specific)
    5) Scrape the headline URL found in the previous step in order to get the story's HTML for story extraction (depends on generic.pull_html)
    6) Get the actual story from the HTML and save it along with the other information *NOT COMPLETE*
    
Steps depending on *generic.pull_html* will save .HTML files in *Data/01_HTML*. Step 2 saves links in a list under *Data/00_Organisation*, and step 4 saves .pickle files with dictionaries for each story under */Data/02_Stories*.

## Initial constants

Some initial constants just for the initial overhead, only this should need to be modified

In [1]:
# Base packages for running the script
import sys, datetime, pickle

# Set the path and proxy in accordance with our OS
if sys.platform == 'linux':
    HOME_PATH = '/home/andreas/Desktop/Projects/Football/'
    proxy_settings = None
else:
    HOME_PATH = 'c:/Users/amathewl/Desktop/3_Personal_projects/football/'
    proxy_path = HOME_PATH + 'Data/00_Organisation/' + 'proxy.pickle'
    
    with open(proxy_path, 'rb') as proxy_file:
        proxy_settings = pickle.load(proxy_file)
    
# Relative paths
data_loc = HOME_PATH + 'Data/'
html_loc = data_loc + '01_HTML/'
organ_loc = data_loc + '00_Organisation/'
story_loc = data_loc + '02_Stories/'

# Files to find
baseurl_loc = organ_loc + 'news_sources.txt'

# Get today's date for various functions
date_today = datetime.datetime.today().strftime('%Y_%m_%d')

## Importing from football_functions

The personal functions we will use from football_functions for the initial processes. Have separated this because could cause some issues when changes are made.

In [2]:
# Logger
from football_functions.generic import default_logger

# Base URLs
from football_functions.processes.html_extraction import baseurls

# Suburls
from football_functions.processes.html_extraction import suburls

# Headlines
from football_functions.processes.information_extraction import headlines

# Stories
from football_functions.processes.information_extraction import stories

## Declaring the logger

Declare the logger we will be using throughout the process to save logs of our stuff instead of printing.
Will get a logger for each process.

In [3]:
baseline_logger = default_logger.get_logger(data_loc, date_today, 'baseline')
suburl_logger = default_logger.get_logger(data_loc, date_today, 'suburl')
headline_logger = default_logger.get_logger(data_loc, date_today, 'headline')
story_logger = default_logger.get_logger(data_loc, date_today, 'story')

## Baseline URL extraction

Getting the HTML from the baseline URLs and saving to file

In [4]:
baseline_errors = baseurls.scrape_urls(baseurl_loc, html_loc, date_today, proxy_settings, baseline_logger)

## Suburl extraction

Getting the suburls from the HTML in block 3 and then scraping them

In [5]:
suburls.extract_urls(html_loc, organ_loc, date_today, suburl_logger)

In [6]:
suburl_errors = suburls.scrape_urls(organ_loc, html_loc, date_today, proxy_settings, suburl_logger)

## Headline extraction

Getting the headlines from all the HTML pages that we have looked at

In [4]:
headlines.process_html(html_loc, story_loc, date_today, headline_logger)

## Story extraction

Getting the HTML from the headline links that we found in block 5 - then getting the text

In [None]:
story_errors = stories.process_articles(story_loc, html_loc, date_today, proxy_settings, story_logger)

In [None]:
text_errors = stories.get_articles(story_loc, html_loc, date_today, story_logger)