# Lookback to old data and fix up

In this script will cycle through all dates found and then check if we have processed them in a certain way. If not - then will go forward and process them.

The parts of the process are:

    1) Extract sublinks

    2) Pull their HTML

    3) Get the headlines

    4) Get the stories

The best way to do this is to:

    1) Go into the HTML folder

    2) Look at a given source

    3) Look at a given date 

    4) Look for base_urls -> convert all to .html (name change)
    
        i) If we don't find sublinks - then spool through HTML to get sublinks

    5) Look for suburls -> convert all to .html

        i) If we don't find the corresponding /Stories/ part, then need to spool through for stories
        ii) If there is no stories link part, then pull the HTML from those story links
        iii) If we do not find the story text + other parts generated by the story extraction, run that too

## Block 1: Initial constants

Want to give some of the basic paths to let us check through the directories. Dates and such should be automatically looped through.

In [1]:
# Base packages for running the script
import sys, os, re, pickle, datetime

# Set the path and proxy in accordance with our OS
if sys.platform == 'linux':
    HOME_PATH = '/home/andreas/Desktop/Projects/Football/'
    proxy_settings = None
else:
    HOME_PATH = 'c:/Users/amathewl/Desktop/3_Personal_projects/football_old/football/'
    proxy_path = HOME_PATH + 'Data/00_Organisation/' + 'proxy.pickle'
    
    with open(proxy_path, 'rb') as proxy_file:
        proxy_settings = pickle.load(proxy_file)
    
# Relative paths
data_loc = HOME_PATH + 'Data/'
html_loc = data_loc + '01_HTML/'
organ_loc = data_loc + '00_Organisation/'
story_loc = data_loc + '02_Stories/'

# End names of directories that we will be checking for
base_urls = 'base_urls'
sub_links = 'sublinks'
story_links = 'story_link'

# Get today's date for our logs
date_today = datetime.datetime.today().strftime('%Y_%m_%d')

In [2]:
# Logger
from football_functions.generic import default_logger

# Base URLs
from football_functions.processes.html_extraction import baseurls

# Suburls
from football_functions.processes.html_extraction import suburls

# Headlines
from football_functions.processes.information_extraction import headlines

# Stories
from football_functions.processes.information_extraction import stories

In [3]:
# Will use this as a master logger for everything instead of separating out
lookback_logger = default_logger.get_logger(data_loc, date_today, 'lookback')

## Block 2: Function definitions

Some basic functions that aren't really worth putting in the package as I will only be using them here. Will do stuff like checking if stuff exists to tell us if we should call some part of the full process.

Function that goes into a directory and converts some ending into another (.txt to .html)

Would be good to have a function that calls whole "ends" of the full process given some numeric input, e.g. f(1) would call the whole thing minus the first step, f(2) would do everything after the second step etc.

In [4]:
def convert_html(directory_loc, logger):
    '''
    Function to convert all files in a directory to .html (just changing the name)
    '''
    all_files = os.listdir(directory_loc)
    
    files_to_convert = [txt_file for txt_file in all_files if '.txt' in txt_file]
    
    logger.info('Renaming {} files out of {} found in \n{}'.format(len(files_to_convert), len(all_files), directory_loc))
    
    for file_name in files_to_convert:
        old_path = directory_loc + '/' + file_name
        new_path = re.sub(r'\.txt$', '.html', old_path)
        
        os.rename(old_path, new_path)
    
    logger.info('Finished renaming files')

def full_process(step, date_today, domain, logger):
    '''
    Function that will do the whole process for us, depending on what step we say
    '''
    if step <= 1:
        # Find and scrape subrls (note that we always did this together)
        logger.info('Entering step 1')
        suburls.extract_urls(html_loc, organ_loc, date_today, logger, domain)
        suburl_errors = suburls.scrape_urls(organ_loc, html_loc, date_today, proxy_settings, logger)
        
    if step <= 2:
        # Get headlines from within suburls
        logger.info('Entering step 2')
        headlines.process_html(html_loc, story_loc, date_today, logger, domain)
        
    if step <= 3:
        # Get the html from story links
        logger.info('Entering step 3')
        story_errors = stories.process_articles(story_loc, html_loc, date_today, proxy_settings, logger, domain)
    
    if step <= 4:
        # Get the story from the html
        logger.info('Entering step 4')
        text_errors = stories.get_articles(story_loc, html_loc, date_today, logger, domain)
    
    # In case we never entered step 3
    if step == 4:
        story_errors = None
    
    return story_errors, text_errors

## Block 3: Moving through sources and dates
The basic loop through of sources and dates, looking for certain things.

In [5]:
# Note that some do not have suburls - so will exclude them, to avoid repeating something we don't want to do
no_suburls = ['skysports', 'telegraph']

In [6]:
# Note that we principally loop the HTML - we will only enter others for checking
for source in os.listdir(html_loc):
    html_source = html_loc + source
    for data_date in os.listdir(html_source):
        print(source)
        print(data_date)
        
        html_date = html_source + '/' + data_date
        
        lookback_logger.info('Working with data found in:\n{}'.format(html_date))
        
        # We won't actually repull the baseurls because cannot be done retroactively - only rename .txt to .html
        lookback_logger.info('Converting files...')
        convert_html(html_date + '/' + base_urls, lookback_logger)
        
        if os.path.exists(html_date + '/' + sub_links) or source in no_suburls:
            
            if os.path.exists(html_date + '/' + sub_links):
                lookback_logger.info('Have found sublinks - converting files...')

                # We have already pulled sublinks and have the HTML - so will convert extensions
                convert_html(html_date + '/' + sub_links, lookback_logger)
            
            # Next step is to check if we have the story in the stories part
            if os.path.exists(story_loc + '/' + source + '/' + data_date):
                lookback_logger.info('Have found stories')
                
                # Then we have already pulled the stories from the HTML found in the sublink
                
                # Next step is to check if we find the stories_link for the HTML
                if os.path.exists(html_date + '/' + story_links):
                    lookback_logger.info('Have found story HTML')
                    
                    # Now finally need to check if we ever actually pulled the story
                    pickle_loc = story_loc + source + '/' + data_date + '/'
                    
                    # Get an example story and check that we have story text
                    possible_pickles = [possible_pickle for possible_pickle in os.listdir(pickle_loc) if 'fake_link' not in possible_pickle]
                    example_loc = pickle_loc + possible_pickles[0]
                    with open(example_loc, 'rb') as example_file:
                        example_story = pickle.load(example_file)
                        
                    if 'story_text' in example_story:
                        lookback_logger.info('Found story text - everything has already been done!')
                    else:
                        print('4')
                        errors = full_process(4, data_date, [source], lookback_logger)
                    
                else:
                    # Need to pull the HTML from the story link and get the story
                    print('3')
                    errors = full_process(3, data_date, [source], lookback_logger)
                    
            else:
                # Need to pull the story headlines from the sublink url
                print('2')
                errors = full_process(2, data_date, [source], lookback_logger)
                
        else:
            # So we need to pull the sublinks from the base_urls html and them pull the HTML - i.e. whole process from there
            print('1')
            errors = full_process(1, data_date, [source], lookback_logger)
        
        lookback_logger.info('Have finished with the current date\n\n\n')
        
    # Print some whitespace just to be clear where we have ended
    lookback_logger.info('Have finished with the current source\n\n\n')

bbc
2018_01_04
bbc
2018_01_08
bbc
2018_01_09
bbc
2018_01_10
3
bbc
2018_01_11
3
bbc
2018_01_12
4
bbc
2018_01_14
4
bbc
2018_01_16
4
bbc
2018_01_17
4
bbc
2018_01_18
4
bbc
2018_01_19
4
bbc
2018_01_22
4
bbc
2018_01_23
4
dailymail
2018_01_04
2
dailymail
2018_01_08
2
dailymail
2018_01_09
2
dailymail
2018_01_10
2
dailymail
2018_01_11
3
dailymail
2018_01_12
4
dailymail
2018_01_14
3
dailymail
2018_01_16
4
dailymail
2018_01_17
4
dailymail
2018_01_18
4
dailymail
2018_01_19
4
dailymail
2018_01_22
4
dailymail
2018_01_23
4
mirror
2018_01_04
2
mirror
2018_01_08
2
mirror
2018_01_09
2
mirror
2018_01_10
2
mirror
2018_01_11
3
mirror
2018_01_12
4
mirror
2018_01_14
3
mirror
2018_01_16
4
mirror
2018_01_17
4
mirror
2018_01_18
4
mirror
2018_01_19
4
mirror
2018_01_22
4
mirror
2018_01_23
4
skysports
2018_01_04
2
skysports
2018_01_08
2
skysports
2018_01_09
2
skysports
2018_01_10
2
skysports
2018_01_11
3
skysports
2018_01_12
4
skysports
2018_01_14
3
skysports
2018_01_16
4
skysports
2018_01_17
4
skysports
2018_01_1