# Development

This script is for development of other functions - just for simplicity of execution etc. Later the code should be moved to *football_functions*, and then deleted from here.

## Block 0: Initial packages and definitions

Just a block to define some stuff that we will probably never be changing

In [1]:
# Base packages for running the script
import sys, datetime

# Set the path and proxy in accordance with our OS
if sys.platform == 'linux':
    HOME_PATH = '/home/andreas/Desktop/Projects/Football/'
    proxy_settings = None
else:
    HOME_PATH = 'c:/Users/amathewl/Desktop/3_Personal_projects/football/'
    proxy_settings = None
    
# Relative paths
data_loc = HOME_PATH + 'Data/'
html_loc = data_loc + '01_HTML/'
organ_loc = data_loc + '00_Organisation/'
story_loc = data_loc + '02_Stories/'

# Get today's date for various functions
date_today = datetime.datetime.today().strftime('%Y_%m_%d')

In [2]:
# Define a logger for development
from football_functions.generic import default_logger

dev_logger = default_logger.get_logger(data_loc, date_today, 'development')

## Block 1: Function to clean up downloaded data

WILL TEST USING BBC DATA FOR A FEW DAYS

The idea will be to build a post-processing clean up function that will let us delete anything that is duplicated.

Initially this will serve to save space by deleting stuff that is already duplicated, but in future it should serve as a way to delete pulled HTML before ever even looking at it.

Because of the way things work - this function should be executed after EVERY SINGLE data capture stage - but also keep a version for a general clean up.

1. First delete sublink HTML that is identical to something previously pulled. This is the first step because an identical URL can return different results depending on the date (e.g. football/teampages/Arsenal) - but if the HTML is the same, then all of the stories on that page are the same (as, by definition, the suburl is where we look for headlines). Once this is done, none of the contained headlines will be searched, saving a lot.

2. Secondly, delete stories that are identical. This is self explanatory - and will save time, but especially HTML. There could be a slight aspect where stories could actually be updated at a later date - will ignore this case for now. This step involves looking at the story pickles (as that is where we directly saved to from sublinks), and then deleting based on info - and also further deleting HTML.

Two things being identical is determined by:

* Checking whether or not the end file name is the same, as there should be nothing making us save the same article under different file names, beyond the location.

* Checking whether or not the actual HTML/content is identical - this is important as having the same file name does not guarantee anything. 

When we delete files, we will always keep the oldest one. 

We will also keep a record of those files that have caused deletions in the past (for both stages), and do an initial check on those ones (this prevents us needing to look through all of our dates for a given file type to delete) - will keep these as a CSV where the columns are path and name

NEED TO BE CAREFUL AND UPDATE LOG IF WE HAVE AN OLDER FILE

In [3]:
import os, pickle, pandas as pd

# Log file names where we will save the full path to previously deleting files - keep this as FILE_NAME / PATH / TYPE OF DELETION
log_file = os.path.join(organ_loc, 'file_deletion.csv')

## Block 2: Function definitions

Will define a function to check whether the content found at two paths are the same

The checking process isn't good - I think the only way to do it properly will be to actually retrieve the info first and then process it.

Considering that the processing of screening the HTMLs is not that heavy - could easily just repeat the process for anything that appears as a duplicate. 
In the long run, implementing everything in a chain, wouldn't even have an issue, as could just check it and then keep it if not.

In [4]:
import re
import football_functions.source_specific.bbc.process_html as bbc # then bbc.process_html.XXX
import football_functions.source_specific.dailymail.process_html as dailymail
import football_functions.source_specific.mirror.process_html as mirror
import football_functions.source_specific.guardian.process_html as guardian
import football_functions.source_specific.skysports.process_html as skysports
import football_functions.source_specific.telegraph.process_html as telegraph

from football_functions.source_specific.bbc import process_html as bbc
from football_functions.source_specific.dailymail import process_html as dailymail
from football_functions.source_specific.guardian import process_html as guardian
from football_functions.source_specific.mirror import process_html as mirror
from football_functions.source_specific.skysports import process_html as skysports
from football_functions.source_specific.telegraph import process_html as telegraph

def check_html(check_1, check_2, domain, is_sublinks, logger):
    '''
    Function to check if 2 HTMLs are the same by doing some set operations
    '''
    
    # If looking at sublinks then pull the links
    if is_sublinks:
        # Now do source specific stuff
        if domain == 'bbc':
            modifier = 'football_teams' in check_1
            articles_info_1 = bbc.extract_headlines(check_1, modifier, logger)
            articles_info_2 = bbc.extract_headlines(check_2, modifier, logger)

        elif domain == 'dailymail':
            modifier = 'football_index' not in check_1
            articles_info_1 = dailymail.extract_headlines(check_1, modifier, logger)
            articles_info_2 = dailymail.extract_headlines(check_2, modifier, logger)

        elif domain == 'theguardian':
            articles_info_1 = guardian.extract_headlines(check_1, logger)
            articles_info_2 = guardian.extract_headlines(check_2, logger)

        elif domain == 'mirror':
            articles_info_1 = mirror.extract_headlines(check_1, logger)
            articles_info_2 = mirror.extract_headlines(check_2, logger)

        elif domain == 'skysports':
            modifier = 'regional' in check_1
            articles_info_1 = skysports.extract_headlines(check_1, modifier, logger)
            articles_info_2 = skysports.extract_headlines(check_2, modifier, logger)

        elif domain == 'telegraph':
            articles_info_1 = telegraph.extract_headlines(check_1, logger)
            articles_info_2 = telegraph.extract_headlines(check_2, logger)
        
        # Compare on the title and link of the article
        return articles_info_1 == articles_info_2
        #return articles_info_1['article_title'] == articles_info_2['article_title'] and articles_info_1['article_link'] == articles_info_2['article_link']
    
    else:
        # Else we are looking at stories - will just get the actual text
        if domain == 'bbc':
            story_details_1 = bbc.get_text(check_1, logger)[0]
            story_details_2 = bbc.get_text(check_2, logger)[0]

        elif domain == 'dailymail':
            story_details_1 = dailymail.get_text(check_1, logger)[0]
            story_details_2 = dailymail.get_text(check_2, logger)[0]

        elif domain == 'theguardian':
            story_details_1 = guardian.get_text(check_1, logger)[0]
            story_details_2 = guardian.get_text(check_2, logger)[0]

        elif domain == 'mirror':
            story_details_1 = mirror.get_text(check_1, logger)[0]
            story_details_2 = mirror.get_text(check_2, logger)[0]

        elif domain == 'skysports':
            story_details_1 = skysports.get_text(check_1, logger)[0]
            story_details_2 = skysports.get_text(check_2, logger)[0]

        elif domain == 'telegraph':
            story_details_1 = telegraph.get_text(check_1, logger)[0]
            story_details_2 = telegraph.get_text(check_2, logger)[0]
        
        return story_details_1 == story_details_2
        

def check_same(f_name, path_1, path_2, is_pickle, domain, logger):
    '''
    A function to check if the content found at path_1 and path_2 is the same - note that we 
    already assume the file names are the same, because only call them in those circumstances
    If we are looking at pickles, need to set is_pickle to True and do a pickle load
    '''
    check_1 = os.path.join(path_1, f_name)
    check_2 = os.path.join(path_2, f_name)
    
    if is_pickle:
        with open(check_1, 'rb') as content_1:
            with open(check_2,'rb') as content_2:
                return pickle.load(content_1) == pickle.load(content_2)
    else:
        is_sublinks = 'sublinks' in check_1
        return check_html(check_1, check_2, domain, is_sublinks, logger)

def try_candidates(deletion_candidates, is_pickle, logger, domain, deletion_log = None):
    '''
    A function to try the candidates to see if they have the same content - and then delete or not accordingly
    Will return a frame with the deleted files (to remove from our potential files) and the log, which has been updated
    Note that we feed in the deletion log to update it - but if we feed "None", it does not get updated
    '''
    # Can't find a way to do this by the column names - so will get indices
    pf_index = deletion_candidates.columns.get_loc('Potential_file')
    pp_index = deletion_candidates.columns.get_loc('Potential_path')
    fp_index = deletion_candidates.columns.get_loc('File_path')
    
    # Then get whether or not we should delete them
    logger.info('Will check {} files to see if they are the same'.format(deletion_candidates.shape[0]))
    deletion_candidates['Delete'] = deletion_candidates.apply(lambda x: check_same(x[pf_index], x[pp_index], x[fp_index], is_pickle, domain, logger), axis = 1)
    
    # Delete those that we should
    logger.info('Have found {} files that will be deleted'.format(deletion_candidates['Delete'].sum()))
    deletion_candidates[deletion_candidates['Delete']].apply(lambda x: os.remove(x[pp_index] + x[pf_index]), axis = 1)
    
    # Then we need to update the deletion log
    if deletion_log is not None:
        # Pull the name and path, and add whether or not we are dealing with pickles
        new_entries = deletion_candidates.loc[deletion_candidates['Delete'], ['File_name', 'File_path']]
        new_entries['Is_pickle'] = is_pickle
        
        # Then append to end
        deletion_log = deletion_log.append(new_entries)
    
    # Finally return the ones we deleted and the log that has been updated
    return deletion_candidates[deletion_candidates['Delete']], deletion_log

## Block 3: Process

The process to follow will be:

1. Load file names for a given date (TODAY's date)

2. Compare these file names to our "log" files

3. If not in the "log" files, then look for the file in each one of the previous dates, starting from the earliest, until we find a match

4. For every match found, check that the HTML content is also equal (note that the same function does these both)

5. If a match is found, immediately stop and delete the current file, adding the original file to our log, if not already there

Note that we ONLY compare domain to domain - NO cross domain stuff

Note that we will initially read in the log files, add to them as we go, and then save at the end

In [5]:
# Start off by loading in the deletion logs
deletion_log = pd.read_csv(log_file)

# Date for testing
date_today = '2018_01_28'

In [6]:
deletion_log.head()

Unnamed: 0,File_name,File_path,Is_pickle


I think maybe could be running into an issue where, even though HTML is the "same", pulling it on different dates will give different metadata / adverts etc.

I think the pickle part is going OK

In [None]:
import os, pandas as pd

def delete_duplicates(search_loc, is_pickle, date_today, past_deletons, logger):
    '''
    Function for looking at currently saved files and delete them if they are duplicates of something already seen
    This will be carried out after saving HTML and getting headlines - focusing on not having to pull more HTML than necessary
    Search_loc is the key argument and should point directly to suburls, story_links, or pickle stories
    Past deletions should be a directory to where the deletions are located
    '''
    logger.info('Looking for files in:\n{}'.format(search_loc))
    
    # Search ever domain in the directory given to search through
    for domain in os.listdir(search_loc):
        logger.info('Looking at {}'.format(domain))
        
        date_loc = os.path.join(search_loc, domain, date_today)
            
        # The first part of the process is to grab all of the potential files up for deletion
        file_frame = pd.DataFrame({'Potential_file' : os.listdir(date_loc), 'Potential_path' : date_loc})
        
        logger.info('Will be looking at {} files'.format(file_frame.shape[0]))
        
        # Now the process of deletion is to look at the past deletions (if any) and then all dates that came before
        deletion_log = pd.read_csv(past_deletions)
        
        if deletion_log.shape[0] != 0:
            # Look at past deletions to do a quick compare and delete - do this by merging on file names
            deletion_candidates = file_frame.merge(deletion_log[deletion_log['Is_pickle'] == is_pickle], left_on = 'Potential_file', right_on = 'File_name')
            num_log_delete = deletion_candidates.shape[0]
            
            logger.info('Have found {} candidates to delete from our log'.format(num_log_delete))
            
            # So if we have something here, will delete, but won't update our log
            if num_log_delete != 0:
                deleted_candidates, _ = try_candidates(deletion_candidates, is_pickle, logger, domain)
                
                # Once we have deleted, will remove those files from our original list
                file_frame = file_frame[~ file_frame['Potential_file'].isin(deleted_candidates['Potential_file'])]
                
                logger.info('After deletion now have {} files left'.format(file_frame.shape[0]))
        
        # Then we will search through previous dates in that domain
        domain_loc = os.path.join(search_loc, domain)
        possbile_dates = os.listdir(domain_loc)
        valid_dates = sorted([possible_date for possible_date in possbile_dates if datetime.datetime.strptime(possible_date, '%Y_%m_%d') < datetime.datetime.strptime(date_today, '%Y_%m_%d')])
        
        logger.info('Have found {} dates to search through'.format(len(search_dates)))
        
        # Now we will loop through the dates and do a similar thing to the deletion log
        for deletion_date in possible_dates:
            logger.info('Now searching date {}'.format(search_date))
            delete_path = os.path.join(search_loc, domain, deletion_date)
            
            # Since we don't have a deletion log, make a "check" frame
            check_frame = pd.DataFrame({'File_name' : os.listdir(delete_path), 'File_path' : delete_path, 'Is_pickle' : is_pickle})
            
            logger.info('Comparing against {} files'.format(check_frame.shape[0]))
            
            # Then follow the same process and get the candidates
            deletion_candidates = file_frame.merge(check_frame, left_on = 'Potential_file', right_on = 'File_name')
            num_date_delete = deletion_candidates.shape[0]

            # Then delete
            logger.info('Have found {} candidates to delete from {}'.format(num_date_delete, deletion_date))

            # Again we dive in if we have found some candidates, but this time updating the log
            if num_date_delete != 0:
                deleted_candidates, deletion_log = try_candidates(deletion_candidates, is_pickle, logger, domain, deletion_log)

                # And finally remove like before
                file_frame = file_frame[~ file_frame['Potential_file'].isin(deleted_candidates['Potential_file'])]
                logger.info('After deletion now have {} files left'.format(file_frame.shape[0]))

In [7]:
for search_loc in [html_loc, story_loc]:
    # So we start off looking in these two locations and will branch slightly depending
    is_pickle = search_loc == story_loc
    
    dev_logger.info('Looking for files in:\n{}'.format(search_loc))
    
    for domain in os.listdir(search_loc):
        dev_logger.info('Looking at {}'.format(domain))
        
        # Declare dateloc first
        date_loc = os.path.join(search_loc, domain, date_today)
        
        # If we are not looking at pickles - we have several places to look (but never look at base urls)
        if is_pickle:
            file_types = ['']
        else:
            file_types = [file_type + '/' for file_type in os.listdir(date_loc) if file_type != 'base_urls'] # Have added the slash here to avoid a double slash later
        
        for file_type in file_types:
            dev_logger.info('Looking at file type: {}'.format(file_type))
            
            # We will only look in our "date today" for our potential files - will use this as a fixed path for deletion
            type_loc = os.path.join(date_loc, file_type)

            # Grab the file names and put into a pandas frame so that we can keep track of it
            file_frame = pd.DataFrame({'Potential_file' : os.listdir(type_loc), 'Potential_path' : type_loc})

            dev_logger.info('Will look at {} files'.format(file_frame.shape[0]))
            ## NOW START PROCESS OF DELETION ##

            # Grab the corresponding entries from our log - note that we should never be comparing data from the same dates here as log should be updated continuously
            deletion_candidates = file_frame.merge(deletion_log[deletion_log['Is_pickle'] == is_pickle], left_on = 'Potential_file', right_on = 'File_name')
            num_log_delete = deletion_candidates.shape[0]

            # Then process these candidates and remove from our original set of files the ones that get deleted
            dev_logger.info('Have found {} candidates to delete from our log'.format(num_log_delete))
            
            # If we have candidates, then try the content but DON'T update the log
            if num_log_delete != 0:
                deleted_candidates, _ = try_candidates(deletion_candidates, is_pickle, dev_logger, domain)

                # And then remove from our file_frame
                file_frame = file_frame[~ file_frame['Potential_file'].isin(deleted_candidates['Potential_file'])]

                dev_logger.info('After deletion now have {} files left'.format(file_frame.shape[0]))

            # Now we have to look at previous dates so we search through the directories which are named after the dates
            domain_loc = os.path.join(search_loc, domain)
            directory_dates = os.listdir(domain_loc)
            search_dates = sorted([directory_date for directory_date in directory_dates if datetime.datetime.strptime(directory_date, '%Y_%m_%d') < datetime.datetime.strptime(date_today, '%Y_%m_%d')])

            dev_logger.info('Have found {} dates to search through'.format(len(search_dates)))

            # Then we will loop through them to find potential candidates
            for search_date in search_dates:
                dev_logger.info('Now searching date {}'.format(search_date))
                search_date_loc = os.path.join(domain_loc, search_date, file_type)

                # Since we don't have the log frame - make a check frame
                check_frame = pd.DataFrame({'File_name' : os.listdir(search_date_loc), 'File_path' : search_date_loc, 'Is_pickle' : is_pickle})

                dev_logger.info('Comparing against {} files'.format(check_frame.shape[0]))

                # Then get our candidates
                deletion_candidates = file_frame.merge(check_frame, left_on = 'Potential_file', right_on = 'File_name')
                num_date_delete = deletion_candidates.shape[0]

                # Then delete
                dev_logger.info('Have found {} candidates to delete from {}'.format(num_date_delete, search_date))
                
                # Again we dive in if we have found some candidates, but this time updating the log
                if num_date_delete != 0:
                    deleted_candidates, deletion_log = try_candidates(deletion_candidates, is_pickle, dev_logger, domain, deletion_log)

                    # And finally remove
                    file_frame = file_frame[~ file_frame['Potential_file'].isin(deleted_candidates['Potential_file'])]
                    dev_logger.info('After deletion now have {} files left'.format(file_frame.shape[0]))

In [8]:
# Finally save the deletion log
dev_logger.info('Saving deletion log of length {}'.format(deletion_log.shape[0]))
deletion_log.to_csv(log_file, index = False)

In [9]:
deletion_log

Unnamed: 0,File_name,File_path,Is_pickle
1,allabout_afcbournemouth.html,/home/andreas/Desktop/Projects/Football/Data/0...,False
2,allabout_stokecityfc.html,/home/andreas/Desktop/Projects/Football/Data/0...,False
3,allabout_burnleyfc.html,/home/andreas/Desktop/Projects/Football/Data/0...,False
7,allabout_huddersfieldtownfc.html,/home/andreas/Desktop/Projects/Football/Data/0...,False
8,allabout_swanseacityfc.html,/home/andreas/Desktop/Projects/Football/Data/0...,False
10,allabout_watfordfc.html,/home/andreas/Desktop/Projects/Football/Data/0...,False
18,allabout_crystalpalacefc.html,/home/andreas/Desktop/Projects/Football/Data/0...,False
19,allabout_brightonandhovealbionfc.html,/home/andreas/Desktop/Projects/Football/Data/0...,False
0,sport_football_matchreports_gallery_premierlea...,/home/andreas/Desktop/Projects/Football/Data/0...,False
1,sport_football_transfernews_jurgenkloppinsists...,/home/andreas/Desktop/Projects/Football/Data/0...,False


In [10]:
print(deletion_log['File_name'].iloc[500])
print(deletion_log['File_path'].iloc[500])

sport_football_42695329.html
/home/andreas/Desktop/Projects/Football/Data/01_HTML/bbc/2018_01_27/story_link/
