# Development

This script is for development of other functions - just for simplicity of execution etc. Later the code should be moved to *football_functions*, and then deleted from here.

## Block 0: Initial packages and definitions

Just a block to define some stuff that we will probably never be changing

In [2]:
# Base packages for running the script
import sys, datetime

# Set the path and proxy in accordance with our OS
if sys.platform == 'linux':
    HOME_PATH = '/home/andreas/Desktop/Projects/Football/'
    proxy_settings = None
else:
    HOME_PATH = 'c:/Users/amathewl/Desktop/03_Personal_projects/football/'
    proxy_settings = None
    
# Relative paths
data_log = HOME_PATH + 'Data/'
html_loc = data_log + '01_HTML/'
organ_loc = data_log + '00_Organisation/'
story_loc = data_log + '02_Stories/'

# Get today's date for various functions
date_today = datetime.datetime.today().strftime('%Y_%m_%d')

In [3]:
# Define a logger for development
from football_functions.generic import default_logger

dev_logger = default_logger(data_loc, date_today, 'development')

ModuleNotFoundError: No module named 'logger'

## Function to pull the data we have saved into a pandas and save to CSV

This is the first step in the data analysis phase and just consists of the initial data pull, where we will load all the headlines and/or stories that we have saved on file into a pandas. This probably won't be run too many times, but is good to keep repeatable for when we add stories.

Won't do too much initial data processing - but could quite easily add some tags like the source, date pulled, URL of the story etc. Also important to properly convert the encoding and stuff such that we are left with something relatively clean that we don't have to fiddle about with too much.

Don't think we will save anything to pickle as it will just eat up too much memory - CSV should be enough.

The general process can be quite easily done, as transforming a dictionary into a PD is very easy - just need to select the elements we want, and concat it with another pandas frame that we start with initially. The only slight difference is that we MUST pass an index (story ID) to the PD when we declare it for the concat.

In [1]:
import os, pickle, pandas as pd

In [8]:
pd.DataFrame({'a' : 4, 'b' : 2}, index = [0])

Unnamed: 0,a,b
0,4,2


In [7]:
pd.concat([pd.DataFrame({'a' : [4], 'b' : [2]}), pd.DataFrame({'a' : [4], 'b' : [2]})], axis = 0)

Unnamed: 0,a,b
0,4,2
0,4,2


## Function definitions

Not sure what functions we will need as it may just be easier as a series of loops and that's it.

## Process

The process to follow will be:

1. Loop over domains / dates

2. Pull the pickles in order

3. Declare the dictionary and add any tags we want - including story ID - also decide if want story or not

4. Add the dictionary to our data frame

5. Save to CSV

In [None]:
for domain in os.listdir(story_loc):
    for date_pulled in os.listdir(os.path.join(story_loc, domain)):
        # Where we will be looking for the pickle
        pickle_loc = os.path.join(story_loc, domain, date_pulled)
        
        # Load the pickles in order

In [17]:
for search_loc in [html_loc, story_loc]:
    # So we start off looking in these two locations and will branch slightly depending
    is_pickle = search_loc == story_loc
    
    dev_logger.info('Looking for files in:\n{}'.format(search_loc))
    
    for domain in os.listdir(search_loc):
        dev_logger.info('Looking at {}'.format(domain))
        
        # If we are not looking at pickles - we have several places to look (but never look at base urls)
        if is_pickle:
            file_types = ['']
        else:
            file_types = [file_type + '/' for file_type in os.listdir(search_loc + domain + '/' + date_today + '/') if file_type != 'base_urls'] # Have added the slash here to avoid a double slash later
        
        for file_type in file_types:
            dev_logger.info('Looking at file type: {}'.format(file_type))
            
            # We will only look in our "date today" for our potential files - will use this as a fixed path for deletion
            date_loc = search_loc + domain + '/' + date_today + '/' + file_type

            # Grab the file names and put into a pandas frame so that we can keep track of it
            file_frame = pd.DataFrame({'Potential_file' : os.listdir(date_loc), 'Potential_path' : date_loc})

            dev_logger.info('Will look at {} files'.format(file_frame.shape[0]))
            ## NOW START PROCESS OF DELETION ##

            # Grab the corresponding entries from our log - note that we should never be comparing data from the same dates here as log should be updated continuously
            deletion_candidates = file_frame.merge(deletion_log[deletion_log['Is_pickle'] == is_pickle], left_on = 'Potential_file', right_on = 'File_name')
            num_log_delete = deletion_candidates.shape[0]

            # Then process these candidates and remove from our original set of files the ones that get deleted
            dev_logger.info('Have found {} candidates to delete from our log'.format(num_log_delete))
            
            # If we have candidates, then try the content but DON'T update the log
            if num_log_delete != 0:
                deleted_candidates, _ = try_candidates(deletion_candidates, is_pickle)

                # And then remove from our file_frame
                file_frame = file_frame[~ file_frame['Potential_file'].isin(deleted_candidates['Potential_file'])]

                dev_logger.info('After deletion now have {} files left'.format(file_frame.shape[0]))

            # Now we have to look at previous dates so we search through the directories which are named after the dates
            domain_loc = search_loc + domain + '/'
            directory_dates = os.listdir(domain_loc)
            search_dates = sorted([directory_date for directory_date in directory_dates if datetime.datetime.strptime(directory_date, '%Y_%m_%d') < datetime.datetime.strptime(date_today, '%Y_%m_%d')])

            dev_logger.info('Have found {} dates to search through'.format(len(search_dates)))

            # Then we will loop through them to find potential candidates
            for search_date in search_dates:
                dev_logger.info('Now searching date {}'.format(search_date))
                search_date_loc = domain_loc + search_date + '/' + file_type

                # Since we don't have the log frame - make a check frame
                check_frame = pd.DataFrame({'File_name' : os.listdir(search_date_loc), 'File_path' : search_date_loc, 'Is_pickle' : is_pickle})

                dev_logger.info('Comparing against {} files'.format(check_frame.shape[0]))

                # Then get our candidates
                deletion_candidates = file_frame.merge(check_frame, left_on = 'Potential_file', right_on = 'File_name')
                num_log_delete = deletion_candidates.shape[0]

                # Then delete
                dev_logger.info('Have found {} candidates to delete from our log'.format(num_log_delete))
                
                # Again we dive in if we have found some candidates, but this time updating the log
                if num_log_delete != 0:
                    deleted_candidates, deletion_log = try_candidates(deletion_candidates, is_pickle, deletion_log)

                    # And finally remove
                    file_frame = file_frame[~ file_frame['Potential_file'].isin(deleted_candidates['Potential_file'])]
                    dev_logger.info('After deletion now have {} files left'.format(file_frame.shape[0]))

Looking for files in:
/home/andreas/Desktop/Projects/Football/Data/01_HTML/
Looking at mirror
Looking at file type: sublinks/
Will look at 20 files
Have found 0 candidates to delete from our log
Have found 0 dates to search through
Looking at file type: story_link/
Will look at 400 files
Have found 0 candidates to delete from our log
Have found 0 dates to search through
Looking at bbc
Looking at file type: sublinks/
Will look at 158 files
Have found 0 candidates to delete from our log
Have found 0 dates to search through
Looking at file type: story_link/
Will look at 1291 files
Have found 0 candidates to delete from our log
Have found 0 dates to search through
Looking at dailymail
Looking at file type: sublinks/
Will look at 24 files
Have found 0 candidates to delete from our log
Have found 0 dates to search through
Looking at file type: story_link/
Will look at 153 files
Have found 0 candidates to delete from our log
Have found 0 dates to search through
Looking at theguardian
Looking 

In [None]:
# Finally save the deletion log
dev_logger.info('Saving deletion log of length {}'.format(deletion_log.shape[0]))
deletion_log.to_csv(log_file, index = False)