# Development

This script is for development of other functions - just for simplicity of execution etc. Later the code should be moved to *football_functions*, and then deleted from here.

## Block 0: Initial packages and definitions

Just a block to define some stuff that we will probably never be changing

In [1]:
# Base packages for running the script
import sys, datetime

# Set the path and proxy in accordance with our OS
if sys.platform == 'linux':
    HOME_PATH = '/home/andreas/Desktop/Projects/Football/'
    proxy_settings = None
else:
    HOME_PATH = 'c:/Users/amathewl/Desktop/3_Personal_projects/football/'
    proxy_settings = None
    
# Relative paths
data_loc = HOME_PATH + 'Data_work/Data/'
html_loc = data_loc + '01_HTML/'
organ_loc = data_loc + '00_Organisation/'
story_loc = data_loc + '02_Stories/'

# Get today's date for various functions
date_today = datetime.datetime.today().strftime('%Y_%m_%d')

In [2]:
# Define a logger for development
from football_functions.generic import default_logger

dev_logger = default_logger.get_logger(data_loc, date_today, 'development')

## Function to pull the data we have saved into a pandas and save to CSV

This is the first step in the data analysis phase and just consists of the initial data pull, where we will load all the headlines and/or stories that we have saved on file into a pandas. This probably won't be run too many times, but is good to keep repeatable for when we add stories.

Won't do too much initial data processing - but could quite easily add some tags like the source, date pulled, URL of the story etc. Also important to properly convert the encoding and stuff such that we are left with something relatively clean that we don't have to fiddle about with too much.

Don't think we will save anything to pickle as it will just eat up too much memory - CSV should be enough.

The general process can be quite easily done, as transforming a dictionary into a PD is very easy - just need to select the elements we want, and concat it with another pandas frame that we start with initially. The only slight difference is that we MUST pass an index (story ID) to the PD when we declare it for the concat.

In [3]:
import os, pickle, pandas as pd

## Function definitions

Not sure what functions we will need as it may just be easier as a series of loops and that's it.

## Process

The process to follow will be:

1. Loop over domains / dates

2. Pull the pickles in order

3. Declare the dictionary and add any tags we want - including story ID - also decide if want story or not

4. Add the dictionary to our data frame

5. Save to CSV

In [4]:
all_stories = None
story_id = 0

for domain in os.listdir(story_loc):
    print('Looking at {}'.format(domain))
    for date_pulled in os.listdir(os.path.join(story_loc, domain)):
        # Where we will be looking for the pickle
        pickle_loc = os.path.join(story_loc, domain, date_pulled)
        
        # Load the pickles in order
        for pickle_file in os.listdir(pickle_loc):
            with open(os.path.join(pickle_loc, pickle_file), 'rb') as story_file:
                story_info = pickle.load(story_file)
            
            # Have to replace empty with None
            for key in story_info:
                if story_info[key] == []:
                    story_info[key] = None
            
            # Get the columns for our DF
            story_id += 1
            to_add = pd.DataFrame({
                'Domain' : domain,
                'URL' : story_info['article_link'],
                'Headline' : story_info['article_title'],
                'Summary' : story_info['article_summary'],
                'Image' : story_info['article_image'],
                'Date' : story_info['article_date'] 
            }, index = [story_id])
            
            # And add to our master frame
            if all_stories is not None:
                all_stories = pd.concat([all_stories, to_add], axis = 0)
            else:
                all_stories = to_add

Looking at bbc
Looking at dailymail
Looking at mirror
Looking at skysports
Looking at telegraph
Looking at theguardian


In [5]:
all_stories.drop_duplicates().shape

(13889, 6)

In [8]:
all_stories.to_csv(os.path.join(data_loc, 'all_stories.csv'), index = False, encoding = 'utf-8')