# Data Loading and Cleaning

## Overview

Part of the frenzy that is doing these stats tends to create a bit of messy data collection code for a few reasons: the countdowns are "unannounced surprises", they have limited time spans, it's a side project, and I want to be timely.  Having said that, I wanted to clean it up a bit and codify what I've learned.  So here goes

### Setup

Under the covers it mostly a combo of [requests](http://docs.python-requests.org/en/master/) and [lxml](http://lxml.de/) for webscraping and (pandas)[https://pandas.pydata.org/] for data munging.  Before we get started, set up the imports

In [11]:
%matplotlib inline
from IPython.display import display, HTML
import requests 
from lxml import html
import pandas as pd
from datetime import date, datetime, time
from os import path, mkdir
import re


### Set up cache/data directories

In [12]:

cache_dir = './test-cache'
playlist_cache_dir = path.join(cache_dir, 'playlists')
data_dir = './test-data'

for d in (cache_dir, playlist_cache_dir, data_dir):
    if not path.exists(d): mkdir(d)

## Generic XPN Playlist scraping

Origionaly I tended to rely on the one-off countdown pages for playlists.
But eventually I ended up using the generic playlist at [http://xpn.org/playlists/xpn-playlist](http://xpn.org/playlists/xpn-playlist).
I've done this enough, it's past time to turn it into something reusable.

In [18]:
def fetch_playlist(day, cache_dir=None, verbose = False):
    """
    Fetches the XPN playlist for a given data
    
    Args:
        day (datetime.date) : The day to fetch the playlist for
        cache_dir (string)  : Path to the cache directory, or None to avoid caching
    
    Returns:
        DataFrame containing Artist and Title as Strings and Airtime as Timestamp
    """
    songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Air Time'])
    if cache_dir is not None:
        cache_file =  path.join(cache_dir, "%04d-%02d-%02d.csv" % \
                                (day.year, day.month, day.day))
    if cache_file is not None and path.exists(cache_file):
        songs = pd.read_csv(cache_file)
        songs['Air Time'] = pd.to_datetime(songs['Air Time'], errors='coerce')
        if verbose: print "Got %d rows from %s" % (len(songs), cache_file)
    else:
        day_s = '%02d-%02d-%04d' % (day.month, day.day, day.year)
        page = requests.post('http://xpn.org/playlists/xpn-playlist',
                                 data = {'playlistdate': day_s})
        if verbose: print "fetching %s returned status %s" % (day_s, page.status_code)
        tree = html.fromstring(page.content)
        tracks = tree.xpath('//h3/a/text()')
        # not all rows are tracks, some are membership callouts
        # but real tracks start with times and are formatted
        # HH:MM [am|pm] Artist - Title
        # Note that I've seen titles with embedded dashes,
        # but so far no artist names with them.  This may be luck.
        # Special programs like World Cafe, Echos, ...
        # also start with an air time, but don't have useful track info
        # but those list the program inside bars
        # eg |World Cafe| -  "Wednesday 11-2-2016 Hour 2, Part 7"
        date_regex = re.compile("^\d{2}:\d{2}\s")
        line_count= 0
        track_count = 0
        for track in tracks:
            line_count += 1
            if date_regex.match(track) and track[9:10] != '|':
                (artist, title) = track[9:].split(' - ', 1)
                dt = datetime.strptime(track[:8], '%I:%M %p')
                air_time = datetime.combine(day, dt.time())
                if verbose: print "adding %s %s %s" % (artist, title, air_time)
                songs = songs.append({'Artist': artist,
                                      'Title': title,
                                      'Air Time': air_time},
                                     ignore_index = True)
                if verbose: print "size = %d" % len(songs)
                track_count += 1
            
        if verbose: print 'read %d line and added %d tracks' % (line_count, track_count)
        # Drop any duplicates, which are not uncommon
        songs = songs.drop_duplicates()
        if cache_file is not None:
            songs.to_csv(cache_file, index=False)
            if verbose: print 'write %d rows to %s' % (len(songs), cache_file)
    
    return songs


### XPN 70's A to Z playlist

The 70s's AtoZ started at 6:00 am on Nov 29 2107.
At this time it is still running.
The only interruptions have been for Free at Noons,
that don't appear in the playlist,
but show as gaps from 12 noon to 12:40 or so on Fridays.


In [22]:
start_time = datetime(2017, 11, 29, 06, 00)
end_time = datetime.now()
seventies = pd.DataFrame(None, columns=['Artist', 'Title', 'Air Time'])
for day in pd.date_range(start_time.date(), end_time.date()):
    seventies = seventies.append(fetch_playlist(day, playlist_cache_dir), ignore_index=True)
# delete data before the start time
seventies = seventies[seventies['Air Time'] >= start_time]
seventies = seventies.sort_values(by = 'Air Time')
# Save off the results
seventies_data_file = path.join(data_dir, 'seventies.csv')
seventies.to_csv(seventies_data_file, index=False)
print "got %d rows" % len(seventies)

got 2888 rows


Look at what we got a bit

In [23]:
HTML(seventies.head(5).to_html())

Unnamed: 0,Artist,Title,Air Time
219,Steeleye Span,A Calling-On Song,2017-11-29 06:02:00
218,Joni Mitchell,A Case Of You,2017-11-29 06:03:00
217,Boz Scaggs,A Clue,2017-11-29 06:07:00
216,Todd Rundgren,A Dream Goes On Forever,2017-11-29 06:13:00
215,Lou Reed,A Gift,2017-11-29 06:16:00


In [24]:
HTML(seventies.describe(include='all', percentiles=[]).to_html(na_rep=''))

Unnamed: 0,Artist,Title,Air Time
count,2888,2888,2888
unique,859,2780,2887
top,Neil Young,Cocaine,2017-12-01 16:24:00
freq,41,3,2
first,,,2017-11-29 06:02:00
last,,,2017-12-08 15:40:00
