# Revised Data Loading and Cleaning

## Status
 - The [XPN 90's A to Z](https://xpn.org/program/90s-a-z/) just stated.
   I'm redoign things for the new playlist format.
 - got the playlist fetch/parse working.  Not doing anything with it yet.

## Overview

Last year I took a year off and spent more time listening,
since the station did a bang up job on stats
and I couldn't think of a value add to do on top.
Meanwhile, they did a site redesign and my old playlist scraping code is obsolete.
If you want to see how this used to happen, 
or see data for earlier years, see the old [Data Loading Notebook](DataLoading.ipynb).


## Setup
Under the coveres, this is mostly [requests](http://docs.python-requests.org/en/master/) to fetch data,
which is now json, so we can use the [internal python library](https://docs.python.org/2.7/library/json.html),
and [Pandas](https://pandas.pydata.org/) for data munging.  So let's start with the imports.

In [1]:
%matplotlib inline
from IPython.display import display, HTML
import requests 
import json
import pandas as pd
import numpy as np
from datetime import date, datetime, time
from os import path, mkdir
import re

## Setup Cache directories
When we can, we'll cache data.
This is only partially for speed.
Checking in the data allows for repetability if sources go away or change.
And for some partial results, its useful to publish data.
There are others in the XPN community doing data analysis or just asking questions,
so csv files are nice to leave around.

In [2]:
cache_dir = './cache'
playlist_cache_dir = path.join(cache_dir, 'playlists')
a2z90s_cache_dir = path.join(cache_dir, 'a2z90s')
musicbrainz_cache_dir = path.join(cache_dir, 'musicbrainz')
data_dir = './data'

for d in (cache_dir, playlist_cache_dir, a2z90s_cache_dir,data_dir, musicbrainz_cache_dir):
    if not path.exists(d): mkdir(d)


## Generic XPN Playlist Scraping
[XPN](xpn.org) updated their site this year.
The [Playlists](https://xpn.org/wxpn-playlists/) are now Json.
They also now have Album info, which is great.

In [3]:
def fetch_daily_playlist(day, cache_dir=None, verbose = False):
    """
    Fetches the XPN playlist for a given date
    
    Args:
        day (datetime.date) : The day to fetch the playlist for
        cache_dir (string)  : Path to the cache directory, or None to avoid caching
    
    Returns:
        DataFrame containing Artist, Title, and Album as Strings and Airtime as Timestamp
    """
    songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Album', 'Air Time'])
    if cache_dir is not None:
        cache_file =  path.join(cache_dir, "%04d-%02d-%02d.csv" % \
                                (day.year, day.month, day.day))
    if cache_file is not None and path.exists(cache_file):
        songs = pd.read_csv(cache_file, encoding='utf-8')
        songs['Air Time'] = pd.to_datetime(songs['Air Time'], errors='coerce')
        if verbose: print "Got %d rows from %s" % (len(songs), cache_file)
    else:
        # example url
        # https://origin.xpn.org/utils/playlist/json/2022-11-30.json
        playlist_url = 'https://origin.xpn.org/utils/playlist/json/%s.json' %(day.isoformat())
        print playlist_url
        page = requests.get(playlist_url)
        if verbose: print "fetching %s returned status %s" % (day.isoformat(), page.status_code)
        # return is a json array of playlist entries
        # each playlist entry is a dict of
        #  - artist : artists name
        #  - song : song title
        #  - album : album title
        #  - timeslice : string containing an iso date with second granularity in Eastern time
        #  - image: url of album cover art
        #  - streamPreview : url of a short mp3 outtake from the track
        # 
        # Not all track are music.  But shows like World Cafe or Echos
        # put the show name, bound by vertical bars, as the artist, so we can skip them easily enough
        track_count = 0
        for track in page.json():
            if track["artist"][0] == '|':
                # skip non-song show titles
                continue
            if verbose: print ("adding %s %s %s %s" % (track['artist'], track['song'], track['album'], 
                                                       datetime.strptime(track['timeslice'],'%Y-%m-%d %H:%M:%S')))
            songs = songs.append({'Artist': track['artist'],
                                  'Title': track['song'],
                                  'Album': track['album'],
                                  'Air Time': datetime.strptime(track['timeslice'],'%Y-%m-%d %H:%M:%S')},
                                 ignore_index = True)
            if verbose: print "size = %d" % len(songs)
            track_count += 1
        
        if verbose: print 'added %d tracks' % (track_count)
        if cache_file is not None:
            songs.to_csv(cache_file, index=False, encoding='utf-8')
            if verbose: print 'write %d rows to %s' % (len(songs), cache_file)
        
        return songs
            
                
            
        
        

In [4]:
def fetch_playlist(start, end, cache_dir=None):
    """
    Fetch all the playlist entries for a range of time.
    
    Args:
        start (datetime.datetime) : The inclusive start time to fetch entries for
        end (datetime.datetime)   : The exclusive end time to fetch entries for
        cache_dir (string)        : path to the cache directory, or None to avoid caching
    
    Returns:
        Dataframe containing Artist, Title, and Album as strings, and Airtime as timestamp
    """
    songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Air Time'])
    for day in pd.date_range(start.date(), end.date()):
        songs = songs.append(fetch_daily_playlist(day.date(), cache_dir), ignore_index=True)
    songs = songs[songs['Air Time'] >= start]
    songs = songs[songs['Air Time'] < end]
    # sometimes the playlist entries are duplicated
    song = songs.drop_duplicates()
    songs = songs.sort_values(by = 'Air Time')
    return songs

## Load the playlists
For now lets do a test run for a few days


In [5]:
ninties = fetch_playlist(datetime(2022, 12, 1, 8, 0), datetime.now(),
                          playlist_cache_dir)
print "got %d rows" % len(ninties)

https://origin.xpn.org/utils/playlist/json/2022-12-01.json
got 183 rows


In [6]:
HTML(ninties.head(5).to_html())

Unnamed: 0,Air Time,Album,Artist,Title
182,2022-12-01 08:02:27,A Century Ends,David Gray,A Century Ends
181,2022-12-01 08:12:21,Lay It Down,Cowboy Junkies,A Common Disaster
180,2022-12-01 08:16:39,Confessions of a Knife,My Life With Thrill Kill Kult,A Daisy Chain 4 Satan
179,2022-12-01 08:24:52,Recorded Live for World Cafe 10/15/91,Bruce Cockburn,A Dream Like Mine
178,2022-12-01 08:28:34,Gorgeous George,Edwyn Collins,A Girl Like You


In [7]:
HTML(ninties.tail(5).to_html())

Unnamed: 0,Air Time,Album,Artist,Title
4,2022-12-01 22:31:43,Blue Lines,Massive Attack,Blue Lines
3,2022-12-01 22:36:13,Blue Sky Mining,Midnight Oil,Blue Sky Mine
2,2022-12-01 22:45:22,Dancing The Blues,Taj Mahal,Blues Ain't Nothing
1,2022-12-01 22:49:27,G Love & Special Sauce,G. Love & Special Sauce,Blues Music
0,2022-12-01 22:54:57,Phantom Power,The Tragically Hip,Bobcaygeon
