# Revised Data Loading and Cleaning

## Status
 - The [XPN 90's A to Z](https://xpn.org/program/90s-a-z/) is done now.
   I'm redoing things for the new playlist format.
 - got the playlist fetch/parse working with the new format.
 - added basic data munging (letter, word, duration)
 - added years.  Used [playlist page](https://xpn.org/countdown/90s-a-z/)
   rather than [MusicBrainz](https://musicbrainz.org)
   - turns out this is a confused as what I used to do.
     Making sense of music publishing is *real hard*.
 - Added support for the non-alphabetic follow-ons

## Overview

Last year I took a year off and spent more time listening,
since the station did a bang up job on stats
and I couldn't think of a value add to do on top.
Meanwhile, they did a site redesign and my old playlist scraping code is obsolete.
If you want to see how this used to happen, 
or see data for earlier years, see the old [Data Loading Notebook](DataLoading.ipynb).


## Setup
Under the covers, this is mostly [requests](http://docs.python-requests.org/en/master/) to fetch data,
which is now json, so we can use the [internal python library](https://docs.python.org/2.7/library/json.html),
and [Pandas](https://pandas.pydata.org/) for data munging.  So let's start with the imports.

In [1]:
%matplotlib inline
from IPython.display import display, HTML
import requests 
import json
import pandas as pd
import numpy as np
from datetime import date, datetime, time
from os import path, mkdir
import re

## Setup Cache directories
When we can, we'll cache data.
This is only partially for speed.
Checking in the data allows for repeatability if sources go away or change.
And for some partial results, its useful to publish data.
There are others in the XPN community doing data analysis or just asking questions,
so csv files are nice to leave around.

In [2]:
cache_dir = './cache'
playlist_cache_dir = path.join(cache_dir, 'playlists')
a2z90s_cache_dir = path.join(cache_dir, 'a2z90s')
musicbrainz_cache_dir = path.join(cache_dir, 'musicbrainz')
data_dir = './data'

for d in (cache_dir, playlist_cache_dir, a2z90s_cache_dir,data_dir, musicbrainz_cache_dir):
    if not path.exists(d): mkdir(d)


## Generic XPN Playlist Scraping
[XPN](xpn.org) updated their site this year.
The [Playlists](https://xpn.org/wxpn-playlists/) are now Json.
They also now have Album info, which is great.

In [3]:
def fetch_daily_playlist(day, cache_dir=None, verbose = False):
    """
    Fetches the XPN playlist for a given date
    
    Args:
        day (datetime.date) : The day to fetch the playlist for
        cache_dir (string)  : Path to the cache directory, or None to avoid caching
    
    Returns:
        DataFrame containing Artist, Title, and Album as Strings and Airtime as Timestamp
    """
    songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Album', 'Air Time'])
    if cache_dir is not None:
        cache_file =  path.join(cache_dir, "%04d-%02d-%02d.csv" % \
                                (day.year, day.month, day.day))
    if cache_file is not None and path.exists(cache_file):
        songs = pd.read_csv(cache_file, encoding='utf-8')
        songs['Air Time'] = pd.to_datetime(songs['Air Time'], errors='coerce')
        if verbose: print "Got %d rows from %s" % (len(songs), cache_file)
    else:
        # example url
        # https://origin.xpn.org/utils/playlist/json/2022-11-30.json
        playlist_url = 'https://origin.xpn.org/utils/playlist/json/%s.json' %(day.isoformat())
        page = requests.get(playlist_url)
        if verbose: print "fetching %s returned status %s" % (day.isoformat(), page.status_code)
        # return is a json array of playlist entries
        # each playlist entry is a dict of
        #  - artist : artists name
        #  - song : song title
        #  - album : album title
        #  - timeslice : string containing an iso date with second granularity in Eastern time
        #  - image: url of album cover art
        #  - streamPreview : url of a short mp3 outtake from the track
        # 
        # Not all track are music.  But shows like World Cafe or Echos
        # put the show name, bound by vertical bars, as the artist, so we can skip them easily enough
        track_count = 0
        for track in page.json():
            if track["artist"][0] == '|':
                # skip non-song show titles
                continue
            if verbose: print ("adding %s %s %s %s" % (track['artist'], track['song'], track['album'], 
                                                       datetime.strptime(track['timeslice'],'%Y-%m-%d %H:%M:%S')))
            songs = songs.append({'Artist': track['artist'],
                                  'Title': track['song'],
                                  'Album': track['album'],
                                  'Air Time': datetime.strptime(track['timeslice'],'%Y-%m-%d %H:%M:%S')},
                                 ignore_index = True)
            if verbose: print "size = %d" % len(songs)
            track_count += 1
        
        if verbose: print 'added %d tracks' % (track_count)
        if cache_file is not None:
            songs.to_csv(cache_file, index=False, encoding='utf-8')
            if verbose: print 'write %d rows to %s' % (len(songs), cache_file)
        
    return songs
            
                
            
        
        

In [4]:
def fetch_playlist(start, end, cache_dir=None):
    """
    Fetch all the playlist entries for a range of time.
    
    Args:
        start (datetime.datetime) : The inclusive start time to fetch entries for
        end (datetime.datetime)   : The exclusive end time to fetch entries for
        cache_dir (string)        : path to the cache directory, or None to avoid caching
    
    Returns:
        Dataframe containing Artist, Title, and Album as strings, and Airtime as timestamp
    """
    songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Album', 'Air Time'])
    for day in pd.date_range(start.date(), end.date()):
        songs = songs.append(fetch_daily_playlist(day.date(), cache_dir), ignore_index=True)
    songs = songs[songs['Air Time'] >= start]
    songs = songs[songs['Air Time'] < end]
    # sometimes the playlist entries are duplicated
    song = songs.drop_duplicates()
    songs = songs.sort_values(by = 'Air Time')
    
    
    return songs

## Load The Playlists
Fetch all the playlists, for the duration of the countdown,
pulling from local cache if possible.

In [5]:
ninties = fetch_playlist(datetime(2022, 12, 1, 8, 0), datetime(2022, 12, 8, 8, 30),
                          playlist_cache_dir)
print "got %d rows" % len(ninties)

got 2133 rows


In [6]:
HTML(ninties.head(5).to_html())

Unnamed: 0,Artist,Title,Album,Air Time
195,David Gray,A Century Ends,A Century Ends,2022-12-01 08:02:27
194,Cowboy Junkies,A Common Disaster,Lay It Down,2022-12-01 08:12:21
193,My Life With Thrill Kill Kult,A Daisy Chain 4 Satan,Confessions of a Knife,2022-12-01 08:16:39
192,Bruce Cockburn,A Dream Like Mine,Recorded Live for World Cafe 10/15/91,2022-12-01 08:24:52
191,Edwyn Collins,A Girl Like You,Gorgeous George,2022-12-01 08:28:34


In [7]:
HTML(ninties.tail(5).to_html())

Unnamed: 0,Artist,Title,Album,Air Time
2137,Smashing Pumpkins,Zero,Melon Collie & The Infinite Sadness,2022-12-08 08:08:58
2136,R. E. M.,Zither,New Adventures In Hi-fi,2022-12-08 08:11:38
2135,The Cranberries,Zombie,No Need To Argue,2022-12-08 08:14:57
2134,U2,Zoo Station,Achtung Baby,2022-12-08 08:23:24
2133,Cherry Poppin' Daddies,Zoot Suit Riot,Zoot Suit Riot: The Swingin' Hits Of The Cherr...,2022-12-08 08:27:53


#### 90s Non-alphabetics

In [8]:
ninties_extras = fetch_playlist(datetime(2022, 12, 8, 8, 30), datetime(2022, 12, 8, 11, 8),
                          playlist_cache_dir)
print "got %d rows" % len(ninties_extras)

got 33 rows


In [9]:
HTML(ninties_extras.head(5).to_html())

Unnamed: 0,Artist,Title,Album,Air Time
36,Brandy,(Everything I Do) I Do It For You,Never Say Never,2022-12-08 08:34:47
35,UB40,(I Can't Help) Falling in Love with You,Promises and Lies,2022-12-08 08:38:54
34,Harry Connick Jr,(I Could Only) Whisper Your Name,She,2022-12-08 08:43:25
33,The Subdudes,(You'll Be) Satisfied,Annunciation,2022-12-08 08:48:23
32,Bob Dylan,'Til I Fell In Love With You,Time Out of Mind,2022-12-08 08:54:53


In [10]:
HTML(ninties_extras.tail(5).to_html())

Unnamed: 0,Artist,Title,Album,Air Time
8,Prince & the New Power Generation,7,Love Symbol,2022-12-08 10:45:46
7,Youssou N'Dour,7 Seconds,The Guide,2022-12-08 10:50:19
6,Southern Culture On The Skids,8 Piece Box,Dirt Track Date,2022-12-08 10:56:13
5,Souls of Mischief,'93 til Infinity,'93 til Infinity,2022-12-08 11:00:36
4,Suzanne Vega,99.9 F,99.9 F,2022-12-08 11:05:15


## Augmenting The Data

### Extracting Initial Letters
This might be the lamest, simplest, data augmentation,
but it is an A to Z countdown isn't it?
Besides nothing is ever really that simple.
Blanks and initial punctuation (ex *'Til*) have show up before.

In [11]:
def first_char(s):
    for c in s:
        if type(c) is str and c.isalpha():
            return c.upper()
    return s[0]

ninties = ninties.join(ninties.apply(lambda x: first_char(x['Title']), axis=1).to_frame('Letter'))


For the non-alphabetic leftovers, we'll do first character instead, so no skipping past non-alphabetics.

In [12]:
ninties_extras = ninties_extras.join(ninties_extras.apply(lambda x: x[1][0].upper(), axis=1).to_frame('First Character'))

### Extracting First Words
Not sure how interesting this is,
but the "should we include leading articles" was the genesis of this effort back in 2016.
Besides it's easy.

In [13]:
from nltk.tokenize import RegexpTokenizer
custom_tokenize = RegexpTokenizer("[\w'\-]+|[^\w'\s\-]").tokenize
ninties = ninties.join(ninties.apply(lambda x: custom_tokenize(x['Title'])[0], axis=1).to_frame('First Word'))
ninties_extras = ninties_extras.join(ninties_extras.apply(lambda x: custom_tokenize(x['Title'])[0], axis=1).to_frame('First Word'))

### Publication Years

For the 90s A-Z, the realtime list uses a single Json request to get a list of songs
and it contains the publicatin year.
this might just be a lot easier than using [MusicBrainz](https://musicbrainz.org).

In [14]:
def fetch_years():
    # the 90s specific playlist page has a json feed with years, 
    # so we can just fetch that as
    # https://origin.xpn.org/countdown/2022/2022_90s_az.json
    # one catch, unlike the normal playlist, it stuffs esacpped esapes on "specials"
    # for example "Don\\'t need no \\'cape here" rather than "Don't need no 'cape here"
    #
    # their data isn't much better than what I did myself.
    # for the moment, drop anything outside 1990-1999.
    years = pd.DataFrame(None, columns=['Artist', 'Title', 'Album', 'Year'])
    az_url = 'https://origin.xpn.org/countdown/2022/2022_90s_az.json'
    page = requests.get(az_url)
    for track in page.json():
        if len(track['releaseDate']) == 4:
            release_year = int(track['releaseDate'])
            if release_year < 1990 or release_year > 1999:
                    release_year = 0
        else:
            release_year = 0
        years = years.append({'Artist': track['artist'].replace("\\", ""),
                              'Title': track['song'].replace("\\", ""),
                              'Album': track['album'].replace("\\", ""),
                              'Year': release_year},
                             ignore_index = True)
    return years
    


In [15]:
ninties = ninties.merge(fetch_years(), on = ['Artist', 'Title', 'Album'], how='left')
ninties['Year'] = ninties['Year'].fillna(value=0).astype(int)
print "Of %d 90s tracks, %d had valid dates and %d did not" % \
    (len(ninties), len(ninties[ninties['Year'] > 0]), len(ninties[ninties['Year'] == 0]))
ninties_extras = ninties_extras.merge(fetch_years(), on = ['Artist', 'Title', 'Album'], how='left')
ninties_extras['Year'] = ninties_extras['Year'].fillna(value=0).astype(int)
print "Of %d 90s extra tracks, %d had valid dates and %d did not" % \
    (len(ninties_extras), len(ninties_extras[ninties['Year'] > 0]), len(ninties_extras[ninties_extras['Year'] == 0]))

Of 2133 90s tracks, 1922 had valid dates and 211 did not
Of 33 90s extra tracks, 14 had valid dates and 3 did not




### Estimating Durations

Since we have air times, we can approximate durations by subtracting the air time from the next track's air times.  There are a couple catches with this
- we need to pass in an explicit end time for the last track, but that's minor
- we need to add some logic to 'skip over' the Free at Noons that happen on Fridays form 12 noon till "like 12:40 or so" and don't appear in the playlist at all
- there's no clear way to account for "non-song time" like station promos, hosts introducing songs, station ids, and so forth.  Fortunately, the percentage of time that is really music is pretty high thanks to XPN being listener supported.

In [16]:
def estimate_durations(playlist, end_time=None):
    """
    Estimate the song durations
    Args: 
        playlist (DataFrame): playlist with minimally an 'Air Time' attribute
        end_time (datetime): end time of the play list, or None if still going
    Return:
        modified DataFrame with 'Duration' attribute added.
    """
    
    playlist['Duration'] = pd.Series([0 for x in range(len(playlist.index))], index=playlist.index)
    previous = None
    last_idx = None
    for idx, row in playlist.iterrows():
        if not previous is None:
            if row['Air Time'].date().weekday() == 4 and previous.hour == 11 and row['Air Time'].hour == 12:
                # We just fell into a free at noon
                playlist.loc[last_idx, 'Duration'] = 60 - previous.minute
            else:
                # just subtract this start from the previous
                delta = row['Air Time'] - previous
                playlist.loc[last_idx, 'Duration'] = delta.seconds / 60
        previous = row['Air Time']
        last_idx = idx

    # fixup the last row
    if end_time is not None:    
        delta = end_time - playlist.loc[last_idx,'Air Time']
        playlist.loc[last_idx, 'Duration'] = delta.seconds / 60
    
    return playlist

In [19]:
ninties = estimate_durations(ninties, datetime(2022,12, 8, 8, 30))
ninties_extras = estimate_durations(ninties_extras, datetime(2022, 12, 8, 11, 9))

## Checking The Results

In [22]:
HTML(ninties.head(5).to_html())

Unnamed: 0,Artist,Title,Album,Air Time,Letter,First Word,Year,Duration
0,David Gray,A Century Ends,A Century Ends,2022-12-01 08:02:27,A,A,0,9
1,Cowboy Junkies,A Common Disaster,Lay It Down,2022-12-01 08:12:21,A,A,0,4
2,My Life With Thrill Kill Kult,A Daisy Chain 4 Satan,Confessions of a Knife,2022-12-01 08:16:39,A,A,0,8
3,Bruce Cockburn,A Dream Like Mine,Recorded Live for World Cafe 10/15/91,2022-12-01 08:24:52,A,A,0,3
4,Edwyn Collins,A Girl Like You,Gorgeous George,2022-12-01 08:28:34,A,A,0,3


In [21]:
HTML(ninties.tail(5).to_html())

Unnamed: 0,Artist,Title,Album,Air Time,Letter,First Word,Year,Duration
2128,Smashing Pumpkins,Zero,Melon Collie & The Infinite Sadness,2022-12-08 08:08:58,Z,Zero,1995,2
2129,R. E. M.,Zither,New Adventures In Hi-fi,2022-12-08 08:11:38,Z,Zither,1996,3
2130,The Cranberries,Zombie,No Need To Argue,2022-12-08 08:14:57,Z,Zombie,1994,8
2131,U2,Zoo Station,Achtung Baby,2022-12-08 08:23:24,Z,Zoo,1991,4
2132,Cherry Poppin' Daddies,Zoot Suit Riot,Zoot Suit Riot: The Swingin' Hits Of The Cherr...,2022-12-08 08:27:53,Z,Zoot,1997,2


In [23]:
HTML(ninties_extras.head(5).to_html())

Unnamed: 0,Artist,Title,Album,Air Time,First Character,First Word,Year,Duration
0,Brandy,(Everything I Do) I Do It For You,Never Say Never,2022-12-08 08:34:47,(,(,1998,4
1,UB40,(I Can't Help) Falling in Love with You,Promises and Lies,2022-12-08 08:38:54,(,(,1993,4
2,Harry Connick Jr,(I Could Only) Whisper Your Name,She,2022-12-08 08:43:25,(,(,1994,4
3,The Subdudes,(You'll Be) Satisfied,Annunciation,2022-12-08 08:48:23,(,(,1994,6
4,Bob Dylan,'Til I Fell In Love With You,Time Out of Mind,2022-12-08 08:54:53,','Til,1997,5


In [24]:
HTML(ninties_extras.tail(5).to_html())

Unnamed: 0,Artist,Title,Album,Air Time,First Character,First Word,Year,Duration
28,Prince & the New Power Generation,7,Love Symbol,2022-12-08 10:45:46,7,7,1992,4
29,Youssou N'Dour,7 Seconds,The Guide,2022-12-08 10:50:19,7,7,1994,5
30,Southern Culture On The Skids,8 Piece Box,Dirt Track Date,2022-12-08 10:56:13,8,8,1995,4
31,Souls of Mischief,'93 til Infinity,'93 til Infinity,2022-12-08 11:00:36,','93,1993,4
32,Suzanne Vega,99.9 F,99.9 F,2022-12-08 11:05:15,9,99,1992,3


## Saving The Data

In [None]:
ninties_data_file = path.join(data_dir, '90sA2Z.csv')
ninties.to_csv(ninties_data_file, index=False, encoding='utf8')
ninties_extras_data_file = path.join(data_dir, '90sextras.csv')
ninties_extras.to_csv(ninties_extras_data_file, index=False, encoding='utf8')

## Availability
The code for this project is in [my github repo](https://github.com/asudell/a2z)
and this file is specifically [Dataloading2](https://github.com/asudell/a2z/blob/master/Dataloading2.ipynb). 

## License


This project is licensed under a
[Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
You are free to use for commercial or non-commercial purposes,
so long as you attribute the source and also allow sharing.