# Now That's What I Call Music Classifier

### Getting Started

You are about to follow along with my process for creating a Now That's What I Call Music Classifier that classfies today's hits as either 'now-worthy' or not with **84% accuracy** and returns the 20 selections with highest confidence, but before you do so there are some things you should know.

1) The best way to actually run this notebook would be to clone this repo, install Jupyter Notebooks http://jupyter.org/install.html, and execute each cell on your own

2) If you follow the advice in Step 1, you can ignore our scraping and correction phase as all dataset are included in the repo

If you choose not to follow along in your own notebook, it shouldn't be hard to follow along just by reading each cell.

Enjoy!

-Drew

## 1 - Wrangling the NWCM Data

The first data engineering task is to wrangle our data:
- Find the link to the NWCM discography and scrape the links of all U.S. series NWCM volumes (exclude NWCM Christmas, Party Anthems, and the like)
- Visit each link, scrape the tracklist items: artist, title, and album release date. Store as a dictionary, append to a master list
- Authenticate with Spotify as we are going to be querying for our audio features
- Clean up our title and artist field a bit to make sure our query returns results (cleaning rules defined in the util python file)
- Query for our audio features, popularity, and general naming info from Spotify, add this to the song dictionary item in the data list.
- When complete, write this out to a csv

- Here things get a little more interesting. During the scraping process if a Spotify query did not return a result, the title and artist were written out to a file named 'corrections.csv'. I can manually search for the track in the Spotify application and add the id. When I run 'make_corrections' the function will read our corrections file and fill in any missing Spotify data to our file. 

**IMPROVEMENT**: This function was only really meant to be run once so any empty Spotify queries are appended to the corrections file regardless if they already appear in the file or not. Ideally, we don't want to write a line if it's already in the file. Perhaps going forward it would make sense to maintain a database

- Our NWCM music data is shaped and cleaned, the next step is to wrangle the billboard data. Billboard data dating further back than our Now albums is available, so I need some sense of how much data to scrape? I decided that adding timedelta, the time lapse between now album release and song release, would be a good idea to figure out: when the best time is to make predictions and how far back before our first now album we have to scrape.
    - To accomplish this  I had to clean up our dates and convert them to datetimes. I then created the timedelta by casting the days attribute of the difference of these two datetimes.
    - There are many values < 0 due to these popular songs appearing later on 'Greatest Hits' or album rereleases (think Single Edit, or Deluxe Version) so we will just ignore them at this time.

In [116]:
# importing the custom modules from my util folder
from utils.utils import write_list_of_dictionaries_to_file, read_csv_to_list_of_dictionaries,\
    make_corrections, set_timedelta, filter_dataframe_by_year, make_predictions
    
from utils.wiki_scrape import get_links_from_page, scrape_tracklisting_from_album_link,\
    clean_duration, clean_artist, clean_title
    
from utils.spotify_utils import create_auth, set_spotify_data, set_spotify_id, set_artist_data,\
    set_audio_features, set_album_data, set_track_data
    
from utils.billboard_utils import scrape_billboard

# importing sklearn modules for use in our classifier section
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

import pandas as pd

### This cell takes ~18 minute to run. I would suggest you download it's resulting file if you are following along

In [2]:
%%time

# get list of album links
url = "https://en.wikipedia.org/wiki/Now_That%27s_What_I_Call_Music!_discography"
pattern = r"https://en\.wikipedia\.org/wiki/Now_That%27s_What_I_Call_Music!_\d+_\(U\.S\._series\)"
url_prepend = "https://en.wikipedia.org"

album_links = get_links_from_page(url, pattern, url_prepend)

# appending the original album link since it doesn't fit our regex pattern
album_links.append("https://en.wikipedia.org/wiki/Now_That%27s_What_I_Call_Music!_(original_U.S._album)")

# scrape each tracklisting, extending the data list each album
now = []
for album in album_links:
    now.extend(scrape_tracklisting_from_album_link(album))

# Authenicating with Spotify
sp = create_auth()

# clean fields to improve spotify query accuracy
for song in now:
    song['duration'] = clean_duration(song['duration'])
    song['title'] = clean_title(song['title'])
    song['artist'] = clean_artist(song['artist'])
    # Add clean date function here to set date format to datetime
    set_spotify_data(song, sp)

# write to csv
write_list_of_dictionaries_to_file(now, "NWCM_Spotify_Features.csv")

Can't find: All of Me
Can't find: F**kin' Perfect
Can't find: Let U Go
Can't find: Fresh AZIMIZ
Can't find: Baby It's You
Can't find: '03 Bonnie & Clyde
Can't find: Fast Forward
Can't find: Obsession
Can't find: Get the Party Started/Sweet Dreams
Can't find: Shut the Front Door
Can't find: Sugarhigh
Can't find: This Summer's Gonna Hurt Like a Mother...
Can't find: Be in Love Tonight
Can't find: Everybody's Free
Can't find: I Care 4 U
Can't find: Don't Trust Me
Can't find: This or That
Can't find: I Believe
Can't find: Poisoned with Love
Can't find: Thinking for a While
Can't find: AM to PM
Can't find: Me, Myself & I
Can't find: Get Yourself Back Home
Can't find: Independent Women Part I
Wall time: 18min 8s


In [2]:
# read and update our file with any corrections listed in the corrections file
make_corrections("NWCM_Spotify_Features.csv", "corrections.csv")

Updated All of Me by John Legend


# 2 - How much training data do we need?

I can scrape back as far as I'd like to but songs on the billboard charts in say, 1941, are unlikely to be useful in classifying today's hits from 'now-worthy' hits. I decided it would be a good question to answer, what is the lapse in time between a song's release and it's appearance on a Now album. This should provide me with not only how far much data I should collect for training, but how far I need to go back when making prediction as well.

In [117]:
# update our data from the manually updated file
now = read_csv_to_list_of_dictionaries("NWCM_Spotify_Features.csv")

# set timedelta column
for song in now:
    set_timedelta(song)
    
# read into a dataframe, drop na values in timedelta, locate only valid times (i.e. > 0 days difference)
# find the median number of days and return this as a week 
nowdf = pd.DataFrame(now)
nowdf.timedelta\
    .dropna(how='any', inplace=False)\
    .loc[nowdf.timedelta >= 0]\
    .median()/7

31.428571428571427

Looks like 31 weeks before our first now album release is a good cut off for our training data collection.
**Question**: I am assuming with the greater access to music today (in relation to 1998) that songs are not deemed 'popular' as long as they used to be. A good way to examine this question would be to look at how this delta has changed over the course of our Now history.

# 3 - Scraping our Billboard data

Rather than scraping all the way back 31 weeks before the first album as I did during the first pass of this project, I will let you in on a little inside knowledge I gained during the feature selection phase. 

Popularity is key. Popularity also degrades as total daily play count on Spotify declines. 

This is a huge limitation on the amount of data we have collected, as songs on now albums have much lower popularity scores than they would have had when they were being selected to appear on that season's release. In essence, unless a song from now 30 is still extremely popular today, we will be training our model on popularity features **today** when we really need popularity scores from when the NWCM album was released. For this reason, we will only be using data from 2016 onward.

This problem could be remedied if we had a figure such as 'peek popularity' or a series of historical popularity scores for each track. While this in currently unavailable, if I want to improve the model going forward, this is a feature I could track, it wouldn't make our model any stronger today but in a few years time perhaps.

### This cell takes ~16 minutes to run. I would suggest you download it's resulting file if you are following along

In [93]:
%%time
# Let's scrape back two years to start
billboard = scrape_billboard(104)

# Write to csv
write_list_of_dictionaries_to_file(billboard, "Billboard_Spotify_Features.csv")

Getting data from: 2017-02-18
Getting data from: 2017-02-11
Getting data from: 2017-02-04
Getting data from: 2017-01-28
Getting data from: 2017-01-21
Getting data from: 2017-01-14
Getting data from: 2017-01-07
Getting data from: 2016-12-31
Getting data from: 2016-12-24
Getting data from: 2016-12-17
Getting data from: 2016-12-10
Getting data from: 2016-12-03
Getting data from: 2016-11-26
Getting data from: 2016-11-19
Getting data from: 2016-11-12
Getting data from: 2016-11-05
Getting data from: 2016-10-29
Getting data from: 2016-10-22
Getting data from: 2016-10-15
Getting data from: 2016-10-08
Getting data from: 2016-10-01
Getting data from: 2016-09-24
Getting data from: 2016-09-17
Getting data from: 2016-09-10
Getting data from: 2016-09-03
Getting data from: 2016-08-27
Getting data from: 2016-08-20
Getting data from: 2016-08-13
Getting data from: 2016-08-06
Getting data from: 2016-07-30
Getting data from: 2016-07-23
Getting data from: 2016-07-16
Getting data from: 2016-07-09
Getting da

# 4 - Forming the Training Data Set

To create our training data set we now need to first merge our billboard and now data.

The process is as follows:

- read files to dataframes, add respective 'now' labels
- append the dataframes and set results to training
- clean up columns so they can be passed on to our feature investigation phase
- We will need to subset this dataset by album release year to accomdate for the popularity decay

In [118]:
# reading in our csv files
now = pd.read_csv("NWCM_Spotify_Features.csv")
billboard = pd.read_csv("Billboard_Spotify_Features.csv")

# add now labels
now['now'] = 1
billboard['now'] = 0

# append dataframes, project only the columns we will be using in training and drop na
training = now.append(billboard)
training = training[['track_id', 'acousticness', 'album_popularity', 'artist_popularity',
 'album_release_date', 'danceability', 'duration_ms', 'energy',
 'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'now',
 'speechiness', 'tempo', 'track_popularity', 'valence']]

# an improvement in the future would be to fill na in now and billboard with median values, respectively
training.dropna(how='any', inplace=True)

# Filter training dataset by album release year
training = filter_dataframe_by_year(training, 'album_release_date', 2015)

# Dropping duplicate track_id's (i.e removing billboard songs that appear on a now cd)
training.sort_values('track_id', inplace=True)
training = training.drop(training.loc[training.track_id == training.track_id.shift()].index)

# Write to csv
training.to_csv("TrainingData.csv")

# 5 - Creating our Classifier

In this section of the project we will select our features and form the training data, train a few different classifiers, evaluate them on accuracy, and finally make our predictions for the next now album!

## Read Training Data, Select Features, and Create Splits

Here we are reading in our training set, selecting our 3 best features, and training 3 different classifiers which we will evaluate on their accuracy.

I ended up selecting the KNN classifier as consistently has the highest accuracy.

Something to improve in the next iteration would be making use of the pipeline and GridSearchCV modules in sklearn. For this project I wanted to experiment and turn some 'knobs' myself but it would be much wiser to automate this using the tools provided in sklearn going forward.

In [127]:
# Load training data
data = pd.read_csv("TrainingData.csv")

# Filtering our now from our data
x = data[[ 'acousticness', 'album_popularity', 'artist_popularity', 'danceability', 'duration_ms',
 'energy', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'speechiness', 'tempo',
 'track_popularity', 'valence']]
y = data['now']

# Selecting our features
k = 5
x_new = SelectKBest(score_func=f_classif, k=k).fit_transform(x,y)
x_train, x_test, y_train, y_test = train_test_split(x_new, y, train_size = 0.8)

feature_names = list(x.columns[list(SelectKBest(score_func=f_classif, k=k).fit(x, y).get_support())])
print "Features in use: " + str(feature_names)

# Creating 3 Classifiers to test
classifiers = [
    {'name': "Decision Tree", 'classifier' : DecisionTreeClassifier(criterion='entropy')},
    {'name': 'SVM', 'classifier' : SVC(C=10)},
    {'name': 'KNN', 'classifier': KNeighborsClassifier(n_neighbors=11, weights='uniform')}
]

for clf in classifiers:
    clf['classifier'].fit(x_train, y_train)
    print "{}\tScore: {}".format(clf['name'], clf['classifier'].score(x_test, y_test))

Features in use: ['album_popularity', 'artist_popularity', 'energy', 'loudness', 'track_popularity']
Decision Tree	Score: 0.812865497076
SVM	Score: 0.842105263158
KNN	Score: 0.847953216374


It looks like our model predicts with **~84% accuracy**!

Now let's see what songs will appear on the next album!

# 6 - Making Predictions

In [115]:
%%time
#  get last 31 weeks of popular music from billboard
recent_popular_songs = scrape_billboard(31)

#  get audio features, project to our predictive features
pop_songs_df = pd.DataFrame(recent_popular_songs)
pop_songs_df.dropna(how='any', inplace=True)

# make predictions on our current popular songs using KNN classifier
album_predictions = make_predictions(pop_songs_df, classifiers[2]['classifier'], feature_names)

print "\nPredictions for the Next Now album:\n"
# print our predictions
for song in album_predictions:
    # print artist, title, and confidence
    print song[0], song[1], song[2]

Getting data from: 2017-02-18
Getting data from: 2017-02-11
Getting data from: 2017-02-04
Getting data from: 2017-01-28
Getting data from: 2017-01-21
Getting data from: 2017-01-14
Getting data from: 2017-01-07
Getting data from: 2016-12-31
Getting data from: 2016-12-24
Getting data from: 2016-12-17
Getting data from: 2016-12-10
Getting data from: 2016-12-03
Getting data from: 2016-11-26
Getting data from: 2016-11-19
Getting data from: 2016-11-12
Getting data from: 2016-11-05
Getting data from: 2016-10-29
Getting data from: 2016-10-22
Getting data from: 2016-10-15
Getting data from: 2016-10-08
Getting data from: 2016-10-01
Getting data from: 2016-09-24
Getting data from: 2016-09-17
Getting data from: 2016-09-10
Getting data from: 2016-09-03
Getting data from: 2016-08-27
Getting data from: 2016-08-20
Getting data from: 2016-08-13
Getting data from: 2016-08-06
Getting data from: 2016-07-30
Getting data from: 2016-07-23

Predictions for the Next Now album:

Meek Mill Litty (feat. Tory Lane

# 7 - Improvements in the Future

### Current Problems and Limitations

- Predictions could theoretically include songs that were released on NWCM albums in the last 9 months
- Predictions limited to what has appeared on Billboard Top 100 in 9 months
- Predictions to not take into account proximity to next Now release. I.e. If the next Now album is due in a week, we will call this release n, the predictions are more likely to be suitable for the n+1 release. This classifier really answers the question, **'Of the Billboard Top 100 charts for the past 9 months, which 20 songs are most likely to appear on a Now CD in the future?`**
- Lack of reliable features. We would really want a track's peak album, artist, and track popularity instead of today's popularity

### Data and Features

- Timedelta; I computed the difference in time between the NWCM album and original album release but there were several issues with determining the actual release time of the song. Spending time correcting these errors and determing a methodology for finding Billboard song's time delta is a costly effort but it is highest priority in improving this model
- Max popularity as a feature. The biggest problem with my classifier is lack of quality features. I went from experimenting with ~15 features to just 3; and those 3 are all popularity based. Spotify popularity changes over time, making it an unreliable feature for older tracks and thus limiting the amount of our training data we could have used. If I were to host this on a server and preserve the popularity for songs at the time they appear on a Now album, I could create a much stronger classifier going forward (either that or Spotify could release a historical popularity endpoint in their API).
- Experimentation with other features: Some ideas I have had for other features to wrangle during the course of this project were:
    - Does this song appear on a movie soundtrack?
    - Number of albums this song appears on in Spotify
    - Count of 'alternate' versions on Spotify (i.e. what is the count of 'remix' or 'radio edit' type versions of this song on spotify
- **Imputation**: Taylor Swift songs account for 1.3% of our Now data and every data droplet is important. I mistakenly dropped songs from both our Billboard and NWCM scrapes that did not feature a spotify id. I would have rather filled these values with median values.

### Algorithms

- I tried out a few popular machine learning algorithms here and chose that which had the best performance. I did not however pursue and ensemble methods (combining multiple algorithms to create one, stronger model)
- Collaborative filtering, I know a lot of recommendation systems use a collaborative filtering model, perhaps another day I will revisit this project and try to fit it to this type of algorithm
- Utilizing sklearn module GridSearchCV for best tuning parameters (I ran experiments myself but I would like to make my experiments more reproducible and automate the process next go-around)

### Organization

- I typically scrape data through a series of 'run-once' type scripts but this is not a good habit if you want to have a clean notebook; further improving scraping functions so they are 'run-once' would be nice.
- Using a data store other than multiple csv files could help

### Presentation

- While the purpose of this project was to have fun, experiment with machine learning, and learn something new, I feel this project could really benefit from a simple front end where users can see the prediction process unfold in front of them.
- Hosting this on a server could also allow better data storage strategies (i.e. having one master table that we read from and update with popularity and new Billboard songs each week so we don't have to scrape the most recent data for a prediction)