## Day 6 Exercise
A. (25 mins) **Genre Classification (Individual)**

   1. Choose up to 6 music genres and obtain track data from the genre's top 20 most-followed playlists in Spotify. A group may distribute the data gathering task by assigning a genre to each person and then pooling all the gathered data in one shared folder. 
   Alternatively, you may also use the provided sample playlist data.
    
   2. Pick any 2 music genres as your groupings for the classification exercise and repeat Steps 1-7. Make sure to answer the guide questions for each step
   
   3. Increase the number of features included in the models and repeat Steps 1-7 (but skip code cells for plotting-- viz for >2D will not work). How does this affect the model scores? Find the combination of features that will give you the best accuracy score.
   
   4. CHALLENGE (optional) Modify the notebook to take in any 3 music genres as groupings and repeat Steps 1-7.

B. (10 mins) **Group sharing**

Take turns presenting this notebook with your code answer to the whole group. Be brief and discuss only your best result.

-----

2. *(Optional, but useful to do ahead for your sprint project)*

    There are almost [innumerable](https://www.musicgenreslist.com/) named music genres online, but a summarized list  may be found [here](https://www.blisshq.com/music-library-management-blog/2011/01/25/fundamental-music-genre-list/).
    
    Can you build a model that can predict **at least 5 genres** listed in the latter with **>70% classification accuracy**?

*Submit this notebook at the end of class time*

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### 1. Read and check values of 2 playlist sets

In [None]:
#set keyword
KEYWORD1='rock'

In [None]:
# read and process the playlist data for keyword
playlist1_df = pd.read_csv('data/playlists/'+KEYWORD1+'_playlist_data.csv')
playlist1_df.head()

In [None]:
playlist1_df.shape

In [None]:
# read and process the playlist data for keyword
tracks1_df = pd.read_csv('data/playlists/'+KEYWORD1+'_playlist_tracks_data.csv')\
                .merge(pd.read_csv('data/playlists/'+KEYWORD1+'_playlist_tracks.csv')[['track_id','playlist_id','playlist_name']],\
                      on='track_id',how='left')
#make duration ms to minutes
tracks1_df['duration_mins']=tracks1_df['duration']/60000
#tag genre with keyword
tracks1_df['genre']=KEYWORD1
tracks1_df.head()

In [None]:
tracks1_df.shape

In [None]:
# How many unique tracks are in playlist set 1?
len(tracks1_df['track_id'].unique())

In [None]:
# What is the distribution of playlist set 1's total tracks?
playlist1_df['playlist_total_tracks'].hist()

In [None]:
len(playlist1_df[playlist1_df['playlist_total_tracks']>10])

In [None]:
# What is the distribution of playlist set 1's total tracks?
playlist1_df['total_followers'].hist()

In [None]:
###################### set keyword
KEYWORD2='R&B'

In [None]:
# read and process the playlist data for keyword
playlist2_df = pd.read_csv('data/playlists/'+KEYWORD2+'_playlist_data.csv')
playlist2_df.head(20)

In [None]:
playlist2_df.shape

In [None]:
# read and process the playlist data for keyword
tracks2_df = pd.read_csv('data/playlists/'+KEYWORD2+'_playlist_tracks_data.csv')\
                .merge(pd.read_csv('data/playlists/'+KEYWORD2+'_playlist_tracks.csv')[['track_id','playlist_id','playlist_name']],\
                      on='track_id',how='left')
#make duration ms to minutes
tracks2_df['duration_mins']=tracks2_df['duration']/60000
#tag genre with keyword
tracks2_df['genre']=KEYWORD2
tracks2_df.head()

In [None]:
tracks2_df.shape

In [None]:
# How many unique tracks are in playlist 2?
len(tracks2_df['track_id'].unique())

## 2. Compare histograms of 2 playlist sets

In [None]:
for col in ['danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo']:
    fig,ax = plt.subplots()
    
    sns.histplot(tracks1_df[col], ax=ax, label= KEYWORD1, kde=True, color='C0', edgecolor='None')
    sns.histplot(tracks2_df[col], ax=ax, label= KEYWORD2,  kde=True, color='C1', edgecolor='None')
    plt.title("%s vs %s: %s " % (KEYWORD1,KEYWORD2,col))
    plt.ylabel('Frequency')
    plt.legend(frameon=False)
    plt.show()


>Q: What feature/s best distinguish the 2 categories from each other? Does it make sense to use this as a feature for a classification model?

## 3. Feature Engineering

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()   #RobustScaler would also work

In [None]:
#get union of two playlist tracks list
tracks_df = pd.concat([tracks1_df,tracks2_df])
tracks1_df.shape, tracks2_df.shape, tracks_df.shape

In [None]:
#retain only distinct tracks per keyword
tracks_df =tracks_df.drop_duplicates(subset='track_id')
tracks_df.shape

In [None]:
#Normalize loudness
tracks_df['loudness'] = scaler.fit_transform(tracks_df[['loudness']])
tracks_df['loudness'].describe()


In [None]:
#Normalize tempo
tracks_df['tempo'] =  scaler.fit_transform(tracks_df[['tempo']])
#check
tracks_df['tempo'].describe()


In [None]:
# map genres to numbers
tracks_df['genre_id'] = tracks_df['genre'].map({KEYWORD1:1,KEYWORD2:2})

## 4. Preview possible classification results

>Q: Pick the 2 best distinguishing features of the 2 playlist sets and plot each row as a scatterplot/distplot colored by genre


In [None]:
feature_cols = ['X','Y']

In [None]:
sns.scatterplot(data=tracks_df, x=feature_cols[0], y=feature_cols[1], hue='genre')

In [None]:
fig = plt.figure()
ax= fig.add_subplot(111)

colormaps = ['Blues','Oranges']
for n,genre in enumerate([KEYWORD1,KEYWORD2]):
    df=tracks_df[tracks_df['genre']==genre]
    sns.kdeplot(x=df[feature_cols[0]],y=df[feature_cols[1]], ax=ax,\
                shade=True, alpha=0.5, cmap=colormaps[n])

#hack for proper legend render
sns.scatterplot(data=tracks_df, x=feature_cols[0], y=feature_cols[1], hue='genre', s=0)

> Q: How would you interpret the resulting scatterplot/distribution?

In [None]:
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import accuracy_score,roc_curve, auc, confusion_matrix, classification_report,\
    plot_confusion_matrix, plot_roc_curve


## 5. Model Tuning: kNN

Select audio features to use for the model

In [None]:
# create feature matrix (X)
# pick energy and tempo as features

X = tracks_df[feature_cols]
y = tracks_df['genre_id']
print(len(X),len(y))

In [None]:
n_neighbors = np.arange(2,51)
KFOLDS = 5

cv_scores_mean = []
cv_scores_std = []

for K in n_neighbors:
    print('Fitting KNN with K=%d ...' % K, end='')
    #initialize model
    knn_model = KNeighborsClassifier(n_neighbors=K)
    # get accuracy metric across train-test sets generated using k-folds
    scores = cross_val_score(knn_model, X, y, cv=KFOLDS, scoring='accuracy')
    # overall accuracy score of K is mean of accuracy scores per k-fold
    # std dev of scores across folds must be a minimum
    cv_scores_mean.append(scores.mean())
    cv_scores_std.append(scores.std())
    print('DONE!')

Choose optimal value of K

In [None]:
# determining best K
idx_max_accuracy = cv_scores_mean.index(max(cv_scores_mean))
optimal_K = n_neighbors[idx_max_accuracy]
print("The optimal number of neighbors is %0.2f with accuracy %0.2f" % (optimal_K, cv_scores_mean[idx_max_accuracy]))

# plot metrics 
fig,axs = plt.subplots(1,2, figsize=(11,4))
axs[0].plot(n_neighbors, cv_scores_mean)
axs[0].plot(optimal_K,max(cv_scores_mean), marker="o", ms=7, color='r')
axs[0].set_xlabel("Number of Neighbors K")
axs[0].set_ylabel("Accuracy")

axs[1].plot(n_neighbors, cv_scores_std)
axs[1].plot(optimal_K,cv_scores_std[idx_min_mse], marker="o", ms=7, color='r')
axs[1].set_xlabel("Number of Neighbors K")
axs[1].set_ylabel("Accuracy standard deviation")


Try out optimal model with entire length of the dataset

In [None]:
#initialize KNN with optimal K
knn_optimal_model = KNeighborsClassifier(n_neighbors=optimal_K )
# fitting the model with entire dataset
knn_optimal_model.fit(X, y)

Create a classification report

In [None]:
# predict the response
pred = knn_optimal_model.predict(X)
# evaluate accuracy
acc = accuracy_score(y, pred) * 100
print('\nThe accuracy of the knn classifier for the full dataset using k = %d is %f%%' % (optimal_K, acc))

## 6. Model Tuning: SVM

In [None]:
# https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html#sphx-glr-auto-examples-svm-plot-iris-svc-py
def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in

    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional

    Returns
    -------
    xx, yy : ndarray
    """
    xmgn= (x.max()-x.min())*0.25
    ymgn = (y.max()-y.min())*0.25
    
    x_min, x_max = x.min() - xmgn, x.max() + xmgn
    y_min, y_max = y.min() - ymgn, y.max() + ymgn
    
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy


def plot_contours(ax, clf, xx, yy, xlims,ylims, **params):
    """Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    ax.set_ylim(ylims)
    ax.set_xlim(xlims)
    
    return out

#visualize support vectors
def plot_vector_bounds(X,svm_model, show_points=True):
    fig,ax=plt.subplots()
    
    X0 = X.to_numpy()[:, 0]
    X1 = X.to_numpy()[:, 1]
    xx, yy = make_meshgrid(X0, X1)

    plot_contours(ax, svm_model, xx, yy, [0,1],[0,1],
                      cmap=plt.cm.coolwarm, alpha=0.8)
    if show_points:
        ax.scatter(X0, X1, c=y,cmap=plt.cm.coolwarm, s=20, edgecolors='k')
    ax.set_xlabel(X.columns[0])
    ax.set_ylabel(X.columns[1])

In [None]:
# create feature matrix (X)
feature_cols = ['energy','tempo']
X = tracks_df[feature_cols]
y = tracks_df['genre_id']

>Q: Go back to the scatter/distplot. What seems to be the appropriate kernel type to use for the classification?

Fit a **linear** kernel

In [None]:
def plot_scores(cv_scores_mean,cv_scores_std):
    fig,axs = plt.subplots(1,2, figsize=(11,4))
    
    x = np.arange(len(cv_scores_mean))
    max_mean_score_idx = cv_scores_mean.argmax()
    
    axs[0].plot(x, cv_scores_mean, marker='.', lw=0)
    axs[0].plot(x[max_mean_score_idx],max(cv_scores_mean), marker="o", ms=7, color='r')
    axs[0].set_xlabel("Model config type")
    axs[0].set_ylabel("Accuracy")

    axs[1].plot(x, cv_scores_std, marker='.', lw=0)
    axs[1].plot(x[max_mean_score_idx],cv_scores_std[max_mean_score_idx], marker="o", ms=7, color='r')
    axs[1].set_xlabel("Model config type")
    axs[1].set_ylabel("Accuracy standard deviation")


In [None]:
#Linear model
print('Fitting SVM with linear kernel...')

# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'kernel': ['linear']}
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 1, cv = KFOLDS )
# fitting the model for grid search
grid.fit(X, y)

#get scores
cv_scores_mean =  grid.cv_results_['mean_test_score']
cv_scores_std = grid.cv_results_['std_test_score']
max_mean_score_idx = cv_scores_mean.argmax()

print('Best model config score is %f%% (vs. overall mean score: %f )' % (100*cv_scores_mean[max_mean_score_idx],
                                                                        100*np.mean(cv_scores_mean)))
print('Std of best model score across folds is %f (vs. overall mean std: %f )' %\
      (cv_scores_std[max_mean_score_idx], np.mean(cv_scores_std)))

# get best model
svm_model1 = grid.best_estimator_
# fit model for entire data
svm_model1.fit(X, y)
pred1 = svm_model1.predict(X)
acc = accuracy_score(y, pred1) * 100
print('The accuracy of the SVM classifier for the full dataset is %f%%' % (acc))
print('DONE!')

In [None]:
#plot bounds
#error: plot vector bounds only works w 2 input features
plot_vector_bounds(X,svm_model1)

In [None]:
plot_scores(cv_scores_mean,cv_scores_std)

Fit a **polynomial** kernel

In [None]:
# defining parameter range
print('Fitting SVM with a polynomial kernel...')

param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'degree': np.arange(2,6),
              'kernel': ['poly']}

grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 1, cv = KFOLDS )
# fitting the model for grid search
grid.fit(X, y)
print('Best model is %s' % grid.best_estimator_)

#get scores
cv_scores_mean =  grid.cv_results_['mean_test_score']
cv_scores_std = grid.cv_results_['std_test_score']
max_mean_score_idx = cv_scores_mean.argmax()

print('Best model config score is %f%% (vs. overall mean score: %f )' % (100*cv_scores_mean[max_mean_score_idx],
                                                                        100*np.mean(cv_scores_mean)))
print('Std of best model score across folds is %f (vs. overall mean std: %f )' %\
      (cv_scores_std[max_mean_score_idx], np.mean(cv_scores_std)))

# get best model
svm_model2 = grid.best_estimator_
# fit model for entire data
svm_model2.fit(X, y)
pred2 = svm_model2.predict(X)
acc = accuracy_score(y, pred2) * 100
print('The accuracy of the SVM classifier for the full dataset is %f%%' % (acc))
print('DONE!')

In [None]:
#plot bounds
#error: plot vector bounds only works w 2 input features
plot_vector_bounds(X,svm_model2)

In [None]:
plot_scores(cv_scores_mean,cv_scores_std)

Fit a **radial** kernel

In [None]:
# defining parameter range
print('Fitting SVM with a polynomial kernel...')

param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']}

grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 1, cv = KFOLDS )
# fitting the model for grid search
grid.fit(X, y)
print('Best model is %s' % grid.best_estimator_)

#get scores
cv_scores_mean =  grid.cv_results_['mean_test_score']
cv_scores_std = grid.cv_results_['std_test_score']
max_mean_score_idx = cv_scores_mean.argmax()

print('Best model config score is %f%% (vs. overall mean score: %f )' % (100*cv_scores_mean[max_mean_score_idx],
                                                                        100*np.mean(cv_scores_mean)))
print('Std of best model score across folds is %f (vs. overall mean std: %f )' %\
      (cv_scores_std[max_mean_score_idx], np.mean(cv_scores_std)))

# get best model
svm_model3 = grid.best_estimator_
# fit model for entire data
svm_model3.fit(X, y)
pred3 = svm_model3.predict(X)
acc = accuracy_score(y, pred3) * 100
print('The accuracy of the SVM classifier for the full dataset is %f%%' % (acc))
print('DONE!')

In [None]:
#plot bounds
plot_vector_bounds(X,svm_model3)

In [None]:
plot_scores(cv_scores_mean,cv_scores_std)

Select best SVM model

In [None]:
svm_optimal_model = svm_model2
#set probability=True to view classification probabilities and refit
svm_optimal_model.probability=True
svm_optimal_model.fit(X, y)

## 7. Model Selection

>Q: Which between KNN and SVM performed better? 

In [None]:
plot_confusion_matrix(svm_optimal_model,X,y)

In [None]:
plot_confusion_matrix(knn_optimal_model,X,y)

In [None]:
print('-------------------------------------------------------------')
print('KNN')
print(classification_report(y,knn_optimal_model.predict(X)))
print('-------------------------------------------------------------')
print('SVM')
print(classification_report(y,svm_optimal_model.predict(X)))

- accuracy = % correct genre classifications
        all correct / all
- precision =  % correct genre classifications given everything model classified as that genre, emphasizes false positives
        TP/TP+FP 
- recall = % correct genre classifications given all actual tracks in the genre, emphasizes false negatives
        TP/TP+FN
- f1-score = weighted average of Precision and Recall
        F1 Score = 2*(Recall * Precision) / (Recall + Precision)
- support = number of items in the class

- macro ave = average of the unweighted mean per label
- weighted ave = average of the weighted mean per label


In [None]:
#helper function
def plot_ROC(model,X,y):
    fig, ax = plt.subplots(figsize=(4,4))
    plot_roc_curve(model,X,y, ax=ax)
    #y=x line
    ax.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    #edit verbose legend
    default_legend = ax.get_legend_handles_labels()[1][0]
    ax.legend(labels=[default_legend.split(' (')[-1][:-1]],loc='lower right')


In [None]:
plot_ROC(knn_optimal_model,X,y)

In [None]:
plot_ROC(svm_optimal_model,X,y)

Choose optimal model among those above

In [None]:
optimal_model = 'XXXXXXXXXXXXX'

## 8. Verifying results using in-sample and out-of-sample predictions

**In-sample**

Check if predicted genres match the genre orinally tagged acc to the spotify playlist name. Focus on misclassified tracks with higher prediction probability to identify possible model improvements

In [None]:
tracks_df['predicted_genre_id'] = tracks_df.apply(lambda x:  optimal_model.predict(x[feature_cols].values.reshape(1,-1))[0]\
                                               , axis=1)
tracks_df['predicted_genre'] = tracks_df['predicted_genre_id'].map({1:KEYWORD1,2:KEYWORD2})
tracks_df['predicted_genre_prob'] = tracks_df.apply(lambda x:  np.max(optimal_model.predict_proba(x[feature_cols].values.reshape(1,-1)))\
                                                    , axis=1)
tracks_df.head()

In [None]:
#View histogram of probabilities
tracks_df['predicted_genre_prob'].hist()

Check tracks mistakenly classified with but high probability

In [None]:
tracks_df[(tracks_df['predicted_genre_id']!=tracks_df['genre_id'])&(tracks_df['predicted_genre_prob']>0.9)]\
        .sort_values('predicted_genre_prob', ascending=False)[['track_name','artist_name','genre','predicted_genre','predicted_genre_prob']]

**Out-of-sample**

Check if best model correctly predicts the genre of a track in the Top 200 charts (assuming most are not in the playlist data). User may validate the results subjectively as a listener, or refer to another source by looking up the track in a genre-tagging site(e.g. https://www.chosic.com/music-genre-finder/)

In [None]:
chart_tracks_df = pd.read_csv("data/spotify_daily_charts_tracks.csv")
chart_tracks_df = chart_tracks_df.dropna()
chart_tracks_df.head()

In [None]:
#scale tempo
chart_tracks_df['tempo'] =  scaler.fit_transform(chart_tracks_df[['tempo']])

In [None]:
#Create columns matching the predicted genre and probability of the best model to each of the tracks in the charts
chart_tracks_df['predicted_genre_id'] = chart_tracks_df.apply(lambda x:  optimal_model.predict(x[feature_cols].values.reshape(1,-1))[0]\
                                               , axis=1)
chart_tracks_df['predicted_genre'] = chart_tracks_df['predicted_genre_id'].map({1:KEYWORD1,2:KEYWORD2})
chart_tracks_df['predicted_genre_prob'] = chart_tracks_df.apply(lambda x:  np.max(optimal_model.predict_proba(x[feature_cols].values.reshape(1,-1)))\
                                                    , axis=1)
chart_tracks_df.head()

In [None]:
chart_tracks_df['predicted_genre'].value_counts()

In [None]:
#View histogram of probabilities
chart_tracks_df['predicted_genre_prob'].hist()

> Q: Can you identify tracks that were misclassfied by the model?
    Does it make sense that the model misclassfied the tracks given the model configuration? Why or why not?

In [None]:
#Check tracks classified with higher probability
chart_tracks_df[chart_tracks_df['predicted_genre']=='rock'][['track_name','artist_name','predicted_genre','predicted_genre_prob']]\
            .sort_values(['predicted_genre_prob'],ascending=False)[:10]

In [None]:
#Check tracks classified with higher probability
chart_tracks_df[chart_tracks_df['predicted_genre']=='R&B'][['track_name','artist_name','predicted_genre','predicted_genre_prob']]\
            .sort_values(['predicted_genre_prob'],ascending=False)[:10]