# Spotify Artist Recommendation Project - Modeling

We explored the time components of the data in each song in the [previous notebook](https://github.com/dvanheeswijk/Spotify/blob/master/Data%20Wrangling%20and%20EDA.ipynb). Now we will look to perform the task of creating an artist recommendation model. To do this, we begin by reading in the [data](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) and libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from random import sample

from sklearn.preprocessing import Normalizer, MinMaxScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV, RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split

In [None]:
data = pd.read_csv('data/data.csv')

data.head()

In [None]:
genre = pd.read_csv('data/data_by_genres.csv')

genre.head()

In [None]:
artists = pd.read_csv('data/data_w_genres.csv')

artists.head()

We will use the three data sets to help with some of the modeling. The majority of the data we will use in modeling and training/testing will be part of the **data** dataframe. The *genre* and *artists* dataframes will help with distinguishing how our model has done.

## Genre Clusters

We would like to create a model that clusters genres of songs together based on the commonality in audio features. In order to do this, we will need to clean the data just a bit to get what we want, before scaling and modeling.

In [None]:
data.info()

In [None]:
columns = genre.columns

columns

In [None]:
columns = columns[1:]
X = genre[columns]

In [None]:
# For our model to work best, we need to normalize the data
scaler = MinMaxScaler()

scaler.fit(X)

X_scaled = scaler.transform(X)

In [None]:
X_scaled = pd.DataFrame(X_scaled, columns=columns)

In [None]:
kmeans = KMeans()

n_clusters = range(1, 100)
error = []
for n in n_clusters:
    kmeans = KMeans(n_clusters=n)
    kmeans.fit(X_scaled)
    error.append(kmeans.inertia_)

In [None]:
sns.lineplot(n_clusters,error)
plt.xlabel('Number of Genres')
plt.ylabel('Sum of squared distances')
plt.title('Evaluation for # of clusters vs error')

plt.show()

It appears that our best estimate for the number of clusters to use to create our "Super" Genres would be right around 20 or so genres. Thus we will create our "Super" genres using 20 clusters and apply them to the already existing genre data set.

In [None]:
best_kmeans = KMeans(n_clusters=20)

super_genres = best_kmeans.fit_predict(X_scaled)

genre['super_genre'] = super_genres

genre.head()

So, let's take a quick look at our 20 super genres to confirm that they make sense.

In [None]:
genre_count = genre.groupby('super_genre')['genres'].count()

plt.figure(figsize=(12,7))
sns.barplot(x = genre_count.index, y=genre_count, color='gray')
plt.title('Number of genres per super genre')

We see that the number of genres per super genre ranges dramatically from one with about 10 total genres to some with nearly 250. We will investigate a couple super genres to see if the combinations make sense.

In [None]:
print(genre[genre['super_genre']==17]['genres'])

In [None]:
# Sample from super genre 4
print(sample(list(genre[genre['super_genre']==4]['genres']),15))

In [None]:
# Sample from super genre 0
print(sample(list(genre[genre['super_genre']==0]['genres']),15))

In [None]:
# Sample from super genre 12
print(sample(list(genre[genre['super_genre']==12]['genres']),15))

In [None]:
# Sample from super genre 18
print(sample(list(genre[genre['super_genre']==18]['genres']),15))

In [None]:
# Sample from super genre 3
print(sample(list(genre[genre['super_genre']==3]['genres']),15))

We find that the super genres tend to have some similarities that pop out. When we look at super genre 17, we find that it is full of genres that are more spoken word like comedy, poetry, and reading. However, we also see that there are a few genres that maybe don't have as much in common with comedy, such as classic hungarian pop.

Looking at a few other super genres gives us the feeling that, within each super genre, there are smaller clusters of similar genres that are put together in the super genre. For example, in super genre 18, there are indie folk, brooklyn indie, and piano rock together in the same super genre as trap, scam rap, and manchester hip hop. Clearly the first three have similarities to each other, but not as much with the second three. To remedy this, we would need to either create more super genres, or recognize that this phenomena will happen with fewer super genres.

## Artist Recommendation

We will now switch to our other task, which is to create an artist recommendation model. To do this, we will use the data and artists dataframes to create our training data set.

In [None]:
artists = pd.read_csv('data/data_w_genres.csv')

artists.head()

In [None]:
data.head()

As we had before, we are going to need to cluster the artists genres based on the data available. Thus, we will create another cluster to create "genres" for each artists, given the audio features for each given artist.

In [None]:
scaler = MinMaxScaler()
artists[columns] = scaler.fit_transform(artists[columns])

In [None]:
artists.describe()

In [None]:
knn = KMeans(n_clusters=20)

artists['genres'] = knn.fit_predict(artists[columns])

artists.head()

Now that we have a cluster for artists, we can look into creating a recommendation system for a given user. The way we will do this is by creating a set of "users" that have first rated song on a scale of 1-10. We will then use the user rating data to create a recommendation of 5 artists based on preference.

In [None]:
artists['user_id'] = np.random.randint(1000,1500,len(artists))
artists['rating'] = np.random.randint(1,11,len(artists))

artists.head()

In [None]:
def recommend(user_id):
    user_genres =  list(dict.fromkeys(artists[artists['user_id']==user_id].sort_values('rating', ascending=False)['genres'][:5]))
    new_artists = artists[artists['genres'].isin(user_genres)].sort_values('popularity',ascending=False)['artists']
    return sample(list(new_artists),10)

In [None]:
recommend(1014)

In [None]:
artists[artists['user_id']==1014].sort_values('rating',ascending=False)[['artists','genres', 'rating']]

## Song Recommendation

What we would really like to do is create a recommendation for songs based on a listener's previously played songs. For this purpose, we are going to create a fake list of users and how many times they listened to certain songs. With this, we will then take each user's top 5 songs, and recommend the next song as a result.

In [None]:
data['artists'] = data['artists'].str.replace('"', '').str.replace("'","").str.replace('[','').str.replace(']','')

In [None]:
scaler = MinMaxScaler()
data[columns] = scaler.fit_transform(data[columns])

In [None]:
data.describe()

In [None]:
data.head()

In [None]:
# Creating a "decade" feature which may help with recommendations
data['decade'] = pd.qcut(data['year'],10)

In [None]:
data.drop_duplicates(inplace=True)
data.info()

In [None]:
data['id'].is_unique

In [None]:
data.reset_index(inplace=True)
data.head()

In [None]:
data.drop(columns='index',inplace=True)

We dropped a few duplicated rows and have discerned that the 'id' column holds a unique index for each song. We will use this as our independent variable, then convert the id to its corresponding artists/song recommendation.

Now, we are ready to make a dummy user with a play count for each song. We will make the listening set sparse by filling the majority of the songs with 0, then focus on 25000 songs to place randomized listening habits. For the sake of this exercise, songs will be played no more than 100 times by the user.

In [None]:
data['user_play_count'] = 0

listeners_list = [np.random.randint(0,len(data)) for i in range(0,25000)]

In [None]:
for i in listeners_list:
    data.loc[i,'user_play_count'] = np.random.randint(0,101)

In [None]:
sns.histplot(data[data['user_play_count']>0]['user_play_count'],bins=25)
plt.title('Random User listens count')

In [None]:
data.info()

In [None]:
knn = KMeans(n_clusters=25)

data['genre'] = knn.fit_predict(data[columns])

In [None]:
y = data['genre']
X = data[['user_play_count', 'year']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)

In [None]:
gb = GradientBoostingClassifier()
learning_rate = [0.001,0.01, 0.1]
n_estimators = [10,50,100]
subsample = [0.5, 0.7, 1.0]
max_depth = [3, 7, 9]

grid = dict(learning_rate=learning_rate, n_estimators=n_estimators, subsample=subsample, max_depth=max_depth)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = RandomizedSearchCV(estimator=gb, param_distributions=grid, n_iter=8, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train, y_train)

In [None]:
class song_recommendation():
    def __init__(self):
        self.train_data = None
        self.user_id = None
        self.item_id = None
        
    def 