### The aim of this notebook is to create a simple content-based recommendation system. Content-based means we will be giving recommendations based on the features of a song. That is songs with similar features will be recommended. For this tutorial, we can think of features as lyrics, i.e. songs with similar words in their lyrics are similar to each other. We build this system in seven steps:

1. : importing necessary libraries
2. : loading Songs dataset (from Kaggle)
3. : clean data
4. : detect important features from dataset
5. : calculating song similarity based on these features
6. : storing top n similar songs for each song
7. : predicting recommendation

### 1. Importing libraries

In [1]:
import pandas as pd
import numpy as np

from typing import List, Dict

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

### 2. Dataset
#### We will be using the songs dataset available on Kaggle at https://www.kaggle.com/mousehead/songlyrics/data#

In [7]:
#Read data into dataframe
songs = pd.read_csv('songdata.csv')
# Inspect the data
songs.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


### 3. Pre-processing dataset

In [9]:
# resample only 5000 rows from dataset (to keep dataset and processing time manageable for the intent of this tutorial)
songs = songs.sample(n = 5000)

In [10]:
songs.describe(include = 'all')

Unnamed: 0,artist,song,link,text
count,5000,5000,5000,5000
unique,603,4763,5000,4998
top,Christmas Songs,A Song For You,/c/conway+twitty/halfway+to+heaven_20213954.html,Baby here I stand before you \nWith my heart ...
freq,23,5,1,2


If we observe the ***text*** column in the dataset, which represents the lyrics for the songs, we will observe the lyrics have some '\n' in them. We must remove them as part of pre-processing.

In [11]:
# Replace \n present in the text with blank space
songs['text'] = songs['text'].str.replace('\n', '' )

### 4. Calculating the term frequency - inverse document frequency (tf-idf) matrix. 
In this matrix, there is the tf-idf score for each song lyric word by word. In other words, we first pick all the unique words across all the lyrics (which can be useful for calculating similarity between two different songs) and calculate their importance (as measured by tf-idf score) with respect to each song.

In [13]:
# Initialize tfidf vectorizer
tfidf = TfidfVectorizer(analyzer = 'word', stop_words= 'english')

# this is used for calculating tf (term frequency) and idf (inverse document frequency)

#tf = number of occurence of that word/ total number of words
#idf = log(total number of documents/no of documents containg the term)

# now we are going to calculate the TF-IDF score for each song lyric word by word i.e. TF * IDF
#tfidf.get_feature_names()

In [14]:
# Fit and transform 
tfidf_matrix = tfidf.fit_transform(songs['text'])

In [15]:
tfidf_matrix.shape 
# from the shape, we see that the matrix has 
# as rows all the 5000 songs and each words 
# importance corresponding to the song is given

(5000, 24591)

As we can see, the algorithm spits out 24591 unique words which it thinks will be useful for calculating similarity between lyrics. This is too many features and in an ideal world we will be reducing them using some sort of dimensionality reduction technique (like PCA or autoencoders), however, for the time being we will let it be.

In [16]:
# to get basic idea how the matrix looks
tfidf_matrix.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

### 5. Calculating similarity between lyrics
To calculate similarity of one lyric to the next, we can be using euclidean distance or cosine similarity (or for that matter any other distance measure). However, we will be using cosine because we are not only interested in the magnitude of the tf-idf but also the angle between the lyrics:
- small angle means more similar songs 
- angle = 0 means songs have nothing in common
- large angle means songs are completely opposite

(Please refer to http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/) for a thorough explanation of the need to choose cosine distances.

To calculate cosine distance between two lyrics, we need a vector of values (just like we need for euclidean distance too) for each of these two lyrics. In our case, this vector of values will be the tfidf score for all the unique words. Thus the vector for song1 looks like <tfidf_1, tfidf_2, .... tfidf_24591> and similar for song2 as well. Luckily for us, we have these vectors for each lyric already calculated in our tfidf matrix and this is what we will pass to the cosine_similarity function as well.

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

# To see the similarity of the first song with all the other songs
# cosine_similarity(tfidf_matrix[0:1], tfidf_matrix) # in the output, first value is 1 because the song is being compared to itself 

# to calculate the similarity between each pair of songs
cosine_similarities = cosine_similarity(tfidf_matrix)

As an output, we will get an n x n matrix where n is the number of songs and the value in ith row and jth column represents the similarity between the ith song and jth song.

In [18]:
# To see the 50 most similar song to the song[0]

# getting the column numbers which essentialy represent the song number of the 50 most similar songs to the song[0]
cosine_similarities[0].argsort()[:-50:-1] # argort gives you the position (starting from 0) rather than the value itself


array([   0, 1803, 3205, 3791, 1757, 2384, 1442, 1535, 2579,  865, 4469,
       4608,  497, 2815, 1516, 1204,  132, 2695, 3466,  835, 1521, 1957,
       4254, 4906,   45,  656, 4123, 1871, 4910,   22, 1491,  775, 2222,
        343, 2044, 3973,  831, 1749, 3953, 1245, 4672, 4162, 1454, 2574,
         97,  213, 1449, 2816, 1507])

### 6. Creating dictionary for highly similar songs

In [19]:
# creating a dictionary to store for each song 50 most similar song
my_dict = {} # initialize empty dictionary

for i in range(len(cosine_similarities)): # loop over all the songs in cosine_similarity matrix
    similar_indices = cosine_similarities[i].argsort()[:-50:-1] # returns the indexes of top 50 songs
    
    # Setting the key as the song name
    song_name = songs['song'].iloc[i] 
    
    # Setting the value as three items i.e (1) similarity score, (2) song name, (3) artist name
    # we need the similarity score, songname and artist name for only those songs whose indices were found in similar_indices
    my_dict[song_name] = [(cosine_similarities[i][x], songs['song'].iloc[x], songs['artist'].iloc[x]) for x in similar_indices][1:]

In [20]:
# Testing whether this dictionary works to give three most similar songs for Song number 10
getforsong = songs['song'].iloc[10]
getforsong
my_dict.get(getforsong)[:3] # select only first 3 rows of the output since we want only top 3 recommendations

[(0.5066125791275065, 'Love Me Now', 'John Legend'),
 (0.5060429641432027, 'For The Girl Who Has Everything', "'n Sync"),
 (0.47467050528723503, "Lookin' For That Girl", 'Tim McGraw')]

### 7. Making content-based song recommendations for particular song

In [21]:
# Let us use this dictionary to present the recommendations

def get_recommendation(ques):
    
            # Get song to find recommendations for
            song = ques['song']
            
            # Get number of songs to recommend
            number_songs = ques['no_of_songs']
            
            # Get the number of songs most similars from my_dict
            recom_song = my_dict.get(song)[:number_songs]
            
            # print each item in recom_song
            print(f"The recommended songs for {song} by {songs.loc[songs['song'] == song, 'artist'].iloc[0]} are")
            for i in range(number_songs):
                print(f"{recom_song[i][1]} by {recom_song[i][2]} with similarity score {recom_song[i][0]} ")
#             print(recom_song)

In [22]:
# Create a dictionary to pass as input

ques = {
    "song" : songs['song'].iloc[1104],
    "no_of_songs" : 4
}

get_recommendation(ques)

The recommended songs for Working Class Hero by Marilyn Manson are
Hero by Mariah Carey with similarity score 0.4161157147759222 
Barrier Reef by Old 97's with similarity score 0.31360076636387807 
Heroes by Helloween with similarity score 0.21100304905514192 
Clampdown by Indigo Girls with similarity score 0.20893836777075372 


In [None]:
# to find the artist name given the song name
# songs.loc[songs['song'] == "Pieces Of A Dream", 'artist'].iloc[0]