# Music Recommender System

Recommendation systems represent a cornerstone in the field of machine learning, playing a pivotal role in predicting user preferences and ratings for various items, such as films, products, or songs.

### Types of Recommender Systems

Two primary categories define recommender systems:

1. **Content-Based Filters**
2. **Collaborative Filters**

**Content-based filters** ascertain user preferences by analyzing the user's historical likes and dislikes. Conversely, **collaborative filters** predict user preferences based on the likes of other users who share similarities with the target user.

### 1) Content-Based Filters

Content-based recommenders approach recommendations as a personalized classification problem. These systems learn a user's preferences by examining the features of items, such as songs.

A fundamental technique within content-based recommendation is **keyword matching**. This involves extracting meaningful keywords from a user's preferred song description, searching for these keywords in other song descriptions to gauge similarities, and leveraging this information to recommend similar songs to the user.

#### Implementation with Term Frequency-Inverse Document Frequency (TF-IDF)

Given the textual nature of the data, the **Term Frequency-Inverse Document Frequency (TF-IDF)** method is applied for matching. This technique involves a step-by-step process to generate a content-based music recommender system. The subsequent sections will elucidate the specific steps involved in this implementation.


### Importing required libraries

First, we'll import all the required libraries.

In [1]:
import numpy as np
import pandas as pd

In [2]:
from typing import List, Dict

We have already used the **TF-IDF score before** when performing Twitter sentiment analysis. 

Likewise, we are going to use TfidfVectorizer from the Scikit-learn package again.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Dataset

So imagine that we have the [following dataset](https://www.kaggle.com/datasets/notshrirang/spotify-million-song-dataset/data). 

This dataset contains name, artist, and lyrics for *57650 songs in English*. The data has been acquired from LyricsFreak through scraping.

In [4]:
songs = pd.read_csv('./songdata.csv')

In [18]:
songs.head(10)

Unnamed: 0,artist,song,text
0,Kelly Clarkson,Walk Away,You've got your mother and your brother \nEve...
1,Hanson,World's On Fire,"Watched from a distance, it's beautiful \nSom..."
2,Kinks,Jack The Idiot Dunce,"Who's the fool with the cross-eyed stare, \nT..."
3,Mary Black,Going Gone,There is a lighthouse in the harbor \nGiving ...
4,The Beatles,All My Loving,Close your eyes and I'll kiss you \nTomorrow ...
5,Prince,Call My Name,"Call, call my name \nCall it, call my name \..."
6,Oingo Boingo,You Really Got Me,"Girl, you really got me now \nYou got me so I..."
7,Lady Gaga,Im On The Edge Of Glory,There aint no reason you and me should be alon...
8,Reba Mcentire,I Wouldn't Know,Everyone I see these days still asks me about ...
9,Lionel Richie,Stay,Are you sad \nOr just a little lonely \nI ca...


Because of the dataset being so big, we are going to resample only 5000 random songs.

In [6]:
songs = songs.sample(n=5000).drop('link', axis=1).reset_index(drop=True)

We can notice also the presence of `\n` in the text, so we are going to remove it.

In [7]:
songs['text'] = songs['text'].str.replace(r'\n', '')

After that, we use TF-IDF vectorizer that calculates the TF-IDF score for each song lyric, word-by-word. 

Here, we pay particular attention to the arguments we can specify.

In [8]:
tfidf = TfidfVectorizer(analyzer='word', stop_words='english')

In [9]:
lyrics_matrix = tfidf.fit_transform(songs['text'])

*How do we use this matrix for a recommendation?* 

We now need to calculate the similarity of one lyric to another. We are going to use **cosine similarity**.

We want to calculate the cosine similarity of each item with every other item in the dataset. So we just pass the lyrics_matrix as argument.

In [10]:
cosine_similarities = cosine_similarity(lyrics_matrix) 

Once we get the similarities, we'll store in a dictionary the names of the 50  most similar songs for each song in our dataset.

In [11]:
similarities = {}

In [12]:
for i in range(len(cosine_similarities)):
    # Now we'll sort each element in cosine_similarities and get the indexes of the songs. 
    similar_indices = cosine_similarities[i].argsort()[:-50:-1] 
    # After that, we'll store in similarities each name of the 50 most similar songs.
    # Except the first one that is the same song.
    similarities[songs['song'].iloc[i]] = [(cosine_similarities[i][x], songs['song'][x], songs['artist'][x]) for x in similar_indices][1:]

Now, We can use that similarity scores to access the most similar items and give a recommendation.

For that, we'll define our Content based recommender class.

In [13]:
class ContentBasedRecommender:
    def __init__(self, matrix):
        self.matrix_similar = matrix

    def _print_message(self, song, recom_song):
        rec_items = len(recom_song)
        
        print(f'The {rec_items} recommended songs for {song} are:')
        for i in range(rec_items):
            print(f"Number {i+1}:")
            print(f"{recom_song[i][1]} by {recom_song[i][2]} with {round(recom_song[i][0], 3)} similarity score") 
            print("--------------------")
        
    def recommend(self, recommendation):
        # Get song to find recommendations for
        song = recommendation['song']
        # Get number of songs to recommend
        number_songs = recommendation['number_songs']
        # Get the number of songs most similars from matrix similarities
        recom_song = self.matrix_similar[song][:number_songs]
        # print each item
        self._print_message(song=song, recom_song=recom_song)

Now, instantiate class

In [14]:
recommedations = ContentBasedRecommender(similarities)

Then, we are ready to pick a song from the dataset and make a recommendation.

In [21]:
recommendation = {
    "song": songs['song'][0],
    "number_songs": 4 
}

songs['song'][0]

'Walk Away'

In [22]:
recommedations.recommend(recommendation)

The 4 recommended songs for Walk Away are:
Number 1:
Walk Away by Kelly Clarkson with 0.353 similarity score
--------------------
Number 2:
Fordham Road by Lana Del Rey with 0.353 similarity score
--------------------
Number 3:
Walk Away From Love by Yazoo with 0.343 similarity score
--------------------
Number 4:
Don't Look Back by Peter Tosh with 0.283 similarity score
--------------------


And we can pick another random song and recommend again:

In [16]:
recommendation2 = {
    "song": songs['song'].iloc[120],
    "number_songs": 4 
}

In [17]:
recommedations.recommend(recommendation2)

The 4 recommended songs for Movin' are:
Number 1:
Don't Stop Moving by Beautiful South with 0.348 similarity score
--------------------
Number 2:
I'll Spend My Life With You by The Monkees with 0.285 similarity score
--------------------
Number 3:
Drifter by Deep Purple with 0.253 similarity score
--------------------
Number 4:
Hong Kong Bar by Tim Buckley with 0.251 similarity score
--------------------
