# Recommender Model

### Problem Statement

Recently, Spotify has decided to try and improve the recommendation system used by their current app by implementing something that is more content-based. As a data scientist I was hired to create a recommender that recommends songs based on Artist and Song Title, additionally I am to include a similar time frame for the release year of the song. Spotify requested that I focus on song features like Danceability, Instrumentalness, Popularity of the song, etc. The goal is to improve the listener experience to make the application more appealing than other competitors like Apple Music and TIDAL by creating a model that analyzes the characteristics of the tracks and creates more niche recommendations that enhance the user experience.

#### Imports and Reading in Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances 
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import pickle

In [2]:
songs = pd.read_csv('../data/clean_data.csv', index_col = 0)
print(songs.shape)
songs.head(2)

(8716, 15)


Unnamed: 0,Track,Artist,Genre,Album Title,Album Type,Release Date,Thumbnail,Acousticness,Danceability,Energy,Instrumentalness,Popularity,Speechiness,Track ID,Release Year
0,"Shook Ones, Pt. II",Mobb Deep,"east coast hip hop, hardcore hip hop, hip hop,...",The Infamous,album,1995-04-25,https://i.scdn.co/image/ab67616d00004851a2203f...,0.0146,0.763,0.786,0.0114,0.787879,0.229,33ZXjLCpiINn8eQIDYEPTD,1995
1,Hypnotize - 2014 Remaster,The Notorious B.I.G.,"east coast hip hop, gangster rap, hardcore hip...",Life After Death (2014 Remastered Edition),album,1997-03-04,https://i.scdn.co/image/ab67616d00004851fde79b...,0.145,0.901,0.697,0.0,0.838384,0.28,7KwZNVEaqikRSBSpyhXK2j,1997


#### Creating a Track + Artist Column

In [3]:
#First I will create a column that contains the Track title plus the Artist
songs['Track_Artist'] = songs['Track'] + ' -' + songs['Artist']

#Next I will set the new Column as the Index
songs.set_index('Track_Artist', inplace = True)

#Finally, I wil check the .head(2) just to make sure it worked
songs.head(2)

Unnamed: 0_level_0,Track,Artist,Genre,Album Title,Album Type,Release Date,Thumbnail,Acousticness,Danceability,Energy,Instrumentalness,Popularity,Speechiness,Track ID,Release Year
Track_Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
"Shook Ones, Pt. II -Mobb Deep","Shook Ones, Pt. II",Mobb Deep,"east coast hip hop, hardcore hip hop, hip hop,...",The Infamous,album,1995-04-25,https://i.scdn.co/image/ab67616d00004851a2203f...,0.0146,0.763,0.786,0.0114,0.787879,0.229,33ZXjLCpiINn8eQIDYEPTD,1995
Hypnotize - 2014 Remaster -The Notorious B.I.G.,Hypnotize - 2014 Remaster,The Notorious B.I.G.,"east coast hip hop, gangster rap, hardcore hip...",Life After Death (2014 Remastered Edition),album,1997-03-04,https://i.scdn.co/image/ab67616d00004851fde79b...,0.145,0.901,0.697,0.0,0.838384,0.28,7KwZNVEaqikRSBSpyhXK2j,1997


#### Model Creation

Below we will create two functions which will be used for the text found in the index.
The first function removes special characters from the text. The second function is used to allow the use of special characters including capitalized letters.

In [4]:
def remove_special_characters(text):
    pattern = r'[^a-zA-Z0-9\s]'
    return re.sub(pattern, '', text)

def preprocess_input(input_string):
    processed_string = remove_special_characters(input_string.lower().strip())
    return processed_string

In [5]:
# Apply the function to the index (Track_Artist column) to remove special characters
songs.index = songs.index.map(remove_special_characters)

In [6]:
# Define a function to recommend songs based on similarity
def recommend_song(song_index, similarity_matrix, songs, category, top_n=10):
    # Get the similarity scores for the specified song index
    similarity_scores = similarity_matrix[song_index]

    # Sort the similarity scores in descending order and get the indices
    sorted_indices = np.argsort(similarity_scores)[::-1]

    recommended_songs = []
    count = 0

    # Find the recommendations that match the specified category
    for index in sorted_indices:
        if count >= top_n:
            break

        song = songs.iloc[index]
        if category.lower() == 'track':
            recommendation = song.name  # Use the index (Track_Artist) as the recommendation
        elif category.lower() == 'artist':
            recommendation = song['Artist']
        elif category.lower() == 'genre':
            recommendation = song['Genre']
        elif category.lower() == 'album title':
            recommendation = song['Album Title']
        else:
            continue

        # Exclude the original song from the recommendations
        if recommendation.lower() != song_name.lower():
            recommended_songs.append(song[['Thumbnail', 'Track', 'Artist', 'Album Title', 'Release Year']])
            count += 1

    # Create a DataFrame from the recommended songs
    return pd.DataFrame(recommended_songs)

Below we will be selecting out features. We are focusing on the numeric features in order to make the recommendations.

In [7]:
#We will look at the numeric features of the data for the recommender
features = ['Acousticness', 'Danceability', 'Energy', 'Instrumentalness', 'Popularity', 'Speechiness', 'Release Year']

In [8]:
#Here we are scaling the numeric features using the MinMaxScaler
scaler = MinMaxScaler()
normalized_features = scaler.fit_transform(songs[features])

In [9]:
#Here we are applying the TFIDF Vectorizer to the Genre Column
tfidf_vectorizer = TfidfVectorizer()
genre_matrix = tfidf_vectorizer.fit_transform(songs['Genre'])

*NOTE: The reason we are using a TFIDF Vectorizer on the Genre column is to create more accurate recommendations.  Additionally, since Spotify provides a compilation of genres for most songs we can look at the term frequency for the genre person which helps us with the recommendations. We have two song from the same general genre but spotify will have multiple genres for each song.*

In [10]:
#Here we are preparing the combined feature matrix by concatinating the normalized numeric features, the genre matrix created using the TFIDF and now the release year
release_year = songs['Release Year'].values.reshape(-1, 1)
combined_matrix = pd.concat([pd.DataFrame(normalized_features), pd.DataFrame(genre_matrix.toarray()), pd.DataFrame(release_year)], axis=1)

In [11]:
#Here we are using cosine similarity to compute the similarity matric
similarity_matrix = cosine_similarity(combined_matrix, combined_matrix)

In [12]:
#Here we are defining a function that will recommend songs based on the similarities from above.
def recommend_songs(song_index, similarity_matrix, songs, top_n=10):
    song_scores = similarity_matrix[song_index]
    top_indices = song_scores.argsort()[::-1][1:top_n+1]
    return songs.iloc[top_indices]

In [13]:
#Here we are creating the recommender value using relevant information.
recommender = {
    'similarity_matrix': similarity_matrix,
    'songs': songs,
    'features': features,
    'tfidf_vectorizer': tfidf_vectorizer,
}

In [14]:
#Here we are going to pickle the recommender model
#with open('recommender.pkl', 'wb') as file:
    #pickle.dump(recommender, file)

---
### Using the Model to Make Recommendations
Below we use the model to make recommendations. The model is set up to make recommendations based on song title and artist, song title only, or artist only.  Additionally, the model works with special characters and without them as well thanks to the regex function used above.

>In reference to what is mentioned above, artists like "A$AP Ferg" can be looked up using the dollar symbol or by typing "AAP Ferg" while excluding the symbol. The capitalization of the letters don't matter just like the example below.

In [14]:
# Enter the song name and artist
song_name = "plAiN JaNe"
artist_name = "a$aP Ferg"

# Preprocess the song and artist names
processed_song_name = preprocess_input(song_name)
processed_artist_name = preprocess_input(artist_name)

# Find the song index based on the processed song and artist names
song_indices = np.where(songs.index.str.lower().str.contains(processed_song_name) & songs.index.str.lower().str.contains(processed_artist_name))[0]
if len(song_indices) > 0:
    song_index = song_indices[0]

    # Set options to display all columns and rows
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)

    # Specify the category for recommendation (e.g., "track", "artist", "genre", "album title", etc.)
    category = "track"

    # Print recommendation for the specified song and category
    recommendation = recommend_song(song_index, similarity_matrix, songs, category)
    if recommendation is not None:
        print("Recommendation for the song:", song_name, "by", artist_name)
        print(recommendation.to_string(index=False))
    else:
        print("No recommendation found for the specified category:", category)
else:
    print("Song not found:", song_name, "by", artist_name)

Recommendation for the song: plAiN JaNe by a$aP Ferg
                                                       Thumbnail                 Track      Artist                      Album Title  Release Year
https://i.scdn.co/image/ab67616d00004851d79cfa3155b50beb5dbeeb9a            Plain Jane   A$AP Ferg                   Still Striving          2017
https://i.scdn.co/image/ab67616d00004851b1d860ab1ba847e778b2796d    Hate It Or Love It    The Game                  The Documentary          2005
https://i.scdn.co/image/ab67616d00004851f2149422121d1674c6f4c009                 CrasH ScHoolboy Q                       CrasH Talk          2019
https://i.scdn.co/image/ab67616d00004851b1d860ab1ba847e778b2796d             How We Do    The Game                  The Documentary          2005
https://i.scdn.co/image/ab67616d00004851b1d860ab1ba847e778b2796d        Westside Story    The Game                  The Documentary          2005
https://i.scdn.co/image/ab67616d0000485150192d5f728fea13fb3af203 I Don'

In the example below I inputted just the song title without the artist name and it still returns recommendations with high similarities. Although it is the same artist, I think that for someone trying to find similar songs this is great.

In [15]:
# Enter the song name and artist
song_name = ""
artist_name = "mf doom"

# Preprocess the song and artist names
processed_song_name = preprocess_input(song_name)
processed_artist_name = preprocess_input(artist_name)

# Find the song index based on the processed song and artist names
song_indices = np.where(songs.index.str.lower().str.contains(processed_song_name) & songs.index.str.lower().str.contains(processed_artist_name))[0]
if len(song_indices) > 0:
    song_index = song_indices[0]

    # Set options to display all columns and rows
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)

    # Specify the category for recommendation (e.g., "track", "artist", "genre", "album title", etc.)
    category = "track"

    # Print recommendation for the specified song and category
    recommendation = recommend_song(song_index, similarity_matrix, songs, category)
    if recommendation is not None:
        print("Recommendation for the song:", song_name, "by", artist_name)
        print(recommendation.to_string(index=False))
    else:
        print("No recommendation found for the specified category:", category)
else:
    print("Song not found:", song_name, "by", artist_name)

Recommendation for the song:  by mf doom
                                                       Thumbnail                                   Track  Artist                      Album Title  Release Year
https://i.scdn.co/image/ab67616d000048516ce90ec627a0198a8efd127f                                Doomsday MF DOOM   Operation: Doomsday (Complete)          1999
https://i.scdn.co/image/ab67616d000048516ce90ec627a0198a8efd127f                              Gas Drawls MF DOOM   Operation: Doomsday (Complete)          1999
https://i.scdn.co/image/ab67616d000048516ce90ec627a0198a8efd127f                        Go With the Flow MF DOOM   Operation: Doomsday (Complete)          1999
https://i.scdn.co/image/ab67616d000048516ce90ec627a0198a8efd127f                              The Finest MF DOOM   Operation: Doomsday (Complete)          1999
https://i.scdn.co/image/ab67616d0000485152f194d02c39909d1b284799 Rapp Snitch Knishes feat. Mr. Fantastik MF DOOM                        MM...FOOD          2004

In [20]:
def calculate_precision(recommended_songs, ground_truth):
    true_positives = 0
    false_positives = 0

    for song in recommended_songs:
        if song in ground_truth:
            true_positives += 1
        else:
            false_positives += 1

    precision = true_positives / (true_positives + false_positives)
    return precision

# Example usage
recommended_songs = ['Vomispit', 'Doomsday', 'Deep Fried Frenz', 'Red and Gold']
ground_truth = ['Vomitspit', 'Red and Gold', 'Thats That']

precision = calculate_precision(recommended_songs, ground_truth)
print("Precision:", precision)

Precision: 0.25


---
## Conclusion/Recommendations
Based on the two predictions above we received failry similar music recommendations. We can see that searching for an Artist will return anything related to that artist, or something similar. Additionally, we are able to search for a song and an artist and the recommender will return artists within the same genre and similar release years. We know that Spotify does well with their suggestion systems and it is a good way to find a variety of music from different time eras as well as sub-genres and artists. Something that could help improve the Spotify system would be having a main genre and a separate sub-genre section that way it is easier to classify songs as well as analyze the data.

---
### Sources

- This [StackOverflow Thread](https://stackoverflow.com/questions/54396405/how-can-i-preprocess-nlp-text-lowercase-remove-special-characters-remove-numb) helped me come up with the functions for the text.
- The [Machine Learning Geek](https://machinelearninggeek.com/spotify-song-recommender-system-in-python/) article on Spotify Recommenders helped me decide on the numeric features to use as well as some aid in creating the function for the recommender.
- I also used this [Towards Data Science](https://towardsdatascience.com/using-cosine-similarity-to-build-a-movie-recommendation-system-ae7f20842599) movie recommender as an idea for my approach to the problem statement.
- Lastly, this [video](https://youtu.be/eyEabQRBMQA) provided some aid when writing out the recommend_song function except I added additional information to get the information I need.