# Creating a Music Recommendation System

In this Jupyter notebook, I present a project focused on creating a music recommendation system by leveraging Spotify data and Python programming. The goal of this project is to design a system that suggests music tracks to users based on their preferences and listening history. By combining the power of data analysis, machine learning techniques, and the Spotify API, we aim to provide personalized and relevant song recommendations.

## About Spotify and Music Recommendation

Spotify has revolutionized the way we consume music, offering a vast library of songs across various genres. Its sophisticated algorithms analyze user behavior, such as listening history and likes, to provide tailored music recommendations. Music recommendation systems have gained immense popularity due to their ability to introduce users to new artists, genres, and songs they might enjoy, enhancing their overall music experience.

## Dataset and Methodology

For this project, we will be using Spotify data containing a collection of [top hits from 2000 to 2019](https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019). The dataset includes information about each song, including its track name, artist name, album details, duration, popularity, and more. By utilizing this rich dataset, we will delve into feature engineering, similarity calculations, and collaborative filtering techniques to build an effective music recommendation system.

## Project Goals

The primary objectives of this project are as follows:

1. **Data Preprocessing:** Clean and preprocess the Spotify dataset to extract relevant features for the recommendation system.

2. **Feature Engineering:** Create meaningful features that capture the essence of each song, artist, and genre.

3. **Similarity Calculation:** Develop methods to calculate similarities between songs, artists, and genres to identify potential recommendations.

Through this project, we aim to showcase the power of data-driven decision-making in the realm of music recommendation. By leveraging Spotify's extensive music collection and combining it with advanced data analysis techniques, we hope to provide users with an enhanced music discovery experience.

Let's dive into the project and explore the journey of building an efficient and accurate music recommendation system!


## Data Preprocessing

In [151]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE

import warnings
warnings.filterwarnings('ignore')

In [152]:
tracks = pd.read_csv('/kaggle/input/top-hits-spotify-from-20002019/songs_normalize.csv')
tracks.head()

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.0,0.612,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,1,0.029,0.173,0.0,0.251,0.278,136.859,"pop, country"
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.0263,1.3e-05,0.347,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.00104,0.0845,0.879,172.656,pop


In [153]:
tracks.isnull().sum()

artist              0
song                0
duration_ms         0
explicit            0
year                0
popularity          0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
genre               0
dtype: int64

## Model Building
I utilized the `%%capture` magic command to suppress the output of the code cell, which ensures that the printed output or messages are not displayed in the notebook. 

I then proceeded to initialize the `song_vectorizer` using the `CountVectorizer` from the `sklearn.feature_extraction.text` module. The vectorizer was fitted with the song names present in the dataset, allowing it to convert each song name into a numerical representation suitable for analysis. This step enables the system to process text data effectively for similarity calculations.

Subsequently, I sorted the `tracks` dataset in descending order based on the 'popularity' column. This arrangement ensures that the most popular songs appear at the beginning of the dataset. By printing the top entries using the `head()` function, I obtained a glimpse of the most popular songs present in the dataset.



In [154]:
%%capture
song_vectorizer = CountVectorizer()
song_vectorizer.fit(tracks['song'])

In [155]:
tracks = tracks.sort_values(by = 'popularity', ascending = False)
tracks.head()

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
1322,The Neighbourhood,Sweater Weather,240400,False,2013,89,0.612,0.807,10,-2.81,1,0.0336,0.0495,0.0177,0.101,0.398,124.053,"rock, pop"
1311,Tom Odell,Another Love,244360,True,2013,88,0.445,0.537,4,-8.532,0,0.04,0.695,1.7e-05,0.0944,0.131,122.769,pop
201,Eminem,Without Me,290320,True,2002,87,0.908,0.669,7,-2.827,1,0.0738,0.00286,0.0,0.237,0.662,112.238,hip hop
1613,WILLOW,Wait a Minute!,196520,False,2015,86,0.764,0.705,3,-5.279,0,0.0278,0.0371,1.9e-05,0.0943,0.672,101.003,"pop, R&B, Dance/Electronic"
6,Eminem,The Real Slim Shady,284200,True,2000,86,0.949,0.661,5,-4.244,0,0.0572,0.0302,0.0,0.0454,0.76,104.504,hip hop


Then I utilized the `cosine_similarity` function from the `sklearn.metrics.pairwise` module to assess song similarities. In the `get_similarities()` function, I computed similarity scores for a given song against all others in the dataset. By converting artist names into numerical arrays using a text vectorization approach (`song_vectorizer`), and selecting relevant numerical features, I calculated both text and numerical resemblances. These similarity scores were combined to create a list of factors representing similarity, which enabled the recommendation of songs closely aligned with the input track.


In [198]:
from sklearn.metrics.pairwise import cosine_similarity

def get_similarities(song_name, data):
    text_array1 = song_vectorizer.transform(data[data['song'] == song_name]['artist']).toarray()
    num_array1 = data[data['song'] == song_name].select_dtypes(include=np.number).to_numpy()
    
    sim = []
    for idx, row in data.iterrows():
        name = row['song']
        text_array2 = song_vectorizer.transform(data[data['song'] == name]['artist']).toarray()
        num_array2 = data[data['song'] == name].select_dtypes(include=np.number).to_numpy()
            
        text_sim = cosine_similarity(text_array1, text_array2)[0][0]  # Fixed typo here
        num_sim = cosine_similarity(num_array1, num_array2)[0][0]  # Fixed typo here
        sim.append(text_sim + num_sim)
            
    return sim


Next, I implemented the `recommend_songs()` function to generate song recommendations based on the input song name. Firstly, the function retrieves the rows corresponding to the input song name from the dataset. If no such rows are found, a message is displayed indicating that the song might not be popular or the name is not present in the playlist. In this case, a random selection of 7 songs from the dataset is printed as alternative recommendations.

On the other hand, if the input song name matches rows in the dataset, the function proceeds to calculate similarity factors using the `get_similarities()` function, taking into account both text and numerical features. The dataset is then sorted based on the calculated similarity factors and the popularity of the songs in descending order. The function creates a subset of recommended songs by excluding the input song and selecting the top 7 songs with the highest similarity and popularity scores. These recommended songs are displayed as a list.

In [199]:
def recommend_songs(song_name, data=tracks):
    song_df = data[data['song'] == song_name]
    
    if song_df.shape[0] == 0:
        print('This song is either not so popular or you have entered an invalid name not contained in this playlist')
        
        for song in data.sample(n=7)['song'].values:
            print(song)
    else:
        data['similarity_factor'] = get_similarities(song_name, data)
        data.sort_values(by=['similarity_factor', 'popularity'],
                        ascending=[False, False],
                        inplace=True)

        recommended_songs = data[data['song'] != song_name][['song', 'artist']][:7]
        display(recommended_songs)

## Testing it All Out
Finally, I tested the recommend_songs() function by providing the song 'Swang' and 'Swimming Pools (Drank) - Extended Version' as the inputs. The function processed the data and calculated the similarity factors between the two songs and other songs in the playlist, based on both the text information about the artist and the numerical features of the songs. The recommended songs included tracks that share similar characteristics, ensuring that the system provides relevant and enjoyable music suggestions. This demonstrates the effectiveness of the music recommendation system in suggesting songs that align with my musical preferences.

In [206]:
recommend_songs('Swang')

Unnamed: 0,song,artist
1170,Without You (feat. Usher),David Guetta
1340,I Could Be The One (Avicii Vs. Nicky Romero) -...,Avicii
1581,Stitches,Shawn Mendes
93,Breathless,The Corrs
1244,Feel So Close - Radio Edit,Calvin Harris
449,Lola's Theme - Radio Edit,The Shapeshifters
926,Evacuate The Dancefloor,Cascada


In [215]:
recommend_songs('Swimming Pools (Drank) - Extended Version')

Unnamed: 0,song,artist
1234,m.A.A.d city,Kendrick Lamar
1836,All The Stars (with SZA),Kendrick Lamar
1898,LOVE. FEAT. ZACARI.,Kendrick Lamar
1750,DNA.,Kendrick Lamar
1739,HUMBLE.,Kendrick Lamar
693,Welcome to the Black Parade,My Chemical Romance
719,Stronger,Kanye West


## Conclusion

In this project, I embarked on the journey of creating a music recommendation system using Python and the Top Spotify Hits from the last two decades. The main objective was to harness the power of data analytics and machine learning to build a system that can suggest songs based on user preferences. By delving into the dataset provided by Spotify, which encompasses a diverse array of songs spanning across 125 different genres, I gained valuable insights into the intricacies of data preprocessing, feature extraction, and similarity analysis.

Throughout this project, I learned about the significance of data preprocessing in ensuring the accuracy of the recommendation system. Handling missing values, normalizing data, and selecting relevant features were crucial steps in this process. Moreover, I discovered the importance of optimizing the system's performance by fine-tuning parameters and incorporating relevant metrics.

In closing, this project not only provided me with practical experience in data analysis and machine learning but also highlighted the potential of recommendation systems in personalizing user experiences. The process of creating a music recommendation system deepened my understanding of data manipulation, feature engineering, and similarity calculations. As the world of data continues to evolve, such projects exemplify the exciting possibilities that arise from merging technology with everyday interests like music.

