# Song Recommendation System

## Problem Statement


A lot of companies worldwide recommend songs to listeners based on their interests. Some popular examples of such companies are Spotify, iTunes, Gaana and Saavn. Song recommendations helps users to discover new artists that make music similar to the genre which they listen to. This helps in increasing revenue across these platforms and helps artists make a survival by streaming their music online.

As part of this exercise we will build a recommendation system that recommends a list of songs based on the user's song preference.

## Attribute Information

There are 2 files that we will be using in this case study, 'songs.csv' and 'song_extra_info.csv'. 

The 'songs.csv' file has the following attributes:

- song_id: Unique id of the song
- song_length: Duration of the song
- genre_ids: Unique id of the genre of the song
- artist_name: Name of the artist who represents the song
- composer: Name of the composer of the song
- lyricist: Name of the lyricist of the song
- language: The language of the song



The 'song_extra_info.csv' file has the following attributes:

- song_id: Unique id of the song
- name: name of the song
- isrc: International standard recording code

## Table of Content

1. Import Libraries

2. Setting options

3. Read Data 

4. Exploratory Data Analysis and Data Preprocessing

  4.1 - Check shape 

  4.2 - Check for missing values

  4.3 - Sample only 10000 data points from the huge dataset 



5. Content Based Recommendation System

6. Conclusion and Interpretation

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

## 1. Import Required Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import os
import glob
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from zipfile import ZipFile

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

# filterwarnings to ignore all unnecessary warnings and logs
import warnings
warnings.filterwarnings('ignore')

## 2. Setting Options

In [4]:
# suppress display of warnings
warnings.filterwarnings('ignore')

# display all dataframe columns
pd.options.display.max_columns = None

# to set the limit to 3 decimals
pd.options.display.float_format = '{:.7f}'.format

# display all dataframe rows
pd.options.display.max_rows = None

In [5]:
# os.chdir('/content/drive/My Drive/Week 1 Practice Case Study')
# os.getcwd()

## 3. Read Data and EDA

In [6]:
#read the data files
songs = pd.read_csv('songs.csv',encoding='latin')
info = pd.read_csv('song_extra_info.csv',encoding='latin')

In [7]:
# check few rows of the imported dataset
songs.head()

Unnamed: 0,song_id,song_length,genre_ids,artist_name,composer,lyricist,language
0,CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=,247640,465,å¼µä¿¡å² (Jeff Chang),è£è²,ä½åå¼,3.0
1,o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=,197328,444,BLACKPINK,TEDDY| FUTURE BOUNCE| Bekuh BOOM,TEDDY,31.0
2,DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=,231781,465,SUPER JUNIOR,,,31.0
3,dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=,273554,465,S.H.E,æ¹¯å°åº·,å¾ä¸ç,3.0
4,W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=,140329,726,è²´æç²¾é¸,Traditional,Traditional,52.0


In [8]:
info.head()

Unnamed: 0,song_id,name,isrc
0,LP7pLJoJFBvyuUwvu+oLzjT+bI+UeBPURCecJsX1jjs=,æå,TWUM71200043
1,ClazTFnk6r0Bnuie44bocdNMM3rdlrq0bCGAsGUWcHE=,Let Me Love You,QMZSY1600015
2,u2ja/bZE3zhCGxvbbOB3zOoUjx27u40cf5g09UXMoKQ=,åè«æ,TWA530887303
3,92Fqsy0+p6+RHe2EoLKjHahORHR1Kq1TBJoClW9v+Ts=,Classic,USSM11301446
4,0QFmz/+rJy1Q56C1DuYqT9hKKqi5TUqx0sN0IwvoHrw=,ææç¾ç¶²,TWA471306001


## 4. Exploratory Data Analysis and Data Preprocessing

### 4.1 Check shape

In [9]:
songs.shape

(2296320, 7)

In [10]:
info.shape

(2295971, 3)

In [11]:
# check the columns in each dataframe
print(songs.columns)
print('===============================================')
print(info.columns)

Index(['song_id', 'song_length', 'genre_ids', 'artist_name', 'composer',
       'lyricist', 'language'],
      dtype='object')
Index(['song_id', 'name', 'isrc'], dtype='object')


In [12]:
#merge the two dataframes
df = info.merge(songs,on='song_id')

In [13]:
print(df.columns)

Index(['song_id', 'name', 'isrc', 'song_length', 'genre_ids', 'artist_name',
       'composer', 'lyricist', 'language'],
      dtype='object')


In [14]:
# make a copy of the original dataframe to 
df_composer = df.copy()

In [15]:
df_composer.head()

Unnamed: 0,song_id,name,isrc,song_length,genre_ids,artist_name,composer,lyricist,language
0,LP7pLJoJFBvyuUwvu+oLzjT+bI+UeBPURCecJsX1jjs=,æå,TWUM71200043,307304,458,æä¸å·§åå åè²å¸¶,An-An Tso,,3.0
1,ClazTFnk6r0Bnuie44bocdNMM3rdlrq0bCGAsGUWcHE=,Let Me Love You,QMZSY1600015,205914,1609,DJ Snake,Justin Bieber| William Grigahcine| Andrew Watt...,,52.0
2,u2ja/bZE3zhCGxvbbOB3zOoUjx27u40cf5g09UXMoKQ=,åè«æ,TWA530887303,252160,465,è­æ¬é¨° (Jam Hsiao),A Qin,A Qin| Chen Tian You| Wu Yi Wei,3.0
3,92Fqsy0+p6+RHe2EoLKjHahORHR1Kq1TBJoClW9v+Ts=,Classic,USSM11301446,175427,465,MKTO,Evan Bogart|Andrew Goldstein|Lindy Robbins|Ema...,Evan Bogart|Andrew Goldstein|Lindy Robbins|Ema...,52.0
4,0QFmz/+rJy1Q56C1DuYqT9hKKqi5TUqx0sN0IwvoHrw=,ææç¾ç¶²,TWA471306001,294983,458,ç¾å¿ç¥¥ (Show Lo),Drew Ryan Scott / David Moses Jassy / Niclas M...,Drew Ryan Scott / David Moses Jassy / Niclas M...,3.0


In [16]:
df_composer = df_composer.drop(df_composer.columns.difference(['song_id','name','composer']),axis =1)
df_composer.head()

Unnamed: 0,song_id,name,composer
0,LP7pLJoJFBvyuUwvu+oLzjT+bI+UeBPURCecJsX1jjs=,æå,An-An Tso
1,ClazTFnk6r0Bnuie44bocdNMM3rdlrq0bCGAsGUWcHE=,Let Me Love You,Justin Bieber| William Grigahcine| Andrew Watt...
2,u2ja/bZE3zhCGxvbbOB3zOoUjx27u40cf5g09UXMoKQ=,åè«æ,A Qin
3,92Fqsy0+p6+RHe2EoLKjHahORHR1Kq1TBJoClW9v+Ts=,Classic,Evan Bogart|Andrew Goldstein|Lindy Robbins|Ema...
4,0QFmz/+rJy1Q56C1DuYqT9hKKqi5TUqx0sN0IwvoHrw=,ææç¾ç¶²,Drew Ryan Scott / David Moses Jassy / Niclas M...


In [17]:
df_composer.shape

(2295422, 3)

In [18]:
#Check Data types
df_composer.dtypes

song_id     object
name        object
composer    object
dtype: object

In [19]:
df_composer.head(4)

Unnamed: 0,song_id,name,composer
0,LP7pLJoJFBvyuUwvu+oLzjT+bI+UeBPURCecJsX1jjs=,æå,An-An Tso
1,ClazTFnk6r0Bnuie44bocdNMM3rdlrq0bCGAsGUWcHE=,Let Me Love You,Justin Bieber| William Grigahcine| Andrew Watt...
2,u2ja/bZE3zhCGxvbbOB3zOoUjx27u40cf5g09UXMoKQ=,åè«æ,A Qin
3,92Fqsy0+p6+RHe2EoLKjHahORHR1Kq1TBJoClW9v+Ts=,Classic,Evan Bogart|Andrew Goldstein|Lindy Robbins|Ema...


### 4.2 Check missing values

In [20]:
# Check for missing values present
print('Number of missing values across columns-\n', df_composer.isnull().sum())


Number of missing values across columns-
 song_id           0
name              2
composer    1070938
dtype: int64


There are 2 missing values in name columns and 1070938 in composer column with total records 2295422.

Let's drop the missing values.

In [21]:
df_composer.dropna(inplace=True)

In [22]:
df_composer.isnull().sum()

song_id     0
name        0
composer    0
dtype: int64

In [23]:
df_composer.shape

(1224482, 3)

### 4.3 Sample only 10000 data points from the huge dataset

In [24]:
df_sampled = df_composer.sample(n=10000,random_state=98)

In [25]:
df_sampled.head()

Unnamed: 0,song_id,name,composer
528785,mIPh1riiWsr6144pZrVCkif1Yi5+185/mq/lRwRDdco=,ä¸è½åè¨´ä½ Â,Zheng Zhi-Hua
887327,fF2MVQ+R9jZ3A6EEwqHpmtlePMUqRODS/hOTC5lpFDc=,California Dreamin',A. Phillips| M. G. Phillips
1381286,nCItc/z5KUqJI9/hj3XUj1pvZ03Sg0tsOx9eT/FgXPM=,GREENSLEVES,DOMÃNIO PÃBLICO CLAUDE DEBUSSY (1862 1918)
171239,NIIl5lIxh6ZCbfGCJA+Yq6IRgkiOI61Q0PAKoqKciEM=,Another Day in Paradise,Bertie Higgins
1941379,h0H0wjQL9TGyMJOJEF5nL7pjYOO3pg/Tu+/HMjRDTa4=,V.V.V.,Sone


In [26]:
df_sampled.shape

(10000, 3)

# 5. Content Based Recommendation System

We will create a document term frequency matrix using tfidf on the composer column 

In [27]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df_sampled['composer'])

In [28]:
tfidf_matrix.shape

(10000, 26958)

In [29]:
tfidf_matrix

<10000x26958 sparse matrix of type '<class 'numpy.float64'>'
	with 52263 stored elements in Compressed Sparse Row format>

We calculate the cosine similarity for the tfidf matrix we generated using tfidfvectorizer

In [48]:
# print ('%f' % 5)  # prints 1347053958.526874

5.000000


In [49]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
# cosine_sim1 = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [50]:
cosine_sim.shape

(10000, 10000)

In [54]:
cosine_sim[1]

array([0., 1., 0., ..., 0., 0., 0.])

In [35]:
df_sampled.head()

Unnamed: 0,song_id,name,composer
528785,mIPh1riiWsr6144pZrVCkif1Yi5+185/mq/lRwRDdco=,ä¸è½åè¨´ä½ Â,Zheng Zhi-Hua
887327,fF2MVQ+R9jZ3A6EEwqHpmtlePMUqRODS/hOTC5lpFDc=,California Dreamin',A. Phillips| M. G. Phillips
1381286,nCItc/z5KUqJI9/hj3XUj1pvZ03Sg0tsOx9eT/FgXPM=,GREENSLEVES,DOMÃNIO PÃBLICO CLAUDE DEBUSSY (1862 1918)
171239,NIIl5lIxh6ZCbfGCJA+Yq6IRgkiOI61Q0PAKoqKciEM=,Another Day in Paradise,Bertie Higgins
1941379,h0H0wjQL9TGyMJOJEF5nL7pjYOO3pg/Tu+/HMjRDTa4=,V.V.V.,Sone


In [40]:
df_sampled = df_sampled.reset_index()

In [41]:
df_sampled.head()

Unnamed: 0,index,song_id,name,composer
0,528785,mIPh1riiWsr6144pZrVCkif1Yi5+185/mq/lRwRDdco=,ä¸è½åè¨´ä½ Â,Zheng Zhi-Hua
1,887327,fF2MVQ+R9jZ3A6EEwqHpmtlePMUqRODS/hOTC5lpFDc=,California Dreamin',A. Phillips| M. G. Phillips
2,1381286,nCItc/z5KUqJI9/hj3XUj1pvZ03Sg0tsOx9eT/FgXPM=,GREENSLEVES,DOMÃNIO PÃBLICO CLAUDE DEBUSSY (1862 1918)
3,171239,NIIl5lIxh6ZCbfGCJA+Yq6IRgkiOI61Q0PAKoqKciEM=,Another Day in Paradise,Bertie Higgins
4,1941379,h0H0wjQL9TGyMJOJEF5nL7pjYOO3pg/Tu+/HMjRDTa4=,V.V.V.,Sone


In [42]:
titles = df_sampled['name']
indices = pd.Series(df_sampled.index, index=df_sampled['name'])

We create an indices dataframe which will give the index of the song given the song name

In [66]:
indices.head()

name
ä¸è½åè¨´ä½ Â           0
California Dreamin'        1
GREENSLEVES                2
Another Day in Paradise    3
V.V.V.                     4
dtype: int64

This function takes in a song name as an argument, finds it's index. Then it gets a list of all similarity scores for the song index. Then it sorts the similarity scores from highest to lowest and takes only the first 30 scores and returns the song names for these indices with highest scores.

In [67]:
indices['GREENSLEVES']

2

In [68]:
cosine_sim[2]

array([0., 0., 1., ..., 0., 0., 0.])

In [69]:
list(enumerate(cosine_sim[2]))

[(0, 0.0),
 (1, 0.0),
 (2, 1.0),
 (3, 0.0),
 (4, 0.0),
 (5, 0.0),
 (6, 0.0),
 (7, 0.0),
 (8, 0.0),
 (9, 0.0),
 (10, 0.0),
 (11, 0.0),
 (12, 0.0),
 (13, 0.0),
 (14, 0.0),
 (15, 0.0),
 (16, 0.0),
 (17, 0.0),
 (18, 0.0),
 (19, 0.0),
 (20, 0.0),
 (21, 0.0),
 (22, 0.0),
 (23, 0.0),
 (24, 0.0),
 (25, 0.0),
 (26, 0.0),
 (27, 0.0),
 (28, 0.0),
 (29, 0.0),
 (30, 0.0),
 (31, 0.0),
 (32, 0.0),
 (33, 0.0),
 (34, 0.0),
 (35, 0.0),
 (36, 0.0),
 (37, 0.0),
 (38, 0.0),
 (39, 0.0),
 (40, 0.0),
 (41, 0.0),
 (42, 0.0),
 (43, 0.0),
 (44, 0.0),
 (45, 0.0),
 (46, 0.0),
 (47, 0.0),
 (48, 0.0),
 (49, 0.0),
 (50, 0.0),
 (51, 0.0),
 (52, 0.0),
 (53, 0.0),
 (54, 0.0),
 (55, 0.0),
 (56, 0.0),
 (57, 0.0),
 (58, 0.0),
 (59, 0.0),
 (60, 0.0),
 (61, 0.0),
 (62, 0.0),
 (63, 0.0),
 (64, 0.0),
 (65, 0.0),
 (66, 0.0),
 (67, 0.0),
 (68, 0.0),
 (69, 0.0),
 (70, 0.0),
 (71, 0.0),
 (72, 0.0),
 (73, 0.0),
 (74, 0.0),
 (75, 0.0),
 (76, 0.0),
 (77, 0.0),
 (78, 0.0),
 (79, 0.0),
 (80, 0.0),
 (81, 0.0),
 (82, 0.0),
 (83, 0.0),
 (

In [70]:
sorted(list(enumerate(cosine_sim[2])), key=lambda x: x[1], reverse=True)

[(2, 1.0),
 (2497, 0.653121251761778),
 (118, 0.36795648708557915),
 (1661, 0.36795648708557915),
 (3709, 0.36795648708557915),
 (7990, 0.36795648708557915),
 (8016, 0.36795648708557915),
 (8229, 0.36795648708557915),
 (8508, 0.36795648708557915),
 (8979, 0.36795648708557915),
 (9160, 0.36795648708557915),
 (9411, 0.36795648708557915),
 (9920, 0.36795648708557915),
 (2274, 0.2771915965408254),
 (5845, 0.21430604389582655),
 (5454, 0.11112053807474691),
 (7573, 0.11090932691763798),
 (936, 0.09560937760488224),
 (3145, 0.08802981317016108),
 (9308, 0.08802981317016108),
 (3260, 0.08690923243968914),
 (6503, 0.08690923243968914),
 (7444, 0.08690923243968914),
 (2102, 0.08145932547415667),
 (8452, 0.07350286744795655),
 (5567, 0.07278244569842479),
 (4885, 0.07214280757863481),
 (8709, 0.07210181783180689),
 (4024, 0.06711299316202739),
 (7260, 0.06534559604982175),
 (6755, 0.06420812780494421),
 (2526, 0.06241178103657731),
 (4202, 0.061274034217756373),
 (5165, 0.061274034217756373),
 (

In [75]:
def get_recommendations(Name):
    idx = indices[Name]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    music_indices = [i[0] for i in sim_scores]
    return titles.iloc[music_indices]

Let us try it on a few songs

In [76]:
get_recommendations('GREENSLEVES').head(5)

2497    Claude Debussy (1862 - 1918) Images - Book 2 -...
118        PrÃ©ludes| Book 1: X. La cathÃ©drale engloutie
1661    Cello Sonata| L. 135: II. SÃ©rÃ©nade. ModÃ©rÃ©...
3709                                          3. SirÃ¨nes
7990    Images| Book 2| L. 111: No. 1. Cloches Ã  trav...
Name: name, dtype: object

In [77]:
get_recommendations('Another Day in Paradise').head(5)

9074                               Floatin
354               Sound Of The Underground
5770    Little White Lies (Wideboys Remix)
0                        ä¸è½åè¨´ä½ Â 
1                      California Dreamin'
Name: name, dtype: object

In [78]:
get_recommendations('V.V.V.').head(10)

0                   ä¸è½åè¨´ä½ Â 
1                 California Dreamin'
2                         GREENSLEVES
3             Another Day in Paradise
5                             Tai Chi
6            æ¨ä¸å¾æç¼ççé
7     Ãtude No. 3 in E Major| Op. 10
8                             Hey You
9                       Fools Rush In
10                          é¨æ«»è±
Name: name, dtype: object

# 6. Conclusion and Interpretation

Thus, we have successfully built a content based song recommendation engine using 10000 songs from the entire dataset of songs that was available to us.