### Building the recommendation engine:

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df = pd.read_csv("../assets/datasets/music_sessions.csv")

If you visualize the dataset, you will see that it has many extra info about a session. We don’t need all of them. So, we choose keywords column to use as our feature set(the so called “content” of the session).

In [3]:
features = ['keywords']

Our next task is to create a function for combining the values of these columns into a single string.

In [4]:
def combine_features(row):
    # return row['keywords']+" "+row['category']+" "+row['tags']
    return row['keywords']
df['keywords'].head()

0    record labels music publishers rights licensin...
1    streaming mechanical performance sync licensin...
2    intellectual property rights protection regist...
3    streaming platforms aggregators release strate...
4    representation booking touring career developm...
Name: keywords, dtype: object

Now, we need to call this function over each row of our dataframe. But, before doing that, we need to clean and preprocess the data for our use. We will fill all the NaN values with blank string in the dataframe.

In [5]:
for feature in features:
    df[feature] = df[feature].fillna('') #filling all NaNs with blank string

df["combined_features"] = df.apply(combine_features,axis=1) #applying combined_features() method over each rows of dataframe and storing the combined string in "combined_features" column
df["combined_features"].head()

0    record labels music publishers rights licensin...
1    streaming mechanical performance sync licensin...
2    intellectual property rights protection regist...
3    streaming platforms aggregators release strate...
4    representation booking touring career developm...
Name: combined_features, dtype: object

Now that we have obtained the combined strings, we can now feed these strings to a CountVectorizer() object for getting the count matrix.

In [6]:
cv = CountVectorizer() #creating new CountVectorizer() object
count_matrix = cv.fit_transform(df["combined_features"]) #feeding combined strings(movie contents) to CountVectorizer() object
count_matrix.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

At this point, 60% work is done. Now, we need to obtain the cosine similarity matrix from the count matrix.

In [7]:
cosine_sim = cosine_similarity(count_matrix)
print(cosine_sim.shape)
cosine_sim  

(100, 100)


array([[1.        , 0.11785113, 0.11785113, ..., 0.        , 0.        ,
        0.        ],
       [0.11785113, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.11785113, 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.40089186,
        0.5       ],
       [0.        , 0.        , 0.        , ..., 0.40089186, 1.        ,
        0.26726124],
       [0.        , 0.        , 0.        , ..., 0.5       , 0.26726124,
        1.        ]])

Now, we will define two helper functions to get movie title from movie index and vice-versa.

In [8]:
def get_name_from_index(index):
    return df[df.index == index]["name"].values[0]
def get_index_from_name(name):
    return df[df.name == name]["index"].values[0]

Our next step is to get the title of the movie that the user currently likes. Then we will find the index of that movie. After that, we will access the row corresponding to this movie in the similarity matrix. Thus, we will get the similarity scores of all other movies from the current movie. Then we will enumerate through all the similarity scores of that movie to make a tuple of movie index and similarity score. This will convert a row of similarity scores like this- `[1 0.5 0.2 0.9]` to this- `[(0, 1) (1, 0.5) (2, 0.2) (3, 0.9)]` . Here, each item is in this form- (movie index, similarity score).

In [9]:
session_user_likes = "Understanding Royalties"
name_index = get_index_from_name(session_user_likes)
similar_sessions = list(enumerate(cosine_sim[name_index])) #accessing the row corresponding to given movie to find all the similarity scores for that movie and then enumerating over it
similar_sessions


[(0, np.float64(0.1178511301977579)),
 (1, np.float64(0.0)),
 (2, np.float64(0.9999999999999999)),
 (3, np.float64(0.0)),
 (4, np.float64(0.0)),
 (5, np.float64(0.0)),
 (6, np.float64(0.0)),
 (7, np.float64(0.12499999999999997)),
 (8, np.float64(0.0)),
 (9, np.float64(0.0)),
 (10, np.float64(0.40089186286863654)),
 (11, np.float64(0.0)),
 (12, np.float64(0.0)),
 (13, np.float64(0.24999999999999994)),
 (14, np.float64(0.0)),
 (15, np.float64(0.0)),
 (16, np.float64(0.0)),
 (17, np.float64(0.12499999999999997)),
 (18, np.float64(0.0)),
 (19, np.float64(0.0)),
 (20, np.float64(0.0)),
 (21, np.float64(0.3749999999999999)),
 (22, np.float64(0.0)),
 (23, np.float64(0.12499999999999997)),
 (24, np.float64(0.0)),
 (25, np.float64(0.0)),
 (26, np.float64(0.0)),
 (27, np.float64(0.0)),
 (28, np.float64(0.13363062095621217)),
 (29, np.float64(0.0)),
 (30, np.float64(0.26726124191242434)),
 (31, np.float64(0.0)),
 (32, np.float64(0.0)),
 (33, np.float64(0.0)),
 (34, np.float64(0.0)),
 (35, np.floa


Now comes the most vital point. We will sort the list `similar_sessions` according to similarity scores in descending order. Since the most similar movie to a given movie will be itself, we will discard the first element after sorting the movies.

In [10]:
sorted_similar_sessions = sorted(similar_sessions,key=lambda x:x[1],reverse=True)[1:6]

Now, we will run a loop to print first 5 entries from `sorted_similar_sessions` list.

In [11]:
i=0
print("Top 5 similar sessions to "+session_user_likes+" are:\n")
for element in sorted_similar_sessions:
    print(get_name_from_index(element[0]),round(element[1],3))
    i=i+1
    if i>5:
        break

Top 5 similar sessions to Understanding Royalties are:

Music Publishing Deals 0.401
Music Copyright Registration 0.375
Music Business Legal Basics 0.267
Music Contracts Negotiation 0.267
Music Rights Management 0.267


# Sort similar sessions by its rating

Let's Inspect the vote_average feature and check if there are any null values. Looks like it is clean.

In [12]:
df["rating"].unique()

array([8.5, 7.8, 8.2, 7.9, 8.7, 8.4, 7.6, 8.3, 7.5, 8.8, 7.7, 8.9, 8.1,
       7.2, 8.6, 9.1, 7.4, 7.3, 9. , 8. ])

Now, we will again sort our sorted_similar_movies but this time with respect to vote_average. x[0] has the index of the movie in the data frame.

In [13]:
sort_by_rating = sorted(sorted_similar_sessions,key=lambda x:df["rating"][x[0]],reverse=True)
print(sort_by_rating)

[(38, np.float64(0.26726124191242434)), (30, np.float64(0.26726124191242434)), (10, np.float64(0.40089186286863654)), (41, np.float64(0.26726124191242434)), (21, np.float64(0.3749999999999999))]


In [14]:
i=0
print("Suggesting top 5 sessions in order of Rating:\n")
for element in sort_by_rating:
    print(get_name_from_index(element[0]))
    i=i+1
    if i>5:
        break

Suggesting top 5 sessions in order of Rating:

Music Contracts Negotiation
Music Business Legal Basics
Music Publishing Deals
Music Rights Management
Music Copyright Registration
