 # Machine Learning and Predictive Modeling - Assignment 6
 ### Arpit Parihar
 ### 05/11/2021
 ****

 **Importing modules**

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import pairwise_distances

import joblib
import warnings
warnings.filterwarnings('ignore')

**Importing data**

In [2]:
data = pd.read_csv('radio_songs.csv')
data.set_index('user', inplace = True)

 ### 1\. Collaborative Filtering

 Use this user-item matrix to:

 **A. Recommend 10 songs to users who have listened to 'u2' and 'pink floyd'. Use item-item collaborative filtering to find songs that are similar using spatial distance with cosine. Since this measures the distance you need to subtract from 1 to get similarity as shown below.**


 Creating a column for users who've listened to both Pink Floyd and U2

In [3]:
data['u2 and pink floyd'] = data['u2'] & data['pink floyd']
print(f'Number of users who have listened to both U2 and Pink Floyd = {sum(data["u2 and pink floyd"])}')

Number of users who have listened to both U2 and Pink Floyd = 0


 There are no users who have listened to both Pink Floyd and U2. We'll check for users who've listened to either U2 **or** Pink Floyd.

In [4]:
data['u2 or pink floyd'] = data['u2'] | data['pink floyd']
print(f'Number of users who have listened to U2 or Pink Floyd = {sum(data["u2 or pink floyd"])}')

Number of users who have listened to U2 or Pink Floyd = 7


 Taking transpose and calculate pairwise cosine distance b/w each band

In [5]:
data_T = data.T
item_cosine_matrix = pd.DataFrame(1 - pairwise_distances(data_T , metric='cosine'), index=data_T.index, columns=data_T.index)

In [6]:
print('10 recommendations for listeners of U2 or Pink Floyd:\n')
item_cosine_matrix.drop(index=['u2', 'pink floyd', 'u2 or pink floyd'])['u2 or pink floyd'].nlargest(10)

item_cosine_matrix.drop(index=['u2 and pink floyd', 'u2 or pink floyd'], columns=['u2 and pink floyd', 'u2 or pink floyd'], inplace=True)
data.drop(columns=['u2 and pink floyd', 'u2 or pink floyd'], inplace=True)
data_T.drop(index=['u2 and pink floyd', 'u2 or pink floyd'], inplace=True)

10 recommendations for listeners of U2 or Pink Floyd:



robbie williams    0.566947
johnny cash        0.400892
genesis            0.377964
misfits            0.377964
foo fighters       0.341882
audioslave         0.338062
green day          0.327327
depeche mode       0.308607
pearl jam          0.308607
peter fox          0.285714
Name: u2 or pink floyd, dtype: float64

 **B\. Find user most similar to user 1606. Use user-user collaborative filtering with cosine similarity. List the recommended songs for user 1606 (Hint: find the songs listened to by the most similar user).**

In [7]:
user_cosine_matrix = pd.DataFrame(1 - pairwise_distances(data, metric='cosine'), index=data.index, columns=data.index)

sim_user_1606 = user_cosine_matrix.drop(index=[1606])[1606].nlargest(1).index[0]

print('Most similar user to user 1606:\n')
sim_user_1606

rec_1606 = pd.DataFrame(data_T[sim_user_1606][data_T[sim_user_1606] == 1].index, columns=['Recommended'])

print('Recommended bands for user 1606:\n')
rec_1606

Most similar user to user 1606:



1144

Recommended bands for user 1606:



Unnamed: 0,Recommended
0,beastie boys
1,bob dylan
2,bob marley & the wailers
3,david bowie
4,elvis presley
5,eric clapton
6,johnny cash
7,pearl jam
8,pink floyd
9,the beatles


 **C\. How many of the recommended songs has already been listened to by user 1606?**

In [8]:
print('Recommended bands already listened to by user 1606:\n')
[x for x in data_T.index[data_T[1606]==1] if x in list(rec_1606['Recommended'])]

Recommended bands already listened to by user 1606:



['elvis presley', 'the beatles']

 **D\. Use a combination of user-item approach to build a recommendation score for each song for each user using the following steps for each user-**

 - 1\. For each song for the user row, get the top 10 similar songs and their similarity score.

 - 2\. For each of the top 10 similar songs, get a list of the user purchases

 - 3\. Calculate a recommendation score as follows:
 $\sum(purchaseHistory.similarityScore)/\sum similarityScore$

In [9]:
try:
    rec_scores = joblib.load('rec_scores.pkl')
except:
    rec_scores = pd.DataFrame(index=data.index, columns=data_T.index)
    for i in range(rec_scores.shape[0]):
        for j in range(rec_scores.shape[1]): 
            user = rec_scores.index[i] 
            band = rec_scores.columns[j]
            if data.iloc[i, j] == 1: 
                rec_scores.iloc[i, j] = 0 
            else: 
                sim_bands = item_cosine_matrix.drop(index=[band])[band].nlargest(10)
                history = data.loc[user, sim_bands.index]
                rec_scores.iloc[i, j] = sum(history*sim_bands)/sum(sim_bands)
    rec_scores.fillna(0, inplace=True)
    joblib.dump(rec_scores, 'rec_scores.pkl')

rec_scores

Unnamed: 0_level_0,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,all that remains,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0000,0.000000,...,0.147209,0.0,0.000000,0.0,0,0.000000,0.000000,0.088231,0.094411,0.000000
33,0.000000,0.000000,0.000000,0.000000,0.208070,0.000000,0.0,0.094377,0.0000,0.000000,...,0.000000,0.0,0.000000,0.0,0,0.000000,0.000000,0.000000,0.000000,0.000000
42,0.173849,0.206181,0.061705,0.072073,0.000000,0.000000,0.0,0.000000,0.0000,0.077179,...,0.000000,0.0,0.000000,0.0,0,0.089999,0.000000,0.000000,0.000000,0.000000
51,0.000000,0.000000,0.188449,0.000000,0.081329,0.095548,0.0,0.000000,0.0000,0.000000,...,0.000000,0.0,0.000000,0.0,0,0.000000,0.000000,0.000000,0.000000,0.000000
62,0.000000,0.000000,0.073010,0.000000,0.178129,0.000000,0.0,0.000000,0.0000,0.000000,...,0.217462,0.0,0.000000,0.0,0,0.000000,0.000000,0.101881,0.094411,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1566,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0865,0.000000,...,0.000000,0.0,0.000000,0.0,0,0.000000,0.000000,0.000000,0.000000,0.000000
1586,0.000000,0.109101,0.000000,0.095344,0.096801,0.095548,0.0,0.077059,0.0000,0.000000,...,0.262859,0.0,0.049279,0.0,0,0.000000,0.000000,0.000000,0.000000,0.049279
1589,0.074842,0.191573,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0000,0.000000,...,0.000000,0.0,0.179883,0.0,0,0.000000,0.000000,0.000000,0.000000,0.179883
1601,0.000000,0.000000,0.094255,0.000000,0.000000,0.000000,0.0,0.000000,0.0000,0.000000,...,0.000000,0.0,0.000000,0.0,0,0.000000,0.088677,0.000000,0.000000,0.000000


 - 4\. What are the top 5 song recommendations for user 1606?

In [10]:
print('Recommended bands for user 1606:\n')
pd.DataFrame(rec_scores.loc[1606, :].nlargest(5).index, columns=['Recommended'])


Recommended bands for user 1606:



Unnamed: 0,Recommended
0,eric clapton
1,howard shore
2,david bowie
3,dream theater
4,apocalyptica


### 2\. Conceptual questions:

**1. Name 2 other similarity measures that you can use instead of cosine similarity above.**  
**Jaccard similarity** and **\(1 - Euclidean distance\)** could have been used instead of cosine similarity.

**2. What is needed to build a Content-Based Recommender system?**  
Content-based recommender system circumvents the cold start problem encountered in traditional recommenders, but it needs the items broken down and scored by as many attributes as possible to provide good recommendations by matching users to attributes. Model based approaches can work, but the interpretability in recommendations is lost, and it's not ideal.

**3. Name 2 methods to evaluate your recommender system.**  
- A traditional method to evaluate a recommendation system is to check **precision and recall @ k**, which means, of the k recommendations made, how many were correct, and how many of the correct recommendations were captured in k respectively
- If the order of recommendations is important in our recommender system, **Normalized Discounted Cumulative Gain \(nDCG\)** can be used for evaluation.