# Collaborative Filtering Recommendation System



## Task 1: Import Modules

pandas: To store and manage data

numpy: To handle all the numerical values in the dataset

sklearn: To create the recommendation system

cosine_similarity from sklearn.metrics.pairwise: To create a cosine similarity matrix

In [1]:
import pandas as pd 
import numpy as np 
import sklearn
from sklearn.metrics.pairwise import cosine_similarity

## Task 2: Import the Dataset

Load Movie_data.csv and Movie_Id_Titles.csv in DataFrames.

Join the DataFrames on Movie_ID.

View the first five rows of the merged DataFrame.

In [2]:
#Load the rating data into a DataFrame:
column_names = ['User_ID', 'User_Names','Movie_ID','Rating','Timestamp']
movies_df = pd.read_csv('Movie_data.csv', sep = ',', names = column_names)

#Load the move information in a DataFrame:
movies_title_df = pd.read_csv("Movie_Id_Titles.csv")
movies_title_df.rename(columns = {'item_id':'Movie_ID', 'title':'Movie_Title'}, inplace = True)

#Merge the DataFrames:
movies_df = pd.merge(movies_df,movies_title_df, on='Movie_ID')

#View the DataFrame:
print(movies_df.head())

   User_ID        User_Names  Movie_ID  Rating  Timestamp       Movie_Title
0        0      Shawn Wilson        50       5  881250949  Star Wars (1977)
1       22     Robert Poulin        50       5  878887765  Star Wars (1977)
2      244      Laura Krulik        50       5  880604379  Star Wars (1977)
3      298      Loren Aucoin        50       5  884125578  Star Wars (1977)
4      115  Dominick Jenkins        50       5  881172049  Star Wars (1977)


## Task 3: Explore the Dataset

Get the dimensions of the DataFrame.

Get the statistical summary of the DataFrame.

Find the number of ratings given by each user.

Store the number of unique movies and users for the next task.

In [3]:
print(f"\n Size of the movie_df dataset is {movies_df.shape}")


 Size of the movie_df dataset is (100003, 6)


In [4]:
movies_df.describe()

Unnamed: 0,User_ID,Movie_ID,Rating,Timestamp
count,100003.0,100003.0,100003.0,100003.0
mean,462.470876,425.520914,3.529864,883528800.0
std,266.622454,330.797791,1.125704,5343791.0
min,0.0,1.0,1.0,874724700.0
25%,254.0,175.0,3.0,879448700.0
50%,447.0,322.0,4.0,882826900.0
75%,682.0,631.0,4.0,888260000.0
max,943.0,1682.0,5.0,893286600.0


In [5]:
movies_df.groupby('User_ID')['Rating'].count().sort_values(ascending = True).head()

User_ID
0       3
166    20
418    20
34     20
441    20
Name: Rating, dtype: int64

In [8]:
movies_df[movies_df['User_ID'] == 0]

Unnamed: 0,User_ID,User_Names,Movie_ID,Rating,Timestamp,Movie_Title
0,0,Shawn Wilson,50,5,881250949,Star Wars (1977)
584,0,Shawn Wilson,172,5,881250949,"Empire Strikes Back, The (1980)"
952,0,Shawn Wilson,133,1,881250949,Gone with the Wind (1939)


In [11]:
movies_df[movies_df['User_ID'] == 166]

Unnamed: 0,User_ID,User_Names,Movie_ID,Rating,Timestamp,Movie_Title
2975,166,Nancy Holder,286,1,886397562,"English Patient, The (1996)"
9141,166,Nancy Holder,322,5,886397723,Murder at 1600 (1997)
11755,166,Nancy Holder,288,3,886397510,Scream (1996)
14036,166,Nancy Holder,300,5,886397723,Air Force One (1997)
14886,166,Nancy Holder,243,3,886397827,Jungle2Jungle (1997)
18468,166,Nancy Holder,258,4,886397562,Contact (1997)
20005,166,Nancy Holder,294,3,886397596,Liar Liar (1997)
22468,166,Nancy Holder,687,1,886397777,McHale's Navy (1997)
31831,166,Nancy Holder,688,3,886397855,Leave It to Beaver (1997)
54557,166,Nancy Holder,346,1,886397596,Jackie Brown (1997)


In [12]:
n_users = movies_df.User_ID.unique().shape[0]
n_movies = movies_df.Movie_ID.unique().shape[0]
print( str(n_users) + ' users')
print( str(n_movies) + ' movies')

944 users
1682 movies


## Task 4: Create an Interaction Matrix

To create a collaborative filtering recommendation system, you need an interaction matrix to represent the relationship of every user with every movie in terms of ratings.

For this task, create a 2D array of nxm dimensions where n is the number of users and m is the number of movies. Next, place the ratings from DataFrame in the array.

In [15]:
print(movies_df['Rating'])

0         5
1         5
2         5
3         5
4         5
         ..
99998     3
99999     1
100000    2
100001    3
100002    3
Name: Rating, Length: 100003, dtype: int64


In [16]:
#This would be a 2D array matrix to display user-movie_rating relationship
#Rows represent users by IDs, columns represent movies by IDs
ratings = np.zeros((n_users, n_movies))
for row in movies_df.itertuples():
    ratings[row[1], row[3]-1] = row[4]

# View the matrix
print(ratings)

[[0. 0. 0. ... 0. 0. 0.]
 [5. 3. 4. ... 0. 0. 0.]
 [4. 0. 0. ... 0. 0. 0.]
 ...
 [5. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 5. 0. ... 0. 0. 0.]]


In [17]:
ratings.shape

(944, 1682)

In [18]:
pd.DataFrame(ratings)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,5.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Task 5: Explore the Interaction Matrix

One of the main characteristics of an interaction matrix is its density, which helps it provide good recommendations. The density of a dataset directly impacts the quality of the recommendations.

For this task, calculate the sparsity of the interaction matrix.

ratings.nonzero()[0]: This returns a list of indices of non-zero elements in the ratings matrix.

len(ratings.nonzero()[0]): This calculates the number of non-zero elements in the matrix.

ratings.shape[0] * ratings.shape[1]: This calculates the total number of elements in the matrix (number of users multiplied by number of items).

sparsity /= (ratings.shape[0] * ratings.shape[1]): This calculates the ratio of non-zero elements to the total number of elements.

sparsity *= 100: This converts the ratio to a percentage.

The output 6.29 means that only 6.29% of the elements in the ratings matrix are non-zero

In [19]:
sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print(sparsity)

6.298179628771237


## Task 6 : Create a Similarity Matrix

User-user collaborative filtering is based on finding the similarity among users.

For this task, use cosine similarity to find the similarity among users.

In [20]:
rating_cosine_similarity = cosine_similarity(ratings)

In [22]:
pd.DataFrame(rating_cosine_similarity)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,934,935,936,937,938,939,940,941,942,943
0,1.000000,0.119888,0.115540,0.000000,0.161475,0.149818,0.061204,0.084954,0.228361,0.170817,...,0.138757,0.000000,0.060767,0.154560,0.096390,0.000000,0.150342,0.000000,0.181809,0.118904
1,0.119888,1.000000,0.166931,0.047460,0.064358,0.378475,0.430239,0.440367,0.319072,0.078138,...,0.369527,0.119482,0.274876,0.189705,0.197326,0.118095,0.314072,0.148617,0.179508,0.398175
2,0.115540,0.166931,1.000000,0.110591,0.178121,0.072979,0.245843,0.107328,0.103344,0.161048,...,0.156986,0.307942,0.358789,0.424046,0.319889,0.228583,0.226790,0.161485,0.172268,0.105798
3,0.000000,0.047460,0.110591,1.000000,0.344151,0.021245,0.072415,0.066137,0.083060,0.061040,...,0.031875,0.042753,0.163829,0.069038,0.124245,0.026271,0.161890,0.101243,0.133416,0.026556
4,0.161475,0.064358,0.178121,0.344151,1.000000,0.031804,0.068044,0.091230,0.188060,0.101284,...,0.052107,0.036784,0.133115,0.193471,0.146058,0.030138,0.196858,0.152041,0.170086,0.058752
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.000000,0.118095,0.228583,0.026271,0.030138,0.071459,0.111852,0.107027,0.095898,0.039852,...,0.066039,0.431154,0.258021,0.226449,0.432666,1.000000,0.087687,0.180029,0.043264,0.144250
940,0.150342,0.314072,0.226790,0.161890,0.196858,0.239955,0.352449,0.329925,0.246883,0.120495,...,0.327153,0.107024,0.187536,0.181317,0.175158,0.087687,1.000000,0.145152,0.261376,0.241028
941,0.000000,0.148617,0.161485,0.101243,0.152041,0.139595,0.144446,0.059993,0.146145,0.143245,...,0.046952,0.203301,0.288318,0.234211,0.313400,0.180029,0.145152,1.000000,0.101642,0.095120
942,0.181809,0.179508,0.172268,0.133416,0.170086,0.152497,0.317328,0.282003,0.175322,0.092497,...,0.226440,0.073513,0.089588,0.129554,0.099385,0.043264,0.261376,0.101642,1.000000,0.182465


## Task 7: Provide Recommendations

Now that the cosine similarity matrix has been created, the system can recommend movies to the users according to their taste.

For this task, we create a function that receives a user’s ID. Then, do the following to give movie recommendations to the user:

Find the k most similar users. Let’s assume k=10.

Find the average rating of the movies rated by these k users.

Find the top 10 rated movies.

In [23]:
def movie_recommender(user_item_m, X_user, user, k=10, top_n=10):
    
    # Get the location of the actual user in the User-Items matrix
    # Use it to index the User similarity matrix
    user_similarities = X_user[user]
    # obtain the indices of the top k most similar users
    most_similar_users = user_item_m.index[user_similarities.argpartition(-k)[-k:]]
    # Obtain the mean ratings of those users for all movies
    rec_movies = user_item_m.loc[most_similar_users].mean(0).sort_values(ascending=False)
    # Discard already seen movies
    m_seen_movies = user_item_m.loc[user].gt(0)
    seen_movies = m_seen_movies.index[m_seen_movies].tolist()
    rec_movies = rec_movies.drop(seen_movies).head(top_n)
    # return recommendations - top similar users rated movies
    rec_movies_a=rec_movies.index.to_frame().reset_index(drop=True)
    rec_movies_a.rename(columns={rec_movies_a.columns[0]: 'Movie_ID'}, inplace=True)
    return rec_movies_a

## Task 8: View the Provided Recommendations 

For this task, run the function created in the previous task and view the recommendations provided to a user through the created system.

In [24]:
#Converting the 2D array into a DataFrame as expected by the movie_recommender function

ratings_df=pd.DataFrame(ratings)

In [25]:
user_ID=12
movie_recommender(ratings_df, rating_cosine_similarity,user_ID)

Unnamed: 0,Movie_ID
0,180
1,209
2,495
3,422
4,172
5,384
6,78
7,567
8,565
9,21


In [26]:
user_ID=943
movie_recommender(ratings_df, rating_cosine_similarity,user_ID)

Unnamed: 0,Movie_ID
0,182
1,175
2,81
3,6
4,264
5,287
6,143
7,650
8,155
9,94


## Task 9: Create Wrapper Function

This project aims to create an application that receives a User ID and provides all the recommendations for that specific user. For this, the recommendation function created in the Jupyter Notebook should be callable in a Python file via another function, i.e., a wrapper function.

For this task, perform the following operations:

Create another function, movie_recommender_run, that takes the user’s name and calls the recommendation function with the respective user ID.

Use the output of the function call and return the list of recommendations in the form of Movie_ID and Movie_Title from movie_recommender_run.

Save the notebook you’re working in to make it is usable in the next tasks.

In [27]:
def movie_recommender_run(user_Name):
    #Get ID from Name
    user_ID=movies_df.loc[movies_df['User_Names'] == user_Name].User_ID.values[0]
    #Call the function
    temp=movie_recommender(ratings_df, rating_cosine_similarity, user_ID)
    # Join with the movie_title_df to get the movie titles
    top_k_rec=temp.merge(movies_title_df, how='inner')
    return top_k_rec