# Movie Recommendations HW

**Name:**  Atul Gupta

**Download the dataset from here:** https://grouplens.org/datasets/movielens/1m/

In [1]:
# Import all the required libraries
import numpy as np
import pandas as pd
import itertools

## Reading the Data
Now that we have downloaded the files from the link above and placed them in the same directory as this Jupyter Notebook, we can load each of the tables of data as a CSV into Pandas.

In [2]:
# Read the dataset from the two files into ratings_data and movies_data
column_list_ratings = ["UserID", "MovieID", "Ratings","Timestamp"]
ratings_data  = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings, engine='python')

column_list_movies = ["MovieID","Title","Genres"]
movies_data = pd.read_csv('movies.dat',sep = '::',names = column_list_movies, engine='python', encoding = 'latin-1')

column_list_users = ["UserID","Gender","Age","Occupation","Zixp-code"]
user_data = pd.read_csv("users.dat",sep = "::",names = column_list_users, engine='python')

`ratings_data`, `movies_data`, `user_data` corresponds to the data loaded from `ratings.dat`, `movies.dat`, and `users.dat` in Pandas.

## Data analysis

We now have all our data in Pandas - however, it's as three separate datasets! To make some more sense out of the data we have, we can use the Pandas `merge` function to combine our component data-frames.

In [3]:
data=pd.merge(pd.merge(ratings_data,user_data),movies_data)

Next, we can create a pivot table to match the ratings with a given movie title. Using `data.pivot_table`, we can aggregate (using the average/`mean` function) the reviews and find the average rating for each movie. We can save this pivot table into the `mean_ratings` variable. 

In [4]:
mean_ratings=data.pivot_table('Ratings','Title',aggfunc='mean')
mean_ratings

Unnamed: 0_level_0,Ratings
Title,Unnamed: 1_level_1
"$1,000,000 Duck (1971)",3.027027
'Night Mother (1986),3.371429
'Til There Was You (1997),2.692308
"'burbs, The (1989)",2.910891
...And Justice for All (1979),3.713568
...,...
"Zed & Two Noughts, A (1985)",3.413793
Zero Effect (1998),3.750831
Zero Kelvin (Kjærlighetens kjøtere) (1995),3.500000
Zeus and Roxanne (1997),2.521739


Now, we can take the `mean_ratings` and sort it by the value of the rating itself. Using this and the `head` function, we can display the top 15 movies by average rating.

In [5]:
mean_ratings=data.pivot_table('Ratings',index=["Title"],aggfunc='mean')
top_15_mean_ratings = mean_ratings.sort_values(by = 'Ratings',ascending = False).head(15)
top_15_mean_ratings

Unnamed: 0_level_0,Ratings
Title,Unnamed: 1_level_1
Ulysses (Ulisse) (1954),5.0
Lured (1947),5.0
Follow the Bitch (1998),5.0
Bittersweet Motel (2000),5.0
Song of Freedom (1936),5.0
One Little Indian (1973),5.0
Smashing Time (1967),5.0
Schlafes Bruder (Brother of Sleep) (1995),5.0
"Gate of Heavenly Peace, The (1995)",5.0
"Baby, The (1973)",5.0


Let's adjust our original `mean_ratings` function to account for the differences in gender between reviews. This will be similar to the same code as before, except now we will provide an additional `columns` parameter which will separate the average ratings for men and women, respectively.

In [6]:
mean_ratings=data.pivot_table('Ratings',index=["Title"],columns=["Gender"],aggfunc='mean')

We can now sort the ratings as before, but instead of by `Rating`, but by the `F` and `M` gendered rating columns. Print the top rated movies by male and female reviews, respectively.

In [7]:
data=pd.merge(pd.merge(ratings_data,user_data),movies_data)

mean_ratings=data.pivot_table('Ratings',index=["Title"],columns=["Gender"],aggfunc='mean')
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)

top_male_ratings = mean_ratings.sort_values(by='M', ascending=False)
print(top_female_ratings)
print(top_male_ratings)

Gender                                               F         M
Title                                                           
Clean Slate (Coup de Torchon) (1981)               5.0  3.857143
Ballad of Narayama, The (Narayama Bushiko) (1958)  5.0  3.428571
Raw Deal (1948)                                    5.0  3.307692
Bittersweet Motel (2000)                           5.0       NaN
Skipped Parts (2000)                               5.0  4.000000
...                                                ...       ...
With Friends Like These... (1998)                  NaN  4.000000
Wooden Man's Bride, The (Wu Kui) (1994)            NaN  3.000000
Year of the Horse (1997)                           NaN  3.250000
Zachariah (1971)                                   NaN  3.500000
Zero Kelvin (Kjærlighetens kjøtere) (1995)         NaN  3.500000

[3706 rows x 2 columns]
Gender                                            F    M
Title                                                   
Schlafes Bruder 

In [8]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]

Gender,F,M,diff
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"James Dean Story, The (1957)",4.0,1.0,-3.0
Country Life (1994),5.0,2.0,-3.0
"Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919)",4.0,1.0,-3.0
Babyfever (1994),3.666667,1.0,-2.666667
"Woman of Paris, A (1923)",5.0,2.428571,-2.571429
Cobra (1925),4.0,1.5,-2.5
"Other Side of Sunday, The (Søndagsengler) (1996)",5.0,2.928571,-2.071429
"To Have, or Not (1995)",4.0,2.0,-2.0
For the Moment (1994),5.0,3.0,-2.0
Phat Beach (1996),3.0,1.0,-2.0


Let's try grouping the data-frame, instead, to see how different titles compare in terms of the number of ratings. Group by `Title` and then take the top 10 items by number of reviews. We can see here the most popularly-reviewed titles.

In [9]:
ratings_by_title=data.groupby('Title').size()
ratings_by_title.sort_values(ascending=False).head(10)

Title
American Beauty (1999)                                   3428
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                     2672
Saving Private Ryan (1998)                               2653
Terminator 2: Judgment Day (1991)                        2649
Matrix, The (1999)                                       2590
Back to the Future (1985)                                2583
Silence of the Lambs, The (1991)                         2578
dtype: int64

Similarly, we can filter our grouped data-frame to get all titles with a certain number of reviews. Filter the dataset to get all movie titles such that the number of reviews is >= 2500.

Creating a ratings matrix using Numpy. This matrix allows us to see the ratings for a given movie and user ID. The element at location $[i,j]$ is a rating given by user $i$ for movie $j$. Printing the **shape** of the matrix produced.  

Additionally, choose 3 users that have rated the movie with MovieID "**1377**" (Batman Returns). Print these ratings, they will be used later for comparison.


**Notes:**
- Do *not* use `pivot_table`.
- A ratings matrix is *not* the same as `ratings_data` from above.
- The ratings of movie with MovieID $i$ are stored in the ($i$-1)th column (index starts from 0)  
- Not every user has rated every movie. Missing entries should be set to 0 for now.
- If you're stuck, you might want to look into `np.zeros` and how to use it to create a matrix of the desired shape.
- Every review lies between 1 and 5, and thus fits within a `uint8` datatype, which you can specify to numpy.

In [10]:
# Create the matrix
# we need a matrix that has rows = number of users and columns = number of movies
ratings_matrix = np.zeros((user_data['UserID'].max(), movies_data['MovieID'].max()),dtype=np.uint8)

In [11]:
# Print the shape
print(ratings_matrix.shape)

(6040, 3952)


In [12]:
# Store and print ratings for Batman Returns

ratings_matrix[ratings_data.UserID.values - 1, ratings_data.MovieID.values - 1] = ratings_data.Ratings.values

# finding the movie id of the given movie
batman_returns_id = data.loc[data['Title'] == 'Batman Returns (1992)','MovieID'].iloc[0]
print(f"[{ratings_matrix[12][1376]} {ratings_matrix[17][1376]} {ratings_matrix[180][1376]}]")

[3 2 3]


Normalize the ratings matrix using Z-score normalization. While we can't use `sklearn`'s `StandardScaler` for this step, we can do the statistical calculations ourselves to normalize the data.

Before you start:
- Your first step should be to get the average of every *column* of the ratings matrix (we want an average by title, not by user!).
- Make sure that the mean is calculated considering only non-zero elements. If there is a movie which is rated only by 10 users, we get its mean rating using (sum of the 10 ratings)/10 and **NOT** (sum of 10 ratings)/(total number of users)
- All of the missing values in the dataset should be replaced with the average rating for the given movie. This is a complex topic, but for our case replacing empty values with the mean will make it so that the absence of a rating doesn't affect the overall average, and it provides an "expected value" which is useful for computing correlations and recommendations in later steps.
- In our matrix, 0 represents a missing rating.
- Next, we want to subtract the average from the original ratings thus allowing us to get a mean of 0 in every *column*. It may be very close but not exactly zero because of the limited precision `float`s allow.
- Lastly, divide this by the standard deviation of the *column*.

- Not every MovieID is used, leading to zero columns. This will cause a divide by zero error when normalizing the matrix. Simply replace any NaN values in your normalized matrix with 0.

In [13]:
# Creating a copy of the ratings_matrix
new_matrix = ratings_matrix.copy()

# Replacing the 0s in the new matrix with nan values to make computing mean easier
new_matrix = np.where(new_matrix==0,np.nan,new_matrix)

new_matrix = np.where(new_matrix==0,np.nanmean(new_matrix,axis=0),new_matrix)
# Computing standard deviation and dividing matrix values with the computation.
new_matrix = (new_matrix - np.nanmean(new_matrix,axis = 0))/np.nanstd(new_matrix,axis = 0)

# Any remaining nan values in the matrix are replaced by 0s
new_matrix[np.isnan(new_matrix)] = 0

print(new_matrix)

  new_matrix = np.where(new_matrix==0,np.nanmean(new_matrix,axis=0),new_matrix)
  new_matrix = (new_matrix - np.nanmean(new_matrix,axis = 0))/np.nanstd(new_matrix,axis = 0)


[[ 1.00118491  0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [-1.3458366   0.          0.         ...  0.          0.
   0.        ]]


  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  new_matrix = (new_matrix - np.nanmean(new_matrix,axis = 0))/np.nanstd(new_matrix,axis = 0)


We're now going to perform Singular Value Decomposition (SVD) on the normalized ratings matrix from the previous question.

In [14]:
# Compute the SVD of the normalised matrix
U, S, VT = np.linalg.svd(new_matrix)
V = VT.transpose()
S = np.diag(S)

In [15]:
# Print the shapes
print(f"Shape of U: {U.shape}, Shape of S: {S.shape}, Shape of V: {V.shape} ")

Shape of U: (6040, 6040), Shape of S: (3952, 3952), Shape of V: (3952, 3952) 


Reconstruct four rank-k rating matrix $R_k$, where $R_k = U_kS_kV_k^T$ for k = [100, 1000, 2000, 3000]. Using each of $R_k$ make predictions for the 3 users selected in Question 1, for the movie with ID 1377 (Batman Returns). Compare the original ratings with the predicted ratings.

In [16]:
k_list = [100,1000,2000,3000]

print("Predicted values\n")
for k in k_list:
    rank_k_matrix = U[:, :k]@S[:k, :k]@VT[:k, :]
    print(f"Rank: {k}")
    print(list(rank_k_matrix[[80, 323, 2000], 1376]))
    print()
print(f"Real values")
print(list(ratings_matrix[[80, 323, 2000], 1376]))

Predicted values

Rank: 100
[-0.04847495035113396, 0.006095092843419837, 0.13232766862558704]

Rank: 1000
[0.09466842181361372, 0.09746294032383132, 0.0013177226147005477]

Rank: 2000
[-0.008379879162190644, 0.00998305736652974, 0.007176379692792603]

Rank: 3000
[0.002205795648211773, -0.0015460752401469633, 3.71316002777416e-05]

Real values
[0, 0, 0]


### Cosine Similarity
Cosine similarity is a metric used to measure how similar two vectors are. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. Cosine similarity is high if the angle between two vectors is 0, and the output value ranges within $cosine(x,y) \in [0,1]$. $0$ means there is no similarity (perpendicular), where $1$ (parallel) means that both the items are 100% similar.

$$ cosine(x,y) = \frac{x^T y}{||x|| ||y||}  $$

**Based on the reconstruction rank-1000 rating matrix $R_{1000}$ and the cosine similarity,** sort the movies which are most similar. You will have a function `top_movie_similarity` which sorts data by its similarity to a movie with ID `movie_id` and returns the top $n$ items, and a second function `print_similar_movies` which prints the titles of said similar movies. Return the top 5 movies for the movie with ID `1377` (*Batman Returns*)

Note: While finding the cosine similarity, there are a few empty columns which will have a magnitude of **zero** resulting in NaN values. These should be replaced by 0, otherwise these columns will show most similarity with the given movie. 

In [17]:
# Sort the movies based on cosine similarity
# Movie id starts from 1
# Use the calculation formula above

def top_movie_similarity(rank_matrix, movie_id, top_n=5):
    # User ratings of the concerned movie
    user_ratings = rank_matrix[:,movie_id-1]
    
    # Dot product of the user ratings with the ranked matrix. Gives the numerator part of the cosine similarity calculation.
    xTy = np.dot(user_ratings,rank_matrix)
    
    # Magnitudes of user ratings matrix and rank matrix (column wise)
    mag_x = np.linalg.norm(user_ratings)
    mag_y = np.linalg.norm(rank_matrix,axis = 0)
    
    # Calculating cosine similarity and replacing any nans with 0
    cosine_similarity = xTy/(mag_x * mag_y)
    cosine_similarity = np.nan_to_num(cosine_similarity,nan=0)
    
    # Sorting calculated cosine similarity in ascending order
    req_movies = np.argsort(cosine_similarity)[-(top_n+1):-1]
    
    # flipping the result and returning in descending order
    return np.flip(req_movies)

# Printing the most similar movies given the indices
def print_similar_movies(movie_titles, top_indices):
    print('Most Similar movies: ')
    print()
    
    # Fetching movie titles from the movie data dataframe where the movie ID is known.
    recommended_movies = list(movie_titles[movie_titles['MovieID'].isin(top_indices + 1)]['Title'])
    
    # Printing the list
    for item in recommended_movies:
        print(item)

# Print the top 5 movies for Batman Returns
movie_id = 1377
rank = 1000
rank_k_matrix = U[:, :1000]@S[:1000, :1000]@VT[:1000, :]

top_indices = top_movie_similarity(rank_k_matrix,movie_id)

print_similar_movies(movies_data, top_indices)

Most Similar movies: 

Batman Forever (1995)
Waterworld (1995)
Batman (1989)
Batman & Robin (1997)
Dick Tracy (1990)


  cosine_similarity = xTy/(mag_x * mag_y)


### Movie Recommendations
Using the same process, write `top_user_similarity` which sorts data by its similarity to a user with ID `user_id` and returns the top result. Then find the MovieIDs of the movies that this similar user has rated most highly, but that `user_id` has not yet seen. Find at least 5 movie recommendations for the user with ID `5954` and print their titles.

Hint: To check your results, find the genres of the movies that the user likes and compare with the genres of the recommended movies.

In [18]:
#Sort users based on cosine similarity
def top_user_similarity(rank_matrix, user_id):
    
    # Taking transpose of the rank matrix because we want to work with columns
    rank_matrix = np.transpose(rank_matrix)
    
    # Following the same process as Question 5 to find the most similar users to the user user_id
    current_user_ratings = rank_matrix[:,user_id-1]
    
    xTy = np.dot(current_user_ratings,rank_matrix)
    
    mag_x = np.linalg.norm(current_user_ratings)
    mag_y = np.linalg.norm(rank_matrix,axis = 0)
    
    cosine_similarity = xTy/(mag_x * mag_y)
    cosine_similarity = np.nan_to_num(cosine_similarity,nan=0)
    
    req_users = np.argsort(cosine_similarity)[-6:-1]
    
    # Returning the top 1, most similar user to the user user_id
    return req_users[-1]


# Function to find movies that are not rated by user user_id but rated very highly by user similar_user
def find_top_unwatched_movies(user_id,similar_user,data):
    
    # Fetching the movies that the user user_id has rated highly and sorted it in descending order.
    new_user_data = data[['UserID','MovieID','Ratings']]
    new_user_data = new_user_data[new_user_data['UserID']==user_id].sort_values(by='Ratings',ascending=False)
    
    # Fetching the movies that the user similar_user has rated highly and sorted it in descending order
    similar_user_data = data[['UserID','MovieID','Ratings']]
    similar_user_data = similar_user_data[similar_user_data['UserID']==similar_user].sort_values(by='Ratings',ascending=False)
    
    # Creating a list (sorted in descending order by rating) of all the movies both the users have watched
    user_watched_movies = list(new_user_data['MovieID'])
    similar_user_movies = list(similar_user_data['MovieID'])
    
    # The required values are the first five movies that similar_user has watched but the user_id has not watched.
    top_indices = [x for x in similar_user_movies if x not in user_watched_movies]
    return top_indices[0:5]

# Function to print the required list of movies given movieIDs and movies data.
def print_recommendations(movie_titles,top_indices):
    print("Your next watchlist is: ")
    print()
    recommended_movies = list(movie_titles[movie_titles['MovieID'].isin(top_indices)]['Title'])
    
    for item in recommended_movies:
        print(item)

rank = 1000
rank_k_matrix = U[:, :1000]@S[:1000, :1000]@VT[:1000, :]

user_id = 5954
similar_user = top_user_similarity(rank_k_matrix,user_id)

top_indices = find_top_unwatched_movies(user_id,similar_user,data)
print_recommendations(movies_data, top_indices)

Your next watchlist is: 

Dumb & Dumber (1994)
Winnie the Pooh and the Blustery Day (1968)
Goonies, The (1985)
Rushmore (1998)
Mad Max 2 (a.k.a. The Road Warrior) (1981)
