### Machine Learning: Collaborative Filtering
___

#### Summary:

In this notebook we create a recommender system using a collaborative filtering algorithm. The 
collaborative filtering algorithm predicts movie ratings for movies that the users have not yet rated.
These predicted ratings will determine what movies to recommend. That is the movies with the highest 
predicted ratings will be recommended.
___
#### This notebook will include:
1. Collaborative Filtering
___
#### Reference: 

Much of what is in this notebook was learned from the Machine Learning Coursera course by Andrew Ng.

In [2]:
# Importing the data
"""
The data Y that will be used is the MovieLens dataset. Y consists of ratings for 1682 movies by 
943 different users. This data is of shape (#movies, #users) where Y[i,j] is the rating of movie i by 
user j. The ratings can take on discrete values from 0 to 5 where 0 means the user has not yet rated 
the movie. 
"""
# Importing the libraries
import pandas as pd
import numpy as np

# Importing the data Y
Y_data = pd.read_csv('Datasets/MovieLens/Y.csv', header = None).as_matrix()

# Importing the list of movies
movie_list = pd.read_csv('Datasets/MovieLens/movie_list.tsv', delimiter = '\t', header = None).as_matrix()

# Printing the dataset shape
print('Y_data:', Y_data.shape)
print('movie_list:', movie_list.shape)



Y_data: (1682, 943)
movie_list: (1682, 1)


In [5]:
# Collaborative Filtering
"""
To train the collaborative filtering algorithm, we need to learn two sets of parameters, X and Theta.
X is of shape (#movies, 100) where the 100 columns represent the features for the movies. Theta is of 
shape (#users, 100) where the 100 columns instead represent the users weights for those features. 
Like most machine learning algorithms, the optimal parameters are obtained by minimizing a cost 
function.
"""
# Importing the libraries
import tensorflow as tf

# Defining the input(s)
Y = tf.placeholder(tf.float32)

# Defining the parameters
X = tf.Variable(tf.truncated_normal([1682, 100], stddev=0.1))
Theta = tf.Variable(tf.truncated_normal([943, 100], stddev=0.1))

# Intermediate calculations
R = tf.cast(tf.not_equal(Y, 0), tf.float32) # R[i,j] indicates if movie i has been rated by user j
Y_mean = (tf.reduce_sum(Y, axis = 1) / tf.reduce_sum(R, axis = 1)) # Feature scaling
Y_norm = Y - tf.reshape(Y_mean,[-1,1]) # Mean normalization

# Cost function calculations
cost = 0.5 * tf.reduce_sum(R * (tf.square(tf.matmul(X, tf.transpose(Theta)) - Y_norm)))

# Defining the optimizer
train_step = tf.train.AdamOptimizer(1e-3).minimize(cost)

# Prediction calculations
prediction = tf.matmul(X, tf.transpose(Theta)) + tf.reshape(Y_mean,[-1,1])

# Creating a new session
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

# 10000 steps of backpropagation
for i in range(10000):
    # Print the cost every 100 iterations
    if i % 100 == 0:
        print('step %d, training cost %g' % (i, cost.eval(feed_dict={Y: Y_data})))
    sess.run(train_step, feed_dict={Y: Y_data})

# Print the predicted ratings of every movie by every user
Y_pred = prediction.eval(feed_dict={Y: Y_data})
print('Predicted Y:\n', Y_pred)


step 0, training cost 50275.9
step 100, training cost 29427.1
step 200, training cost 11578.9
step 300, training cost 4935.22
step 400, training cost 2476.04
step 500, training cost 1418.98
step 600, training cost 892.488
step 700, training cost 597.608
step 800, training cost 418.095
step 900, training cost 302.404
step 1000, training cost 224.621
step 1100, training cost 170.483
step 1200, training cost 131.691
step 1300, training cost 103.219
step 1400, training cost 81.8956
step 1500, training cost 65.6454
step 1600, training cost 53.0704
step 1700, training cost 43.2079
step 1800, training cost 35.3819
step 1900, training cost 29.1088
step 2000, training cost 24.0368
step 2100, training cost 19.9054
step 2200, training cost 16.5193
step 2300, training cost 13.7295
step 2400, training cost 11.4214
step 2500, training cost 9.50543
step 2600, training cost 7.91087
step 2700, training cost 6.5813
step 2800, training cost 5.47139
step 2900, training cost 4.54424
step 3000, training cos

In [19]:
# Recommending movies to users
"""
After learning the parameters Theta for all the users and X for all the movies we can recommend 
the users movies that they have not yet seen.
"""
# Obtain the recommendations for all users
Y_not_seen = (Y_data==0) * Y_pred
recommend = np.argsort(-Y_not_seen, axis=0)

# Obtain the top 5 recommendations for user j
j = 1
print('Movie recommendations for user %d:' % j)
for i in range (5):
    print(movie_list[recommend[i,j],0])

Movie recommendations for user 1:
Remains of the Day, The (1993)
Blade Runner (1982)
Army of Darkness (1993)
Last of the Mohicans, The (1992)
Sling Blade (1996)
