## Data Preparation
Let's load this data into Python.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading ratings file
# Ignore the timestamp column
ratings = pd.read_csv('ratings.csv', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating'])

# Reading users file
users = pd.read_csv('users.csv', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

# Reading movies file
movies = pd.read_csv('movies.csv', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

In [3]:
print(movies.shape)

(100, 3)


## Content-Based Recommendation Model

Computes similarity between movies based on movie genres. It will suggest movies that are most similar to a particular movie based on its genre.

In [4]:
# Break up the big genre string into a string array
movies['genres'] = movies['genres'].str.split('|')
print(movies.shape)


(100, 3)


In [5]:
# Convert genres to string value
movies['genres'] = movies['genres'].fillna("").astype('str')
movies.head(5)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),"['Animation', ""Children's"", 'Comedy']"
1,2,Jumanji (1995),"['Adventure', ""Children's"", 'Fantasy']"
2,3,Grumpier Old Men (1995),"['Comedy', 'Romance']"
3,4,Waiting to Exhale (1995),"['Comedy', 'Drama']"
4,5,Father of the Bride Part II (1995),['Comedy']


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])
tfidf_matrix.shape

(100, 44)

In [7]:
tfidf_matrix

<100x44 sparse matrix of type '<class 'numpy.float64'>'
	with 231 stored elements in Compressed Sparse Row format>

In [8]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#cosine_sim[:4, :4]
cosine_sim

array([[1.        , 0.15337409, 0.12551391, ..., 0.        , 0.        ,
        0.        ],
       [0.15337409, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.12551391, 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.25861841],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.25861841, 0.        ,
        1.        ]])

In [80]:
# Build a 1-dimensional array with movie titles
titles = movies['title']
indices = pd.Series(movies.index, index=movies['title'])

#TODO: Function that get movie recommendations based on the cosine similarity score of movie genres
def genre_recommendations(title):
    index = indices[title]
    cossim = cosine_sim[index]
    rankings = pd.Series(cossim)
    ...
    return pd.DataFrame([titles[rankings.index], cossim], ["title", "score"]).transpose().sort_values(ascending=False)

Let's try and get the top recommendations for a few movies and see how good the recommendations are.

In [81]:
genre_recommendations('Toy Story (1995)').head(20)

TypeError: sort_values() missing 1 required positional argument: 'by'

## Collaborative Filtering Recommendation Model


Use the file **ratings.csv** first as it contains User ID, Movie IDs and Ratings. These three elements are all needed for determining the similarity of the users based on their ratings for a particular movie.


In [14]:
# Fill NaN values in user_id and movie_id column with 0
ratings['user_id'] = ratings['user_id'].fillna(0)
ratings['movie_id'] = ratings['movie_id'].fillna(0)

# Replace NaN values in rating column with average of all values
ratings['rating'] = ratings['rating'].fillna(ratings['rating'].mean())

Take a random sample of 20,000 ratings (2%) (due to limitation of personal laptop)

In [15]:
# Randomly sample 1% of the ratings dataset
small_data = ratings.sample(frac=0.02)
# Check the sample info
print(small_data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1497 to 2410
Data columns (total 3 columns):
user_id     150 non-null int64
movie_id    150 non-null int64
rating      150 non-null int64
dtypes: int64(3)
memory usage: 4.7 KB
None


In [16]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(small_data, test_size=0.2)

In [17]:
# Create two user-item matrices, one for training and another for testing
train_data_matrix = train_data.as_matrix(columns = ['user_id', 'movie_id', 'rating'])
test_data_matrix = test_data.as_matrix(columns = ['user_id', 'movie_id', 'rating'])

# Check their shape
print(train_data_matrix.shape)
print(test_data_matrix[:4, :4])

(120, 3)
[[25 39  4]
 [87  9  3]
 [ 1 31  2]
 [47 20  4]]


Now I use the **pairwise_distances** function from sklearn [Pearson Correlation Coefficient](https://stackoverflow.com/questions/1838806/euclidean-distance-vs-pearson-correlation-vs-cosine-similarity). This method provides a safe way to take a distance matrix as input, while preserving compatibility with many other algorithms that take a vector array.

In [18]:
from sklearn.metrics.pairwise import pairwise_distances

# User Similarity Matrix
user_correlation = 1 - pairwise_distances(train_data, metric='correlation')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation[:4, :4])

[[1.         0.65993828 0.94738282 0.99487195]
 [0.65993828 1.         0.38471464 0.58056371]
 [0.94738282 0.38471464 1.         0.97490059]
 [0.99487195 0.58056371 0.97490059 1.        ]]


In [19]:
# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(train_data_matrix.T, metric='correlation')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation[:4, :4])

[[ 1.         -0.09531795  0.00190598]
 [-0.09531795  1.         -0.01101394]
 [ 0.00190598 -0.01101394  1.        ]]


With the similarity matrix in hand, I can now predict the ratings that were not included with the data. Using these predictions, I can then compare them with the test data to attempt to validate the quality of our recommender model.

For the user-user CF case, I will look at the similarity between 2 users (A and B, for example) as weights that are multiplied by the ratings of a similar user B (corrected for the average rating of that user).

In [20]:
#TODO: Function to predict ratings
def predict(ratings, similarity, type='user'):
    if type == 'user':
        ...
    elif type == 'item':
        ...
    return pred

In [21]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Function to calculate RMSE
def rmse(pred, actual):
    # Ignore nonzero terms.
    pred = pred[actual.nonzero()].flatten()
    print(pred)
    actual = actual[actual.nonzero()].flatten()
    return sqrt(mean_squared_error(pred, actual))

In [22]:
# Predict ratings on the training data with both similarity score
user_prediction = predict(train_data_matrix, user_correlation, type='user')
item_prediction = predict(train_data_matrix, item_correlation, type='item')

# RMSE on the test data
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

[ 78.31313638  49.0192993   16.66756433  73.86810119  69.88663803
  24.24526078  65.72009475  27.43945197   5.84045328  66.10521735
  33.77398224   4.12080041  36.50694765  56.33191642   4.16113592
  70.54550129  51.3892098   13.06528891  47.26743534  47.10863418
  -0.37606952  52.36926963  14.25204758  -7.62131721  40.02502532
  43.63962539  -5.66465071  69.13442201  42.71976656   8.14581143
  74.73830219  44.37177681  12.889921    37.22886257  60.87214129
   9.89899614  59.31285421  55.7700163    9.91712949  21.95451647
  43.80694408  -7.76146055  33.52400763  32.19922366 -14.72323129
  76.29899188  45.34442797  14.35658015  51.72412692  42.75641535
  -0.48054227  51.30431903  30.64818009  -6.95249912  65.51507241
  25.88638908   6.59853852  46.22979186  10.78081257 -15.01060444
  62.34586004  31.26134152   0.39279845  53.32368335  15.52721592
  -6.85089927  17.33526181  32.79428305 -20.12954486  43.53990616
  65.9297929   14.53030094  50.51489919  53.82008777   4.66501303
  68.95661