## Data Preparation
Let's load this data into Python.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading ratings file
# Ignore the timestamp column
ratings = pd.read_csv('ratings.csv', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating'])

# Reading users file
users = pd.read_csv('users.csv', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

# Reading movies file
movies = pd.read_csv('movies.csv', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

In [0]:
print(movies.shape)

(100, 3)


In [0]:
ratings.head(5)

Unnamed: 0,user_id,movie_id,rating
0,1,1,3
1,1,3,5
2,1,4,4
3,1,5,3
4,1,6,3


In [0]:
users.head(5)

Unnamed: 0,user_id,gender,zipcode,age_desc,occ_desc
0,1,F,48067,Under 18,K-12 student
1,2,M,70072,56+,self-employed
2,3,M,55117,25-34,scientist
3,4,M,2460,45-49,executive/managerial
4,5,M,55455,25-34,writer


In [0]:
movies.head(5)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


## Content-Based Recommendation Model

Computes similarity between movies based on movie genres. It will suggest movies that are most similar to a particular movie based on its genre.

In [0]:
# Break up the big genre string into a string array
movies['genres'] = movies['genres'].apply(lambda x: str(x).split('|'))
print(movies.shape)
movies.head(5)

(100, 3)


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),"[Animation, Children's, Comedy]"
1,2,Jumanji (1995),"[Adventure, Children's, Fantasy]"
2,3,Grumpier Old Men (1995),"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama]"
4,5,Father of the Bride Part II (1995),[Comedy]


In [0]:
# Convert genres to string value
movies['genres'] = movies['genres'].fillna("").astype('str')
movies.head(5)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),"['Animation', ""Children's"", 'Comedy']"
1,2,Jumanji (1995),"['Adventure', ""Children's"", 'Fantasy']"
2,3,Grumpier Old Men (1995),"['Comedy', 'Romance']"
3,4,Waiting to Exhale (1995),"['Comedy', 'Drama']"
4,5,Father of the Bride Part II (1995),['Comedy']


In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])
tfidf_matrix.shape

(100, 45)

In [0]:
tfidf_matrix

<100x45 sparse matrix of type '<class 'numpy.float64'>'
	with 246 stored elements in Compressed Sparse Row format>

In [0]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#cosine_sim[:4, :4]
cosine_sim

array([[1.        , 0.15337409, 0.12551391, ..., 0.        , 0.        ,
        0.        ],
       [0.15337409, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.12551391, 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.25861841],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.25861841, 0.        ,
        1.        ]])

In [0]:
# Build a 1-dimensional array with movie titles
titles = movies['title']
indices = pd.Series(movies.index, index=movies['title'])

#TODO: Function that get movie recommendations based on the cosine similarity score of movie genres
def genre_recommendations(title):
    id = np.where(titles==title)[0][0]
    rec = sorted(titles, key=lambda x: -cosine_sim[id][np.where(titles==x)[0][0]])
    rec = rec[1:]
    result = pd.DataFrame(columns=["ID", "Title"])
    while len(rec) > 0:
      top = rec[0]
      rec = rec[1:]
      result = result.append({"ID": np.where(titles==top)[0][0], "Title": top}, ignore_index=True)

    return result
genre_recommendations('Toy Story (1995)').head(20)

Unnamed: 0,ID,Title
0,12,Balto (1995)
1,86,Dunston Checks In (1996)
2,33,Babe (1995)
3,47,Pocahontas (1995)
4,4,Father of the Bride Part II (1995)
5,18,Ace Ventura: When Nature Calls (1995)
6,37,It Takes Two (1995)
7,51,Mighty Aphrodite (1995)
8,62,Don't Be a Menace to South Central While Drink...
9,64,Bio-Dome (1996)


Let's try and get the top recommendations for a few movies and see how good the recommendations are.

In [0]:
genre_recommendations('Toy Story (1995)').head(20)

Unnamed: 0,ID,Title
0,12,Balto (1995)
1,86,Dunston Checks In (1996)
2,33,Babe (1995)
3,47,Pocahontas (1995)
4,4,Father of the Bride Part II (1995)
5,18,Ace Ventura: When Nature Calls (1995)
6,37,It Takes Two (1995)
7,51,Mighty Aphrodite (1995)
8,62,Don't Be a Menace to South Central While Drink...
9,64,Bio-Dome (1996)


## Collaborative Filtering Recommendation Model


Use the file **ratings.csv** first as it contains User ID, Movie IDs and Ratings. These three elements are all needed for determining the similarity of the users based on their ratings for a particular movie.


In [2]:
# Fill NaN values in user_id and movie_id column with 0
ratings['user_id'] = ratings['user_id'].fillna(0)
ratings['movie_id'] = ratings['movie_id'].fillna(0)

# Replace NaN values in rating column with average of all values
ratings['rating'] = ratings['rating'].fillna(ratings['rating'].mean())
ratings

Unnamed: 0,user_id,movie_id,rating
0,1,1,3
1,1,3,5
2,1,4,4
3,1,5,3
4,1,6,3
...,...,...,...
7507,100,96,4
7508,100,97,4
7509,100,98,5
7510,100,99,5


Take a random sample of 20,000 ratings (2%) (due to limitation of personal laptop)

In [0]:
# Randomly sample 1% of the ratings dataset
small_data = ratings.sample(frac=0.02)
# Check the sample info
print(small_data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 5450 to 3680
Data columns (total 3 columns):
user_id     150 non-null int64
movie_id    150 non-null int64
rating      150 non-null int64
dtypes: int64(3)
memory usage: 4.7 KB
None


In [0]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(ratings, test_size=0.2)

In [12]:
# Create two user-item matrices, one for training and another for testing
train_data_matrix = train_data.as_matrix(columns = ['user_id', 'movie_id', 'rating'])
test_data_matrix = test_data.as_matrix(columns = ['user_id', 'movie_id', 'rating'])

# Check their shape
print(train_data_matrix.shape)
print(test_data_matrix[:4, :4])

(6009, 3)
[[35 74  2]
 [59 93  3]
 [13 70  5]
 [24 52  4]]


  """Entry point for launching an IPython kernel.
  


In [14]:
rating_mat = np.zeros((100, 100), np.int)
for index, row in train_data.iterrows():
  rating_mat[row['user_id'] - 1][row['movie_id'] - 1] = row['rating']
print(rating_mat[:5, :5])
rating_mat_test = np.zeros((100, 100), np.int)
for index, row in test_data.iterrows():
  rating_mat_test[row['user_id'] - 1][row['movie_id'] - 1] = row['rating']
print(rating_mat_test[:5, :5])

[[0 0 5 4 3]
 [3 4 5 0 5]
 [0 0 4 3 3]
 [1 3 0 0 0]
 [2 4 0 4 0]]
[[3 0 0 0 0]
 [0 0 0 0 0]
 [3 1 0 0 0]
 [0 0 3 0 3]
 [0 0 4 0 0]]


Now I use the **pairwise_distances** function from sklearn [Pearson Correlation Coefficient](https://stackoverflow.com/questions/1838806/euclidean-distance-vs-pearson-correlation-vs-cosine-similarity). This method provides a safe way to take a distance matrix as input, while preserving compatibility with many other algorithms that take a vector array.

In [15]:
from sklearn.metrics.pairwise import pairwise_distances

# User Similarity Matrix
user_correlation = 1 - pairwise_distances(rating_mat, metric='correlation')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation.shape)

(100, 100)


In [16]:
# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(rating_mat.T, metric='correlation')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation.shape)

(100, 100)


With the similarity matrix in hand, I can now predict the ratings that were not included with the data. Using these predictions, I can then compare them with the test data to attempt to validate the quality of our recommender model.

For the user-user CF case, I will look at the similarity between 2 users (A and B, for example) as weights that are multiplied by the ratings of a similar user B (corrected for the average rating of that user).

In [0]:
#TODO Function to predict ratings
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        rat_dif = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(rat_dif) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [0]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Function to calculate RMSE
def rmse(pred, actual):
    # Ignore nonzero terms.
    pred = pred[actual.nonzero()].flatten()
    print(pred)
    actual = actual[actual.nonzero()].flatten()
    return sqrt(mean_squared_error(pred, actual))

In [21]:
# Predict ratings on the training data with both similarity score
user_prediction = predict(rating_mat, user_correlation, type='user')
item_prediction = predict(rating_mat, item_correlation, type='item')

# RMSE on the test data
print('User-based CF RMSE: ' + str(rmse(user_prediction, rating_mat_test)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, rating_mat_test)))

[1.66027952 1.98681217 1.46098987 ... 1.58732367 1.58361461 1.47377934]
User-based CF RMSE: 2.2182628534550775
[-0.6036019  -0.18022933 -0.596318   ...  0.05390317  0.03115049
 -0.38906551]
Item-based CF RMSE: 3.944125979632492
