## Data Preparation
Let's load this data into Python.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading ratings file
# Ignore the timestamp column
ratings = pd.read_csv('ratings.csv', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating'])

# Reading users file
users = pd.read_csv('users.csv', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

# Reading movies file
movies = pd.read_csv('movies.csv', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

In [2]:
print(movies.shape)

(100, 3)


## Content-Based Recommendation Model

Computes similarity between movies based on movie genres. It will suggest movies that are most similar to a particular movie based on its genre.

In [3]:
# Break up the big genre string into a string array
movies['genres'] = movies['genres'].str.split('|')
print(movies.shape)


(100, 3)


In [4]:
# Convert genres to string value
movies['genres'] = movies['genres'].fillna("").astype('str')
movies.head(5)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),"['Animation', ""Children's"", 'Comedy']"
1,2,Jumanji (1995),"['Adventure', ""Children's"", 'Fantasy']"
2,3,Grumpier Old Men (1995),"['Comedy', 'Romance']"
3,4,Waiting to Exhale (1995),"['Comedy', 'Drama']"
4,5,Father of the Bride Part II (1995),['Comedy']


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])
tfidf_matrix.shape

(100, 44)

In [6]:
tfidf_matrix

<100x44 sparse matrix of type '<class 'numpy.float64'>'
	with 231 stored elements in Compressed Sparse Row format>

In [7]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#cosine_sim[:4, :4]
cosine_sim

array([[1.        , 0.15337409, 0.12551391, ..., 0.        , 0.        ,
        0.        ],
       [0.15337409, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.12551391, 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.25861841],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.25861841, 0.        ,
        1.        ]])

In [8]:
# Build a 1-dimensional array with movie titles
titles = movies['title']
indices = pd.Series(movies.index, index=movies['title'])

#TODO: Function that get movie recommendations based on the cosine similarity score of movie genres

def genre_recommendations(title):
    index = indices[title]
    df = pd.DataFrame(cosine_sim[index])
    df.columns = ['sim']
    df['titles'] = movies['title']
    df = df.sort_values(by=['sim'],ascending=False)
    df = df.drop(df.index[0])
    return df

Let's try and get the top recommendations for a few movies and see how good the recommendations are.

In [9]:
genre_recommendations('Balto (1995)').head(20)

Unnamed: 0,sim,titles
0,0.819159,Toy Story (1995)
47,0.610613,Pocahontas (1995)
86,0.286895,Dunston Checks In (1996)
7,0.264066,Tom and Huck (1995)
33,0.237548,Babe (1995)
1,0.187234,Jumanji (1995)
55,0.187234,Kids of the Round Table (1995)
72,0.0,Misérables
71,0.0,Kicking and Screaming (1995)
70,0.0,Fair Game (1995)


## Collaborative Filtering Recommendation Model


Use the file **ratings.csv** first as it contains User ID, Movie IDs and Ratings. These three elements are all needed for determining the similarity of the users based on their ratings for a particular movie.


In [10]:
# Fill NaN values in user_id and movie_id column with 0
ratings['user_id'] = ratings['user_id'].fillna(0)
ratings['movie_id'] = ratings['movie_id'].fillna(0)

# Replace NaN values in rating column with average of all values
ratings['rating'] = ratings['rating'].fillna(ratings['rating'].mean())

Take a random sample of 20,000 ratings (2%) (due to limitation of personal laptop)

In [11]:
# Randomly sample 1% of the ratings dataset
small_data = ratings.sample(frac=0.02)
# Check the sample info
print(small_data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 3484 to 6910
Data columns (total 3 columns):
user_id     150 non-null int64
movie_id    150 non-null int64
rating      150 non-null int64
dtypes: int64(3)
memory usage: 4.7 KB
None


In [12]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(small_data, test_size=0.2)

In [13]:
# Create two user-item matrices, one for training and another for testing
train_data_matrix = train_data.as_matrix(columns = ['user_id', 'movie_id', 'rating'])
test_data_matrix = test_data.as_matrix(columns = ['user_id', 'movie_id', 'rating'])

# Check their shape
print(train_data_matrix.shape)
print(test_data_matrix[:4, :4])

(120, 3)
[[31 91  2]
 [ 9  1  4]
 [ 1 74  2]
 [42 77  3]]


  
  This is separate from the ipykernel package so we can avoid doing imports until


Now I use the **pairwise_distances** function from sklearn [Pearson Correlation Coefficient](https://stackoverflow.com/questions/1838806/euclidean-distance-vs-pearson-correlation-vs-cosine-similarity). This method provides a safe way to take a distance matrix as input, while preserving compatibility with many other algorithms that take a vector array.

In [14]:
from sklearn.metrics.pairwise import pairwise_distances

# User Similarity Matrix
user_correlation = 1 - pairwise_distances(train_data, metric='correlation')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation[:4, :4])

[[ 1.          0.52437882  0.49870074  0.8085333 ]
 [ 0.52437882  1.          0.99955344 -0.07707902]
 [ 0.49870074  0.99955344  1.         -0.10683751]
 [ 0.8085333  -0.07707902 -0.10683751  1.        ]]


In [15]:
# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(train_data_matrix.T, metric='correlation')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation)

[[ 1.         -0.07526278 -0.06675173]
 [-0.07526278  1.         -0.01270332]
 [-0.06675173 -0.01270332  1.        ]]


With the similarity matrix in hand, I can now predict the ratings that were not included with the data. Using these predictions, I can then compare them with the test data to attempt to validate the quality of our recommender model.

For the user-user CF case, I will look at the similarity between 2 users (A and B, for example) as weights that are multiplied by the ratings of a similar user B (corrected for the average rating of that user).

In [16]:
#TODO: Function to predict ratings

def predict(ratings, similarity, type='user'):
    similarity[similarity == 1] = 0
    prediction = np.copy(train_data_matrix)
    users, items = prediction.shape
    if type == 'user':
        ratings_mean = np.nanmean(ratings,axis=1)
        for i in range(users):
            for j in range(items):
                prediction[i][j] = ratings_mean[i] + np.nansum(similarity[i] * (ratings[:,j] - ratings_mean)) / sum(similarity[i])
    elif type == 'item':
        ratings_mean = np.nanmean(ratings,axis=0)
        for i in range(users):
            for j in range(items):
                 prediction[i][j] = ratings_mean[j] + np.nansum(similarity[j] * (ratings[j,:] - ratings_mean)) / sum(similarity[j])
        print("test")
    return prediction


In [17]:
prediction = np.copy(train_data_matrix)
users, items = prediction.shape
np.nanmean(train_data_matrix, axis=1)

array([29.        , 33.        , 31.33333333, 39.        , 24.33333333,
       32.66666667, 48.        , 51.        , 54.66666667, 47.        ,
       41.        , 16.33333333, 35.66666667, 28.33333333, 55.66666667,
       33.66666667, 34.66666667, 40.33333333, 35.66666667, 49.66666667,
       40.33333333, 21.        , 28.33333333, 60.        , 43.66666667,
       37.33333333, 56.        , 23.66666667, 19.        , 38.33333333,
       39.        , 41.        , 30.66666667, 24.33333333, 46.33333333,
       18.66666667, 47.33333333, 40.33333333, 16.33333333, 27.33333333,
       49.33333333, 40.33333333, 60.        , 49.33333333, 31.33333333,
       31.33333333, 28.33333333, 15.33333333, 27.        , 24.33333333,
       35.66666667, 40.33333333, 38.        , 47.33333333, 55.        ,
       28.33333333, 44.        , 50.        , 30.33333333, 32.        ,
       22.66666667, 31.        ,  7.        , 43.33333333, 41.66666667,
       54.66666667, 23.        , 32.33333333, 32.66666667, 45.  

In [18]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Function to calculate RMSE
def rmse(pred, actual):
    # Ignore nonzero terms.
    pred = pred[actual.nonzero()].flatten()
    print(pred)
    actual = actual[actual.nonzero()].flatten()
    return sqrt(mean_squared_error(pred, actual))

In [19]:
# Predict ratings on the training data with both similarity score
user_prediction = predict(train_data_matrix, user_correlation, type='user')
item_prediction = predict(train_data_matrix, item_correlation, type='item')

# RMSE on the test data
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

test
[ 48  44  -6  34  66  -2  32  65  -3  67  44   4  24  59 -11  42  58  -2
  56  74  13  74  61  16  72  71  19  60  68  12  53  63   5  54  13 -18
  47  58   0  51  40  -6  68  77  20  44  57  -1  32  72   0  63  51   5
  39  67   0  65  68  14  59  56   5  39  38 -14  68  23  -6  74  80  25
  73  48   9  57  51   2  70  76  21  29  53 -11  22  51 -16  44  67   3]
User-based CF RMSE: 29.556913084947
[ 46  20 -28  46  20 -28  46  20 -28  46  20 -28  46  20 -28  46  20 -28
  46  20 -28  46  20 -28  46  20 -28  46  20 -28  46  20 -28  46  20 -28
  46  20 -28  46  20 -28  46  20 -28  46  20 -28  46  20 -28  46  20 -28
  46  20 -28  46  20 -28  46  20 -28  46  20 -28  46  20 -28  46  20 -28
  46  20 -28  46  20 -28  46  20 -28  46  20 -28  46  20 -28  46  20 -28]
Item-based CF RMSE: 32.43060968351419
