This notebook shows two simple examples of Collaborative Filtering using user-based and item-based approaches. The source data is from [here]('http://files.grouplens.org/datasets/movielens/ml-latest-small.zip').

### Import packages

In [None]:
import pandas as pd
import numpy as np
import random
from sklearn.metrics.pairwise import pairwise_distances

### Set-up

In [None]:
# location and filenames of the input data
ratings_file = 'https://bitbucket.org/vishal_derive/vcu-data-mining/raw/37a416e794c656e5c84dd87149cdbbf3c0d8737b/data/ratings.csv'
movies_file = 'https://bitbucket.org/vishal_derive/vcu-data-mining/raw/37a416e794c656e5c84dd87149cdbbf3c0d8737b/data/movies.csv'

### Read data

In [None]:
# ratings data set
ratings_df = pd.read_csv(ratings_file)
ratings_df.shape

In [None]:
# movies data set
movies_df = pd.read_csv(movies_file)
movies_df.shape

### Prepare data

Let's drop columns that we don't need, and then combine those two datasets.

In [None]:
ratings_df = ratings_df.drop('timestamp', axis=1)
movies_df = movies_df.drop('genres', axis=1)

df = 

### Matrix representation

Collaborative Filtering requires the data to be in a user-item matrix format.

In [None]:
user_item_df = 

For the purpose of this analysis, we will restrict the number of movies to 500.

In [None]:
# take the most rated top 500 movies
top_500_movies = 

In [None]:
# subset the uset_item_df dataframe by taking only those top 500 titles (columns)
user_item_df = 

Let's take a quick look at the top rated movies.

In [None]:
all_titles = 

In [None]:
avg_ratings = 

Let's check the ratings of a specific movie.

In [None]:
target_movie = 'Groundhog Day (1993)'

#--

What % of this entire population saw (rated) this movie?

In [None]:
#--

How many users have *not* watched (rated) this movie?

In [None]:
#--

### User-based Collaborative Filtering

Let's randomly choose one user who has not watched *Groundhog Day.* We will then proceed to predict his/her rating for this movie.

In [None]:
# first take all users who have not watched this movie
target_users = 

In [None]:
# randomly select one user from this group
random.seed(5)
target_user = 

This is our **target** user, for whom we wish to predict the rating for *Groundhog Day*.

Let's calculate the distance from this user to all other users. In order to calculate the distance, let's first collect all movies that the target user has rated.

In [None]:
movies_rated_by_target_user = 

In [None]:
# all movies rated by the target user 
rated_movies_by_target_user = movies_rated_by_target_user.columns

# create a mask to select all users but the target user, and users who have rated the target movie
mask = (user_item_df['userId'] != target_user) & (user_item_df[target_movie].notnull())

# apply the filter -- take all users ID's that satisfy this criteria
user_search_space = user_item_df[mask]['userId'].values

# take all users from this search space, and get their ratings for all movies that our target user rated
X = user_item_df[mask][rated_movies_by_target_user].fillna(0)

# take the target user, and get their ratings for all movies sans the target movie
y = movies_rated_by_target_user

len(X), len(y)

In [None]:
# calculate distances
user_dist = 

Find the use that is "closest" to the target user.

In [None]:
min_dist = 

In [None]:
closest_user = 

In [None]:
# the closet user and his/her ratings
#--

In [None]:
# target user's ratings (for comparison)
#--

In [None]:
predicted_rating = 

This is a demonstartion of how user-based distance can be used to predict ratings (and recommend movies).

____________

### Item-based Collaborative Filtering

In [None]:
# list of all movies other than the target movie
all_other_movies = [col for col in user_item_df.columns if col not in (target_movie, 'userId')]

X = user_item_df[all_other_movies].fillna(0)
y = user_item_df[target_movie].fillna(0)

In [None]:
item_dist = pairwise_distances(X.T, y.values.reshape(-1, len(y)), metric='cosine')
item_dist[:10]

In [None]:
min_item_dist = [item_dist[item_dist == item_dist.min()]][0][0]
min_item_dist

In [None]:
# list of all movies sorted by distance (shortest first)
[title for d, title in sorted(zip(item_dist, all_other_movies))][:10]

In [None]:
closest_item = all_other_movies[np.argmin(item_dist)]
closest_item

Calculate the averate rating for this title. This will become our predicted rating for *Groundhog Day*.

In [None]:
predicted_rating_2 = user_item_df[closest_item].mean()
round(predicted_rating_2, 1)

This is a demonstartion of how item-based distance can be used to predict ratings (and recommend movies).