# Item-Item

If you use Netflix, you will notice that there is a section titled "Because you watched Movie X", which provides recommendations for movies based on a recent movie that you've watched. This is a classic example of an item-item recommendation. 

In this tutorial, we will generate item-item recommendations using a technique called [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering). Let's get started! 

## Step 1: Import the Dependencies

In [1]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Step 2: Load the Data

Let's download a small version of the [MovieLens](https://www.wikiwand.com/en/MovieLens) dataset. You can access it via the zip file url [here](https://grouplens.org/datasets/movielens/), or directly download [here](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip). We're working with data in `ml-latest-small.zip` and will need to add the following files to our local directory: 
- ratings.csv
- movies.csv

Alternatively, you can access the data here: 
- https://s3-us-west-2.amazonaws.com/recommender-tutorial/movies.csv
- https://s3-us-west-2.amazonaws.com/recommender-tutorial/ratings.csv

Let's load in our data and take a peak at the structure.

In [2]:
ratings = pd.read_csv("data/ratings.csv")

ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
ratings.shape

(100836, 4)

In [4]:
movies = pd.read_csv("data/movies.csv")

movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Step 4: Transforming the data

We will be using a technique called [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) to generate user recommendations. This technique is based on the assumption of "homophily" - similar users like similar things. Collaborative filtering is a type of unsupervised learning that makes predictions about the interests of a user by learning from the interests of a larger population.

The first step of collaborative filtering is to transform our data into a `user-item matrix` - also known as a "utility" matrix. In this matrix, rows represent users and columns represent items. The beauty of collaborative filtering is that it doesn't require any information about the users or items to generate recommendations. 

The `create_X()` function outputs a sparse matrix X with four mapper dictionaries:
- **user_mapper:** maps user id to user index
- **movie_mapper:** maps movie id to movie index
- **user_inv_mapper:** maps user index to user id
- **movie_inv_mapper:** maps movie index to movie id

We need these dictionaries because they map which row and column of the utility matrix corresponds to which user ID and movie ID, respectively.

The **X** (user-item) matrix is a [scipy.sparse.csr_matrix](scipylinkhere) which stores the data sparsely.

In [5]:
pd.DataFrame(ratings['movieId'].unique()).sort_values(0).head()

Unnamed: 0,0
0,1
481,2
1,3
482,4
483,5


In [6]:
import uuid

users = pd.DataFrame(ratings['userId'].unique(), columns=['userId'])
users['new_id'] = [str(uuid.uuid4())[:8] for i in range(users.shape[0])]

In [7]:
df = pd.merge(ratings, movies, on='movieId', how='left')[['userId', 'title', 'rating']]
df = pd.merge(df, users, on='userId')
df = df[['new_id', 'title', 'rating']]
df.to_csv('better_movies.csv', index=False)

In [8]:
df = pd.read_csv('better_movies.csv')

In [9]:
df.head()

Unnamed: 0,new_id,title,rating
0,2eb7bb8a,Toy Story (1995),4.0
1,2eb7bb8a,Grumpier Old Men (1995),4.0
2,2eb7bb8a,Heat (1995),4.0
3,2eb7bb8a,Seven (a.k.a. Se7en) (1995),5.0
4,2eb7bb8a,"Usual Suspects, The (1995)",5.0


In [10]:
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import csr_matrix

def create_csr_X_matrix(df):
    m_len = len(df.title.unique())
    u_len = len(df.new_id.unique())

    movie_encoder = LabelEncoder()
    user_encoder = LabelEncoder()

    m = movie_encoder.fit_transform(df.title)
    u = user_encoder.fit_transform(df.new_id)

    m_len = len(df.title.unique())
    u_len = len(df.new_id.unique())

    X = csr_matrix((df["rating"], (m, u)), shape=(m_len, u_len))
    return X, movie_encoder, user_encoder

In [11]:
X, movie_encoder, user_encoder = create_csr_X_matrix(df)

In [12]:
from sklearn.neighbors import NearestNeighbors

movie = '101 Dalmatians (1996)'

def find_me_a_movie_to_watch(movie_title, X, k=10):
    movie_ind = movie_encoder.transform([movie])
    movie_vec = X[movie_ind]
    kNN = NearestNeighbors(n_neighbors=k, algorithm="brute", metric='minkowski')
    kNN.fit(X)
    neighbour = kNN.kneighbors(movie_vec, return_distance=False)
    close_movies = neighbour.flatten().tolist()
    z = [movie_encoder.inverse_transform([m])[0] for m in close_movies]
    z = [zi for zi in z if zi != movie_title]
    return z

In [13]:
find_me_a_movie_to_watch('Peter Pan (2003)', X, k=10)

['101 Dalmatians (1996)',
 'Jack (1996)',
 'George of the Jungle (1997)',
 'Kazaam (1996)',
 'Nanny McPhee (2005)',
 'Dracula: Dead and Loving It (1995)',
 'D2: The Mighty Ducks (1994)',
 'Borrowers, The (1997)',
 'Super Mario Bros. (1993)',
 'Muppets, The (2011)']