# Item-Based Collaborative Filtering

https://www.youtube.com/watch?v=PA1XIDSHldc

Building a movie recommender system using the MovieLens.org 100K user ratings and movie datasets

The item-based filtering places an emphasis on the relationship between the movies as the basis of creating recommendations.

In [1]:
# Import the necessary libraries
import pandas as pd

In [2]:
# Load the ratings and movie datasets and merge into a single DataFrame
# Ratings
r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('./datasets/ml-100k/u.data', sep = '\t', names = r_cols, usecols = range(3), encoding = 'latin-1')

# Movies
m_cols = ['movie_id', 'title']
movies = pd.read_csv('./datasets/ml-100k/u.item', sep = '|', names = m_cols, usecols = range(2), encoding = 'latin-1')

# Combine the movie and ratings DataFrames into a single DataFrame
ratings = pd.merge(movies, ratings)

In [3]:
# Display the newly merged DataFrame
ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


Use a pivot table to create a matrix of the users and movies they've rated

In [4]:
# Create a pivot table that lists the user_id, uses the movie titles as columns, and displays the user ratings 
# for each movie
userRatings = ratings.pivot_table(index = ['user_id'], columns = ['title'], values = ['rating'])   # Index = Rows
userRatings.head()

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


Calculate the correlation score for each column pair in the matrix. 

This will create a table that lists the movie titles in both the index and columns, and displays the correlation between any given movie pair based on the rating scores provided (user rating vectors).

In [5]:
# Create a correlation matrix that compares the correlation scores of the movies
corrMatrix = userRatings.corr()
corrMatrix.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
Unnamed: 0_level_1,title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
Unnamed: 0_level_2,title,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
rating,'Til There Was You (1997),1.0,,-1.0,-0.5,-0.5,0.522233,,-0.426401,,,...,,,,,,,,,,
rating,1-900 (1994),,1.0,,,,,,-0.981981,,,...,,,,-0.944911,,,,,,
rating,101 Dalmatians (1996),-1.0,,1.0,-0.04989,0.269191,0.048973,0.266928,-0.043407,,0.111111,...,,-1.0,,0.15884,0.119234,0.680414,0.0,0.707107,,
rating,12 Angry Men (1957),-0.5,,-0.04989,1.0,0.666667,0.256625,0.274772,0.178848,,0.457176,...,,,,0.096546,0.068944,-0.361961,0.144338,1.0,1.0,
rating,187 (1997),-0.5,,0.269191,0.666667,1.0,0.596644,,-0.5547,,1.0,...,,0.866025,,0.455233,-0.5,0.5,0.475327,,,


The filter needs to be adjusted to display the results of movies that several people rated together. This will also help determine which titles are the most popular/easily recognizable. To do this, the min_periods argument will be used to pull results for movie rating pairs where at least 100 users rated them.

In [6]:
# Use the Pearson correlation score with movies that had at least 100 ratings to determine the correlation scores
corrMatrix = userRatings.corr(method = 'pearson', min_periods = 100)
corrMatrix.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
Unnamed: 0_level_1,title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
Unnamed: 0_level_2,title,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
rating,'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
rating,1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
rating,101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
rating,12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
rating,187 (1997),,,,,,,,,,,...,,,,,,,,,,


### More work needed here

Add new user at user number 0 <br>
Rate 3-5 movies for new user

Create a new user 0 as a test case

In [7]:
# Pull a single user record to work with - Displays all of the titles user #32 rated
myRatings = userRatings.loc[32].dropna()
myRatings

        title                                             
rating  Austin Powers: International Man of Mystery (1997)    4.0
        Beavis and Butt-head Do America (1996)                2.0
        Cable Guy, The (1996)                                 2.0
        Chasing Amy (1997)                                    4.5
        Close Shave, A (1995)                                 3.0
        Con Air (1997)                                        1.0
        Dead Man Walking (1995)                               3.0
        Devil's Advocate, The (1997)                          2.0
        Devil's Own, The (1997)                               2.0
        Face/Off (1997)                                       5.0
        Fargo (1996)                                          3.0
        Fathers' Day (1997)                                   3.0
        Fierce Creatures (1997)                               3.0
        Fifth Element, The (1997)                             4.0
        George of

Go through each of the rated titles to build a list of potential recommendations based on the user's rated titles.

For each of the rated movies, a list of similar movies will be pulled from the correlation matrix. Those correlations will then be scaled based on the user ratings, so that the movies that were liked most will have a greater impact.

## More work here

Adjust the for loop to appear similarly to the tutorial (7:17)

In [13]:
simCandidates = pd.Series()
for i in range(0, len(myRatings.index)):
    print("Adding similar movies for " + str(myRatings.index[i]) + "...")
    # Retrieve similar movies to the ones already rated
    sims = corrMatrix[myRatings.index[i]].dropna()
    # Scale the similarity based on how the user rated the movie
    sims = sims.map(lambda x: x * myRatings[i])
    # Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)

# Check the results so far
print("Sort in decreasing order of similarity score: ")
simCandidates.sort_values(inplace = True, ascending = False)
print(simCandidates.head(10))

Adding similar movies for ('rating', 'Austin Powers: International Man of Mystery (1997)')...
Adding similar movies for ('rating', 'Beavis and Butt-head Do America (1996)')...
Adding similar movies for ('rating', 'Cable Guy, The (1996)')...
Adding similar movies for ('rating', 'Chasing Amy (1997)')...
Adding similar movies for ('rating', 'Close Shave, A (1995)')...
Adding similar movies for ('rating', 'Con Air (1997)')...
Adding similar movies for ('rating', 'Dead Man Walking (1995)')...
Adding similar movies for ('rating', "Devil's Advocate, The (1997)")...
Adding similar movies for ('rating', "Devil's Own, The (1997)")...
Adding similar movies for ('rating', 'Face/Off (1997)')...
Adding similar movies for ('rating', 'Fargo (1996)')...
Adding similar movies for ('rating', "Fathers' Day (1997)")...
Adding similar movies for ('rating', 'Fierce Creatures (1997)')...
Adding similar movies for ('rating', 'Fifth Element, The (1997)')...
Adding similar movies for ('rating', 'George of the Ju

In [None]:
# simCandidates = pd.Series()
# for i in range(0, len(myRatings.index)):
#     sims = corrMatrix[myRatings.index[i]].dropna()
#     sims = sims.map(lambda x: x * myRatings[i])
#     simCandidates = simCandidates.append(sims)
    
# simCandidates.sort_values(inplace = True, ascending = False)
# print (simCandidates.head(10))

Use groupby to add together the scores from movies that show up more than once so that they will count more

In [14]:
simCandidatessimCand = simCandidates.groupby(simCandidates.index).sum()

In [15]:
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)

(rating, Face/Off (1997))                       5.0
(rating, Trainspotting (1996))                  5.0
(rating, Chasing Amy (1997))                    4.5
(rating, Star Wars (1977))                      4.0
(rating, Titanic (1997))                        4.0
(rating, Mission: Impossible (1996))            4.0
(rating, Twelve Monkeys (1995))                 4.0
(rating, Men in Black (1997))                   4.0
(rating, People vs. Larry Flynt, The (1996))    4.0
(rating, Return of the Jedi (1983))             4.0
dtype: float64

Lastly, filter the movies that have already been rated, because recommending a movie that was already seen isn't helpful

In [16]:
filteredSims = simCandidates.drop(myRatings.index)
filteredSims.head(10)

(rating, Empire Strikes Back, The (1980))    2.991926
(rating, Empire Strikes Back, The (1980))    2.884917
(rating, Pulp Fiction (1994))                2.259531
(rating, Raiders of the Lost Ark (1981))     2.144468
(rating, GoldenEye (1995))                   2.033738
(rating, Raising Arizona (1987))             1.955990
(rating, Pretty Woman (1990))                1.918852
(rating, Nutty Professor, The (1996))        1.895031
(rating, Raiders of the Lost Ark (1981))     1.869562
(rating, Back to the Future (1985))          1.813533
dtype: float64