### How To Build A Movie Recommender System With Item-Based Collaborative Filtering

To start building our movie recommendation system, we'll first load the "MovieLens 100K" data set into a pandas dataframe.

In [25]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")

m_cols = ['movie_id', 'title']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")

ratings = pd.merge(movies, ratings)

ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


Next we'll build a table of users and the movies they rated.

In [26]:
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
userRatings.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


Now we'll build a matrix which shows the correlation between users who rated the same pair of movies. Because we want to throw out pairs rated by only a small number of users, we'll establish "100" to be the minimum number of pairs of user ratings.

In [27]:
corrMatrix = userRatings.corr(method='pearson', min_periods=100)
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


Now lets run "User 0" (who prefers sci-fi, apparently) through our recommender system to see if the recommended movies make sense given this user's preferences.

We'll start by establishing a variable for this user's ratings, using "dropna()" to leave only movies for which this user left a rating.

In [28]:
User0Ratings = userRatings.loc[0].dropna()
User0Ratings

title
Empire Strikes Back, The (1980)    5.0
Gone with the Wind (1939)          1.0
Star Wars (1977)                   5.0
Name: 0, dtype: float64

Now let's build our list of recommended movies!

In [29]:
simCandidates = pd.Series()
for i in range(0, len(User0Ratings.index)):
    print ("Adding sims for " + User0Ratings.index[i] + "...")
    # Retrieve similar movies to this one that User 0 rated
    sims = corrMatrix[User0Ratings.index[i]].dropna()
    # Now scale its similarity by how well User 0 rated this movie
    sims = sims.map(lambda x: x * User0Ratings[i])
    # Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)
    
# Group by Movie Title
simCandidates = simCandidates.groupby(simCandidates.index).sum()

# Sort by "recommendation score", in from highest to lowest
simCandidates.sort_values(inplace = True, ascending = False)

# Eliminate movies already rated
filteredSims = simCandidates.drop(User0Ratings.index)

# Display "Top 10 Recommended Movies"!
filteredSims.head(10)

Adding sims for Empire Strikes Back, The (1980)...
Adding sims for Gone with the Wind (1939)...
Adding sims for Star Wars (1977)...


Return of the Jedi (1983)                    7.178172
Raiders of the Lost Ark (1981)               5.519700
Indiana Jones and the Last Crusade (1989)    3.488028
Bridge on the River Kwai, The (1957)         3.366616
Back to the Future (1985)                    3.357941
Sting, The (1973)                            3.329843
Cinderella (1950)                            3.245412
Field of Dreams (1989)                       3.222311
Wizard of Oz, The (1939)                     3.200268
Dumbo (1941)                                 2.981645
dtype: float64

Pretty good!

*** Note: Increasing "min_periods" in our correlation matrix from 100 users to 150 will give us even better recommendations!

See below for the updated results!

In [32]:
# Increase the minimum number of "user ratings pairs" from 100 to 150
corrMatrix = userRatings.corr(method='pearson', min_periods=150)

User0Ratings = userRatings.loc[0].dropna()
User0Ratings

simCandidates = pd.Series()
for i in range(0, len(User0Ratings.index)):
    print ("Adding sims for " + User0Ratings.index[i] + "...")
    # Retrieve similar movies to this one that User 0 rated
    sims = corrMatrix[User0Ratings.index[i]].dropna()
    # Now scale its similarity by how well User 0 rated this movie
    sims = sims.map(lambda x: x * User0Ratings[i])
    # Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)
    
# Group by Movie Title
simCandidates = simCandidates.groupby(simCandidates.index).sum()

# Sort by "recommendation score", in from highest to lowest
simCandidates.sort_values(inplace = True, ascending = False)

# Eliminate movies already rated
filteredSims = simCandidates.drop(User0Ratings.index)

# Display "Top 10 Recommended Movies"!
filteredSims.head(10)

Adding sims for Empire Strikes Back, The (1980)...
Adding sims for Gone with the Wind (1939)...
Adding sims for Star Wars (1977)...


Return of the Jedi (1983)                    6.968925
Raiders of the Lost Ark (1981)               5.519700
Indiana Jones and the Last Crusade (1989)    3.316717
Sting, The (1973)                            3.209627
Back to the Future (1985)                    3.100622
Field of Dreams (1989)                       3.068508
Star Trek: The Wrath of Khan (1982)          2.968080
Batman (1989)                                2.947566
Jaws (1975)                                  2.802935
Wizard of Oz, The (1939)                     2.770049
dtype: float64