# Movie Similarities Using Item-Based Collaborative Filtering

In [7]:
import pandas as pd
import numpy as np

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('/Users/czar.yobero/SparkScala/ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))

m_cols = ['movie_id', 'title']
movies = pd.read_csv('/Users/czar.yobero/SparkScala/ml-100k/u.item', sep='|', names=m_cols, usecols=range(2))

ratings = pd.merge(movies, ratings)
ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


Now we pivot the above data frame using the pivot_table method and construct a user-movie rating matrix. NaN indicates missing data (i.e. movies that specific users didn't rate).

In [8]:
movie_ratings = ratings.pivot_table(index=['user_id'], columns=['title'], values='rating')
movie_ratings.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,� k�ldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


Let's extract a series of nerds who rated Star Wars.

In [9]:
starwars_ratings = movie_ratings['Star Wars (1977)']
starwars_ratings.head()

user_id
1    5.0
2    5.0
3    NaN
4    5.0
5    4.0
Name: Star Wars (1977), dtype: float64

Now, we'll use Panda's corrwith method to compute the pairwise correlation of Star Wars' vector of user ratings with every other movie! After that, we'll drop NaNs and construct a new data frame of movies and their similarity scores (i.e. correlation coefficients) to Star Wars.

In [10]:
similar_movies = movie_ratings.corrwith(starwars_ratings)
similar_movies = similar_movies.dropna()
df = pd.DataFrame(similar_movies)
df.head(n=10)

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
'Til There Was You (1997),0.872872
1-900 (1994),-0.645497
101 Dalmatians (1996),0.211132
12 Angry Men (1957),0.184289
187 (1997),0.027398
2 Days in the Valley (1996),0.066654
"20,000 Leagues Under the Sea (1954)",0.289768
2001: A Space Odyssey (1968),0.230884
"39 Steps, The (1935)",0.106453
8 1/2 (1963),-0.142977


Let's sort the results by similarity score and we should have the movies that are most similar to Star Wars!

In [11]:
similar_movies.order(ascending=False)

  if __name__ == '__main__':


title
No Escape (1994)                                                                     1.000000
Man of the Year (1995)                                                               1.000000
Hollow Reed (1996)                                                                   1.000000
Commandments (1997)                                                                  1.000000
Cosi (1996)                                                                          1.000000
Stripes (1981)                                                                       1.000000
Star Wars (1977)                                                                     1.000000
Golden Earrings (1947)                                                               1.000000
Mondo (1996)                                                                         1.000000
Line King: Al Hirschfeld, The (1996)                                                 1.000000
Outlaw, The (1943)                                    

Our results are incredibly spurious because I doubt that the movie 'Til There Was You is similar to Star Wars in any shape or form. The reason for this is perhaps our data set consists of movies that hav eonly been viewed by a handful of users who also happened to like Star Wars. So, we need to get rid of movies that were only watched by a handful of people. We can do this by constructing a new data frame that counts up how many ratings exist for each movie, as wekl as the average rating, which might come in handy later.


In [12]:
movie_stats = ratings.groupby('title').agg({'rating': [np.size, np.mean]})
movie_stats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
'Til There Was You (1997),9,2.333333
1-900 (1994),5,2.6
101 Dalmatians (1996),109,2.908257
12 Angry Men (1957),125,4.344
187 (1997),41,3.02439


Let's get rid of any movies rated by fewer than 200 people and check the top-rated ones that are left.

In [18]:
popular_movies = movie_stats['rating']['size'] >= 250
movie_stats[popular_movies].sort([('rating', 'mean')], ascending=False)[:10]


  from ipykernel import kernelapp as app


Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
Schindler's List (1993),298,4.466443
"Shawshank Redemption, The (1994)",283,4.44523
"Usual Suspects, The (1995)",267,4.385768
Star Wars (1977),583,4.358491
One Flew Over the Cuckoo's Nest (1975),264,4.291667
"Silence of the Lambs, The (1991)",390,4.289744
"Godfather, The (1972)",413,4.283293
Raiders of the Lost Ark (1981),420,4.252381
Titanic (1997),350,4.245714
"Empire Strikes Back, The (1980)",367,4.20436


Let's join this data set with our original set of similar movies to Star Wars. 

In [21]:
df = movie_stats[popular_movies].join(pd.DataFrame(similar_movies, columns=['similarity']))
df.head(n=25)

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2001: A Space Odyssey (1968),259,3.969112,0.230884
Air Force One (1997),431,3.63109,0.113164
Alien (1979),291,4.034364,0.248991
Aliens (1986),284,3.947183,0.254444
Amadeus (1984),276,4.163043,0.19028
Apollo 13 (1995),276,3.931159,0.222006
Back to the Future (1985),350,3.834286,0.274839
"Birdcage, The (1996)",293,3.443686,0.060544
Blade Runner (1982),275,4.138182,0.196715
"Blues Brothers, The (1980)",251,3.836653,0.19256


Let's sort by similarity scores.

In [23]:
df.sort_values(['similarity'], ascending=False)[:15]

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars (1977),583,4.358491,1.0
"Empire Strikes Back, The (1980)",367,4.20436,0.747981
Return of the Jedi (1983),507,4.00789,0.672556
Raiders of the Lost Ark (1981),420,4.252381,0.536117
Indiana Jones and the Last Crusade (1989),331,3.930514,0.350107
L.A. Confidential (1997),297,4.161616,0.319065
E.T. the Extra-Terrestrial (1982),300,3.833333,0.303619
Back to the Future (1985),350,3.834286,0.274839
Jaws (1975),280,3.775,0.265459
"Terminator, The (1984)",301,3.933555,0.262255
