# Movie Recommender System 

Here I use the data from movielens which is a site dedicated to collect over thousands of movie ratings from the user. I first download the data from its website
============================
MovieLens 100K Dataset

Stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998.

README.txt
ml-100k.zip (size: 5 MB, checksum)
Index of unzipped files
Permalink: http://grouplens.org/datasets/movielens/100k/
=============================

In [80]:
import pandas as pd
import numpy as np

col_name_1 = ['user_id', 'movie_id', 'rating']
col_name_2 = ['movie_id', 'title']

ratings = pd.read_csv('/Users/baileyhsu/Desktop/ml-100k/u.data', sep='\t', names=col_name_1, usecols=range(3), encoding="ISO-8859-1")
movieid = pd.read_csv('/Users/baileyhsu/Desktop/ml-100k/u.item', sep='|', names=col_name_2, usecols=range(2), encoding="ISO-8859-1")

temp = pd.merge(movieid, ratings)


Next, we will need to create a pivot table to have the user_id in the row, and movie name in the column, and the rating number as the value of each grid.

In [91]:
Ratings = temp.pivot_table(index=['user_id'],columns=['title'],values='rating')
Ratings.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


For example, below we can find the correlations of all the movies with respect to the Indiana Johns based on the rating

In [93]:

indianaRating=Ratings['Indiana Jones and the Last Crusade (1989)']
corrIndiana = Ratings.corrwith(indianaRating)

#drop NAN term
corrIndiana = corrIndiana.dropna()

#convert it to pd data frame
df = pd.DataFrame(corrIndiana)

#sort it 
corrIndiana.sort_values(ascending=False)


title
'Til There Was You (1997)                             1.0
Crossfire (1947)                                      1.0
My Favorite Season (1993)                             1.0
Wild Reeds (1994)                                     1.0
Metisse (Café au Lait) (1993)                         1.0
Maya Lin: A Strong Clear Vision (1994)                1.0
Second Jungle Book: Mowgli & Baloo, The (1997)        1.0
Colonel Chabert, Le (1994)                            1.0
Infinity (1996)                                       1.0
Indiana Jones and the Last Crusade (1989)             1.0
Cement Garden, The (1993)                             1.0
Dangerous Ground (1997)                               1.0
Stripes (1981)                                        1.0
Faces (1968)                                          1.0
Vermin (1998)                                         1.0
Forbidden Christ, The (Cristo proibito, Il) (1950)    1.0
Bad Moon (1996)                                       1.0
Wedding 

However, the above result seems strange probably due to the low rating size of some movies. Therefore, we start again to collect the rating size and the mean of the origin data

In [101]:
# we can group by the movie title with columns set to be the size and the mean of the rating value
movies = temp.groupby('title').agg({'rating':[np.size, np.mean]})

# here we define the hot movies as size of the ratings larger than 50
hotmovies = movies['rating']['size'] >= 100

# plugged back to the origin movies data set, we can find the top 30 rated popular movies
movies[hotmovies].sort_values([('rating', 'mean')])[:30]

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
"Cable Guy, The (1996)",106,2.339623
Jungle2Jungle (1997),132,2.439394
Crash (1996),128,2.546875
Event Horizon (1997),127,2.574803
Spawn (1997),143,2.615385
Batman Forever (1995),114,2.666667
Batman Returns (1992),142,2.683099
George of the Jungle (1997),162,2.685185
Down Periscope (1996),101,2.70297
Mimic (1997),101,2.742574


In [102]:
df = movies[hotmovies].join(pd.DataFrame(corrIndiana, columns=['Similarity']))
df.sort_values(['Similarity'], ascending=False)[:15]

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",Similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Indiana Jones and the Last Crusade (1989),331,3.930514,1.0
"Ghost and the Darkness, The (1996)",128,3.203125,0.553154
Raiders of the Lost Ark (1981),420,4.252381,0.539606
Young Guns (1988),101,3.207921,0.49267
Back to the Future (1985),350,3.834286,0.479793
"Firm, The (1993)",151,3.278146,0.477194
Donnie Brasco (1997),147,3.802721,0.469901
Waterworld (1995),102,2.803922,0.46523
Independence Day (ID4) (1996),429,3.438228,0.461999
"Lost World: Jurassic Park, The (1997)",158,2.943038,0.452847


Now we see the top 15 movies actually look more than people who would watch Indiana Jones would watch too!


What else can we improve? Maybe increased the threshold of the rating size?

In [104]:
# here we define the hot movies as size of the ratings larger than 50
hotmovies = movies['rating']['size'] >= 200

In [105]:
df = movies[hotmovies].join(pd.DataFrame(corrIndiana, columns=['Similarity']))
df.sort_values(['Similarity'], ascending=False)[:15]

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",Similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Indiana Jones and the Last Crusade (1989),331,3.930514,1.0
Raiders of the Lost Ark (1981),420,4.252381,0.539606
Back to the Future (1985),350,3.834286,0.479793
Independence Day (ID4) (1996),429,3.438228,0.461999
Return of the Jedi (1983),507,4.00789,0.422294
Liar Liar (1997),485,3.156701,0.414427
Jurassic Park (1993),261,3.720307,0.399337
"Fugitive, The (1993)",336,4.044643,0.395868
Speed (1994),230,3.647826,0.381742
Terminator 2: Judgment Day (1991),295,4.00678,0.367973
