# MovieLens Dataset

The GroupLens MovieLens 1M Dataset project consolidated several collections of users' movie ratings from 1990 - 2000. Per the name, the data contains over 1 million ratings from over 6,000 users of more than 4,000 movies. The data holds users' feedback on movie ratings, users' demographics, and each movie's metadata. 

In [146]:
# Start with imports
import pandas as pd 

In [148]:
# Make the display smaller 
pd.options.display.max_rows = 10

At the beginning of the analysis we started with three different ZIP files that
held users' information. After extracting out the data from the files, we load 
each table into a pandas DataFrame object using the pandas.read_table function.

In [151]:
# Define the usernames
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']

In [153]:
# Load table into DataFrame
users = pd.read_table('users.dat', sep='::', header=None, names=unames, engine='python')

In [155]:
# Define the user ratings
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']

In [157]:
# Load table into DataFrame
ratings = pd.read_table('ratings.dat', sep='::', header=None, names=rnames, engine='python')

In [158]:
# Define the movie name
mnames = ['movie_id', 'title', 'genres']

In [159]:
# Load table into DataFrame
movies = pd.read_table('movies.dat', sep='::', header=None, names=mnames, engine='python')

In [160]:
# Preview data for atleast 5 lines
users[:5]

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [165]:
ratings[:5]

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [167]:
movies[:5]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [169]:
ratings

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


Merge the gathered data into a single table with panda's merge function. 
First merge the ratings file with the users file, then merge that data 
with the movies file. Panda will choose which columns to merge based on
overlapping names between the tables.

In [172]:
data = pd.merge(pd.merge(ratings, users), movies)

In [174]:
# Preview the data 
data

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical
2,1,914,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical|Romance
3,1,3408,4,978300275,F,1,10,48067,Erin Brockovich (2000),Drama
4,1,2355,5,978824291,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy
...,...,...,...,...,...,...,...,...,...,...
1000204,6040,1091,1,956716541,M,25,6,11106,Weekend at Bernie's (1989),Comedy
1000205,6040,1094,5,956704887,M,25,6,11106,"Crying Game, The (1992)",Drama|Romance|War
1000206,6040,562,5,956704746,M,25,6,11106,Welcome to the Dollhouse (1995),Comedy|Drama
1000207,6040,1096,4,956715648,M,25,6,11106,Sophie's Choice (1982),Drama


In [176]:
# Preview the data 
data.iloc[0]

user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

Get the mean of movie ratings for each film, and group the data by gender via a 
pivot table. 


In [179]:
mean_ratings = data.pivot_table('rating', index="title", columns='gender', aggfunc='mean')

Display the first 5 elements (0-4) of the mean_ratings using python's slicing 
syntax. Now the movie titles should be the row (index) and the genders are
column fields.

In [182]:
# Preview the data 
mean_ratings[:5]

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024


Filter out movies that have atleast 250 ratings, by grouping the data by title and use .size() to get a series of group sizes by title.

In [185]:
ratings_by_title = data.groupby('title').size()

In [187]:
# Preview the data 
ratings_by_title[:10]

title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64

In [189]:
# Filtered movies with atleast 250 ratings
active_titles = ratings_by_title.index[ratings_by_title >= 250]

In [191]:
# Preview the data 
active_titles

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

The index of titles can be used to select rows from the mean_ratings. 

In [194]:
mean_ratings = mean_ratings.loc[active_titles]

In [196]:
# Preview the data 
mean_ratings

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",2.793478,2.962085
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.500000
101 Dalmatians (1996),3.240000,2.911215
12 Angry Men (1957),4.184397,4.328421
...,...,...
Young Guns (1988),3.371795,3.425620
Young Guns II (1990),2.934783,2.904025
Young Sherlock Holmes (1985),3.514706,3.363344
Zero Effect (1998),3.864407,3.723140


View the most rated top films for female viewers by sorting by the F column in desending order.

In [199]:
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)

In [201]:
# Preview the data 
top_female_ratings[:10]

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Close Shave, A (1995)",4.644444,4.473795
"Wrong Trousers, The (1993)",4.588235,4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.57265,4.464589
Wallace & Gromit: The Best of Aardman Animation (1996),4.563107,4.385075
Schindler's List (1993),4.562602,4.491415
"Shawshank Redemption, The (1994)",4.539075,4.560625
"Grand Day Out, A (1992)",4.537879,4.293255
To Kill a Mockingbird (1962),4.536667,4.372611
Creature Comforts (1990),4.513889,4.272277
"Usual Suspects, The (1995)",4.513317,4.518248


Measuring rating disagreement between genders is done by finding the difference in means then sorting the data. Sorting this way gives the highest rating difference for viewing movies preferred by women.

In [204]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

In [206]:
sorted_by_diff = mean_ratings.sort_values(by='diff')

In [208]:
# Preview the data 
sorted_by_diff[:10]

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dirty Dancing (1987),3.790378,2.959596,-0.830782
Jumpin' Jack Flash (1986),3.254717,2.578358,-0.676359
Grease (1978),3.975265,3.367041,-0.608224
Little Women (1994),3.870588,3.321739,-0.548849
Steel Magnolias (1989),3.901734,3.365957,-0.535777
Anastasia (1997),3.8,3.281609,-0.518391
"Rocky Horror Picture Show, The (1975)",3.673016,3.160131,-0.512885
"Color Purple, The (1985)",4.158192,3.659341,-0.498851
"Age of Innocence, The (1993)",3.827068,3.339506,-0.487561
Free Willy (1993),2.921348,2.438776,-0.482573


To get movies prefered by men reverse the order of the rows and slice off the top 10 rows.

In [211]:
sorted_by_diff[::-1][:10]

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Good, The Bad and The Ugly, The (1966)",3.494949,4.2213,0.726351
"Kentucky Fried Movie, The (1977)",2.878788,3.555147,0.676359
Dumb & Dumber (1994),2.697987,3.336595,0.638608
"Longest Day, The (1962)",3.411765,4.031447,0.619682
"Cable Guy, The (1996)",2.25,2.863787,0.613787
Evil Dead II (Dead By Dawn) (1987),3.297297,3.909283,0.611985
"Hidden, The (1987)",3.137931,3.745098,0.607167
Rocky III (1982),2.361702,2.943503,0.581801
Caddyshack (1980),3.396135,3.969737,0.573602
For a Few Dollars More (1965),3.409091,3.953795,0.544704


Find movies with the most disagreement independent of gender, disagreements can be measured by the variance (standard deveiation) of ratings. Start first with the standard deviation of rating by title, then filter out the active titles, then order the series by value in descending order 

In [214]:
rating_std_by_title = data.groupby('title')['rating'].std()

In [216]:
rating_std_by_title = rating_std_by_title.loc[active_titles]

In [218]:
rating_std_by_title.sort_values(ascending=False)[:10]

title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64