# Movies

This notebook was originally authored by Abhijit Dasgupta and was adapted from [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) by Wes McKinney

## Objectives

* What are the highest rated movies?
* What is the best movie for date night?
* Which movies do men and women disagree on the most?

In [1]:
import pandas as pd
import os
engine='python'

### Reading in the data using `merge`

In [2]:
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table(os.path.join('data','movies','users.dat'), 
                      sep='::', header=None, names=unames)
   
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(os.path.join('data','movies','ratings.dat'), 
                        sep='::', header=None, names=rnames)

mnames = ['movie_id', 'title','genres']
movies = pd.read_table(os.path.join('data','movies','movies.dat'), 
                       sep='::', header=None, names=mnames)
data = pd.merge(pd.merge(ratings, users), movies)


  app.launch_new_instance()


In [3]:
data.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama


## What is the highest rated movie?

In [30]:
tmp = data[['title','rating']]
tmp.head()

Unnamed: 0,title,rating
0,One Flew Over the Cuckoo's Nest (1975),5
1,One Flew Over the Cuckoo's Nest (1975),5
2,One Flew Over the Cuckoo's Nest (1975),4
3,One Flew Over the Cuckoo's Nest (1975),4
4,One Flew Over the Cuckoo's Nest (1975),5


In [31]:
mean_rating = tmp.groupby('title').mean()
print(mean_rating.describe())
mean_rating.head(10)

            rating
count  3706.000000
mean      3.238892
std       0.672925
min       1.000000
25%       2.822705
50%       3.331546
75%       3.740741
max       5.000000


Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
"$1,000,000 Duck (1971)",3.027027
'Night Mother (1986),3.371429
'Til There Was You (1997),2.692308
"'burbs, The (1989)",2.910891
...And Justice for All (1979),3.713568
1-900 (1994),2.5
10 Things I Hate About You (1999),3.422857
101 Dalmatians (1961),3.59646
101 Dalmatians (1996),3.046703
12 Angry Men (1957),4.295455


In [32]:
# view the top ten sorted by rating
mean_rating.sort_values(by='rating', ascending=False).head(10)

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
Ulysses (Ulisse) (1954),5.0
Lured (1947),5.0
Follow the Bitch (1998),5.0
Bittersweet Motel (2000),5.0
Song of Freedom (1936),5.0
One Little Indian (1973),5.0
Smashing Time (1967),5.0
Schlafes Bruder (Brother of Sleep) (1995),5.0
"Gate of Heavenly Peace, The (1995)",5.0
"Baby, The (1973)",5.0


Seems a bit odd?  What's wrong with this picture?

In [33]:
# view review counts
mean_rating = tmp.groupby('title')['rating'].agg(['mean','count']) 
mean_rating.sort_values(by='mean', ascending=False).head(10)

Unnamed: 0_level_0,mean,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Ulysses (Ulisse) (1954),5.0,1
Lured (1947),5.0,1
Follow the Bitch (1998),5.0,1
Bittersweet Motel (2000),5.0,1
Song of Freedom (1936),5.0,1
One Little Indian (1973),5.0,1
Smashing Time (1967),5.0,2
Schlafes Bruder (Brother of Sleep) (1995),5.0,1
"Gate of Heavenly Peace, The (1995)",5.0,3
"Baby, The (1973)",5.0,1


### Filter our Movies

Only look at movies that have had at least 1000 ratings.

In [14]:
mask = mean_rating['count'] > 1000
mean_rating[mask].head()

Unnamed: 0_level_0,mean,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
2001: A Space Odyssey (1968),4.068765,1716
"Abyss, The (1989)",3.683965,1715
"African Queen, The (1951)",4.251656,1057
Air Force One (1997),3.58829,1076
Airplane! (1980),3.971115,1731


In [17]:
mean_rating[mask].sort_values(by='mean', ascending=False).head(10)

Unnamed: 0_level_0,mean,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Shawshank Redemption, The (1994)",4.554558,2227
"Godfather, The (1972)",4.524966,2223
"Usual Suspects, The (1995)",4.517106,1783
Schindler's List (1993),4.510417,2304
Raiders of the Lost Ark (1981),4.477725,2514
Rear Window (1954),4.47619,1050
Star Wars: Episode IV - A New Hope (1977),4.453694,2991
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963),4.44989,1367
Casablanca (1942),4.412822,1669
"Sixth Sense, The (1999)",4.406263,2459


## What is the best movie for both men and women?

We would like to creat another data frame of our data that contains mean ratings with movie totals as row lables and gender as colunm lables.

In [18]:
mean_ratings = pd.pivot_table(data, 'rating', index='title', columns ='gender', aggfunc='mean')
mean_ratings.head(10)

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024
1-900 (1994),2.0,3.0
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.5
101 Dalmatians (1996),3.24,2.911215
12 Angry Men (1957),4.184397,4.328421


In [19]:
mask.head(10)

title
$1,000,000 Duck (1971)               False
'Night Mother (1986)                 False
'Til There Was You (1997)            False
'burbs, The (1989)                   False
...And Justice for All (1979)        False
1-900 (1994)                         False
10 Things I Hate About You (1999)    False
101 Dalmatians (1961)                False
101 Dalmatians (1996)                False
12 Angry Men (1957)                  False
Name: count, dtype: bool

But this has **all** the movies, not just the ones with the largest **count**.

Notice:

- The DataFrame `mean_ratings` has the `title` as the index.
- The `mask` also has `title` as the index.

In [20]:
top_mean_ratings = mean_ratings.ix[mask]
top_mean_ratings.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
2001: A Space Odyssey (1968),3.825581,4.129738
"Abyss, The (1989)",3.659236,3.689507
"African Queen, The (1951)",4.324232,4.223822
Air Force One (1997),3.699588,3.555822
Airplane! (1980),3.656566,4.064419


What are the top rated movies by women?

In [22]:
top_female = top_mean_ratings.sort_values('F', ascending=False)
top_female.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Schindler's List (1993),4.562602,4.491415
"Shawshank Redemption, The (1994)",4.539075,4.560625
"Usual Suspects, The (1995)",4.513317,4.518248
Rear Window (1954),4.484536,4.472991
"Sixth Sense, The (1999)",4.47741,4.379944


What are the top rated by men?

In [23]:
top_male = top_mean_ratings.sort_values('M', ascending=False)
top_male.head(5)

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Godfather, The (1972)",4.3147,4.583333
"Shawshank Redemption, The (1994)",4.539075,4.560625
Raiders of the Lost Ark (1981),4.332168,4.520597
"Usual Suspects, The (1995)",4.513317,4.518248
Star Wars: Episode IV - A New Hope (1977),4.302937,4.495307


### Which ones do men and women differ on the least, i.e., date night?

In [26]:
top_mean_ratings['diff'] = abs(top_mean_ratings['F'] - top_mean_ratings['M'])
top_mean_ratings.sort_values(by='diff', ascending=True).head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jerry Maguire (1996),3.758315,3.759424,0.001109
Indiana Jones and the Temple of Doom (1984),3.674312,3.676568,0.002256
Good Will Hunting (1997),4.174672,4.177064,0.002392
"Fugitive, The (1993)",4.100457,4.104046,0.00359
Batman Returns (1992),2.9801,2.975904,0.004196
"Usual Suspects, The (1995)",4.513317,4.518248,0.004931
"Green Mile, The (1999)",4.159722,4.153105,0.006617
Boogie Nights (1997),3.763838,3.771295,0.007458
Chicken Run (2000),3.885559,3.877339,0.00822
"Blair Witch Project, The (1999)",3.038732,3.029381,0.009351


### What's the worst movie for date night?

In [28]:
top_mean_ratings.sort_values('diff', ascending=False).head(10)

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Animal House (1978),3.628906,4.167192,0.538286
"Rocky Horror Picture Show, The (1975)",3.673016,3.160131,0.512885
Mary Poppins (1964),4.19774,3.730594,0.467147
Reservoir Dogs (1992),3.769231,4.213873,0.444642
Gone with the Wind (1939),4.269841,3.829371,0.440471
"South Park: Bigger, Longer and Uncut (1999)",3.422481,3.846686,0.424206
Airplane! (1980),3.656566,4.064419,0.407854
Predator (1987),3.299401,3.706195,0.406793
"Godfather: Part II, The (1974)",4.040936,4.437778,0.396842
"Clockwork Orange, A (1971)",3.757009,4.145813,0.388803
