# Date Night Movie

In this assignment we are going to use pandas to figure out - What's the best **date-night movie**?

This assignment is going to use
- Joining
- Groupby
- Sorting



In [2]:
import os
import pandas as pd

# Read in the movie data: `pd.read_table`

In [3]:
def get_movie_data():
    
    unames = ['user_id','gender','age','occupation','zip']
    users = pd.read_table(os.path.join('../data','users.dat',), 
                          sep='::', header=None, names=unames, encoding='latin1')
    
    rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
    ratings = pd.read_table(os.path.join('../data', 'ratings.dat'), 
                            sep='::', header=None, names=rnames, encoding='latin1')
    
    mnames = ['movie_id', 'title','genres']
    movies = pd.read_table(os.path.join('../data', 'movies.dat'), 
                           sep='::', header=None, names=mnames, encoding='latin1')

    return users, ratings, movies

In [4]:
users, ratings, movies = get_movie_data()

  users = pd.read_table(os.path.join('../data','users.dat',),
  ratings = pd.read_table(os.path.join('../data', 'ratings.dat'),
  movies = pd.read_table(os.path.join('../data', 'movies.dat'),


In [5]:
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [76]:
users.shape

(6040, 5)

In [6]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [77]:
ratings.shape

(1000209, 4)

In [7]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [78]:
movies.shape

(3883, 5)

##### Clean up the `movies`

- Get the `year`
- Shorten the `title`

In [8]:
tmp = movies.title.str.extract('(.*) \(([0-9]+)\)')
tmp.apply(lambda x:x[0] if len(x) > 0 else None)
tmp.apply(lambda x: x[0][:40] if len(x) > 0 else None)

0    Toy Story
1         1995
dtype: object

In [9]:
movies['year'] = tmp[1]
movies['short_title'] = tmp[0]

In [10]:
movies.head()

Unnamed: 0,movie_id,title,genres,year,short_title
0,1,Toy Story (1995),Animation|Children's|Comedy,1995,Toy Story
1,2,Jumanji (1995),Adventure|Children's|Fantasy,1995,Jumanji
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,Grumpier Old Men
3,4,Waiting to Exhale (1995),Comedy|Drama,1995,Waiting to Exhale
4,5,Father of the Bride Part II (1995),Comedy,1995,Father of the Bride Part II


# Join the tables with `pd.merge` (20 pts)

merge users and ratings with user_id as common colum

In [11]:
users_ratings = pd.merge(users, ratings, on=['user_id'], how='inner')
users_ratings

Unnamed: 0,user_id,gender,age,occupation,zip,movie_id,rating,timestamp
0,1,F,1,10,48067,1193,5,978300760
1,1,F,1,10,48067,661,3,978302109
2,1,F,1,10,48067,914,3,978301968
3,1,F,1,10,48067,3408,4,978300275
4,1,F,1,10,48067,2355,5,978824291
...,...,...,...,...,...,...,...,...
1000204,6040,M,25,6,11106,1091,1,956716541
1000205,6040,M,25,6,11106,1094,5,956704887
1000206,6040,M,25,6,11106,562,5,956704746
1000207,6040,M,25,6,11106,1096,4,956715648


merge users_ratings and movies with movie_id as common colum

In [49]:
users_ratings_movies = pd.merge(users_ratings, movies, on=['movie_id'], how='inner')
users_ratings_movies

Unnamed: 0,user_id,gender,age,occupation,zip,movie_id,rating,timestamp,title,genres,year,short_title
0,1,F,1,10,48067,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama,1975,One Flew Over the Cuckoo's Nest
1,2,M,56,16,70072,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975),Drama,1975,One Flew Over the Cuckoo's Nest
2,12,M,25,12,32793,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975),Drama,1975,One Flew Over the Cuckoo's Nest
3,15,M,25,7,22903,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975),Drama,1975,One Flew Over the Cuckoo's Nest
4,17,M,50,1,95350,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975),Drama,1975,One Flew Over the Cuckoo's Nest
...,...,...,...,...,...,...,...,...,...,...,...,...
1000204,5949,M,18,17,47901,2198,5,958846401,Modulations (1998),Documentary,1998,Modulations
1000205,5675,M,35,14,30030,2703,3,976029116,Broken Vessels (1998),Drama,1998,Broken Vessels
1000206,5780,M,18,17,92886,2845,1,958153068,White Boys (1999),Drama,1999,White Boys
1000207,5851,F,18,20,55410,3607,5,957756608,One Little Indian (1973),Comedy|Drama|Western,1973,One Little Indian


# What's the highest rated movie? (20 pts))

### Getting number of users and movies from the dataset.

In [50]:
user_ids = ratings.user_id.unique().tolist()
movie_ids = ratings.movie_id.unique().tolist()
print('Number of Users: {}'.format(len(user_ids)))
print('Number of Movies: {}'.format(len(movie_ids)))

Number of Users: 6040
Number of Movies: 3706


### Groupby movie_id

In [51]:
movie_id_group = users_ratings_movies.groupby(['movie_id','short_title'])['rating'].agg(['count','mean'])
movie_id_group 

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
movie_id,short_title,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story,2077,4.146846
2,Jumanji,701,3.201141
3,Grumpier Old Men,478,3.016736
4,Waiting to Exhale,170,2.729412
5,Father of the Bride Part II,296,3.006757
...,...,...,...
3948,Meet the Parents,862,3.635731
3949,Requiem for a Dream,304,4.115132
3950,Tigerland,54,3.666667
3951,Two Family House,40,3.900000


## Sorting 

#### Sorting by mean first and count for the movie ratings 

- sorting it to the descending order 
- Which gives an output of the highest mean of the movie ratings but rated by only three users which is **not statistically significant.**

In [52]:
movie_id_group.sort_values(by=['mean','count'], ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
movie_id,short_title,Unnamed: 2_level_1,Unnamed: 3_level_1
787,"Gate of Heavenly Peace, The",3,5.0
3233,Smashing Time,2,5.0
989,Schlafes Bruder (Brother of Sleep),1,5.0
1830,Follow the Bitch,1,5.0
3172,Ulysses (Ulisse),1,5.0
...,...,...,...
3237,Kestrel's Eye (Falkens öga),1,1.0
3312,"McCullochs, The",1,1.0
3376,"Fantastic Night, The (La Nuit Fantastique)",1,1.0
3460,Hillbillys in a Haunted House,1,1.0


> Method 1

#### Sorting by counts and mean for the movie ratings
- This gives an output of highest number of counts and its mean

In [53]:
movie_id_group.sort_values(by=['count','mean'], ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
movie_id,short_title,Unnamed: 2_level_1,Unnamed: 3_level_1
2858,American Beauty,3428,4.317386
260,Star Wars: Episode IV - A New Hope,2991,4.453694
1196,Star Wars: Episode V - The Empire Strikes Back,2990,4.292977
1210,Star Wars: Episode VI - Return of the Jedi,2883,4.022893
480,Jurassic Park,2672,3.763847
...,...,...,...
3237,Kestrel's Eye (Falkens öga),1,1.000000
3312,"McCullochs, The",1,1.000000
3376,"Fantastic Night, The (La Nuit Fantastique)",1,1.000000
3460,Hillbillys in a Haunted House,1,1.000000


> Method 2

In [54]:
movie_id_group.nlargest(5, 'count')

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
movie_id,short_title,Unnamed: 2_level_1,Unnamed: 3_level_1
2858,American Beauty,3428,4.317386
260,Star Wars: Episode IV - A New Hope,2991,4.453694
1196,Star Wars: Episode V - The Empire Strikes Back,2990,4.292977
1210,Star Wars: Episode VI - Return of the Jedi,2883,4.022893
480,Jurassic Park,2672,3.763847


- movie_id and its rating counts

> Method 3

In [55]:
count_group

movie_id
1       2077
2        701
3        478
4        170
5        296
        ... 
3948     862
3949     304
3950      54
3951      40
3952     388
Name: rating, Length: 3706, dtype: int64

- if we put a threshold for movies having rating count greater than 2500, we get the top 13  with  rating counts higher than 2500
- if we put a threshold for movies having rating count greater than 3000, we get the only movie_id which as the rating count more than 3000 which is 2858.
- by refering to the movie_id_group table , the movie_id number 2858 belongs to 
### American Beauty which has the movie rating at 4.317 with  a highest rating count of 2858 is the Highest rated movie

In [74]:
count_group = users_ratings_movies.groupby("movie_id").count()["rating"]
movie_list = count_group[count_group > 2500].index.values
movie_list

array([ 260,  480,  589,  593,  608, 1196, 1198, 1210, 1270, 1580, 2028,
       2571, 2858], dtype=int64)

In [73]:
count_group = users_ratings_movies.groupby("movie_id").count()["rating"]
movie_list = count_group[count_group > 3000].index.values
movie_list

array([2858], dtype=int64)

> Method 4

### threshold is movies rating counts more than 2500 
- Sort the ratings counts in descending order

In [62]:
highest_rated = movie_id_group[movie_id_group['count'] > 2500] 
highest_rated .sort_values(by=['count'], ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
movie_id,short_title,Unnamed: 2_level_1,Unnamed: 3_level_1
2858,American Beauty,3428,4.317386
260,Star Wars: Episode IV - A New Hope,2991,4.453694
1196,Star Wars: Episode V - The Empire Strikes Back,2990,4.292977
1210,Star Wars: Episode VI - Return of the Jedi,2883,4.022893
480,Jurassic Park,2672,3.763847
2028,Saving Private Ryan,2653,4.337354
589,Terminator 2: Judgment Day,2649,4.058513
2571,"Matrix, The",2590,4.31583
1270,Back to the Future,2583,3.990321
593,"Silence of the Lambs, The",2578,4.351823


### The Highest Rated Movie would be 
> The movie that has the highest rating count and the highest mean rating
### American Beauty is the highest rated movie with 3428 ratings count and 4.317 rating mean 

# What is a good rated movie for date night? (60 pts)

- Hint - highly rated movie by 
    - both partners (might be the same gender or not),
    - based on genre preferences,
    - age group can also be combined