# Pandas Assignment

## Part 1

In this assignment we are going to use pandas to figure out - What's the best **date-night movie**?

This assignment is going to use
- Joining
- Groupby
- Sorting

Hint! Find the highly rated movies which appeals to both genders 'M' and 'F'


In [139]:
import os
import pandas as pd
import numpy as np
from math import *

##### Read in the movie data: `pd.read_table`

In [140]:
def get_movie_data():
    
    unames = ['user_id','gender','age','occupation','zip']
    users = pd.read_table(os.path.join('../data','users.dat'), 
                          sep='::', header=None, names=unames)
    
    rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
    ratings = pd.read_table(os.path.join('../data', 'ratings.dat'), 
                            sep='::', header=None, names=rnames)
    
    mnames = ['movie_id', 'title','genres']
    movies = pd.read_table(os.path.join('../data', 'movies.dat'), 
                           sep='::', header=None, names=mnames)

    return users, ratings, movies

In [141]:
users, ratings, movies = get_movie_data()

  """
  if __name__ == '__main__':
  del sys.path[0]


In [142]:
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [143]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [144]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


##### Clean up the `movies`

- Get the `year`
- Shorten the `title`


In [145]:
tmp = movies.title.str.extract('(.*) \(([0-9]+)\)')
tmp.apply(lambda x:x[0] if len(x) > 0 else None)
tmp.apply(lambda x: x[0][:40] if len(x) > 0 else None)

  """Entry point for launching an IPython kernel.


0    Toy Story
1         1995
dtype: object

In [146]:
movies['year'] = tmp[1]
movies['short_title'] = tmp[0]

In [147]:
movies.head()

Unnamed: 0,movie_id,title,genres,year,short_title
0,1,Toy Story (1995),Animation|Children's|Comedy,1995,Toy Story
1,2,Jumanji (1995),Adventure|Children's|Fantasy,1995,Jumanji
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,Grumpier Old Men
3,4,Waiting to Exhale (1995),Comedy|Drama,1995,Waiting to Exhale
4,5,Father of the Bride Part II (1995),Comedy,1995,Father of the Bride Part II


##### Join the tables with `pd.merge`

In [163]:
# I suppose I should join the movies with their ratings?
# Perform a left join, because we want to keep movies which don't have any ratings and
# report them/use them as movies which have not yet been rated by any users.
movies_ratings = pd.merge(left=movies,right=ratings,how='left')
movies_ratings.head()
movies_ratings['genres'].value_counts()

Comedy                                           116905
Drama                                            111507
Comedy|Romance                                    42716
Comedy|Drama                                      42254
Drama|Romance                                     29173
Action|Thriller                                   26759
Horror                                            22566
Drama|Thriller                                    18250
Thriller                                          17852
Action|Adventure|Sci-Fi                           17783
Drama|War                                         14657
Action|Sci-Fi                                     14309
Action|Sci-Fi|Thriller                            13970
Action                                            12314
Action|Drama|War                                  12224
Crime|Drama                                       11872
Comedy|Drama|Romance                              11069
Action|Adventure                                

**Aggregate the data for number of ratings and mean rating**

In [155]:
# how should we determine the hightest rated movie?
# perhaps I can take the average rating of each movie based on it's number of ratings
movies_ratings[movies_ratings.isnull().any(axis=1)]

# we can aggregate the dataframe on movie_id, creating a new dataframe with all the original columns
# but the rating is now the mean rating for each film, and we add a new column called num_ratings,
# which tracks the number of ratings given for each film.

# define our aggregation funcitons so that we get proper names after aggregation.
def num_ratings(x):
    return len(x)
def mean_rating(x):
    return x.mean()

# group by movie_id and title, etc, letting us keep those values.
g = movies_ratings.groupby(['movie_id','title','genres','year'])
movies_agg = g.agg({'user_id':[num_ratings], 'rating':[mean_rating]})

# get rid of first level of columns
movies_agg.columns = movies_agg.columns.droplevel(0)
# turn our movie_id + title index into columns, and reindex
movies_agg.reset_index(inplace=True)
movies_agg.head()

Unnamed: 0,movie_id,title,genres,year,num_ratings,mean_rating
0,1,Toy Story (1995),Animation|Children's|Comedy,1995,2077.0,4.146846
1,2,Jumanji (1995),Adventure|Children's|Fantasy,1995,701.0,3.201141
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,478.0,3.016736
3,4,Waiting to Exhale (1995),Comedy|Drama,1995,170.0,2.729412
4,5,Father of the Bride Part II (1995),Comedy,1995,296.0,3.006757


##### What's the highest rated movie?

We can't simply choose the movie which has the highest mean rating, because this would be a skewed 
result. Suppose a movie only had one rating which was a 5. This can't be considered for highest rated since it
only represents one sample out of the population. Obviously a movie with an average rating of 4.5 and 100 ratings
should rank higher. To solve this we need to define some way to rank a film based on the number of ratings it has 
and it's average rating.

Consider the following:

$$Rank = \frac{ratings}{rating_{avg}}$$

This gives us an *okay enough* ranking for movies. For example a movie, $A$ with $ratings = 5$ and $rating_{avg} = 4.7$ would rank lower than another movie, $B$ with $ratings = 10$ and $rating_{avg} = 4.1$

$$Rank_A = \frac{ratings_A}{rating_{avgA}} = \frac{5}{4.7} = 1.06$$

$$Rank_B = \frac{ratings_B}{rating_{avgB}} = \frac{10}{4.1} = 2.44$$

$$1.06 < 2.44$$

This works well enough for our purposes because the number of ratings is a continuous variable, whereas the mean rating is limited to the interval $[0,5]$. This means that we effectively produce another continuous variable, rank, from the number of ratings and the mean rating of each movie. Doing it the other way around would result in cases in which films with a higher rating than the number of ratings, would get higher rank than they deserve, and the rank variable would be limited to the interval $[0,5]$. We also leave the decimal portion of each rank, incase two or more movies have the same whole number portion of rank, we can choose between them in most cases.

In [156]:
# Assign a rank to the dataframe
movies_agg['rank'] = movies_agg['num_ratings'] / movies_agg['mean_rating']
movies_agg.head()

Unnamed: 0,movie_id,title,genres,year,num_ratings,mean_rating,rank
0,1,Toy Story (1995),Animation|Children's|Comedy,1995,2077.0,4.146846,500.862533
1,2,Jumanji (1995),Adventure|Children's|Fantasy,1995,701.0,3.201141,218.984403
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,478.0,3.016736,158.449376
3,4,Waiting to Exhale (1995),Comedy|Drama,1995,170.0,2.729412,62.284483
4,5,Father of the Bride Part II (1995),Comedy,1995,296.0,3.006757,98.444944


**Now we can decide which film is the higest rated film based on it's rank**

In [432]:
movies_agg.iloc[movies_agg['rank'].idxmax()]

movie_id                         2858
title          American Beauty (1999)
genres                   Comedy|Drama
year                             1999
num_ratings                      3428
mean_rating                   4.31739
rank                          793.999
Name: 2788, dtype: object

### Now we can start working out what the best date night film is

*Of course the model for determining such a film will be heavily biased towards the creators own notions.*

---

Now that we have ranked films according to their ratings and number of ratings, we have to state some assumptions;

**The reccommendation model assumes:**
1. a date night is between two people.
2. Those two people have records in our users dataset
3. Does not assume that the two people have made ratings on any films.
  * If they have made ratings, we will consider their rating history as a factor in determining the film.

#### So how do we decide what film to show a couple?

Given two people on a date the minimal amount of information we need from them is their age and sex and user_id.
From this we can check our model (it's going to be a simple one) to see which films would appeal the best, and then
select the one with the highest rank. Additionally we can utilize information from ratings that members of the couple may
have given to bias the film selection towards generes they have rated highly.

***We need to develop a system*** which we can treat as a function that takes a tuple of user info and spits out a single
film.

$$f(X,Y) \rightarrow Z$$
Where $X$ and $Y$ are sets.

**EX:** 
$$f(\{Genre, Userid, Age, Sex, [Ratings]\}, \{Genre, Userid, Age, Sex, [Ratings]\}) \rightarrow {Films}$$

We keep it really simple. For each person in the couple we take their age, sex, and a genre and we compute films
which have ratings by users that closely match the age and sex of both of each member of the couple and which are 
of the genre passed in. We optionally consider any ratings that either user has made.

**Let's define f(X,Y)**

* Calculate the jaccard similarity of each person's info over every film, and keep the average of each similarity.

$$jaccard\_similarity \space J(A,B) = \frac{|A \cap B|}{|A \cup B|}$$

$$numeric\_similarity \space I(A,B) = \frac{min(A,B)}{max(A,B)}$$

* Calculate the jaccard similarity of each person in the couple.
* Select movies which fall within a range of the jaccard similartiy of the couple

$$f(X,Y) = \{Z \in S : min\_similarity < I(Z_{jaccard},\space Couple_{jaccard}) < max\_similarity) \}$$

In [506]:
def jaccard_similarity(a, b):
    intersection_cardinality = len([x for x in a if x in b])
    union_cardinalilty = len(set(a + b))
    return intersection_cardinality/union_cardinalilty

def avg_jaccard(vals=[]):
    return np.mean(vals)

def numeric_similarity(x, y):
    return min(x,y)/max(x,y)

def is_similar(x, y, thresh=0.5):
    s = []
    for i in x:
        s.append((numeric_similarity(i,y) > thresh))
    return s

couple = users.sample(n=2)

# get the most rated genre of both 
c1 = {
    'most_rated_genre':'',
    'mean_rating':0
}

c2 = {
    'most_rated_genre':'',
    'mean_rating':0
}

c1['most_rated_genre'] = \
    movies_ratings[movies_ratings['user_id'] == couple.iloc[0]['user_id']]['genres'].value_counts().index.tolist()[0]
c1['mean_rating'] = \
    movies_ratings[movies_ratings['user_id'] == couple.iloc[0]['user_id']]['rating'].mean()
c2['most_rated_genre'] = \
    movies_ratings[movies_ratings['user_id'] == couple.iloc[1]['user_id']]['genres'].value_counts().index.tolist()[0]
c2['mean_rating'] = \
    movies_ratings[movies_ratings['user_id'] == couple.iloc[1]['user_id']]['rating'].mean()

mask = movies_agg['genres'].str.contains(c1['most_rated_genre']) & \
    movies_agg['genres'].str.contains(c2['most_rated_genre'])
    
print(c1,c2)
tmp = movies_agg[mask]
tmp = tmp[is_similar(tmp['mean_rating'],c1['mean_rating'],0.9)]
tmp = tmp[is_similar(tmp['mean_rating'],c2['mean_rating'],0.9)]
tmp.reset_index(inplace=True)
# now select the movie from this which has the highest rank
if tmp['rank'].size == 0:
    print('Sorry, were having trouble finding a good movie for you.')
else:
    print(tmp.iloc[tmp['rank'].idxmax()])

{'most_rated_genre': 'Drama', 'mean_rating': 4.245901639344262} {'most_rated_genre': 'Drama', 'mean_rating': 3.8684210526315788}
index                                                       1177
movie_id                                                    1196
title          Star Wars: Episode V - The Empire Strikes Back...
genres                         Action|Adventure|Drama|Sci-Fi|War
year                                                        1980
num_ratings                                                 2990
mean_rating                                              4.29298
rank                                                     696.486
Name: 101, dtype: object


## Part 2

Load the dataset in `titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [153]:
from IPython.core.display import HTML
HTML(filename='../data/titanic.html')

0,1,2,3,4,5
Name,Labels,Units,Levels,Storage,NAs
pclass,,,3,integer,0
survived,Survived,,,double,0
name,Name,,,character,0
sex,,,2,integer,0
age,Age,Year,,double,263
sibsp,Number of Siblings/Spouses Aboard,,,double,0
parch,Number of Parents/Children Aboard,,,double,0
ticket,Ticket Number,,,character,0
fare,Passenger Fare,British Pound (\243),,double,1

0,1
Variable,Levels
pclass,1st
,2nd
,3rd
sex,female
,male
cabin,
,A10
,A11
,A14


In [154]:
# you would need xlrd - pip install xlrd
t_file = pd.ExcelFile('../data/titanic.xls')
t_df = t_file.parse("titanic", header=None)
t_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.338,B5,S,2,,"St Louis, MO"
2,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"


### Women and children first?

*** 1. Use the `groupby` method to calculate the proportion of passengers that survived by sex. *** 

*** 2. Calculate the same proportion, but by class and sex. *** 

*** 3. Create age categories: children (under 14 years), adolescents (14-20), adult (21-64), and senior(65+), and calculate survival proportions by age category, class and sex. ***