# Recommender systems

## One of the most common uses of big data is to predict and suggest what users may want.  This allows Google to show you relevant ads or to suggest news in Google Now; Amazon to recommend relevant products; Netflix to recommend movies that you might like; or most recently, the famous **Weekly Dicovery** of Spotify.

## All these products are based on systems of recommendation: a information retrieval method to provide users with relevant, yet novel and diverse, information. 

## In this class we will use a pretty famous dataset based on movies ratings, 'MovieLens', to learn the basics of recommender systems. 

## Table of Contents (times are approximated)

1. [Getting and analysing some data (1.5-2 h, typically until break)](#data)
2. [Most popular movies (0.5-1 min)](#popular)
3. [Metrics for recommender systems (2-2.5h)](#metrics)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import io
import os

<a id='data'></a>
## 1.1 Load data

We will use MovieLens dataset, which is one of the most common datasets used when implementing and testing recommender engines. This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies. 
* Each user has rated at least 20 movies. 
* Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens [website](https://movielens.org) during the seven-month period from September 19th, 
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set.

You can download the dataset [here](http://files.grouplens.org/datasets/movielens/ml-100k.zip).

Take a look at the readme file!!!

In [10]:
data_root = "../ml-100k/"

In [11]:
!head -n500 {data_root}README

SUMMARY & USAGE LICENSE

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
 
This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
	* Each user has rated at least 20 movies. 
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th, 
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.

Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set.  The data set may be used for any research
purposes under th

In [12]:
!ls

RecSys_first_class_student_version.ipynb
RecSys_first_class_teacher_version.ipynb
RecSys_second_class_student_version.ipynb


In [13]:
columns = ['user_id', 'item_id', 'rating', 'timestamp']
datafile = os.path.join(data_root, "u.data")
data = pd.read_csv(datafile, sep='\t', names=columns)
data.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


### A remainder of the numpy library

*Pandas library is nothing alse than numpy under the hood (numpy with steroids, if you like). You can access the data (in matrix from) with he "values" attribute, e.g. data.values*

In [14]:
data.values.shape

(100000, 4)

In [15]:
# access all rows, and first 3 columns 
data.values[:, :3]

array([[ 196,  242,    3],
       [ 186,  302,    3],
       [  22,  377,    1],
       ...,
       [ 276, 1090,    1],
       [  13,  225,    2],
       [  12,  203,    3]])

In [16]:
# access all collumns, and first 3 rows 
data.values[:3, :].shape

(3, 4)

In [17]:
# access first 10 rows
data.values[:10, :]
# This is equivalent to data.values[:10]

array([[      196,       242,         3, 881250949],
       [      186,       302,         3, 891717742],
       [       22,       377,         1, 878887116],
       [      244,        51,         2, 880606923],
       [      166,       346,         1, 886397596],
       [      298,       474,         4, 884182806],
       [      115,       265,         2, 881171488],
       [      253,       465,         5, 891628467],
       [      305,       451,         3, 886324817],
       [        6,        86,         3, 883603013]])

In [18]:
# access first column
data.values[:, 2]

array([3, 3, 1, ..., 1, 2, 3])

In [19]:
# The attribute shape provides the shape of the matrix


In [20]:
# Note that if we return the first column, we get a vector (of 100000 components)


In [21]:
# same with the first row (this time, we get a vector of 4 components)


In [22]:
# Number of users and items
n_users = data.user_id.unique().shape[0]

In [23]:
n_items = data.item_id.unique().shape[0]
print("There are %s users and %s items" %(n_users, n_items))

There are 943 users and 1682 items


## 1.2 A dictionary for movies and a search tool

In order to analyze the predicted recommendations, let's create a python dictonary that will allow us to translate any item id to the corresponding movie title. Also, let's write a small function that returns the ids of the movies containing some text.

The correspondance between titles and ids is stored in the u.item file

In [34]:
data_root = "../ml-100k/"
items_id_file = os.path.join(data_root, "u.item")
!head $items_id_file

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|0|0|0|1

### *Simple reminder of dictionaries*

In [35]:
aux = {'hola': 'que haces?', 1: '237'}

In [36]:
# Access value of key='hola'
aux[1]

'237'

In [37]:
# create new key
aux['nueva clave'] = 'nuevo valor'

In [38]:
aux

{1: '237', 'hola': 'que haces?', 'nueva clave': 'nuevo valor'}

In [39]:
# Update value of existing key
aux[1] = 237
aux

{1: 237, 'hola': 'que haces?', 'nueva clave': 'nuevo valor'}

In [40]:
list(aux.keys())

['hola', 1, 'nueva clave']

In [41]:
list(aux.values())

['que haces?', 237, 'nuevo valor']

In [42]:
for x1, x2 in list(aux.items())[:3]:
    print('clave:', x1, 'valor:', x2)

clave: hola valor: que haces?
clave: 1 valor: 237
clave: nueva clave valor: nuevo valor


In [43]:
# Create a dictionary for movie titles and ids
item_dict = {}
with io.open(items_id_file, 'rb') as f:
    for line in f.readlines():
        record = line.split(b'|')
        item_dict[int(record[0])] = str(record[1])
    
# We can use this dict to see the films a user has seen, for instance. 
for record in data.values[:20]:
    print("User {u} viewed '{m}' and gave a {r} rating".format(
        u=record[0], m=item_dict[record[1]], r=record[2]))    

User 196 viewed 'b'Kolya (1996)'' and gave a 3 rating
User 186 viewed 'b'L.A. Confidential (1997)'' and gave a 3 rating
User 22 viewed 'b'Heavyweights (1994)'' and gave a 1 rating
User 244 viewed 'b'Legends of the Fall (1994)'' and gave a 2 rating
User 166 viewed 'b'Jackie Brown (1997)'' and gave a 1 rating
User 298 viewed 'b'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)'' and gave a 4 rating
User 115 viewed 'b'Hunt for Red October, The (1990)'' and gave a 2 rating
User 253 viewed 'b'Jungle Book, The (1994)'' and gave a 5 rating
User 305 viewed 'b'Grease (1978)'' and gave a 3 rating
User 6 viewed 'b'Remains of the Day, The (1993)'' and gave a 3 rating
User 62 viewed 'b'Men in Black (1997)'' and gave a 2 rating
User 286 viewed 'b"Romy and Michele's High School Reunion (1997)"' and gave a 5 rating
User 200 viewed 'b'Star Trek: First Contact (1996)'' and gave a 5 rating
User 210 viewed 'b'To Wong Foo, Thanks for Everything! Julie Newmar (1995)'' and gave a 3 

In [44]:
item_dict

{1: "b'Toy Story (1995)'",
 2: "b'GoldenEye (1995)'",
 3: "b'Four Rooms (1995)'",
 4: "b'Get Shorty (1995)'",
 5: "b'Copycat (1995)'",
 6: "b'Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)'",
 7: "b'Twelve Monkeys (1995)'",
 8: "b'Babe (1995)'",
 9: "b'Dead Man Walking (1995)'",
 10: "b'Richard III (1995)'",
 11: "b'Seven (Se7en) (1995)'",
 12: "b'Usual Suspects, The (1995)'",
 13: "b'Mighty Aphrodite (1995)'",
 14: "b'Postino, Il (1994)'",
 15: 'b"Mr. Holland\'s Opus (1995)"',
 16: "b'French Twist (Gazon maudit) (1995)'",
 17: "b'From Dusk Till Dawn (1996)'",
 18: "b'White Balloon, The (1995)'",
 19: 'b"Antonia\'s Line (1995)"',
 20: "b'Angels and Insects (1995)'",
 21: "b'Muppet Treasure Island (1996)'",
 22: "b'Braveheart (1995)'",
 23: "b'Taxi Driver (1976)'",
 24: "b'Rumble in the Bronx (1995)'",
 25: "b'Birdcage, The (1996)'",
 26: "b'Brothers McMullen, The (1995)'",
 27: "b'Bad Boys (1995)'",
 28: "b'Apollo 13 (1995)'",
 29: "b'Batman Forever (1995)'",
 30: "b'Belle de jou

In [45]:
text = 'potter'
title = 'harry potter'
if text in title:
    print(text)

potter


In [46]:
# Define a function that retrieves all the ids and titles for movies containing 'text' in its title
def returnItemId(text, ids):
    """
    :param text: string to be looked for in movies titles
    :param ids: dicttionary of {id:title}
    
    :return: a list of (id,title) if text found in titles, and an empty list otherwise.
    """
    # convert input text to lowercase
    text_ = text.lower()
    # find occurances
    search = []
    for id_,title in ids.items():
        if text in title.lower():
            search.append((id_, title))
    
    return search

In [47]:
returnItemId('but', item_dict)

[(240, "b'Beavis and Butt-head Do America (1996)'"),
 (435, "b'Butch Cassidy and the Sundance Kid (1969)'"),
 (580,
  "b'Englishman Who Went Up a Hill, But Came Down a Mountain, The (1995)'"),
 (1401, "b'M. Butterfly (1993)'"),
 (1459, "b'Madame Butterfly (1995)'"),
 (1614, "b'Reluctant Debutante, The (1958)'"),
 (1621, "b'Butterfly Kiss (1995)'"),
 (1645, "b'Butcher Boy, The (1998)'"),
 (1650, "b'Butcher Boy, The (1998)'")]

## 1.3 Data consistency (always double check everything!!!)

In [48]:
# check whether titles are unique or not. They are not!!!
len(set(item_dict.keys()))

1682

In [49]:
len(set(item_dict.values()))

1664

### One work around: create another dict that consolidates ids with the same movie title

In [50]:
item_dict.values()

dict_values(["b'Toy Story (1995)'", "b'GoldenEye (1995)'", "b'Four Rooms (1995)'", "b'Get Shorty (1995)'", "b'Copycat (1995)'", "b'Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)'", "b'Twelve Monkeys (1995)'", "b'Babe (1995)'", "b'Dead Man Walking (1995)'", "b'Richard III (1995)'", "b'Seven (Se7en) (1995)'", "b'Usual Suspects, The (1995)'", "b'Mighty Aphrodite (1995)'", "b'Postino, Il (1994)'", 'b"Mr. Holland\'s Opus (1995)"', "b'French Twist (Gazon maudit) (1995)'", "b'From Dusk Till Dawn (1996)'", "b'White Balloon, The (1995)'", 'b"Antonia\'s Line (1995)"', "b'Angels and Insects (1995)'", "b'Muppet Treasure Island (1996)'", "b'Braveheart (1995)'", "b'Taxi Driver (1976)'", "b'Rumble in the Bronx (1995)'", "b'Birdcage, The (1996)'", "b'Brothers McMullen, The (1995)'", "b'Bad Boys (1995)'", "b'Apollo 13 (1995)'", "b'Batman Forever (1995)'", "b'Belle de jour (1967)'", "b'Crimson Tide (1995)'", "b'Crumb (1994)'", "b'Desperado (1995)'", "b'Doom Generation, The (1995)'", "b'Free Willy 

In [51]:
dd = pd.DataFrame.from_dict(item_dict, orient='index').reset_index()
dd.columns = ['item_id', 'title']
dd.head()

Unnamed: 0,item_id,title
0,1,b'Toy Story (1995)'
1,2,b'GoldenEye (1995)'
2,3,b'Four Rooms (1995)'
3,4,b'Get Shorty (1995)'
4,5,b'Copycat (1995)'


In [52]:
duplicates_item_dict = {}
# Las claves en "duplicates_item_dict" son los nombres de las películas
# Los valores son una lista de los ids (que pueden ser uno solo, o varios)
for id,name in list(item_dict.items()):
    # clave: name; valor: list(id)
    if name not in duplicates_item_dict:
        duplicates_item_dict[name] = [id]
    else:
        duplicates_item_dict[name].append(id)

# show the duplicated titles
for k,v in list(duplicates_item_dict.items()):
    if len(v)>1:
        print(k,v)

b'Chasing Amy (1997)' [246, 268]
b'Kull the Conqueror (1997)' [266, 680]
b"Ulee's Gold (1997)" [297, 303]
b'Fly Away Home (1996)' [304, 500]
b'Ice Storm, The (1997)' [305, 865]
b'Deceiver (1997)' [309, 1606]
b'Desperate Measures (1998)' [329, 348]
b'Body Snatchers (1993)' [573, 670]
b'Substance of Fire, The (1996)' [711, 1658]
b'Money Talks (1997)' [876, 881]
b'That Darn Cat! (1997)' [878, 1003]
b'Hugo Pool (1997)' [1175, 1617]
b'Chairman of the Board (1998)' [1234, 1654]
b'Designated Mourner, The (1997)' [1256, 1257]
b'Hurricane Streets (1998)' [1395, 1607]
b'Sliding Doors (1998)' [1429, 1680]
b'Nightwatch (1997)' [1477, 1625]
b'Butcher Boy, The (1998)' [1645, 1650]


Create a dict where the key are the original ids, and the values are the unique one. 
We will use this dictionary to remove duplicates in a dataframe.

In [53]:
unique_id_item_dict ={}
for new_id, old_id_list in enumerate(duplicates_item_dict.values()):
    # key: old_id; value: new_id
    for old_id in old_id_list:
        unique_id_item_dict[old_id] = new_id

unique_id_item_dict = {old_id: new_id for new_id, old_id_list in enumerate(duplicates_item_dict.values()) 
                                      for old_id in old_id_list}

In [54]:
unique_id_item_dict

{1: 0,
 2: 1,
 3: 2,
 4: 3,
 5: 4,
 6: 5,
 7: 6,
 8: 7,
 9: 8,
 10: 9,
 11: 10,
 12: 11,
 13: 12,
 14: 13,
 15: 14,
 16: 15,
 17: 16,
 18: 17,
 19: 18,
 20: 19,
 21: 20,
 22: 21,
 23: 22,
 24: 23,
 25: 24,
 26: 25,
 27: 26,
 28: 27,
 29: 28,
 30: 29,
 31: 30,
 32: 31,
 33: 32,
 34: 33,
 35: 34,
 36: 35,
 37: 36,
 38: 37,
 39: 38,
 40: 39,
 41: 40,
 42: 41,
 43: 42,
 44: 43,
 45: 44,
 46: 45,
 47: 46,
 48: 47,
 49: 48,
 50: 49,
 51: 50,
 52: 51,
 53: 52,
 54: 53,
 55: 54,
 56: 55,
 57: 56,
 58: 57,
 59: 58,
 60: 59,
 61: 60,
 62: 61,
 63: 62,
 64: 63,
 65: 64,
 66: 65,
 67: 66,
 68: 67,
 69: 68,
 70: 69,
 71: 70,
 72: 71,
 73: 72,
 74: 73,
 75: 74,
 76: 75,
 77: 76,
 78: 77,
 79: 78,
 80: 79,
 81: 80,
 82: 81,
 83: 82,
 84: 83,
 85: 84,
 86: 85,
 87: 86,
 88: 87,
 89: 88,
 90: 89,
 91: 90,
 92: 91,
 93: 92,
 94: 93,
 95: 94,
 96: 95,
 97: 96,
 98: 97,
 99: 98,
 100: 99,
 101: 100,
 102: 101,
 103: 102,
 104: 103,
 105: 104,
 106: 105,
 107: 106,
 108: 107,
 109: 108,
 110: 109,
 111: 11

Create another dict mapping moving titles to this new unique id

In [55]:
unique_item_dict = {unique_id_item_dict[k]:v 
                    for k,v in item_dict.items()}
    
assert(len(set(unique_item_dict.keys())) == 
       len(set(unique_item_dict.values())))

Now we can use our `returnItemId()` mehtod safely =)

In [56]:
returnItemId('but', unique_item_dict)

[(239, "b'Beavis and Butt-head Do America (1996)'"),
 (431, "b'Butch Cassidy and the Sundance Kid (1969)'"),
 (575,
  "b'Englishman Who Went Up a Hill, But Came Down a Mountain, The (1995)'"),
 (1390, "b'M. Butterfly (1993)'"),
 (1448, "b'Madame Butterfly (1995)'"),
 (1601, "b'Reluctant Debutante, The (1958)'"),
 (1607, "b'Butterfly Kiss (1995)'"),
 (1630, "b'Butcher Boy, The (1998)'")]

In [74]:
returnItemId('but', item_dict)

[(240, "b'Beavis and Butt-head Do America (1996)'"),
 (435, "b'Butch Cassidy and the Sundance Kid (1969)'"),
 (580,
  "b'Englishman Who Went Up a Hill, But Came Down a Mountain, The (1995)'"),
 (1401, "b'M. Butterfly (1993)'"),
 (1459, "b'Madame Butterfly (1995)'"),
 (1614, "b'Reluctant Debutante, The (1958)'"),
 (1621, "b'Butterfly Kiss (1995)'"),
 (1645, "b'Butcher Boy, The (1998)'"),
 (1650, "b'Butcher Boy, The (1998)'")]

## 1.4 Train and test sets

GroupLens provides several splits of the dataset, so that we can check the goodness of our algorithms. See the README file for more  details. Here we will use one of such splits.

Please notice that we have to correct for the non-unique movie's id issue!!

In [57]:
!ls $data_root

allbut.pl  u1.base  u2.test  u4.base  u5.test  ub.base	u.genre  u.occupation
mku.sh	   u1.test  u3.base  u4.test  ua.base  ub.test	u.info	 u.user
README	   u2.base  u3.test  u5.base  ua.test  u.data	u.item


In [58]:
trainfile = os.path.join(data_root, 'ua.base')
!head $trainfile

1	1	5	874965758
1	2	3	876893171
1	3	4	878542960
1	4	3	876893119
1	5	3	889751712
1	6	5	887431973
1	7	4	875071561
1	8	1	875072484
1	9	5	878543541
1	10	3	875693118


In [59]:
columns = ['user_id', 'item_id', 'rating', 'timestamp']
trainfile = os.path.join(data_root, "ua.base")
train = pd.read_csv(trainfile, sep='\t', names=columns)
print('There are %s users, %s itmes and %s pairs in the train set' \
      %(train.user_id.unique().shape[0], train.item_id.unique().shape[0], train.item_id.count()))
train.head()


There are 943 users, 1680 itmes and 90570 pairs in the train set


Unnamed: 0,user_id,item_id,rating,timestamp
0,1,1,5,874965758
1,1,2,3,876893171
2,1,3,4,878542960
3,1,4,3,876893119
4,1,5,3,889751712


In [60]:
# same for test
columns = ['user_id', 'item_id', 'rating', 'timestamp']
testfile = os.path.join(data_root, "ua.test")
test = pd.read_csv(testfile, sep='\t', names=columns)
print('There are %s users, %s itmes and %s pairs in the test set' \
      %(test.user_id.unique().shape[0], test.item_id.unique().shape[0], test.item_id.count()))
test.head()


There are 943 users, 1129 itmes and 9430 pairs in the test set


Unnamed: 0,user_id,item_id,rating,timestamp
0,1,20,4,887431883
1,1,33,4,878542699
2,1,61,4,878542420
3,1,117,3,874965739
4,1,155,2,878542201


### Correcting for non-unique movies id 

*Reminder of lambda functions in Python: is a way of calling short functions (1 line of code), without having to define the function in a separted cell.*

In [61]:
train['item_id'] = train.item_id.apply(lambda x: unique_id_item_dict[x])

In [62]:
test['item_id'] = test.item_id.apply(lambda x: unique_id_item_dict[x])

<a id='popular'></a>
## 2. Most popular movies

Recommending popular items is a simple, yet quite effective baseline for recommendation. Indeed, most RS suffer from a strong *popularity bias*, i.e. they tend to recommend popular items more frequently than they should -just because suggesting what is popular is effective!-. There is a lot of research  devote to understand this behaviour and to develop recipies to avoid it. 

Movies can be ranked according to different popularity metrics:
* Most rated movie (it is assumed that this is the most watched movie)
* Most positively rated movie (rating > 4.0)
* Highest rated movie

## 2.1 Most rated movie

In [63]:
# group the train dataset by item and count the number of users using Pandas
mostRated = train.groupby('item_id')['user_id'].count()
mostRated.head()

item_id
0    392
1    121
2     85
3    198
4     79
Name: user_id, dtype: int64

In [64]:
mostRated.head()

item_id
0    392
1    121
2     85
3    198
4     79
Name: user_id, dtype: int64

In [65]:
# sort in descending order
mostRatedSorted = mostRated.sort_values(ascending=False)

In [66]:
mostRatedSorted.head()

item_id
49     495
99     443
180    439
257    412
284    400
Name: user_id, dtype: int64

In [70]:
mostRatedMovie = mostRatedSorted.reset_index().apply(lambda x: (x[0], unique_item_dict[x[0]], x[1]), axis=1)
mostRatedMovie.head()

0                (49, b'Star Wars (1977)', 495)
1                    (99, b'Fargo (1996)', 443)
2      (180, b'Return of the Jedi (1983)', 439)
3                 (257, b'Contact (1997)', 412)
4    (284, b'English Patient, The (1996)', 400)
dtype: object

In [71]:
mostRatedMovie.values

array([(49, "b'Star Wars (1977)'", 495), (99, "b'Fargo (1996)'", 443),
       (180, "b'Return of the Jedi (1983)'", 439), ...,
       (1568, "b'Baton Rouge (1988)'", 1),
       (1569, "b'Liebelei (1933)'", 1),
       (1663, "b'Scream of Stone (Schrei aus Stein) (1991)'", 1)],
      dtype=object)

In [74]:
# Return a numpy array of [id, title, frequency]

# numpy requires knowledge of the data types. Since we have ids (integers) and titles (strings), 
# we will use a "parent" data type, np.object
mostRatedMovies = np.zeros(shape=(mostRatedSorted.shape[0], 3), dtype=np.object)

aux = mostRatedMovie.values
for i, row in enumerate(aux):
    mostRatedMovies[i] = np.array(list(row))
    
mostRatedMovies[:10,0:]

array([['49', "b'Star Wars (1977)'", '495'],
       ['99', "b'Fargo (1996)'", '443'],
       ['180', "b'Return of the Jedi (1983)'", '439'],
       ['257', "b'Contact (1997)'", '412'],
       ['284', "b'English Patient, The (1996)'", '400'],
       ['292', "b'Liar Liar (1997)'", '398'],
       ['0', "b'Toy Story (1995)'", '392'],
       ['286', "b'Scream (1996)'", '386'],
       ['120', "b'Independence Day (ID4) (1996)'", '384'],
       ['173', "b'Raiders of the Lost Ark (1981)'", '379']], dtype=object)

## 2.2 Most positively rated movie

In [75]:
train.query('rating>4')

Unnamed: 0,user_id,item_id,rating,timestamp
0,1,0,5,874965758
5,1,5,5,887431973
8,1,8,5,878543541
11,1,11,5,878542960
12,1,12,5,875071805
13,1,13,5,874965706
14,1,14,5,875071608
15,1,15,5,878543541
18,1,18,5,875071515
30,1,31,5,888732909


In [76]:
train[train.rating>4]

Unnamed: 0,user_id,item_id,rating,timestamp
0,1,0,5,874965758
5,1,5,5,887431973
8,1,8,5,878543541
11,1,11,5,878542960
12,1,12,5,875071805
13,1,13,5,874965706
14,1,14,5,875071608
15,1,15,5,878543541
18,1,18,5,875071515
30,1,31,5,888732909


In [77]:
# filter movies rated with rating >=4.0. Then group by item, count the number of users and sort in descending order.
mostPositiveRated = train.query('rating>4').groupby('item_id')['user_id'].count().sort_values(ascending=False)

mostPositiveRated = mostPositiveRated.reset_index().apply(lambda x: (x[0], unique_item_dict[x[0]], x[1]), axis=1)
mostPositiveRated.head()

0                   (49, b'Star Wars (1977)', 277)
1                       (99, b'Fargo (1996)', 194)
2    (173, b'Raiders of the Lost Ark (1981)', 181)
3             (126, b'Godfather, The (1972)', 179)
4                (55, b'Pulp Fiction (1994)', 170)
dtype: object

In [78]:
mostPositiveRatedMovies = np.zeros(shape=(mostPositiveRated.shape[0], 3), dtype=np.object)

aux = mostPositiveRated.values
for i, row in enumerate(aux):
    mostPositiveRatedMovies[i] = np.array(list(row))
    
mostPositiveRatedMovies[:10,0:]

array([['49', "b'Star Wars (1977)'", '277'],
       ['99', "b'Fargo (1996)'", '194'],
       ['173', "b'Raiders of the Lost Ark (1981)'", '181'],
       ['126', "b'Godfather, The (1972)'", '179'],
       ['55', "b'Pulp Fiction (1994)'", '170'],
       ['315', 'b"Schindler\'s List (1993)"', '168'],
       ['97', "b'Silence of the Lambs, The (1991)'", '167'],
       ['171', "b'Empire Strikes Back, The (1980)'", '159'],
       ['63', "b'Shawshank Redemption, The (1994)'", '152'],
       ['180', "b'Return of the Jedi (1983)'", '146']], dtype=object)

## 2.3 Highest mean rating movie

In [116]:
# obtain the highest rated movies, with a minium number of users/ratings.
min_ratings = 50
meanMovies = train.groupby('item_id')['rating'].mean()[train.groupby('item_id')['rating'].count() > min_ratings].sort_values(ascending=False)
meanMovies.head(10)

item_id
113    4.491525
404    4.480769
168    4.476636
315    4.475836
479    4.459821
63     4.457364
11     4.386454
598    4.374359
49     4.365657
177    4.327434
Name: rating, dtype: float64

In [117]:
len(meanMovies)

570

In [119]:
meanMovies[113]

4.491525423728813

In [121]:
meanRateMovies = np.zeros(shape=(len(meanMovies),3), dtype=np.object)

for i, index in enumerate(meanMovies.index):
    index = index
    title = unique_item_dict[index]
    rating = meanMovies[index]
    meanRateMovies[i] = [index,title,rating]
    
meanRateMovies[:10,:]

array([[113, "b'Wallace & Gromit: The Best of Aardman Animation (1996)'",
        4.491525423728813],
       [404, "b'Close Shave, A (1995)'", 4.480769230769231],
       [168, "b'Wrong Trousers, The (1993)'", 4.4766355140186915],
       [315, 'b"Schindler\'s List (1993)"', 4.4758364312267656],
       [479, "b'Casablanca (1942)'", 4.459821428571429],
       [63, "b'Shawshank Redemption, The (1994)'", 4.457364341085271],
       [11, "b'Usual Suspects, The (1995)'", 4.386454183266932],
       [598, "b'Rear Window (1954)'", 4.374358974358974],
       [49, "b'Star Wars (1977)'", 4.365656565656566],
       [177, "b'12 Angry Men (1957)'", 4.327433628318584]], dtype=object)

<div class  = "alert alert-info"> 
** QUESTION **: set the value of *min_ratings* to 1, and re-run the cell. What happens now? Change this value
</div>

<div class  = "alert alert-info"> 
** QUESTION **: Which method is better?? How to measure a recommender system? 
</div>

<div class  = "alert alert-info"> 
** IMPORTANT QUESTION **: When might be useful to recommend popular items?
</div>

<a id='metrics'></a>
## 3. Metrics for recommender systems

As we have seen, even with the simplest solution --aka, recommending popular items-- is difficult to known which technique performs better. For this, there are a number of metrics that allow one to measure the goodness of a recommender system. 

Metrics can be design for measuring the relevance or accuracy of a recommendation, but they can be created for evaluating the novelty of a recommendation, or its diversity. 

For now, we will focus on relevance and accuracy. Several metrics exist:
* Accuracy: rmse, mae.
* Not ranked: Recall@k, Precision@k.
* With rank disccount: map@k, ndcg@k.
* With rank ordering: mean percentile rank.

We will be definiing some of them whitin this class. For the moment, let's talk about precision and recall.

## 3.1 Precision and recall

<img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" alt="Precision and Recall in IR" style="float: right; width: 300px"/>

The concept of precision and recall comes form the world of information retrieval, have a look at the wikipedia:

https://en.wikipedia.org/wiki/Precision_and_recall

From this entry:

 * "**precision** (also called positive predictive value) is the fraction of retrieved instances that are relevant".
 * "**recall** (also known as sensitivity) is the fraction of relevant instances that are retrieved".

<br />
<div class  = "alert alert-info"> 
** QUESTION **: how do we know if some movie, unknown to the user, is relevant?
</div>

In other words, we cannot measure a false positive --something recommended that was not relevant--. In this regard, only recall-oriented metrics have an actual meaning in RS. Nonetheless, its common practice to define both metrics in RS as follows:
 
### $$\mathrm{recall}@N = \frac{\sum_{k=1}^N rel(k)}{\sum_{i\in \mathcal{I}_u} 1}$$
### $$\mathrm{precision}@N = \frac{\sum_{k=1}^N rel(k)}{N}$$

Here, $\mathcal{I}_u$ is the set of items adopted by user $u$, and $rel(k)$ is the relevance of a recommendation at position k in the list of recommendations. For ratings, the relevance could be defined as those movies rated above a certain threshold, e.g. $r_{ui}>4.0$. 

**Important to note: since precision is pretty much the same as recall in RS, metrcis usch as the *area under the ROC curve* doesn't have any meaning!!**

<div class = "alert alert-success">
As an example, consider a user that watched the following films:
<br /><br />
'Designated Mourner, The (1997)'
<br />
'Money Talks (1997)'
<br />
'Madame Butterfly (1995)'
<br />
'Batman Forever (1995)'
<br /><br />
The recommended items were: 
<br /><br />
'Batman (1989)' 
<br />
'Madame Butterfly (1995)'
<br /><br />
**What would be the recall and precision @1? and @2?**
<br />
**What do you think of recommending Batman? Is a bad or a good recommendation?**
</div>

Please notice that there isn't any actual difference between precision and recall in the context of RS: both measure the relevance of the recommendations, and tell nothing about items recommended that haven't been adopted by the user. Thus, it make sense to define a normalized recall as:

### $$\mathrm{recall}@N = \frac{\sum_{i=1}^N rel_i}{\mathrm{min}(N, \sum_{i\in \mathcal{I}_u} 1})$$

This way, results are normalized to 1 always.

<div class="alert alert-success">
**Exercise** Implement the above definition of recall
</div>

In [None]:
def recall_at_n(N, test, recommended, train=None):
    """
    :param N: number of recommendations
    :param test: list of movies seen by user in test
    :param train: list of movies seen by user in train. This has to be removed from the recommended list 
    :param recommended: list of movies recommended
    
    :return the recall
    """
    if train is not None: # Remove items in train
        rec_true = []
        for r in recommended:
            ?
    else:
        rec_true = recommended    
    intersection = ?
    return intersection / float(np.minimum(N, len(test)))

In [None]:
seen = ['Designated Mourner, The (1997)', 'Money Talks (1997)', 'Madame Butterfly (1995)', 'Batman Forever (1995)']
recommended = ['Batman (1989)', 'Madame Butterfly (1995)']

In [None]:
recall_at_n(1, seen, recommended)

In [None]:
recall_at_n(2, seen, recommended)

In [None]:
# Check it's well normalized
print(recall_at_n(3, seen, recommended))
print(recall_at_n(10, seen, recommended))
print(recall_at_n(100, seen, recommended))

### Now, use this implementation to measure the efficiency of the popularity baselines in the test set. Use the top-5 movies, for instance

In [None]:
mostRatedMovies[:5,1:]

In [None]:
positiveRatedMovies[:5,1:]

In [None]:
meanRateMovies[:5,1:]

In [None]:
train.head()

*Since `recall_at_n` takes both train and test list per user, we need to create a dataset with the list of movies seen in train and test*

Thus, get the list of movies per user in train and test, and join the two dataframes. For the join, use the pandas method `merge`.

In [None]:
# get movies in train per user. For this, group by user and get a list of item ids.
trainUsersGrouped = ?


In [None]:
# same with test data
testUsersGrouped = ?

In [None]:
# make the join: use pandas merge method
joined = ?

In [None]:
joined.head()

In [None]:
joined.item_id_test.head()

In [None]:
# How would you access values in test?
?

In [None]:
# This second method is easier if we want to access several columns at once, and operate over them.
# For instance, if we like to concatenate both train and test list, we will do:
?

In [None]:
# Use the above method to calculate the recall of the mostRatedMovies recommendation, for each user:
?

*As you can see, some users have a quite large recall (0.5), while for others is small (e.g, 0.14). Let's calculate the mean.*

In [None]:
topN = 30
# calculate the average recall across all users for mostRatedMovies recommendation
recall_per_user = ?
recall_per_user.mean()

In [None]:
# calculate the average recall across all users for positiveRatedMovies recommendation
?

In [None]:
# calculate the average recall across all users for meanRatedMovies recommendation
?

## 3.2 Mean Averaged Precision (MAP) -- Advanced material

Previous metrics did not account for the ranking of the recommendation, i.e. the relative position of a movie within the sorted list of recommendations. **But orders matters!** Metrics like MAP, MRR or NDCG try to tackle down this problem. 

From the blog *http://fastml.com/what-you-wanted-to-know-about-mean-average-precision/*:

> Here’s another way to understand average precision. Wikipedia says AP is used to score document retrieval. You can think of it this way: you type something in Google and it shows you 10 results. It’s probably best if all of them were relevant. If only some are relevant, say five of them, then it’s much better if the relevant ones are shown first. It would be bad if first five were irrelevant and good ones only started from sixth, wouldn’t it? AP score reflects this.

Implementation taken from:

https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py



## Average Precision 

The Average Precision is definied as:

### $$\mathrm{AP}@N = \frac{\sum_{k=1}^N P(k) \times rel(k)}{\mathrm{min}(N, \sum_{i\in \mathcal{I}_u} 1)}$$

where $P(k)$ is the precision at cut-off in the item list, i.e. the ratio of the number of recommended items adopted, up to the position k, over the number k. Thus:

### $$\mathrm{AP}@N = \frac{\sum_{k=1}^N \left(\sum_{i=1}^k rel(i)\right)/k \times rel(k)}{\mathrm{min}(N, \sum_{i\in \mathcal{I}_u} 1)}$$



<div class = "alert alert-success">
Following the example above, consider a user that watched the following films:
<br /><br />
'Designated Mourner, The (1997)'
<br />
'Money Talks (1997)'
<br />
'Madame Butterfly (1995)'
<br />
'Batman Forever (1995)'
<br /><br />
The recommended items were: 
<br /><br />
'Batman (1989)' 
<br />
'Madame Butterfly (1995)'
<br /><br />

<div class = "alert alert-success">
**Calculate AP@1**
<br /><br />
First, *rel(1)=0*, because Batman was not viewed. Also, *P(1) = 0*. Thus, AP@1=0.
<br />
**Calculate AP@2**
<br /><br />
As before, *rel(1)=0*, so the first term does not contribute. For the second term, *rel(2)=1*, so that *P(2)=0.5*. The numerator is hence:
<br /><br />
$P(1)*rel(1)+P(2)*rel(2)=0*0+0.5*1$
<br /><br />
For the denominator, $N=2$ and $\sum_{i\in \mathcal{I}_u} 1)=4$, thus:
<br /><br />
AP@2 = 0.5/2 = 0.25
</div>

Let's now implement it =)

In [None]:
def apk(N, test, recommended, train=None):
    """
    Computes the average precision at N given recommendations.
    
    :param N: number of recommendations
    :param test: list of movies seen by user in test
    :param train: list of movies seen by user in train. This has to be removed from the recommended list 
    :param recommended: list of movies recommended
    
    :return The average precision at N over the test set
    """
    if train is not None: 
        rec_true = []
        for r in recommended:
            if r not in train:
                rec_true.append(r)
    else:
        rec_true = recommended    
    predicted = rec_true[:N] # top-k predictions
    
    score = 0.0 # This will store the numerator
    num_hits = 0.0 # This will store the sum of rel(i)

    for i,p in enumerate(predicted):
        if p in test and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits/(i+1.0)

    return score / min(len(test), N)

In [None]:
seen = ['Designated Mourner, The (1997)', 'Money Talks (1997)', 'Madame Butterfly (1995)', 'Batman Forever (1995)']
recommended = ['Madame Butterfly (1995)', 'Batman (1989)']

In [None]:
apk(1, seen, recommended)

In [None]:
apk(2, seen, recommended)

In [None]:
apk(3, seen, recommended)

## MAP

Mean avergae precision is nothing else than the AP averaged across users ;)

Apply it to popularity baselines

In [None]:
?

<div class="alert alert-success">
The rest of the class is covered in a different notebook
</div>