# Lesson 1: Non-Personalized Recommender systems

## 1. Non-Personalised Recommender using MovieLens Dataset
We will work with the well known MovieLens dataset (http://grouplens.org/datasets/movielens/). This dataset was initially constructed to support participants in the Netflix Prize. Today, we can find several versions of this dataset with different amout of data, from 100k samples version to 20m sample version. Although performance on bigger dataset is expected to be better, we will work with the smallest dataset: MovieLens 100K Dataset (ml-100k-zip). Working with this lite version has the benefit of less computational costs

With a unix machine the dataset can be downloaded with the following code:


In [1]:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip 
!unzip ml-100k.zip -d "data/"

/bin/sh: wget: command not found
unzip:  cannot find or open ml-100k.zip, ml-100k.zip.zip or ml-100k.zip.ZIP.


If you are working with a windows machine, please go to the website and download the 100k version and extract it to the subdirectory named "data/ml-100k/"

Once you have downloaded and unzipped the file into a directory, you can create a DataFrame with the following code:


In [2]:
#NETFLIX REAL 50.000.000 usuaris and 100.000 items
%autosave 150
%matplotlib inline
import pandas as pd
import numpy as np
import math
import matplotlib.pylab as plt

# Load Data set
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('data/ml-100k/u.user', sep='|', names=u_cols)

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('data/ml-100k/u.data', sep='\t', names=r_cols)

# the movies file contains columns indicating the movie's genres
# let's only load the first three columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date']
movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=m_cols, usecols=range(3))

# Construcció del DataFrame
data = pd.merge(pd.merge(ratings, users), movies)
data = data[['user_id','title', 'movie_id','rating','release_date','sex','age']]


print "La BD has "+ str(data.shape[0]) +" ratings"
print "La BD has ", data.user_id.nunique()," users"
print "La BD has ", data.movie_id.nunique(), " movies"
print data.head()


Autosaving every 150 seconds
La BD has 100000 ratings
La BD has  943  users
La BD has  1682  movies
   user_id         title  movie_id  rating release_date sex  age
0      196  Kolya (1996)       242       3  24-Jan-1997   M   49
1      305  Kolya (1996)       242       5  24-Jan-1997   M   23
2        6  Kolya (1996)       242       4  24-Jan-1997   M   42
3      234  Kolya (1996)       242       4  24-Jan-1997   M   60
4       63  Kolya (1996)       242       3  24-Jan-1997   M   31


If you explore the dataset in detail, you will see that it consists of:
* 100,000 ratings from 943 users of 1682 movies. Ratings are from 1 to 5.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)

### 2.1 Top movies ranking. 
The simplest way to show the ranking is by using the mean rating.


In [None]:
mean_score = data.groupby('title').rating.mean()
print mean_score.sort_values(ascending=False).head(10)

title
Marlene Dietrich: Shadow and Light (1996)            5.0
Prefontaine (1997)                                   5.0
Santa with Muscles (1996)                            5.0
Star Kid (1997)                                      5.0
Someone Else's America (1995)                        5.0
Entertaining Angels: The Dorothy Day Story (1996)    5.0
Saint of Fort Washington, The (1993)                 5.0
Great Day in Harlem, A (1994)                        5.0
They Made Me a Criminal (1939)                       5.0
Aiqing wansui (1994)                                 5.0
Name: rating, dtype: float64


What do you think about the output? 

Now, let's show only ranking the mean rating but using only those movies with at least 20 ratings

In [None]:
size = data.groupby('title').size()
print mean_score[size>20].sort_values(ascending=False).head(10)

### 2.2 Market basket case data
Top Associated productes


In [None]:
# read data/grocieries.csv
def union(a, b):
    """ return the union of two lists """
    return list(set(a) | set(b))

market_data = []
cont = 0
items = []
with open("data/groceries.csv") as f:
    for l in f:
        market_data.append(l.split(','))
        items = union(items,l.split(','))

print "Number of different items", len(items)
print "Number of rows ", len(market_data)


print "An example:", market_data[3]

Number of different items 331
Number of rows  9835
An example: ['pip fruit', 'yogurt', 'cream cheese ', 'meat spreads\n']


### 2.2 Top associated movies to a target one
#### Create a function that give the top associated movies to a target one using the following simple metric:
$$score(X|Y) = \frac{X \ and \ Y}{X}$$

In [None]:
def top_associated_movies(data, target_movie, N = 10):
    
    return 0

#### Let's show the recomender movies for a user who clicks on 'GoldenEye (1995)'

In [None]:
top_associated_movies(data , target_movie = 'GoldenEye (1995)')

In [None]:
top_associated_movies(data , target_movie = 'Pulp Fiction (1994)')

In [None]:
top_associated_movies(data , target_movie = 'Forrest Gump (1994)')

In [None]:
top_associated_movies(data, target_movie= '101 Dalmatians (1996)')