# Recommendation System Datasets

This notebook uses the following datasets:

- [MovieLens 10M data set](http://grouplens.org/datasets/movielens/10m/)
- [MovieLens 22M data set](http://grouplens.org/datasets/movielens/latest/)
- [Million song data set](http://labrosa.ee.columbia.edu/millionsong/tasteprofile)

In [None]:
TODO: rename parse_rating to parseRating
TODO: add settings.py file

## Split dataset into 60-20-20 train-validate-test partitions

In [4]:
import os

def exists(filepath):
    return os.path.exists(filepath)

In [5]:
if (exists('ml-10M100K/train60.dat') and exists('ml-10M100K/validation20.dat') and exists('ml-10M100K/test20.dat')):
    print "Already created files: train60.dat, validation20.dat, test20.dat"    

else:
    # sort by timestamp (4th column)
    print 'sorting file...'
    !sort -t ':' -k4 ml-10M100K/ratings.dat > ml-10M100K/new_ratings.dat 
    print "sorting complete."
    
    # split into 5 parts of 2 million each: train(3 parts), validation (1 part), test (1 part)
    print "splitting file..."
    !split -l 2000000 ml-10M100K/new_ratings.dat ff
    !cat ffaa ffab ffac > ml-10M100K/train60.dat
    !mv ffad ml-10M100K/validation20.dat
    !mv ffae ml-10M100K/test20.dat
    
    # remove tmp files used to create partitions
    !rm new_ratings.dat ff*
    print "splitting complete."    
    print "Newly created files: train60.dat, validation20.dat, test20.dat"

Already created files: train60.dat, validation20.dat, test20.dat


In [88]:
help(ALS.train)

Help on method train in module pyspark.mllib.recommendation:

train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False, seed=None) method of __builtin__.type instance



### Meaning of parameters

- numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
- ***rank*** is the number of latent factors in the model.
- iterations is the number of iterations to run.
- ***lambda*** specifies the regularization parameter in ALS.
- implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
- alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.


# Using train data, learn ALS model

In [82]:
import contextlib
from math import sqrt
from operator import add
import sys
from pyspark.mllib.recommendation import ALS

In [306]:
def parse_rating(line):
    """
    Parses a rating record that's in MovieLens format.
    
    :param str line: userId::movieId::rating::timestamp
    """
    fields = line.strip().split("::")

    return (int(fields[0]),   # User ID
            int(fields[1]),   # Movie ID
            float(fields[2])) # Rating


def compute_rmse(model, data, dataCount):
    """
    Compute RMSE (Root Mean Squared Error).
    :param object model
    :param list data
    :param integer validation_count
    """
    predictions = model.predictAll(data.map(lambda x: (x[0], x[1]))) #userId and #movieId
    predictionsAndRatings = \
        predictions.map(lambda x: ((x[0], x[1]), x[2])) \
                   .join(data.map(lambda x: ((x[0], x[1]), x[2]))) \
                   .values()
    return sqrt(
        predictionsAndRatings.map(
            lambda x: (x[0] - x[1]) ** 2
        ).reduce(add) / float(dataCount)
    )

In [35]:
training = sc.textFile('ml-10M100K/train60.dat') \
         .filter(lambda x: x and len(x.split('::')) == 4) \
         .map(parse_rating)

In [36]:
validation = sc.textFile('ml-10M100K/validation20.dat') \
         .filter(lambda x: x and len(x.split('::')) == 4) \
         .map(parse_rating)

In [37]:
test = sc.textFile('ml-10M100K/test20.dat') \
         .filter(lambda x: x and len(x.split('::')) == 4) \
         .map(parse_rating)

In [38]:
trainCount = training.count()
trainCount

6000000

In [39]:
validationCount = validation.count()
validationCount

2000000

In [40]:
testCount = test.count()
testCount

2000000

In [89]:
training.take(3)

[(37746, 3409, 0.5), (37746, 175, 0.5), (51778, 5430, 0.5)]

In [94]:
validation.take(3)

[(6352, 6787, 4.0), (26571, 1580, 4.0), (26571, 2115, 4.0)]

In [95]:
test.take(3)

[(5337, 296, 4.0), (5337, 307, 4.0), (32329, 3745, 4.0)]

### Train ALS model using different regularization parameter and latent factors

In [41]:
rank_list = [10, 20, 30, 40, 50] # latent factor
lamda_list = [0.01, 0.1, 1.0, 10.0] # regularization parameter
iterations = 10
chosenModel = None
smallestRMSE = 9999999

for rank in rank_list:
    for lamda in lamda_list:
        model = ALS.train(training, rank, iterations, lamda)
        rmse = compute_rmse(model, validation, validationCount)
        
        if rmse < smallestRMSE:
            smallestRMSE = rmse
            chosenModel = model

        print 'Rank={}, Lambda={}, RMSE={}'.format(rank, lamda, rmse)

Rank=10, Lambda=0.01, RMSE=1.06049775091
Rank=10, Lambda=0.1, RMSE=1.10979440812
Rank=10, Lambda=1.0, RMSE=1.95801505852
Rank=10, Lambda=10.0, RMSE=3.99960898089
Rank=20, Lambda=0.01, RMSE=1.07348413238
Rank=20, Lambda=0.1, RMSE=1.14287691247
Rank=20, Lambda=1.0, RMSE=1.95987810731
Rank=20, Lambda=10.0, RMSE=3.99960898089
Rank=30, Lambda=0.01, RMSE=1.08488028136
Rank=30, Lambda=0.1, RMSE=1.1275414034
Rank=30, Lambda=1.0, RMSE=1.95889910186
Rank=30, Lambda=10.0, RMSE=3.99960898089
Rank=40, Lambda=0.01, RMSE=1.08504936949
Rank=40, Lambda=0.1, RMSE=1.16572274978
Rank=40, Lambda=1.0, RMSE=1.95928678985
Rank=40, Lambda=10.0, RMSE=3.99960898089
Rank=50, Lambda=0.01, RMSE=1.10281105899
Rank=50, Lambda=0.1, RMSE=1.18677347337
Rank=50, Lambda=1.0, RMSE=1.95958789238
Rank=50, Lambda=10.0, RMSE=3.99960898089


In [61]:
chosenModel.save(sc, 'chosenModel')

In [63]:
print 'The smallest RMSE is:{0: .2f}'.format(smallestRMSE)

The smallest RMSE is: 1.06


### Use chosen model with test set

In [60]:
testRMSE = compute_rmse(chosenModel, test, testCount)
print 'Final error metric using test set ={0: .2f}'.format(testRMSE)

Final error metric using test set = 1.87


### Create ratings file that contains movie ratings for one user

In [138]:
user01Ratings = sc.textFile('ml-10M100K/ratings.dat')
user01Ratings = newRatings.filter(lambda x: x.split('::')[0] == '1') # userId == 1
if not exists('ml-10M100K/user01Ratings.dat'):
    user01Ratings.saveAsTextFile('ml-10M100K/user01Ratings.dat')

In [334]:
def generate_recommendations(model, ratingsFile, numRecommended=5):

    userMovies = sc.textFile(ratingsFile) \
        .filter(lambda x: x and len(x.split('::')) == 4) \
        .map(parse_rating) \
        .map(lambda x: x[1]).collect()

    # get all the rated films that the user has not seen yet
    moviesNotSeen = sc.textFile('ml-10M100K/ratings.dat')\
        .filter(lambda x: x and len(x.split('::')) == 4)\
        .map(parse_rating).map(lambda r: (r[1], 1)) \
        .reduceByKey(add).map(lambda r: r[0])\
        .filter(lambda r: r not in userMovies).collect()

    candidates = sc.parallelize(moviesNotSeen) \
                .map(lambda x: (x, 1)) \
                .cache()
            
    predictions = model.predictAll(candidates).collect()
    predictions = sorted(predictions, key=lambda x: x[2], reverse=True)[:numRecommended]

    movies = ''
    with open('ml-10M100K/movies.dat', 'r') as open_file:
        movies = {int(line.split('::')[0]): line.split('::')[1]
              for line in open_file
              if len(line.split('::')) == 3}

    recommendations = []
    for movieId, _, _ in predictions:
        if movieId in movies:
            recommendations.append(movies[movieId]) 
 
    return recommendations

In [335]:
ratingsFile = 'ml-10M100K/user01Ratings.dat'
generate_recommendations(chosenModel, ratingsFile)

['Last House on the Left, The (1972)',
 'Innocents, The (1961)',
 'Seed of Chucky (2004)',
 'Telling Lies in America (1997)',
 "My Life and Times With Antonin Artaud (En compagnie d'Antonin Artaud) (1993)"]

In [164]:
validation.take(3)

[(6352, 6787, 4.0), (26571, 1580, 4.0), (26571, 2115, 4.0)]

In [186]:
vshort = validation.map(lambda x: (x[0], x[1])) #userid and #movieId
vshort.take(3)

[(6352, 6787), (26571, 1580), (26571, 2115)]

In [184]:
pp = chosenModel.predictAll(ratings.map(lambda x: (x[0], x[1])))
pp.take(3)

[]

In [180]:
    ppR = \
        pp.map(lambda x: ((x[0], x[1]), x[2])) \
                   .join(validation.map(lambda x: ((x[0], x[1]), x[2]))) #\
                  # .values()

In [182]:
ppR.take(3)

[((48092, 4306), (3.6792822988476472, 4.0)),
 ((37891, 2567), (2.775702218689793, 4.0)),
 ((47658, 6), (3.1336700509820354, 4.0))]

In [167]:
help(pp.join)

Help on method join in module pyspark.rdd:

join(self, other, numPartitions=None) method of pyspark.rdd.RDD instance
    Return an RDD containing all pairs of elements with matching keys in
    C{self} and C{other}.
    
    Each pair of elements will be returned as a (k, (v1, v2)) tuple, where
    (k, v1) is in C{self} and (k, v2) is in C{other}.
    
    Performs a hash join across the cluster.
    
    >>> x = sc.parallelize([("a", 1), ("b", 4)])
    >>> y = sc.parallelize([("a", 2), ("a", 3)])
    >>> sorted(x.join(y).collect())
    [('a', (1, 2)), ('a', (1, 3))]



In [172]:
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("a", 3)])
sorted(x.join(y).collect())

[('a', 1), ('b', 4)]


[('a', (1, 2)), ('a', (1, 3))]

In [174]:
x = sc.parallelize([("a", 66, 1, 7), ("b", 50, 4, 8)])
y = sc.parallelize([("b", 77, 2, 9), ("a", 22, 3, 6)])
sorted(x.join(y).collect())

[('a', (66, 22)), ('b', (50, 77))]