# Recommendation System Datasets

This notebook uses the following datasets:

- [MovieLens 10M data set](http://grouplens.org/datasets/movielens/10m/)
- [MovieLens 22M data set](http://grouplens.org/datasets/movielens/latest/)
- [Million song data set](http://labrosa.ee.columbia.edu/millionsong/tasteprofile)

## Split dataset into 60-20-20 train-validate-test partitions

In [29]:
import os

def exists(filepath):
    return os.path.exists(filepath)

In [30]:
# show current files
!ls -l ml-10M100K/

total 1043656
-rw-r--r--@ 1 fnokeke  staff      11563 Jan 29 10:38 README.html
-rwxr-x---@ 1 fnokeke  staff        753 Jan  5  2009 [31mallbut.pl[m[m
-rw-r--r--@ 1 fnokeke  staff     522197 Jan  5  2009 movies.dat
-rw-r--r--@ 1 fnokeke  staff  265105635 Jan  5  2009 ratings.dat
-rwxr-x---@ 1 fnokeke  staff       1304 Feb 16 10:06 [31msplit_ratings.sh[m[m
-rw-r--r--@ 1 fnokeke  staff    3584119 Jan  5  2009 tags.dat
-rw-r--r--  1 fnokeke  staff   51584300 Apr  6 11:43 test20.dat
-rw-r--r--  1 fnokeke  staff  161529860 Apr  6 11:43 train60.dat
-rw-r--r--  1 fnokeke  staff   51990078 Apr  6 11:43 validation20.dat


In [44]:
if (exists('ml-10M100K/train60.dat') and exists('ml-10M100K/validation20.dat') and exists('ml-10M100K/test20.dat')):
    print "Already created files: train60.dat, validation20.dat, test20.dat"    

else:
    # sort by timestamp (4th column)
    print 'sorting file...'
    !sort -t ':' -k4 ml-10M100K/ratings.dat > ml-10M100K/new_ratings.dat 
    print "sorting complete."
    
    # split into 5 parts of 2 million each: train(3 parts), validation (1 part), test (1 part)
    print "splitting file..."
    !split -l 2000000 ml-10M100K/new_ratings.dat ff
    !cat ffaa ffab ffac > ml-10M100K/train60.dat
    !mv ffad ml-10M100K/validation20.dat
    !mv ffae ml-10M100K/test20.dat
    
    # remove tmp files used to create partitions
    !rm new_ratings.dat ff*
    print "splitting complete."    
    print "Newly created files: train60.dat, validation20.dat, test20.dat"

Already created files: train60.dat, validation20.dat, test20.dat


# Using train data, learn ALS model

In [19]:
import contextlib
import itertools
from math import sqrt
from operator import add
import sys

from docopt import docopt
from pyspark import SparkConf, SparkContext
from pyspark.mllib.recommendation import ALS
from pyspark.mllib.recommendation import Rating


SPARK_EXECUTOR_MEMORY = '6g'
SPARK_APP_NAME = 'movieRecommender'
SPARK_MASTER = 'local'

In [80]:
@contextlib.contextmanager
def spark_manager():
    conf = SparkConf().setMaster(SPARK_MASTER) \
                      .setAppName(SPARK_APP_NAME) \
                      .set("spark.executor.memory", SPARK_EXECUTOR_MEMORY)
    spark_context = SparkContext(conf=conf)

    try:
        yield spark_context
    finally:
        spark_context.stop()



def parse_rating(line):
    """
    Parses a rating record that's in MovieLens format.
    
    :param str line: userId::movieId::rating::timestamp
    """
    fields = line.strip().split("::")

    return (int(fields[0]),   # User ID
            int(fields[1]),   # Movie ID
            float(fields[2])) # Rating

In [81]:
context = spark_manager()

In [82]:
training = sc.textFile('ml-10M100K/train60.dat') \
         .filter(lambda x: x and len(x.split('::')) == 4) \
         .map(parse_rating)

In [83]:
validation = sc.textFile('ml-10M100K/validation20.dat') \
         .filter(lambda x: x and len(x.split('::')) == 4) \
         .map(parse_rating)

In [84]:
test = sc.textFile('ml-10M100K/test20.dat') \
         .filter(lambda x: x and len(x.split('::')) == 4) \
         .map(parse_rating)

In [85]:
trainCount = training.count()
trainCount

6000000

In [86]:
validationCount = validation.count()
validationCount

2000000

In [87]:
testCount = test.count()
testCount

2000000

In [88]:
help(ALS.train)

Help on method train in module pyspark.mllib.recommendation:

train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False, seed=None) method of __builtin__.type instance



### Meaning of parameters

- numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
- ***rank*** is the number of latent factors in the model.
- iterations is the number of iterations to run.
- ***lambda*** specifies the regularization parameter in ALS.
- implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
- alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.


In [89]:
training.take(3)

[(37746, 3409, 0.5), (37746, 175, 0.5), (51778, 5430, 0.5)]

In [94]:
validation.take(3)

[(6352, 6787, 4.0), (26571, 1580, 4.0), (26571, 2115, 4.0)]

In [95]:
test.take(3)

[(5337, 296, 4.0), (5337, 307, 4.0), (32329, 3745, 4.0)]

### Train and Save ALS Model

In [91]:
model = ALS.train(training, 10, 20, 0.01)

In [92]:
model.save(sc, 'myALSmodel')

### Try model with validation set (user-product pairs needed)

In [97]:
help(model.predictAll)

Help on method predictAll in module pyspark.mllib.recommendation:

predictAll(self, user_product) method of pyspark.mllib.recommendation.MatrixFactorizationModel instance
    Returns a list of predicted ratings for input user and product pairs.



In [98]:
validation_pair = validation.map(lambda x: (x[0], x[1]))
validation_pair.take(3)

[(6352, 6787), (26571, 1580), (26571, 2115)]

In [99]:
predictions = model.predictAll(validation_pair)
predictions.take(3)

[Rating(user=57436, product=1356, rating=2.9622857548829815),
 Rating(user=57436, product=648, rating=2.7055366122842517),
 Rating(user=57436, product=260, rating=2.618657352015365)]

In [100]:
predictions = predictions.map(lambda x: (x[0], x[1], x[2]))
predictions.take(3)

[(57436, 1356, 2.9622857548829815),
 (57436, 648, 2.7055366122842517),
 (57436, 260, 2.618657352015365)]

### Compute Mean Squared Error

In [102]:
ratesAndPreds = training.join(predictions)
ratesAndPreds.take(3)

[(65538, (34, 432)), (65538, (34, 376)), (65538, (34, 592))]

In [105]:
ratesAndPreds.take(13)

[(65538, (34, 432)),
 (65538, (34, 376)),
 (65538, (34, 592)),
 (65538, (34, 224)),
 (65538, (34, 440)),
 (65538, (34, 256)),
 (65538, (34, 349)),
 (65538, (34, 281)),
 (65538, (34, 153)),
 (65538, (34, 457)),
 (65538, (34, 5)),
 (65538, (34, 593)),
 (65538, (34, 105))]

In [106]:
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2)
MSE.take(10)

[158404, 116964, 311364, 36100, 164836, 49284, 99225, 61009, 14161, 178929]

In [108]:
MSE.reduce(add)

124643018399371127

In [109]:
MSE.count()

482038959

In [110]:
RMSE = sqrt(124643018399371127)/482038959.0
print("Root Mean Squared Error = " + str(RMSE))

Root Mean Squared Error = 0.732405907908


# Using validation data, choose different regularization parameters with different latent factors

# Using test data, test chosen model and report metric error

# Use ALS model and ratings file to return output of predicted recommendations

In [48]:
def generate_recommendations(modelALS, ratings):
    prediction = []
    return prediction