# Tugas 4 Recommender System using Collaborative Filtering (ALS)

# Datasets 
I will be using some publicly available song data from audioscrobbler, which can be found here (https://old.datahub.io/dataset/audioscrobbler). However, i modified the original data files so that the code will run in a reasonable time on a single machine. The reduced data files have been suffixed with _small.txt and contains only the information relevant to the top 50 most prolific users (highest artist play counts).

# 1. Spark Initialization

In [2]:
import findspark
findspark.init()

Import Package Needed

In [6]:
from pyspark.mllib.recommendation import *
import random
from operator import *


Importing data from the Source in Spark

In [7]:
from pyspark import SparkContext
sc =SparkContext()

# Load Data 
I'm going to load the 3 datasets into RDDs,.

In [11]:
def parser(s, delimeters=" ", to_int=None):
    s = s.split(delimeters)
    if to_int:
        return tuple([int(s[i]) if i in to_int else s[i] for i in range(len(s))])
    return tuple(s)
artistData = sc.textFile("artist_data_small.txt").map(lambda x: parser(x,'\t',[0]))
artistAlias = sc.textFile("artist_alias_small.txt").map(lambda x: parser(x,'\t', [0,1]))
artistAliasMap = artistAlias.collectAsMap()
userArtistData = sc.textFile("user_artist_data_small.txt").map(lambda x: parser(x,' ',[0,1,2]))
userArtistData = userArtistData.map(lambda x: (x[0], artistAliasMap.get(x[1], x[1]), x[2]))


# Data Exploration 
Example : I will find the users' total play counts.  Find the three users with the highest number of total play counts (sum of all counters) and print the user ID, the total play count, and the mean play count (average number of times a user played an artist).

In [41]:
def summary(user_id):
    play_list = userArtistData.map(lambda x: (x[0], (x[1], x[2]))).lookup(user_id)
    total = sum(x[1] for x in play_list)
    d = "User %s has a total play count of %s and a mean play count of %s." % (user_id, total, total/len(play_list),)
    print (d)
summary(1059637)
summary(2064012)
summary(2069337)

User 1059637 has a total play count of 674412 and a mean play count of 1878.5849582172702.
User 2064012 has a total play count of 548427 and a mean play count of 9455.637931034482.
User 2069337 has a total play count of 393515 and a mean play count of 1519.3629343629343.


# Split Data for Testing 
Use the randomSplit function to divide the data (userArtistData) into:

1. A training set, trainData, that will be used to train the model. This set should constitute 40% of the data.
2. A validation set, validationData, used to perform parameter tuning. This set should constitute 40% of the data.
3. A test set, testData, used for a final evaluation of the model. This set should constitute 20% of the data.

In [20]:
trainingData, validationData, testData = userArtistData.randomSplit([40,40,20], 13)
trainingData.cache()
validationData.cache()
testData.cache()

PythonRDD[16] at RDD at PythonRDD.scala:52

In [31]:
print (trainingData.take(3))
print (validationData.take(3))
print (testData.take(3))
print (trainingData.count())
print (validationData.count())
print (testData.count())

[(1059637, 1000049, 1), (1059637, 1000056, 1), (1059637, 1000114, 2)]
[(1059637, 1000010, 238), (1059637, 1000062, 11), (1059637, 1000123, 2)]
[(1059637, 1000094, 1), (1059637, 1000112, 423), (1059637, 1000113, 5)]
19769
19690
10022


# Build The Recommender Model
For this project, i will train the model with implicit feedback. You can read more information about this from the collaborative filtering page: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

In [32]:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

def cal_score(predict, actual):
    if len(actual) < len(predict):
#         print "here"
        predict = predict[0:len(actual)]
    return len(list(set(predict) & set(actual)))*1.0/len(actual)

def modelEval(model, dataset):
    # Find the list of all artists in the whole data set
    all_artists = userArtistData.map(lambda x: x[1]).distinct().collect()
    # Find the users in the input dataset
    test_user = dataset.map(lambda p: p[0]).distinct().collect()
    # Find the artists each user listened to in the training set and generate the test data
    global trainingData
    testdata = trainingData.filter(lambda x: x[0] in test_user).map(lambda x: (x[0], x[1])).groupByKey()
    testdata = testdata.map(lambda x: (x[0], list(x[1])))
    testdata = testdata.flatMap(lambda x: [(x[0],a) for a in all_artists if a not in x[1]])
    # Find the artists each user listened to in the input dataset
    testdata_actual = dataset.map(lambda x: (x[0], x[1])).groupByKey().map(lambda x: (x[0], list(x[1]))).collectAsMap()
    predictions = model.predictAll(testdata).map(lambda x: (x[0], (x[1], x[2])))
    predictions = predictions.groupByKey().map(lambda x: (x[0], sorted(list(x[1]), key=lambda y: y[1], reverse=True)))
    predictions = predictions.map(lambda x: (x[0], cal_score([y[0] for y in x[1]], testdata_actual[x[0]])))
    return predictions.map(lambda x:x[1]).reduce(lambda x, y: x+ y) * 1.0 / len(test_user)

In [39]:
training = trainingData.map(lambda x: Rating(int(x[0]), int(x[1]), float(x[2])))
for r in [2, 10, 20]:
    model = ALS.trainImplicit(training, rank = r, seed=345)
    s = "The model score for rank %s is %s" % (r, modelEval(model, validationData),)
    print (s) 


The model score for rank 2 is 0.09365101274010958
The model score for rank 10 is 0.0948198831576256
The model score for rank 20 is 0.07628028287919196


In [43]:
bestModel = ALS.trainImplicit(training, rank=10, seed=345)
print (modelEval(bestModel, testData))

0.059515526043292855


# Example : Trying Some Artist Recommendations
Using the best model above, predict the top 5 artists for user 1059637 using the recommendProducts function. Map the results (integer IDs) into the real artist name using artistAlias.

In [44]:
recommended = map(lambda x: x.product, bestModel.recommendProducts(1059637, 5))
for i, artist in enumerate(recommended):
    a = "Artist %s: %s" % (i, artistData.lookup(artist)[0],)
    print (a)

Artist 0: Something Corporate
Artist 1: My Chemical Romance
Artist 2: Alkaline Trio
Artist 3: The Used
Artist 4: Further Seems Forever
