## Music Recommendation
### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023


---

#### Instructions

In this assignment, you will prepare data and build an ALS recommendation algorithm based on user listening data from Autoscrobbler.

The data consists of: 
- user data (listeners)
- item data (songs)
- interaction data (user listens, which is implicit feedback).  

The code is outlined below. Make the requested modifications, run the code, and copy all answers to the **ANSWER SECTION** at the bottom of the notebook. Note the *None* variable is a placeholder for code.

**NOTE**: For a given userID, some/many recommendation might come back as $None$.  
This comes from artists not used in the training data.  
These should be filtered out using a list comprehension as follows:

`print([x for x in recommendationsForUser if x is not None])`

**TOTAL POINTS: 10**
***

In [1]:
# import modules
import os

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.mllib import recommendation
from pyspark.mllib.recommendation import *
import pandas as pd

In [2]:
# set configurations
conf = SparkConf().setMaster("local").setAppName("autoscrobbler")

In [3]:
# set context
sc = SparkContext.getOrCreate(conf=conf)

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/10/11 18:49:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
# pathing and params
user_artist_data_file = 'user_artist_data.txt'
artist_data_file = 'artist_data.txt'
artist_alias_data_file  = 'artist_alias.txt'

numPartitions = 2
topk = 10

In [5]:
# read user_artist_data_file into RDD (417MB file, 24MM records of users’ plays of artists, along with count)
# specifically, each row holds: userID, artistID, count
rawDataRDD = sc.textFile(user_artist_data_file, numPartitions)
rawDataRDD.cache()

user_artist_data.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [6]:
# inspect some records
rawDataRDD.take(2)

[Stage 0:>                                                          (0 + 1) / 1]

23/10/11 18:49:47 WARN BlockManager: Task 0 already completed, not releasing lock for rdd_1_0


                                                                                

['1000002 1 55', '1000002 1000006 33']

In [43]:
# read artist_data_file using *textFile*
rawArtistRDD = sc.textFile(artist_data_file, numPartitions)
rawArtistRDD.cache()

artist_data.txt MapPartitionsRDD[32] at textFile at NativeMethodAccessorImpl.java:0

In [44]:
# inspect some records
rawArtistRDD.take(10)

23/10/11 19:09:51 WARN BlockManager: Task 37 already completed, not releasing lock for rdd_32_0


['1134999\t06Crazy Life',
 '6821360\tPang Nakarin',
 '10113088\tTerfel, Bartoli- Mozart: Don',
 '10151459\tThe Flaming Sidebur',
 '6826647\tBodenstandig 3000',
 '10186265\tJota Quest e Ivete Sangalo',
 '6828986\tToto_XX (1977',
 '10236364\tU.S Bombs -',
 '1135000\tartist formaly know as Mat',
 '10299728\tKassierer - Musik für beide Ohren']

In [10]:
# read artist_alias_data_file using *textFile*
aliasRDD = sc.textFile(artist_alias_data_file, numPartitions)

In [11]:
# inspect some records
aliasRDD.take(10)

['1092764\t1000311',
 '1095122\t1000557',
 '6708070\t1007267',
 '10088054\t1042317',
 '1195917\t1042317',
 '1112006\t1000557',
 '1187350\t1294511',
 '1116694\t1327092',
 '6793225\t1042317',
 '1079959\t1000557']

In [12]:
# 1) (1 PT) Print the first 10 records from rawDataRDD
rawDataRDD.take(10)

[Stage 4:>                                                          (0 + 1) / 1]

23/10/11 18:54:47 WARN BlockManager: Task 4 already completed, not releasing lock for rdd_1_0


                                                                                

['1000002 1 55',
 '1000002 1000006 33',
 '1000002 1000007 8',
 '1000002 1000009 144',
 '1000002 1000010 314',
 '1000002 1000013 8',
 '1000002 1000014 42',
 '1000002 1000017 69',
 '1000002 1000024 329',
 '1000002 1000025 1']

In [13]:
def parseArtistIdNamePair(singlePair):
   splitPair = singlePair.rsplit('\t')
   # we should have two items in the list - id and name of the artist.
   if len(splitPair) != 2:
       #print singlePair
       return []
   else:
       try:
           return [(int(splitPair[0]), splitPair[1])]
       except:
           return []


In [45]:
# 2) (1 PT) Apply parseArtistIdNamePair to rawArtistRDD, and print the first 10 records, showing only artist names
rawArtistRDD.map(lambda x: parseArtistIdNamePair(x)).map(lambda x: x[0][1]).take(10)

23/10/11 19:09:54 WARN BlockManager: Task 38 already completed, not releasing lock for rdd_32_0


['06Crazy Life',
 'Pang Nakarin',
 'Terfel, Bartoli- Mozart: Don',
 'The Flaming Sidebur',
 'Bodenstandig 3000',
 'Jota Quest e Ivete Sangalo',
 'Toto_XX (1977',
 'U.S Bombs -',
 'artist formaly know as Mat',
 'Kassierer - Musik für beide Ohren']

In [31]:
artistByID = dict(rawArtistRDD.flatMap(lambda x: parseArtistIdNamePair(x)).collect())
artist_vals = artistByID.values()
list(artist_vals)[:10]

                                                                                

['06Crazy Life',
 'Pang Nakarin',
 'Terfel, Bartoli- Mozart: Don',
 'The Flaming Sidebur',
 'Bodenstandig 3000',
 'Jota Quest e Ivete Sangalo',
 'Toto_XX (1977',
 'U.S Bombs -',
 'artist formaly know as Mat',
 'Kassierer - Musik für beide Ohren']

---

In [32]:
def parseArtistAlias(alias):
    splitPair = alias.rsplit('\t')
    # we should have two ids in the list.
    if len(splitPair) != 2:
        #print singlePair
        return []
    else:
        try:
            return [(int(splitPair[0]), int(splitPair[1]))]
        except:
            return []

In [34]:
artistAlias = aliasRDD.flatMap(lambda x: parseArtistAlias(x)).collectAsMap()

                                                                                

In [35]:
# turn the artistAlias into a broadcast variable.
# This will distribute it to worker nodes efficiently, so we save bandwidth.
artistAliasBroadcast = sc.broadcast( artistAlias )

In [36]:
artistAliasBroadcast.value.get(2097174)

1007797

In [37]:
# Print the number of records from the largest RDD, rawDataRDD
print( rawDataRDD.count() )



24296858


                                                                                

In [40]:
# Sample 10% of rawDataRDD (to reduce runtime) using seed 314. Call it sample.
seed = 314
weights = [.10,.9]
sample, _ = rawDataRDD.randomSplit(weights, seed)
sample.cache()

PythonRDD[28] at RDD at PythonRDD.scala:53

In [41]:
# take the first 5 records from the sample. each row represents userID, artistID, count.
sample.take(5)

                                                                                

['1000002 1000014 42',
 '1000002 1000088 157',
 '1000002 1000139 56',
 '1000002 1000140 95',
 '1000002 1000210 23']

In [60]:
# Based on sampled data, build the matrix for model training
def mapSingleObservation(x):
    # Returns Rating object represented as (user, product, rating) tuple.
    
    # [add line of code here to split each record into userID, artistID, count]
    x = x.split(" ")
    userID = x[0]
    artistID = x[1]
    count = x[2]
    # given possible aliasing, get finalArtistID
    finalArtistID = artistAliasBroadcast.value.get(artistID)
    if finalArtistID is None:
        finalArtistID = artistID
    return Rating(userID, finalArtistID, count)

In [61]:
trainData = sample.map(lambda x: mapSingleObservation(x))
trainData.cache()

PythonRDD[45] at RDD at PythonRDD.scala:53

In [63]:
# 3) (1 PT) Print the first 5 records from trainData
trainData.take(5)

[Rating(user=1000002, product=1000014, rating=42.0),
 Rating(user=1000002, product=1000088, rating=157.0),
 Rating(user=1000002, product=1000139, rating=56.0),
 Rating(user=1000002, product=1000140, rating=95.0),
 Rating(user=1000002, product=1000210, rating=23.0)]

In [64]:
# Train the ALS implicit model (since the measurements are activity and not ratings)
# using seed 314, rank 10, iterations 5, alpha 0.01
# import packages for RDD API
from pyspark.mllib import recommendation
from pyspark.mllib.recommendation import *

# Train the model
model = ALS.trainImplicit(trainData, rank=10, iterations=5, alpha=0.01, seed = 314)

                                                                                

In [69]:
# Model Evaluation

# fetch artists for a test user
testUserID = 1000002

# broadcast artistByID for speed
artistByIDBroadcast = sc.broadcast( artistByID )

# from trainData, collect the artists for the test user. Call the object artistsForUser.
# hint: you will need to apply .value.get(x.product) to the broadcast artistByID, where x is the Rating RDD.
# if you don't do this, you may see artistIDs. you want artist names.
artistsForUser = (trainData
                  .filter(lambda observation: observation.user == testUserID)
                  .map(lambda observation: artistByIDBroadcast.value.get(observation.product))
                  .collect())

                                                                                

In [71]:
# 4) (1 PT) Print the artist listens for testUserID = 1000002
print([x for x in artistsForUser if x is not None])

['Café Del Mar', 'Eric Clapton', 'Eurythmics']


In [82]:
# 5) (2 PTS) Make 10 recommendations for testUserID = 1000002
num_recomm = 600 # this filters down to 10 after filtering Nones
recommendationsForUser = map(lambda observation: artistByID.get(observation.product), model.call("recommendProducts", testUserID, num_recomm))

print([x for x in recommendationsForUser if x is not None])

['Eric Clapton', 'Dark Tranquillity', '植松伸夫', 'Scorpions', 'Enigma', 'Gary Jules', 'Eurythmics', 'Elvis Costello', 'Saliva', 'Nena']


In [83]:
# Train a second ALS model with seed 314, rank 20, iterations 5, lambda 0.01.
model2 = ALS.trainImplicit(trainData, rank=20, iterations=5, alpha=0.01, seed = 314)

                                                                                

In [84]:
# 6) (2 PTS) Using the rank 20 model, make 10 recommendations for the same test user
recommendationsForUser_rank20 = map(lambda observation: artistByID.get(observation.product), model2.call("recommendProducts", testUserID, num_recomm))
print([x for x in recommendationsForUser_rank20 if x is not None])

['Eric Clapton', 'Dark Tranquillity', 'Scorpions', 'Enigma', 'Eurythmics', '植松伸夫', 'Gary Jules', 'Hypocrisy', 'Elvis Costello', 'Nena', 'Joss Stone', 'Erasure', 'Echo & the Bunnymen', 'Saliva']


#### ANSWER SECTION (COPY ALL ANSWERS HERE)

In [9]:
# ANSWER 1 (1 PT)
# Print the first 10 records from rawDataRDD
rawDataRDD.take(10)

[Stage 2:>                                                          (0 + 1) / 1]

23/10/11 18:52:48 WARN BlockManager: Task 2 already completed, not releasing lock for rdd_1_0


                                                                                

['1000002 1 55',
 '1000002 1000006 33',
 '1000002 1000007 8',
 '1000002 1000009 144',
 '1000002 1000010 314',
 '1000002 1000013 8',
 '1000002 1000014 42',
 '1000002 1000017 69',
 '1000002 1000024 329',
 '1000002 1000025 1']

In [85]:
# ANSWER 2 (1 PT)
# Apply parseArtistIdNamePair to rawArtistRDD and print the first 10 records, showing only artist names
rawArtistRDD.map(lambda x: parseArtistIdNamePair(x)).map(lambda x: x[0][1]).take(10)    

23/10/11 19:41:48 WARN BlockManager: Task 496 already completed, not releasing lock for rdd_32_0


['06Crazy Life',
 'Pang Nakarin',
 'Terfel, Bartoli- Mozart: Don',
 'The Flaming Sidebur',
 'Bodenstandig 3000',
 'Jota Quest e Ivete Sangalo',
 'Toto_XX (1977',
 'U.S Bombs -',
 'artist formaly know as Mat',
 'Kassierer - Musik für beide Ohren']

In [86]:
# ANSWER 3 (1 PT)
# Print the first 5 records from trainData
trainData.take(5)

[Rating(user=1000002, product=1000014, rating=42.0),
 Rating(user=1000002, product=1000088, rating=157.0),
 Rating(user=1000002, product=1000139, rating=56.0),
 Rating(user=1000002, product=1000140, rating=95.0),
 Rating(user=1000002, product=1000210, rating=23.0)]

In [87]:
# ANSWER 4 (1 PT)
# Print the artist listens for testUserID = 1000002
print([x for x in artistsForUser if x is not None])

['Café Del Mar', 'Eric Clapton', 'Eurythmics']


In [88]:
# ANSWER 5 (2 PTS)
# Make 10 recommendations for testUserID = 1000002
num_recomm = 600 # this filters down to 10 after filtering Nones
recommendationsForUser = map(lambda observation: artistByID.get(observation.product), model.call("recommendProducts", testUserID, num_recomm))

print([x for x in recommendationsForUser if x is not None])

['Eric Clapton', 'Dark Tranquillity', '植松伸夫', 'Scorpions', 'Enigma', 'Gary Jules', 'Eurythmics', 'Elvis Costello', 'Saliva', 'Nena']


                                                                                

In [89]:
# ANSWER 6 (2 PTS)
# Using the rank 20 model, make 10 recommendations for testUserID = 1000002
recommendationsForUser_rank20 = map(lambda observation: artistByID.get(observation.product), model2.call("recommendProducts", testUserID, num_recomm))
print([x for x in recommendationsForUser_rank20 if x is not None])

['Eric Clapton', 'Dark Tranquillity', 'Scorpions', 'Enigma', 'Eurythmics', '植松伸夫', 'Gary Jules', 'Hypocrisy', 'Elvis Costello', 'Nena', 'Joss Stone', 'Erasure', 'Echo & the Bunnymen', 'Saliva']


# ANSWER 7 (2 PTS)
How does the rank 10 model seem to perform versus the rank 20 model?
The contents of artistsForUser may help answer the question.

The two models perform roughly the same. Most of the recommended artists in both of their recommendations are the same, the ordering is just a little different. Additionally, both models recommend the same two artists the user is already listening to, which indicates the models are both learning something. In this case, it doesn't seem like increasing the rank is all that helpful. 