# Recommending Music using the AudioScrobbler Dataset

To build a recommendation system using Spark and MLib using a dataset published by AudioScrobbler. This data is 500MB uncompressed and can be downloaded here- http://www-etud.iro.umontreal.ca/%7Ebergstrj/audioscrobbler_data.html

This is based on an exercise in the awesome book [Advanced Analytics with Spark](http://shop.oreilly.com/product/0636920035091.do). This is not intended to be a production ready application but instead a learning exercise. Chapter 3 of the book, entitled Recommending Music and the Audioscrobbler Data Set introduces readers to a recommendation algorithm Spark supplies called The Alternating Least Squares Recommender Algorithm.

## Technology Stack:

* Spark MLlib
* Scala

## Preparing Data

### Importing Data

In [1]:
val uadatapath="hdfs:///user/swethakolalapudi/audio/user_artist_data.txt"
val rawUserArtistData = sc.textFile(uadatapath)

Each line of the file contains a user ID, an artist ID, and a play count, separated by spaces. To compute statistics on the play count, we split the line by space, and the third (0-indexed) value is parsed as a number. The stats() method returns an object containing statistics like maximum and minimum.

In [9]:
rawUserArtistData.map(x => x.split(" ")(2).toDouble).stats()

(count: 24296858, mean: 15.295762, stdev: 153.915321, max: 439771.000000, min: 1.000000)

The computed statistics that are printed reveal that the maximum play count is 439771 and mean is around 15. Total number of entries are around 24 million. The maximum play count could be an outlier.

In [2]:
import org.apache.spark.mllib.recommendation._

Some of the ratings may be just noise as the user may have listened to an artist a few times to give it a try. So, I will filter the ratings to include only very strong ratings. This will also help to reduce the number of records as ALS is very expensive computation wise.

Creating an RDD of rating objects by filtering out ratings which are below 20. As this RDD may be used a number of times by ALS it is better to persist it.

In [4]:
val uaData=rawUserArtistData.map(_.split(" ")).filter(_(2).toInt>=20).map(x => Rating(x(0).toInt,x(1).toInt,x(2).toInt))
uaData.persist()

MapPartitionsRDD[5] at map at <console>:27

## Training

ALS has two methods: train and trainImplicit. SInce, our ratings are implicit, I use trainImplicit method.

In [6]:
val model=ALS.trainImplicit(uaData,10,5,0.01,1)
// 10 is the number of hidden factors that ALS should look for
// 5 is the max number of iterations ALS should go through
// lambda and alpha are set to 0.01 and 1

The model RDD returned by ALS has a recommendedProducts method. This method has input parameters as the user id and the number of recommedations needed for that user id.

## Getting Recommendations for a user id

In [14]:
var user: Int=1000002
var recommendations=model.recommendProducts(user,5)

In [None]:
recommendations is an RDD of Rating objects.

In [8]:
recommendations

Array(Rating(1000002,1270,1.185725442054678), Rating(1000002,1000188,1.1677245029656391), Rating(1000002,1205,1.1586795057541834), Rating(1000002,1428,1.1576052790420486), Rating(1000002,82,1.1389863112439214))

## Evaluating the recommendations made

Importing the file artist_data.txt to create a lookup for getting names of artist using artist id.

In [24]:
val artistsPath="hdfs:///user/swethakolalapudi/audio/artist_data.txt"
// We need to make sure that array has 2 elements
val artistLookup=sc.textFile(artistsPath).map(_.split("\t")).filter(_.length==2).map(x => (x(0),x(1)))
artistLookup.persist()

MapPartitionsRDD[167] at map at <console>:25

Getting the artist ids with ratings greater than 50 for the user id.

In [19]:
val userArtists=rawUserArtistData.map(_.split(" ")).filter{case Array(userId,_,rating) => (userId.toInt == user) && (rating.toInt>50)}.map(_(1)).collect()

Printing list of artists the user already prefers.

In [27]:
for (artist <- userArtists){
     println( artistLookup.lookup(artist)(0))}

Portishead
A Perfect Circle
Aerosmith
Judas Priest
Metallica
Foo Fighters
Counting Crows
Creed
Audioslave
Muse
(hed) Planet Earth
Dire Straits
Free
Fun Lovin' Criminals
Guns N' Roses
Satriani, Joe
A
Joe Satriani
Bruce Springsteen
Goo Goo Dolls
Fugees
Michael Jackson
Roachford
Barenaked Ladies
Buckcherry
Jools Holland
The Classic Chill Out Album
Frankie Goes To Hollywood
King's X
Mr. Big
Dave Weckl
Dan Reed Network
Liquid Tension Experiment
Level 42
Rage Against the Machine
Badly Drawn Boy
Beth Orton
Dido
Lenny Kravitz
Everclear
Feeder
Jimi Hendrix
Red Hot Chili Peppers
R.E.M.
Desert Sessions
The Kleptones
Jamiroquai
Led Zeppelin
Marcus Miller
Moby
Miles Davis
Electric Wizard
Matchbox Twenty
The Police
Nina Simone
Jeff Buckley
Dream Theater
Eels
Nickelback
Diana Krall
The Jimi Hendrix Experience
Pink
Rammstein
Norah Jones
Ben Folds Five
Radiohead


The user seems to have a preference for Rock music.

In [28]:
for (rating <- recommendations){
     println( artistLookup.lookup(rating.product.toString)(0))}

Queen
Dire Straits
U2
Eric Clapton
Pink Floyd


Artists recommended also seem to be of Rock genre. Hence, the recommendations are upto the mark.