# FIT5202 : Music Recommendation using Collaborative Filtering

Collaborative filtering (CF) is a technique commonly used to build personalized recommendations on the Web. Some popular websites that make use of the collaborative filtering technology include Amazon, Netflix, iTunes, IMDB, LastFM, Delicious and StumbleUpon. In collaborative filtering, algorithms are used to make automatic predictions about a user's interests by compiling preferences from several users.

In this lab, our task is to use a collaborative algorithm to recommend top artists from the given dataset. The dataset can be downloaded from Moodle. 
<p style="color:red">Complete the required tasks in the tutorial. The activited is denoted as "Task" with the required instructions</p>
<br/>

## Table of Contents

* [ALS Lecture Demo](#als-demo)
* [Use-Case Music Recommendation](#use-case)
    * [Data Loading](#data-loading)
    * [Data Preparation](#data-prep)
    * [Data Exploration](#data-exploration)
    * [Train-Test Split](#train-test-split)
    * [Model Building](#model-building)
    * [Evaluation](#evaluation)
    * [Hyperparameter Tuning and Cross Validation](#cv)
    * [Making Predictions](#predictions)    
* [Lab Tasks](#lab-task-1)
    * [Lab Task 1](#lab-task-1)
    * [Lab Task 2](#lab-task-2)
    * [Lab Task 3](#lab-task-3)
    * [Lab Task 4](#lab-task-4)
    * [Lab Task 5](#lab-task-5)
    * [Lab Task 6](#lab-task-6)
    * [Lab Task 7](#lab-task-7)
    
## Including Libraries and Initializing Spark Context

2691611809081_.pic_hd.jpg![image.png](attachment:image.png)

In [1]:
#import libraries
from pyspark import SparkContext
from pyspark.ml.recommendation import ALS
from pyspark.sql import SparkSession ,Row
from pyspark.sql.functions import col,split
from pyspark.sql.types import StructType,StructField,IntegerType


appName="Collaborative Filtering with PySpark"
#initialize the spark session
spark = SparkSession.builder.appName(appName).getOrCreate()
#get sparkcontext from the sparksession
sc = spark.sparkContext

## Alternating Least Squares DEMO <a class="anchor" name="als-demo"></a>

Please go through the ALS Demo presented in the Lecture to understand the basic flow before starting with the lab tasks.

## Use-Case : Music Recommendation <a class="anchor" name="use-case"></a>
The goal here is to use the data provided to create a recommendation system using collaborative filtering using the social influence data and predict artists a user might like but have not listened to.

> Consider the following example. If user A is a neighbor of user B, and they have similar musical tastes, then there is a very strong tie between them. If user B is a big fan of artist C, and has scrobbled them numerous times, then there is also a strong tie between them. Based on last.fm’s data, user A has not yet listened to artist C (no link has formed between them yet), and there is a good chance that user A will also like artist C.<a href="https://blogs.cornell.edu/info2040/2012/09/20/last-fm-music-reccomendation-incorporating-social-network-ties-and-collaborative-filtering/#:~:text=their%20listening%20frequency.-,Last.,in%20the%20user's%20local%20network." target="_BLANK">Ref</a>


The original dataset is available at <a href="https://www.last.fm/api/" target="_blank">last.fm api</a>. The dataset provided here is a lighter version, resized for the sake of simplicity. The dataset contains three files as follows:
<ul>
    <li><strong>user_artist_data.txt</strong>
        3 columns: <code>userid, artistid, playcount</code></li>
    <li><strong>artist_data.txt</strong>
        2 columns: <code>artistid ,artist_name</code></li>
    <li><strong>artist_alias.txt</strong>
        2 columns: <code>badid, goodid</code>
        [known incorrectly spelt artists and the correct artist id].</li>
</ul>


<a class="anchor" id="lab-task-1"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">1. Lab Task: </strong> 
Import the two other files (user_artist_data.txt and artist_alias.txt) to create two dataframes <code>df_artist_alias</code> and <code>df_user_artist</code>
    
<strong style="color:red">NOTE:</strong> Check the <strong>delimiter</strong> used in these files. <code>\t</code> may not be used for all files.
</div>.




## Data Loading <a class="anchor" name="data-loading"></a>

In [3]:
df = spark.read.text("artist_data.txt")
split_col = split(df['value'], '\t')
df = df.withColumn('artist_id', split_col.getItem(0))
df = df.withColumn('artist_name', split_col.getItem(1))
df_artist=df.drop('value')

In [6]:
df_artist.show(3)

+---------+--------------------+
|artist_id|         artist_name|
+---------+--------------------+
|  1240105|        André Visior|
|  1240113|           riow arai|
|  1240132|Outkast & Rage Ag...|
+---------+--------------------+
only showing top 3 rows



In [7]:
#Load user_artist_data.txt to a dataframe called df_user_artist
df = spark.read.text("user_artist_data.txt")
split_col = split(df['value'], ' ')
df = df.withColumn('user_id', split_col.getItem(0))
df = df.withColumn('artist_id', split_col.getItem(1))
df = df.withColumn('playcount', split_col.getItem(2))
df_user_artist = df.drop('value')

In [9]:
df_user_artist.toPandas().head()

Unnamed: 0,user_id,artist_id,playcount
0,1059637,1000010,238
1,1059637,1000049,1
2,1059637,1000056,1
3,1059637,1000062,11
4,1059637,1000094,1


In [11]:
#Load artist_alias.txt to a dataframe called df_artist_alias
df = spark.read.text("artist_alias.txt")
split_col = split(df['value'], '\t')
df = df.withColumn('bad_id', split_col.getItem(0))
df = df.withColumn('good_id', split_col.getItem(1))
df_artist_alias = df.drop("value")

In [13]:
df_artist_alias.take(5)

[Row(bad_id='1027859', good_id='1252408'),
 Row(bad_id='1017615', good_id='668'),
 Row(bad_id='6745885', good_id='1268522'),
 Row(bad_id='1018110', good_id='1018110'),
 Row(bad_id='1014609', good_id='1014609')]

## Data Preparation <a class="anchor" name="data-prep"></a>

The <code>df_user_artist</code> contains <strong>bad ids</strong>, so the <strong>bad ids</strong> in the <code>user_artist_data.txt</code> file need to be remapped to <strong>goodids</strong>. The mapping of <strong>bad_ids</strong> to <strong>good_ids</strong> is in <strong>artist_alias.txt</strong> file. The first task is to create a dictionary of the artist_alias, so that it can be passed over a <strong>broadcast variable</strong>.

Broadcast makes Spark send and hold in memory just one copy for each executor in the cluster. When there are thousands of tasks, and many execute in parallel on each executor, this can save significant network traffic and memory.
But you cannot directly broadcast a dataframe, it has to be converted to a list first

In [92]:
#After loading the artist_alias data to the dataframe,
#it is converted to a dictionary to be set as a broadcast variable
artist_alias = dict(df_artist_alias.collect())
# artist_alias

<a class="anchor" id="lab-task-2"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">2. Lab Task: </strong> 
For the dictionary <strong>artist_alias</strong> which contains key value pair of badid and goodid, create a broadcast variable called <strong>bArtistAlias</strong>.
</div>

In [24]:
#Write the code below to create a broadcast variable bArtistAlias
bArtistAlias = sc.broadcast(artist_alias)

In [27]:
print(bArtistAlias)

<pyspark.broadcast.Broadcast object at 0x7fe479096160>


After the broadcast variable is created, a function to replace the badids by looking up the values from the broadcasted dictionary is implemented for the userArtistRDD.


In [28]:
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import StringType
def lookup_correct_id(artist_id):  
    finalArtistID = bArtistAlias.value.get(artist_id)
    if finalArtistID is None:
        finalArtistID = artist_id
    return finalArtistID    

lookup_udf = udf(lookup_correct_id, StringType())

In [97]:
#Row(bad_id='1017615', good_id='668')
f1 = bArtistAlias.value.get('1017615')
print(f1)

668


<a class="anchor" id="lab-task-3"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">3. Lab Task: </strong> 
    Use the udf <code>lookup_udf</code> to replace the "badids" in the <code>df_user_artist</code> dataframe.
</div>

In [34]:
#Write your code below
newColumn = lookup_udf(df_user_artist.user_id)
df_user_artist = df_user_artist.withColumn('user_id', newColumn)

In [35]:
df_user_artist.take(5)

[Row(user_id='1059637', artist_id='1000010', playcount='238'),
 Row(user_id='1059637', artist_id='1000049', playcount='1'),
 Row(user_id='1059637', artist_id='1000056', playcount='1'),
 Row(user_id='1059637', artist_id='1000062', playcount='11'),
 Row(user_id='1059637', artist_id='1000094', playcount='1')]

When we want to repeteadly access a dataframe or an RDD, it is a good idea to cache them, it helps to speed up applications.

In [36]:
#Uncomment this to use caching
df_user_artist.cache()

DataFrame[user_id: string, artist_id: string, playcount: string]

## Data Exploration <a class="anchor" name="data-exploration"></a>

<a class="anchor" id="lab-task-4"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">4. Lab Task: </strong> 
    Write a query in the function below, to return top <strong>N</strong> artist for a user with user_id : <code>2062243</code>. You will need to join <code>user</code> and <code>user_artist</code> datasets on the common key <code>artist_id</code>. The sample output is given below.
</div>



In [51]:
def top_n_artists(artist,user_artist,user_id,limit):
    '''Returns top n artists liked by a particular user'''
    '''Takes artist,user_artist, user_id and limit as input'''
    #WRITE THE QUERY HERE
    df = user_artist.join(artist,  user_artist.artist_id == artist.artist_id, how ='inner')\
            .select('user_id', 'playcount', 'artist_name')\
            .withColumn('playcount', col('playcount').cast(IntegerType()))\
            .filter(col('user_id') == user_id)\
            .sort('playcount', ascending = False)\
            .limit(limit)
    return df  

top_n_artists(df_artist,df_user_artist,2062243,60).show(truncate=False)

+-------+---------+----------------------+
|user_id|playcount|artist_name           |
+-------+---------+----------------------+
|2062243|26107    |Music 205             |
|2062243|10314    |Mos Def               |
|2062243|7193     |Morrissey             |
|2062243|6652     |Modest Mouse          |
|2062243|4913     |Mouse on Mars         |
|2062243|3983     |The Movielife         |
|2062243|3658     |The Beatles           |
|2062243|3354     |Led Zeppelin          |
|2062243|2843     |Mogwai                |
|2062243|1888     |Queen                 |
|2062243|1718     |Radiohead             |
|2062243|1706     |Motion City Soundtrack|
|2062243|1700     |Talib Kweli           |
|2062243|1470     |Mudvayne              |
|2062243|1359     |Kanye West            |
|2062243|1348     |Jackson Browne        |
|2062243|1257     |moe.                  |
|2062243|1257     |Bob Dylan             |
|2062243|1147     |David Bowie           |
|2062243|1079     |The Killers           |
+-------+--

Here we want to convert data types to integer type where required.

In [50]:
df_user_artist.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- artist_id: string (nullable = true)
 |-- playcount: string (nullable = true)



In [52]:
#Cast the data column into integer types
for col_name in df_user_artist.columns:
    df_user_artist = df_user_artist.withColumn(col_name, df_user_artist[col_name].cast(IntegerType()))

df_artist = df_artist.withColumn('artist_id', df_artist['artist_id'].cast(IntegerType()))

In [53]:
df_user_artist.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- artist_id: integer (nullable = true)
 |-- playcount: integer (nullable = true)



In [54]:
df_artist.printSchema()

root
 |-- artist_id: integer (nullable = true)
 |-- artist_name: string (nullable = true)



## Train Test Split <a class="anchor" name="train-test-split"></a>

<a class="anchor" id="lab-task-5"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">5. Lab Task: </strong> 
Create training and testing dataset with a 80/20 split
</div>

In [102]:
#Write your code here
train, test = df_user_artist.randomSplit([0.8, 0.2], seed=10000)

## Model Building <a href="https://spark.apache.org/docs/latest/ml-collaborative-filtering.html" target="_blank">[REF]</a> <a class="anchor" name="model-building"></a>
Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. spark.ml currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.ml uses the alternating least squares (ALS) algorithm to learn these latent factors. The implementation in spark.ml has the following parameters:

- <strong>numBlocks</strong> is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
- <strong>rank</strong> is the number of latent factors in the model (defaults to 10).
- <strong>maxIter</strong> is the maximum number of iterations to run (defaults to 10).
- <strong>regParam</strong> specifies the regularization parameter in ALS (defaults to 1.0).
- <strong>implicitPrefs</strong> specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data (defaults to false which means using explicit feedback).
- <strong>alpha</strong> is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations (defaults to 1.0).
- <strong>nonnegative</strong> specifies whether or not to use nonnegative constraints for least squares (defaults to false).

In [55]:
als = ALS(maxIter=5, implicitPrefs=True, alpha=40,userCol="user_id", itemCol="artist_id", ratingCol="playcount",
          coldStartStrategy="drop")

<a class="anchor" id="lab-task-6"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">6. Lab Task: </strong> 
Perform the following tasks.
    <ul><li>Train the model with the training set created from above.</li><li> Then transform use the test data to get the predictions. </li><li>Display the first 20 predictions from the results.</li></ul>    
<i>The predictions shown below will be just indicator of how closely a given artist will be to the user's existing preferences</i>
</div>

In [103]:
#Write your code below
model = als.fit(train)
predictions = model.transform(test)
predictions.show(10)

+-------+---------+---------+-----------+
|user_id|artist_id|playcount| prediction|
+-------+---------+---------+-----------+
|1059334|      463|       36|-0.17948887|
|1031009|      463|        4|  2.5921273|
|1035511|      496|        4| 0.02169519|
|1024631|      833|        5| 0.12948526|
|1029563|      833|        3| 0.05250678|
|1046559|      833|      126| 0.11291194|
|2062243|      833|      112|-0.17410207|
|1024631|     3175|       13| -0.1574901|
|1026084|     3918|       10|  0.7202213|
|1024631|  1001129|      199|-0.31320852|
+-------+---------+---------+-----------+
only showing top 10 rows



## Evalutation of ALS <a class="anchor" name="evaluation"></a>
We can evaluate ALS using RMSE (Root Mean Squared Error) using the RegressionEvaluator as shown below:

In [62]:
#Write your code here
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="rmse", labelCol="playcount",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 6536.445664331771


<strong style="color:red">NOTE: </strong>If you run the above code, the RMSE you will observe is very high. 

For implicit data, RMSE is not a reliable score since, we don't have any reliable feedback over if items are disliked. RMSE requires knowing which items the user dislikes. Spark does not have a readily available solution for to evaluate the implicit data. The following function implements ROEM (Rank Ordering Error Metric) on the prediction data. You can refer to the details about this <a href="https://campus.datacamp.com/courses/recommendation-engines-in-pyspark/what-if-you-dont-have-customer-ratings?ex=6" target="_BLANK">here</a>.

In [63]:
def ROEM(predictions, userCol = "userId", itemCol = "songId", ratingCol = "num_plays"):
    #Creates table that can be queried
    predictions.createOrReplaceTempView("predictions")

    #Sum of total number of plays of all songs
    denominator = predictions.groupBy().sum(ratingCol).collect()[0][0]

    #Calculating rankings of songs predictions by user
    spark.sql("SELECT " + userCol + " , " + ratingCol + " , PERCENT_RANK() OVER (PARTITION BY " + userCol + " ORDER BY prediction DESC) AS rank FROM predictions").createOrReplaceTempView("rankings")

    #Multiplies the rank of each song by the number of plays and adds the products together
    numerator = spark.sql('SELECT SUM(' + ratingCol + ' * rank) FROM rankings').collect()[0][0]

    performance = numerator/denominator

    return performance

In [64]:
ROEM(predictions,'user_id','artist_id','playcount')

0.5933004208099689

## Hyperparameter tuning and cross validation <a class="anchor" name="cv"></a>

Since we can't use RMSE as the evaluation metric for "implicit data", we need to manually implement the hyperparameter tuning for the ALS. This code is adapted from the following source [<a href="https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/ROEM_cv.py" target="_BLANK">ref</a>]

<code>alpha</code> is an important hyper-parameter for ALS with implicit feedback. It governs the baseline confidence in preference observations. It is a way to assign a confidence values to the <code>playcount</code>. Higher <code>playcount</code> would mean that we have higher confidence that the user likes that artist and lower <code>playcount</code> would mean the user doesn't like that much.

In [98]:
def ROEM_cv(df, userCol = "user_id", itemCol = "artist_id", ratingCol = "playcount", ranks = [10], maxIters = [10], regParams = [.05], alphas = [10, 40]):
  
    from pyspark.sql.functions import rand
    from pyspark.ml.recommendation import ALS

    ratings_df = df.orderBy(rand()) #Shuffling to ensure randomness

    #Building train and validation test sets
    train, validate = df.randomSplit([0.8, 0.2], seed = 0)

    #Building 3 folds within the training set.
    test1, test2,test3 = train.randomSplit([0.33,0.33,0.33], seed = 1)
    train1 = test2.union(test3)
    train2 = test1.union(test2)
    train3 = test1.union(test3)
    

    #Creating variables that will be replaced by the best model's hyperparameters for subsequent printing
    best_validation_performance = 9999999999999
    best_rank = 0
    best_maxIter = 0
    best_regParam = 0
    best_alpha = 0
    best_model = 0
    best_predictions = 0

      #Looping through each combindation of hyperparameters to ensure all combinations are tested.
    for r in ranks:
        for mi in maxIters:
            for rp in regParams:
                for a in alphas:
                #Create ALS model
                    als = ALS(rank = r, maxIter = mi, regParam = rp, alpha = a, userCol=userCol, itemCol=itemCol, ratingCol=ratingCol,
                            coldStartStrategy="drop", nonnegative = True, implicitPrefs = True)

                    #Fit model to each fold in the training set
                    model1 = als.fit(train1)
                    model2 = als.fit(train2)
                    model3 = als.fit(train3)
                    
                    #Generating model's predictions for each fold in the test set
                    predictions1 = model1.transform(test1)
                    predictions2 = model2.transform(test2)
                    predictions3 = model3.transform(test3)
                    
                    #Expected percentile rank error metric function
                    def ROEM(predictions, userCol = userCol, itemCol = itemCol, ratingCol = ratingCol):
                        #Creates table that can be queried
                        predictions.createOrReplaceTempView("predictions")

                        #Sum of total number of plays of all songs
                        denominator = predictions.groupBy().sum(ratingCol).collect()[0][0]

                        #Calculating rankings of songs predictions by user
                        spark.sql("SELECT " + userCol + " , " + ratingCol + " , PERCENT_RANK() OVER (PARTITION BY " + userCol + " ORDER BY prediction DESC) AS rank FROM predictions").createOrReplaceTempView("rankings")

                        #Multiplies the rank of each song by the number of plays and adds the products together
                        numerator = spark.sql('SELECT SUM(' + ratingCol + ' * rank) FROM rankings').collect()[0][0]

                        performance = numerator/denominator

                        return performance

                    #Calculating expected percentile rank error metric for the model on each fold's prediction set
                    performance1 = ROEM(predictions1)
                    performance2 = ROEM(predictions2)
                    performance3 = ROEM(predictions3)
                    

                    #Printing the model's performance on each fold        
                    print("Model Parameters: \nRank:", r,"\nMaxIter:", mi, "\nRegParam:",rp,"\nAlpha: ",a)
                    print("Test Percent Rank Errors: ", performance1, performance2, performance3)

                    #Validating the model's performance on the validation set
                    validation_model = als.fit(train)
                    validation_predictions = validation_model.transform(validate)
                    validation_performance = ROEM(validation_predictions)

                    #Printing model's final expected percentile ranking error metric
                    print("Validation Percent Rank Error: ", validation_performance)
                    print(" ")

                    #Filling in final hyperparameters with those of the best-performing model
                    if validation_performance < best_validation_performance:
                        best_validation_performance = validation_performance
                        best_rank = r
                        best_maxIter = mi
                        best_regParam = rp
                        best_alpha = a
                        best_model = validation_model
                        best_predictions = validation_predictions

    #Printing best model's expected percentile rank and hyperparameters
    print ("**Best Model** ")
    print ("  Percent Rank Error: ", best_validation_performance)
    print ("  Rank: ", best_rank)
    print ("  MaxIter: ", best_maxIter)
    print ("  RegParam: ", best_regParam)
    print ("  Alpha: ", best_alpha)
    
    return best_model, best_predictions

In [100]:
ROEM_cv(df_user_artist)

Model Parameters: 
Rank: 10 
MaxIter: 10 
RegParam: 0.05 
Alpha:  10
Test Percent Rank Errors:  0.3886900814443464 0.21145746425075793 0.14897874333517022
Validation Percent Rank Error:  0.24753423914351042
 
Model Parameters: 
Rank: 10 
MaxIter: 10 
RegParam: 0.05 
Alpha:  40
Test Percent Rank Errors:  0.4131049241402385 0.22059283833157214 0.46929899379135687
Validation Percent Rank Error:  0.3452799867599754
 
**Best Model** 
  Percent Rank Error:  0.24753423914351042
  Rank:  10
  MaxIter:  10
  RegParam:  0.05
  Alpha:  10


(ALSModel: uid=ALS_660246daa887, rank=10,
 DataFrame[user_id: int, artist_id: int, playcount: int, prediction: float])

## Making Predictions <a class="anchor" name="predictions"></a>
The k-fold validation implement might take long time to run. You can use the initial ALS model to make the predictions.
Assuming you have successfully trained the model, we want to now use the model to <strong>find top Artists recommended for each user</strong>. We can use the <i><strong>recommendForAllUsers</strong></i> function available in the ALS model to get the list of top recommendations for each users. You can further explore the details of the API <a href="https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS" target="_blank">here</a>.

The <code>recommendForAllUsers</code> only gives the list of artist_ids for the users, you can write the code to map these artist_ids back to their names.

In [104]:
model.recommendForAllUsers(10).show(truncate=False)

+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|user_id|recommendations                                                                                                                                                                                                            |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2007381|[[1238230, 23.326607], [1002280, 1.7649511], [1006134, 1.5370071], [1000899, 1.4817386], [1348794, 1.4599241], [1007735, 1.4507561], [1009156, 1.4061242], [1128473, 1.402673], [1001861, 1.3773758], [1000009, 1.3563573]]|
|1059637|[[1000766, 4.273414], [2231, 4.036212], [1001907, 3.7821138], [1002840,

<a class="anchor" id="lab-task-7"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">7. Lab Task: </strong> 
    Write a function to find the top <strong>N</strong> recommended artists for the user : <strong>2062243</strong>. Display  <code>artist_id and artist_name</code> both. A sample output is given below.
</div>

![image.png](attachment:image.png)

In [105]:
def recommendedArtists(als_model,user_id,limit):
    #get the recommendations
    test = als_model.recommendForAllUsers(limit).filter(col('user_id')==user_id).select("recommendations").collect() 
    #create a dataframe for the top artist list
    top_artist = spark.createDataFrame(test[0][0])
    #join the top_artist dataframe with the artist master dataframe to include the artist_name
    final = top_artist.join(df_artist,  top_artist.artist_id == df_artist.artist_id, how ='inner')\
            .select(top_artist.artist_id, 'artist_name')
    return final

In [106]:
test = model.recommendForAllUsers(20).filter(col('user_id')==2062243).select("recommendations").collect()
print(test)

[Row(recommendations=[Row(artist_id=1198, rating=7.796609878540039), Row(artist_id=1238230, rating=5.410921573638916), Row(artist_id=1179, rating=5.097838878631592), Row(artist_id=82, rating=4.848802089691162), Row(artist_id=1000569, rating=4.783620834350586), Row(artist_id=1002994, rating=4.208910942077637), Row(artist_id=1007903, rating=3.7284817695617676), Row(artist_id=441, rating=3.607126235961914), Row(artist_id=873, rating=3.4548346996307373), Row(artist_id=1000623, rating=3.2998063564300537), Row(artist_id=1844, rating=3.2847483158111572), Row(artist_id=1014421, rating=2.853095531463623), Row(artist_id=75, rating=2.7991340160369873), Row(artist_id=581, rating=2.777670383453369), Row(artist_id=1000200, rating=2.7462172508239746), Row(artist_id=13, rating=2.6834685802459717), Row(artist_id=1223, rating=2.6562039852142334), Row(artist_id=1194, rating=2.573530673980713), Row(artist_id=754, rating=2.5291330814361572), Row(artist_id=1001864, rating=2.4979324340820312)])]


In [107]:
test[0][0]

[Row(artist_id=1198, rating=7.796609878540039),
 Row(artist_id=1238230, rating=5.410921573638916),
 Row(artist_id=1179, rating=5.097838878631592),
 Row(artist_id=82, rating=4.848802089691162),
 Row(artist_id=1000569, rating=4.783620834350586),
 Row(artist_id=1002994, rating=4.208910942077637),
 Row(artist_id=1007903, rating=3.7284817695617676),
 Row(artist_id=441, rating=3.607126235961914),
 Row(artist_id=873, rating=3.4548346996307373),
 Row(artist_id=1000623, rating=3.2998063564300537),
 Row(artist_id=1844, rating=3.2847483158111572),
 Row(artist_id=1014421, rating=2.853095531463623),
 Row(artist_id=75, rating=2.7991340160369873),
 Row(artist_id=581, rating=2.777670383453369),
 Row(artist_id=1000200, rating=2.7462172508239746),
 Row(artist_id=13, rating=2.6834685802459717),
 Row(artist_id=1223, rating=2.6562039852142334),
 Row(artist_id=1194, rating=2.573530673980713),
 Row(artist_id=754, rating=2.5291330814361572),
 Row(artist_id=1001864, rating=2.4979324340820312)]

In [101]:
#seed = 1000
recommendedArtists(model,2062243,20)\
.sort(df_artist.artist_id)\
.show(truncate=False)

+---------+---------------------+
|artist_id|artist_name          |
+---------+---------------------+
|13       |Spiritualized        |
|581      |Depeche Mode         |
|754      |Sigur Rós            |
|1179     |Damien Rice          |
|5659     |Midtown              |
|1000196  |Anal Cunt            |
|1000200  |Alicia Keys          |
|1000481  |Slipknot             |
|1001066  |The Ataris           |
|1001365  |They Might Be Giants |
|1001487  |Finch                |
|1001530  |The Starting Line    |
|1001762  |My Morning Jacket    |
|1002994  |Deerhoof             |
|1005937  |Juno Reactor         |
|1006672  |Further Seems Forever|
|1013151  |Rise Against         |
|1233196  |The Postal Service   |
|1236708  |The Four Tops        |
|1238230  |Straylight Run       |
+---------+---------------------+



In [108]:
#seed = 10000
recommendedArtists(model,2062243,20)\
.sort(df_artist.artist_id)\
.show(truncate=False)

+---------+------------------------+
|artist_id|artist_name             |
+---------+------------------------+
|13       |Spiritualized           |
|75       |Tortoise                |
|82       |Pink Floyd              |
|441      |Aphex Twin              |
|581      |Depeche Mode            |
|754      |Sigur Rós               |
|873      |Tool                    |
|1179     |Damien Rice             |
|1194     |Lenny Kravitz           |
|1198     |Everclear               |
|1223     |Jimi Hendrix            |
|1844     |Tom Waits               |
|1000200  |Alicia Keys             |
|1000569  |Queens of the Stone Age |
|1000623  |Savage Garden           |
|1001864  |Röyksopp                |
|1002994  |Deerhoof                |
|1007903  |Maroon 5                |
|1014421  |Rage Against the Machine|
|1238230  |Straylight Run          |
+---------+------------------------+



### Congratulations on finishing this activity. See you next week.