# Collaborative Filtering Song Recommendation System  

- Personalized playlist - based on Alternative Least Square algorithm 

_______________________________________________________________________


# Objective

- Load data from csv file. format: (userId, trackId, freq).
- Use predefined parameters from exploration notebook for ALS algorithm.
- predict 20 unlistened songs for each user.
- Write the final recommendations to a csv file. 


# Set up

In [1]:
import sys
sys.path.insert(0,'.')
# !pip install pandas pyspark

# import os
# import sys

# os.environ['PYSPARK_PYTHON'] = sys.executable
# os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

In [2]:
# spark configuration
# I might be wrong, but even running 3G file on spark is very memory consuming
# 6s
from Recommendor2 import ALSRecommendor


spark = ALSRecommendor.setup_spark()
r = ALSRecommendor(spark)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/04/01 10:44:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/04/01 10:44:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


# Preprocess Data

In [3]:
r.load_data_from_csv('../data/input/listening_history.csv')

  [(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)]


# Collaborative Filtering

Assumption: if person A and B share similar listening history, A might be also interested in B's other songs.

For this history, as for each user, we have song plays frequency for only a subset of the songs, because not all users listened to all songs. 

The intuition is that collaborative filtering would be able to approximate the matrix; which decompomises the matrix as the product of "user property" cross "song property".

Optimization with ALS algorithm:

1. First randomly fill in the users matrix.
2. Optimize the song values that the error is minimized (least squared error).
3. Then, hold song's value fixed and optimize the user matrix (that's why it's called alternation).

## Alternative Least Square (ALS) with Spark ML


- Initialize ALS with Spark ML
- Set RMSE as evaluation metric
- Perform Cross Validation - Grid Search to find the best parameters
- Generate personalized playlists based on listening history

In [4]:
r.train(rank=6, regParam=0.25)

all data prepared!
About to start training..


                                                                                

23/04/01 10:38:33 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/04/01 10:38:33 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
23/04/01 10:38:33 WARN InstanceBuilder$NativeLAPACK: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK
Training done. Analyzing rmse..


                                                                                

The final RMSE on the test set is 0.7071067811865476


0.7071067811865476

### It seems to be a nice score. We need to examine the result to check if it can make sense.

# Evaluation on all users

In [5]:
# only recommend to users who have listened to more than 10 songs

#users = r.get_users(min_tracks_listened=1, limit=5)

In [6]:
#print(users)
r.total_df.show()

+-------+--------+---------+
|user_id|track_id|frequency|
+-------+--------+---------+
|      1|     596|        1|
|      1|     598|        1|
|      1|     601|        1|
|      1|     602|        1|
|      1|     615|        1|
|      1|     617|        4|
|      1|     618|        1|
|      1|     620|       11|
|      1|     632|        2|
|      1|     648|        1|
|      1|     650|       10|
|      1|     651|        8|
|      1|     666|       25|
|      1|     670|        1|
|      1|     673|        2|
|      1|     677|        1|
|      1|     712|        5|
|      1|     743|        1|
|      1|     781|        6|
|      1|    1059|       31|
+-------+--------+---------+
only showing top 20 rows



In [9]:
# recommend 20 songs to each user
# 2m 50s for each user
for user in r.get_users(limit=5):
    df = r.recommend(user, limit=20)
    df.show(5)


Generating recommendation for user: 1
listened_track count: 53
+--------+----------+
|track_id|prediction|
+--------+----------+
+--------+----------+

Generating recommendation for user: 2
listened_track count: 10
+--------+----------+
|track_id|prediction|
+--------+----------+
|     620| 5.4310966|
|     650|   4.93736|
|    1495| 4.4436245|
|    1448| 3.9498885|
|    1488| 3.4561522|
+--------+----------+
only showing top 5 rows

