# Collaborative Filtering Song Recommendation System  

- Personalized playlist - based on Alternative Least Square algorithm 

_______________________________________________________________________


# Objective

- Load data from csv file. format: (userId, trackId, freq).
- Use predefined parameters from exploration notebook for ALS algorithm.
- predict 20 unlistened songs for each user.
- Write the final recommendations to a csv file. 


# Set up

In [3]:
import sys
sys.path.insert(0,'.')
# !pip install pandas pyspark

# import os
# import sys

# os.environ['PYSPARK_PYTHON'] = sys.executable
# os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

In [4]:
# spark configuration
# I might be wrong, but even running 3G file on spark is very memory consuming
# 6s
from Recommendor import ALSRecommendor


spark = ALSRecommendor.setup_spark()
r = ALSRecommendor(spark)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/31 23:04:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/03/31 23:04:56 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


# Preprocess Data

In [None]:
r.load_data_from_csv('../../data/tmp/listening_history.csv')

  [(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)]


# Collaborative Filtering

Assumption: if person A and B share similar listening history, A might be also interested in B's other songs.

For this history, as for each user, we have song plays frequency for only a subset of the songs, because not all users listened to all songs. 

The intuition is that collaborative filtering would be able to approximate the matrix; which decompomises the matrix as the product of "user property" cross "song property".

Optimization with ALS algorithm:

1. First randomly fill in the users matrix.
2. Optimize the song values that the error is minimized (least squared error).
3. Then, hold song's value fixed and optimize the user matrix (that's why it's called alternation).

## Alternative Least Square (ALS) with Spark ML


- Initialize ALS with Spark ML
- Set RMSE as evaluation metric
- Perform Cross Validation - Grid Search to find the best parameters
- Generate personalized playlists based on listening history

In [None]:
r.train(rank=6, regParam=0.25)

all data prepared!
About to start training..


[Stage 0:>                                                          (0 + 0) / 8]

23/03/31 21:12:16 WARN TaskSetManager: Stage 0 contains a task of very large size (183115 KiB). The maximum recommended task size is 1000 KiB.


[Stage 0:>    (0 + 8) / 8][Stage 1:>    (0 + 0) / 8][Stage 3:>    (0 + 0) / 8]8]

23/03/31 21:12:31 WARN TaskSetManager: Stage 1 contains a task of very large size (183115 KiB). The maximum recommended task size is 1000 KiB.


[Stage 1:>                  (0 + 8) / 8][Stage 3:>                  (0 + 0) / 8]

23/03/31 21:12:40 WARN TaskSetManager: Stage 3 contains a task of very large size (183115 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

23/03/31 21:16:38 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/03/31 21:16:38 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS




23/03/31 21:16:39 WARN InstanceBuilder$NativeLAPACK: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


                                                                                

Training done. Analyzing rmse..




The final RMSE on the test set is 6.389258845800043


                                                                                

6.389258845800043

### It seems to be a nice score. We need to examine the result to check if it can make sense.

# Evaluation on all users

In [None]:
# only recommend to users who have listened to more than 10 songs

users = r.get_users(min_tracks_listened=10, limit=5)

                                                                                

In [None]:
df = r.recommend('testid', 101, limit=20)
df.show(5)

Generating recommendation for user 101 (testid)


                                                                                

listened_track count: 12




+------------------+----------+
|          track_id|prediction|
+------------------+----------+
|TRSGUQI128F92D452E|  3.565889|
|TRPQGHT128E078D10A| 3.5387394|
|TRGASNY128F14696B0| 3.2831128|
|TRUPPJG12903C98EE6|  3.236645|
|TRFRXRO12903D03A74| 3.1199164|
+------------------+----------+
only showing top 5 rows



                                                                                

In [None]:
# recommend 20 songs to each user
# 2m 50s for each user

for user in users:
    df = r.recommend(user.user_id, user.new_userid, limit=20)
    df.show(5)
    

Generating recommendation for user 83 (106340f89e92b4b52041477f927993fd5ac278b8)


                                                                                

listened_track count: 104


                                                                                

+------------------+----------+
|          track_id|prediction|
+------------------+----------+
|TRUPPJG12903C98EE6| 3.1896992|
|TRSGUQI128F92D452E| 3.1324916|
|TRPQGHT128E078D10A| 2.9886236|
|TRGASNY128F14696B0|  2.950057|
|TRFRXRO12903D03A74| 2.8017323|
+------------------+----------+
only showing top 5 rows

Generating recommendation for user 1855 (04eaed85643f0f84f8255e9b3b1b22b7e682b991)


                                                                                

listened_track count: 12


                                                                                

+------------------+----------+
|          track_id|prediction|
+------------------+----------+
|TRUPPJG12903C98EE6| 4.2846785|
|TRSGUQI128F92D452E|  3.925344|
|TRGASNY128F14696B0|   3.79862|
|TRFRXRO12903D03A74| 3.6238704|
|TRKGBAV128F1491622| 3.4721503|
+------------------+----------+
only showing top 5 rows

Generating recommendation for user 1657 (8c174da0146bea17f71920e030eadcad491f09d0)


                                                                                

listened_track count: 19


                                                                                

+------------------+----------+
|          track_id|prediction|
+------------------+----------+
|TRUPPJG12903C98EE6|  3.897885|
|TRSGUQI128F92D452E| 3.8870978|
|TRGASNY128F14696B0| 3.8285406|
|TRFRXRO12903D03A74| 3.6314344|
|TRDUQBK128F423EB4C| 3.5979087|
+------------------+----------+
only showing top 5 rows

Generating recommendation for user 26 (f161aac4e91701299cc95bf3c269f966e0013663)


                                                                                

listened_track count: 92


                                                                                

+------------------+----------+
|          track_id|prediction|
+------------------+----------+
|TRUPPJG12903C98EE6| 3.8215747|
|TRSGUQI128F92D452E| 3.5906286|
|TRPQGHT128E078D10A| 3.5073428|
|TRGASNY128F14696B0| 3.4717524|
|TRFRXRO12903D03A74|  3.293797|
+------------------+----------+
only showing top 5 rows

Generating recommendation for user 1239 (dd40c92f28bc031a5a3e63dbffdb511933d99efc)


                                                                                

listened_track count: 38




+------------------+----------+
|          track_id|prediction|
+------------------+----------+
|TRUPPJG12903C98EE6| 3.4604883|
|TRSGUQI128F92D452E|  3.257988|
|TRGASNY128F14696B0| 3.1858516|
|TRFRXRO12903D03A74| 3.0250497|
|TRKGBAV128F1491622|  2.911509|
+------------------+----------+
only showing top 5 rows



                                                                                