# GROUP 4: SONG RECOMMENDATION PROJECT

# Introduction
<b>Recommendation systems are probably the most trending data science application today. They can be used to predict users rating or preference for any particular item. All of the major tech companies are using recommendation system in some form or the other. Amazon is using it to suggest "frequently bought together" or "Customers who viewed this item also viewed". YouTube is using it to create an auto playlist based on your preferences. Infact, for companies like Netflix and Spotify, the entire business model and its success revolves around how good their recommandation system is. What's more, Netflix offered a million dollar prize competition in year 2009 to improve its system prediction by 10%.</b>

# Problem statement
<b>Build a song recommendation system which can recommend songs to listeners based on information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users”. </b>

</b>Data Source: https://www.kaggle.com/c/msdchallenge/data </b>

<b>Loading the data </b>

In [5]:
# File location and type
file_song_data = "/FileStore/tables/song_data.csv"
file_triplets_data ="/FileStore/tables/triplets_10000.txt"

<b>Uploading the data to the dataframe</b>

In [7]:
#Uploading Song CSV file with tab delimited
song_df = spark.read.csv(file_song_data,
                            inferSchema ='true',
                            header = 'true',
                            sep=',')


display(song_df)

song_id,title,release,artist_name,year
SOQMMHC12AB0180CB8,Silent Night,Monster Ballads X-Mas,Faster Pussy cat,2003
SOVFVAK12A8C1350D9,Tanssi vaan,Karkuteillä,Karkkiautomaatti,1995
SOGTUKN12AB017F4F1,No One Could Ever,Butter,Hudson Mohawke,2006
SOBNYVR12A8C13558C,Si Vos Querés,De Culo,Yerba Brava,2003
SOHSBXH12A8C13B0DF,Tangle Of Aspens,Rene Ablaze Presents Winter Sessions,Der Mystic,0
SOZVAPQ12A8C13B63C,"""Symphony No. 1 G minor """"Sinfonie Serieuse""""/Allegro con energia""",Berwald: Symphonies Nos. 1/2/3/4,David Montgomery,0
SOQVRHI12A6D4FB2D7,We Have Got Love,Strictly The Best Vol. 34,Sasha / Turbulence,0
SOEYRFT12AB018936C,2 Da Beat Ch'yall,Da Bomb,Kris Kross,1993
SOPMIYT12A6D4F851E,Goodbye,Danny Boy,Joseph Locke,0
SOJCFMH12A8C13B0C2,Mama_ mama can't you see ?,March to cadence with the US marines,The Sun Harbor's Chorus-Documentary Recordings,0


<b>Defining schema for triplets data</b>

In [9]:
from pyspark.sql.types import *

# Creating schema
schema = StructType([StructField('user_id', StringType()),
                      StructField('songid', StringType()),
                      StructField('play_count', IntegerType())])

#Uploading Triplets file with tab delimited
tri_df = spark.read.csv(file_triplets_data,
                          schema= schema,
                            sep='\t')


display(tri_df)

user_id,songid,play_count
b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1
b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1
b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1
b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODDNQT12A6D4F5F7E,5
b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODXRTY12AB0180F3B,1
b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOFGUAY12AB017B0A8,1
b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOFRQTD12A81C233C0,1
b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOHQWYZ12A6D4FA701,1


# Implicit VS Explicit data:

<b>Explicit data is the data where we have user rating associated with a song or a movie on a fixed scale. For instance: 1 to 5 ratings in the Netflix dataset. From such rating, we can interpret how much a user likes or dislikes a movie, but it is hard to get such data because generally users do not care to rate every movie they see.<br><br>
Implicit data is the type of data we are using for song recommendation. The data is gathered from the user behavior, with no explicit rating associated with it. It could be how many times a user played a song or watched a movie, how long they have a spent reading a particular article etc. The advantage here is we have a lot of such data but it is usually very noisy and unreliable.<br><br>
When a user rates a movie 1 on scale of 5 that means that he did not like the movie. But with play count of a song it we can’t make any implicit assumption that the user loved the song or hated the song or somewhere-in-between. Also, if they did not play a song does not necessarily mean that they do not like the song.
Therefore, we focus on what we know about the users behavior and the confidence we have in whether or not they like any given item. For instance: we can have a higher confidence on a song if the user played it 100 times against a song which he played on 1 time.</b>

<b>Joining the two dataframes to form MSD Data having the playcounts from tri_df</b>

In [12]:
MSD = tri_df.join(song_df, tri_df.songid == song_df.song_id,how='left') 
MSD.show(5)

<b> Removing the redundant column </b>

In [14]:
MSD = MSD['user_id','song_id','play_count','title','release','artist_name','year']
MSD.show(5)

<b>Changing the column name year to release_year</b>

In [16]:
MSD = MSD.withColumnRenamed('year', 'release_year')
MSD.show(5)


<b>Identifying total number of distinct songs and users</b>

In [18]:
# Number of rows 
print(MSD.count())
print(MSD.select("user_id").distinct().count())
print(MSD.select("song_id").distinct().count())

#Data Exploration

In [20]:
# Create a view or table

temp_table_name = "user_playlist"

MSD.createOrReplaceTempView(temp_table_name)

In [21]:
%sql
select * from user_playlist

user_id,song_id,play_count,title,release,artist_name,release_year
79f93851e840f9d1faeba586ee18b30fdb0008b6,SOATHTW12A58A7EDB5,1,Mutt,Enema Of The State,Blink-182,1998
043d81932e75d5749ed5758d6420506e7bc457a5,SOATHTW12A58A7EDB5,5,Mutt,Enema Of The State,Blink-182,1998
ebacfcb5fa29a601f596b2d1076d7973177737e1,SOATHTW12A58A7EDB5,1,Mutt,Enema Of The State,Blink-182,1998
417c73dd95669d1919c869ef20fd2d0f7a31403d,SOATHTW12A58A7EDB5,1,Mutt,Enema Of The State,Blink-182,1998
52ab33fbb2fa3aeb2a261734603061e288a2b253,SOATHTW12A58A7EDB5,1,Mutt,Enema Of The State,Blink-182,1998
13ce57b3a25ef63fa614335fd838e8024c42ec17,SOATHTW12A58A7EDB5,1,Mutt,Enema Of The State,Blink-182,1998
3294ef9047ac6bf73c8c6fb14a3096ca05a67cb6,SOATHTW12A58A7EDB5,1,Mutt,Enema Of The State,Blink-182,1998
8b871cbd0f9c62dc3dd349b86d855102515070de,SOATHTW12A58A7EDB5,2,Mutt,Enema Of The State,Blink-182,1998
dd88cd67ebe00d6a81a040df0d3be9390927d399,SOATHTW12A58A7EDB5,1,Mutt,Enema Of The State,Blink-182,1998
b1f0e90e73e6a786bd2b3b17f6e917a777e6e19b,SOATHTW12A58A7EDB5,10,Mutt,Enema Of The State,Blink-182,1998


<b>Most played 10 Songs and there Artists</b>

In [23]:
%sql
select artist_name, title, sum(play_count) as number_of_total_play from user_playlist group by title,artist_name order by sum(play_count) desc limit 10;

artist_name,title,number_of_total_play
Dwight Yoakam,You're The One,54136
Björk,Undo,49253
Kings Of Leon,Revelry,41418
Barry Tuckwell/Academy of St Martin-in-the-Fields/Sir Neville Marriner,Horn Concerto No. 4 in E flat K495: II. Romance (Andante cantabile),31153
Harmonia,Sehr kosmisch,31036
Florence + The Machine,Dog Days Are Over (Radio Edit),26663
Kings Of Leon,Use Somebody,22140
OneRepublic,Secrets,22100
Five Iron Frenzy,Canada,21019
Tub Ring,Invalid,19645


<b>Top 10 listeners</b>

In [25]:
%sql
select user_id, sum(play_count) as number_of_total_play from user_playlist group by user_id order by sum(play_count) desc limit 10;

user_id,number_of_total_play
4be305e02f4e72dad1b8ac78e630403543bab994,4884
6d625c6557df84b60d90426c0116138b617b9449,3548
972cce803aa7beceaa7d0039e4c7c0ff097e4d55,3399
0b19fe0fad7ca85693846f7dad047c449784647e,3059
d13609d62db6df876d3cc388225478618bb7b912,2728
283882c3d18ff2ad0e17124002ec02b847d06e9a,2322
083a2a59603a605275107c00812a811526c2a0af,2289
2231cb435771a1a621ec44e95cdd28b81fad3288,2205
6a944bfe30ae8d6b873139e8305ae131f1607d5f,2170
9c859962257112ad523f1d3c121d35191daa6d2b,2124


<b>Distribution of Play count for all songs</b>

In [27]:
%sql
select play_count, count(*) as count from user_playlist group by play_count order by play_count

play_count,count
1,1188874
2,327381
3,149521
4,86400
5,96247
6,47439
7,32289
8,23514
9,17436
10,19566


<b>Top played songs based on Artists</b>

In [29]:
%sql
select artist_name, sum(play_count) as number_of_total_play from user_playlist group by artist_name order by sum(play_count) desc limit 10;

artist_name,number_of_total_play
Kings Of Leon,86031
Coldplay,78540
Florence + The Machine,60066
Dwight Yoakam,54136
Björk,53814
The Black Keys,52220
Muse,52136
Justin Bieber,50376
Jack Johnson,48487
Eminem,41754


<b>Distribution of distinct songs listened by each users</b>

In [31]:
%sql
select Number_of_songs,count(*) from (
select user_id,count(distinct song_id) as Number_of_songs from user_playlist group by user_id order by Number_of_songs) group by Number_of_songs order by Number_of_songs;

Number_of_songs,count(1)
1,875
2,1226
3,1585
4,1952
5,2410
6,2753
7,3070
8,3410
9,3454
10,3600


<b>Taking 10% of the users from MSD data</b>

In [33]:
# number of distinct user_Id
user=MSD.select("user_id").distinct()
user1,user2= user.randomSplit([0.05,0.95], seed=123)
usercount = user1.count()
print("Number of users: ", usercount)

<b>Dataframe of distinct list of songs</b>

In [35]:
# number of distinct song_Id
songs= MSD.select("song_id").distinct()
songcount=songs.count()
print("Number of songs: ", songcount)

<b>Giving incrementing Ids to UserId field to convert them into integer such that it can be accepted by our model</b>

In [37]:
from pyspark.sql.functions import monotonically_increasing_id

# Creating new columns of unique integers for user_id and song_id
user_df = user1.withColumn("new_userid", monotonically_increasing_id())
user_df.show()

<b>Giving incrementing Ids to SongId field to convert them into integer such that it can be accepted by our model</b>

In [39]:
songs_df = songs.select("song_id", monotonically_increasing_id().alias('new_songId'))
songs_df.show()

<b>Cross joining to map all the users with all the songs such that we have entry for all the songs for each user</b>

In [41]:
#Cross Join user and Songs
crossjoin = user_df.crossJoin(songs_df)
crossjoin.show(5)

In [42]:
crossjoin.count()

<b> Joining the crossjoin dataframe with entire data such that we can have play_count populated for all the user and song combination, and replacing the NA's with 0 when there is no match i.e. when user did not listen to that song
</b>

In [44]:
df = crossjoin.join(MSD, ["user_id", "song_id"], "left").fillna(0)

<b>Selecting only numeric columns that we want for Modeling</b>

In [46]:
model_df= df.select(df.new_userid.cast("int"),df.new_songId.cast("int"),df.play_count.cast("int"))

#Alternating Least Squares (ALS)
<b>ALS is an iterative optimization process in which for every run we try to arrive closer and closer to a factorized representation of our original data. Assume that our original matrix M of size U x I, where u are the number of users and i are the number of items. We want to find a way such that we can express our original matrix M into product of two matrix. One matrix of user and hidden features of dimension U * F and second matrix of items and hidden features of dimension F x I. These two matrix have weights for how each user/item relates to each hidden feature. Using gradient descent, we evalaute these two matrix such that their product approximates M as closely as possible.</b>

<b>Defining the Model</b>

In [49]:
# Set the ALS hyperparameters
from pyspark.ml.recommendation import ALS

model = ALS(userCol= "new_userid", itemCol= "new_songId", ratingCol= "play_count", rank = 10, maxIter = 10,alpha = 20, regParam = .05,  coldStartStrategy="drop", nonnegative = True, implicitPrefs = True)

<b>Dividing the data into train and test dataset</b>

In [51]:
# Split the dataframe into training and test data
(train_data, test_data) = model_df.select('new_userid','new_songId','play_count').randomSplit([0.7, 0.3], seed=12345)

# Rank Ordering Error Metric (ROEM)
<b> The ALS model from Spark ml library contains additional parameter for implicit rating called alpha which is an integer value that tells Spark how much additional song play should add to the confidence of model that the user actually likes a particular song.<br><br>
For explicit ratings, we can use RMSE to evaluate the model which makes sense we can match predictions back to a true measure of user ratings. However, in case of implicit rating we don’t have true measure of user ratings we only have the number of times user played a song and a measure of how confident the model is that they like that song and therefore we can’t use RMSE for evaluating our model. However, using test dataset, we can see if our model is giving high predictions to the songs that users have actually listened to.<br><br>
The logic is if our model is returning a high prediction for a song that the respective user actually listened to, then the model prediction make sense, especially if they’ve listened to it more than once. We can measure this using Rank Ordering Error Metric (ROEM), which checks if songs have higher number of plays have higher predictions.</b>

In [53]:
def ROEM(predictions, userCol = "new_userid", itemCol = "new_songId", ratingCol = "play_count"):
  #Creates table that can be queried
  predictions.createOrReplaceTempView("predictions")

  #Sum of total number of plays of all songs
  denominator = predictions.groupBy().sum(ratingCol).collect()[0][0]

  #Calculating rankings of songs predictions by user
  spark.sql("SELECT " + userCol + " , " + ratingCol + " , PERCENT_RANK() OVER (PARTITION BY " + userCol + " ORDER BY prediction DESC) AS rank FROM predictions").createOrReplaceTempView("rankings")

   #Multiplies the rank of each song by the number of plays and adds the products together
  numerator = spark.sql('SELECT SUM(' + ratingCol + ' * rank) FROM rankings').collect()[0][0]

  performance = numerator/denominator

  return performance

In [54]:
train_data.cache()

<b>Fits model to fold within training data</b>

In [56]:
fitted_model = model.fit(train_data)

<b>Generates predictions using fitted_model on respective CV test data</b>

In [58]:
predictions = fitted_model.transform(test_data)

<b>Generates and prints a ROEM metric CV test data</b>

In [60]:
# Generates and prints a ROEM metric CV test data
validation_performance = ROEM(predictions)
print(validation_performance)

# Conclusion

Our model achieved the accuracy of 50.26%. We trained the model on ~4000 users.
<br> Out of 2 two songs recommended, 1 was relevant to the user.
<br> We are able to process 37830000 records (3783 users and 10000 songs).
<br> We tried multiple cloud platforms to increase accuracy and performance of the code.(Databricks, Google Cloud Platform)
<br> The model performance can be improved if we train it on more data points.

# References
<b>https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/ROEM_function.py <br>
https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe<br>
IEEE Paper : “Collaborative Filtering for Implicit Feedback Datasets”</b><br>