# Session Based Collaborative Filtering
[Session-Based Collaborative Filtering for Predicting the Next Song](https://www.researchgate.net/publication/228959287_Session-Based_Collaborative_Filtering_for_Predicting_the_Next_Song)

In [1]:
import pandas as pd
import numpy as np
import gc
import dateutil.parser
from scipy.sparse import csr_matrix
import operator
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import correlation, cosine
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
df_hist = pd.read_table('userid-timestamp-artid-artname-traid-traname.tsv', error_bad_lines = False)

  """Entry point for launching an IPython kernel.
b'Skipping line 2120260: expected 6 fields, saw 8\n'
b'Skipping line 2446318: expected 6 fields, saw 8\n'
b'Skipping line 11141081: expected 6 fields, saw 8\n'
b'Skipping line 11152099: expected 6 fields, saw 12\nSkipping line 11152402: expected 6 fields, saw 8\n'
b'Skipping line 11882087: expected 6 fields, saw 8\n'
b'Skipping line 12902539: expected 6 fields, saw 8\nSkipping line 12935044: expected 6 fields, saw 8\n'
b'Skipping line 17589539: expected 6 fields, saw 8\n'


In [3]:
df_hist.columns = ['userid', 'timestamp', 'artistid','artistname','trackid','trackname']

In [4]:
df_hist.head()

Unnamed: 0,userid,timestamp,artistid,artistname,trackid,trackname
0,user_000001,2009-05-04T13:54:10Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Composition 0919 (Live_2009_4_15)
1,user_000001,2009-05-04T13:52:04Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Mc2 (Live_2009_4_15)
2,user_000001,2009-05-04T13:42:52Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Hibari (Live_2009_4_15)
3,user_000001,2009-05-04T13:42:11Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Mc1 (Live_2009_4_15)
4,user_000001,2009-05-04T13:38:31Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,To Stanford (Live_2009_4_15)


In [5]:
df_hist['userid'].nunique()

992

So, we have data for a total of 992 users. For our simplicity sake, let's restrict our dataset to contain information of only 50 users.

In [6]:
df_profile = pd.read_table('userid-profile.tsv')

  """Entry point for launching an IPython kernel.


In [7]:
user_id = df_profile['#id'].tolist()[:50]

df_hist = df_hist[df_hist['userid'].isin(user_id)]

In [8]:
df_hist['timestamp'] = df_hist['timestamp'].apply(lambda x : dateutil.parser.parse(x))

In [9]:
df_hist.head()

Unnamed: 0,userid,timestamp,artistid,artistname,trackid,trackname
0,user_000001,2009-05-04 13:54:10+00:00,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Composition 0919 (Live_2009_4_15)
1,user_000001,2009-05-04 13:52:04+00:00,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Mc2 (Live_2009_4_15)
2,user_000001,2009-05-04 13:42:52+00:00,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Hibari (Live_2009_4_15)
3,user_000001,2009-05-04 13:42:11+00:00,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Mc1 (Live_2009_4_15)
4,user_000001,2009-05-04 13:38:31+00:00,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,To Stanford (Live_2009_4_15)


# Session Based Collaborative Filtering

Here, we introduce a recommendation technique called Session-based Collaborative Filtering (SSCF) that uses preferred items in the similar session.

One approach for music recommendation using items is playlist generation. The authors generate a playlist simply using the n nearest songs from a given seed song. But these approaches assume the existence of rich metadata or audio processed features of music. We use the information of co-preferred occurrence in sessions.

Last.fm(Our data source here) is an Automatic CF recommender that generates personalized playlists on the basis of usage history of users whose profiles are similar. When there is no profile of an active user, it cannot help using the items user has selected only, so we take this approach as our base line for performance comparison.

We propose a technique that predicts the next song request by looking at what past sessions have shown.In short,our approach is that songs frequently preferred together in sessions rather than a user can benefit predicting songs in the active sessions.

Let's first sessionize the datset where for each user, a session will be 5 minutes long

In [10]:
df_hist.sort_values(by=['userid','timestamp'], inplace=True)
cond1 = df_hist.timestamp-df_hist.timestamp.shift(1) > pd.Timedelta(5, 'm')
cond2 = df_hist.userid != df_hist.userid.shift(1)
df_hist['SessionID'] = (cond1|cond2).cumsum()

In [11]:
df_hist.SessionID.value_counts(ascending=True)

2049         1
236098       1
234051       1
240196       1
238149       1
246345       1
252490       1
256588       1
254541       1
260686       1
199248       1
201299       1
207444       1
205397       1
215640       1
217691       1
223836       1
225887       1
164449       1
168547       1
229953       1
174692       1
232000       1
27196        1
94751        1
39458        1
37411        1
43556        1
41509        1
47654        1
          ... 
242336     323
221376     330
226916     331
57889      339
189310     341
237363     342
228360     370
57138      392
57006      435
232859     491
239399     505
58048      612
58053      653
57148      664
56993      671
57333      792
58338      907
57142      962
57140     1040
57007     1110
57290     1209
56682     1382
56880     1456
57888     1482
57607     1560
57640     1574
57519     1580
57664     1687
56761     1769
58260     2202
Name: SessionID, Length: 293564, dtype: int64

So, there are many sessions with just one song browsed/listened to in that session. For, our similarity calculation we will remove sessions with less than three songs in it.

In [12]:
session = df_hist['SessionID'].unique().tolist()

In [13]:
rem_session = []
for s in session:
    if df_hist[df_hist.SessionID == s].shape[0] <3:
        rem_session.append(s)

In [155]:
len(rem_session)

185984

In [157]:
df_hist.SessionID.nunique()

293564

So, out of 293564 sessions, 185984 sessions are only atmost 2 songs long. 
<b> Can we remove these sessions from calculating similar sessions?</b> 

In [14]:
df_hist = df_hist[~df_hist.SessionID.isin(rem_session)]

So, we are left with 107580 sessions only.

In [15]:
# We rename column "SessionID" to "sessionid"
df_hist.rename(index=str, columns={"SessionID": "sessionid"}, inplace=True)

In [167]:
df_hist['sessionid'].unique()

array([     3,      6,     10, ..., 293553, 293558, 293564], dtype=int64)

SessionID starts from 3 and takes discrete values till 293564 when we have only 107580 sessions to consider. So, let's map sessions to IDs starting with 0 and taking on continuous values for simplicity sake.

In [16]:
df_hist['session_id'] = df_hist['sessionid'].astype('category').cat.codes
session_lookup = df_hist[['session_id','sessionid']].drop_duplicates()

In [17]:
df_hist['track_id'] = df_hist['trackname'].astype("category").cat.codes
track_lookup = df_hist[['track_id', 'trackname']].drop_duplicates()

In [18]:
df_hist.drop(['userid','timestamp','artistid','artistname','trackid','trackname','sessionid'],axis=1,inplace=True)

In [19]:
df_hist.reset_index(drop=True, inplace=True)

In [20]:
df_hist.head()

Unnamed: 0,session_id,track_id
0,0,63636
1,0,111426
2,0,51641
3,1,17887
4,1,83255


In [175]:
df_hist['track_id'].nunique()

119081

In [177]:
df_hist['session_id'].nunique()

107580

We consider the items user currently selected as a part of the current session. Thus, given p currently selected items, we make session profile described above, and find the most similar sessions from those in the training sessions . By weighting the similarity to the ratings on the items in the session, the system predicts the rating of an item for the active session.

In [118]:
data = {}
for row in df_hist.itertuples():
    ssn = row.session_id
    tra = row.track_id
    tpl = (ssn,tra)
    if tpl in data.keys():
        data[tpl] += 1
    else:
        data[tpl] = 1

In [121]:
sorted_data = sorted(data.items(),key = operator.itemgetter(1))

In [122]:
sorted_data

[((2, 75914), 1),
 ((2, 132827), 1),
 ((2, 61607), 1),
 ((5, 21577), 1),
 ((5, 99357), 1),
 ((5, 32365), 1),
 ((9, 85509), 1),
 ((9, 125373), 1),
 ((9, 27815), 1),
 ((10, 76337), 1),
 ((10, 59096), 1),
 ((10, 131129), 1),
 ((10, 105035), 1),
 ((13, 44944), 1),
 ((13, 125490), 1),
 ((13, 48607), 1),
 ((15, 58910), 1),
 ((15, 4255), 1),
 ((15, 139), 1),
 ((15, 103464), 1),
 ((18, 127419), 1),
 ((18, 27815), 1),
 ((18, 38606), 1),
 ((20, 24715), 1),
 ((20, 10624), 1),
 ((20, 27816), 1),
 ((21, 113014), 1),
 ((21, 48982), 1),
 ((21, 98350), 1),
 ((22, 101483), 1),
 ((22, 27815), 1),
 ((22, 71874), 1),
 ((22, 22544), 1),
 ((31, 89406), 1),
 ((31, 123034), 1),
 ((31, 65604), 1),
 ((46, 96170), 1),
 ((46, 130811), 1),
 ((46, 72025), 1),
 ((47, 56681), 1),
 ((47, 113809), 1),
 ((47, 38423), 1),
 ((50, 102687), 1),
 ((50, 95481), 1),
 ((50, 77807), 1),
 ((50, 77727), 1),
 ((50, 55028), 1),
 ((50, 16237), 1),
 ((50, 86002), 1),
 ((50, 37107), 1),
 ((51, 112029), 1),
 ((51, 56681), 1),
 ((51, 113

We can see thet for our dataset, in a session any song has been played only once, so there is no point keeping a count of the number of times a song has been played in a session. Instead, we will have '1' stored corresponding to a combination of (ssnID,trkID) for it been played and '0' if NOT.

In [178]:
session = []
track = []
plays = []
for row in df_hist.itertuples():
    session.append(row.session_id)
    track.append(row.track_id)
    plays.append(1)

In [252]:
all_tracks = df_hist.track_id.unique().tolist()

In [180]:
rows = len(list(np.sort(df_hist.session_id.unique())))
columns = len(list(np.sort(df_hist.track_id.unique())))

In [183]:
data_sparse = csr_matrix((plays, (session, track)), shape=(rows, columns))

In [265]:
def findksimilarsessions(sessionid, data_sparse = data_sparse, metric=cosine, k=5):
    similarities = cosine_similarity(data_sparse,dense_output=False)
    sim_sparse = similarities[sessionid]
    sim_array = sim_sparse.toarray()
    # Let's get index for all non zeros entries in sim_array
    r,c = np.nonzero(sim_array)
    sim = {}
    for col in c:
        sim[col] = sim_array[0,col]
    sorted_sim =  sorted(sim.items(), key=operator.itemgetter(1),reverse=True)
    sim_session = []
    sim_score = []
    print("Top 5 similar sessions are:")
    for i in range (1,6):
        sim_session.append(sorted_sim[i][0])
        sim_score.append(sorted_sim[i][1])
        print("Session : {0} with similarity {1}".format(sorted_sim[i][0],sorted_sim[i][1]))
    return sim_session,sim_score

In [272]:
def getrecommendation(sessionid):
    pred_dict = {}
    sessn,scor = findksimilarsessions(sessionid)
    for track in all_tracks:
        if data_sparse[sessionid,track] != 1:
            wtd_sum = 0
            for i in range(len(sessn)):
                wtd_sum += data_sparse[sessn[i],track] * scor[i]
            pred_dict[track] = wtd_sum
    sorted_pred =  sorted(pred_dict.items(), key=operator.itemgetter(1),reverse=True)
    print("Top 5 recommended songs are:")
    for i in range(5):
        print("Track : {0} with score {1}".format(sorted_pred[i][0],sorted_pred[i][1]))

In [274]:
getrecommendation(1)

Top 5 similar sessions are:
Session : 132 with similarity 0.3333333333333334
Session : 144 with similarity 0.3333333333333334
Session : 9464 with similarity 0.3333333333333334
Session : 34014 with similarity 0.3333333333333334
Session : 49549 with similarity 0.2886751345948129
Top 5 recommended songs are:
Track : 71979 with score 0.6666666666666669
Track : 76507 with score 0.3333333333333334
Track : 90558 with score 0.3333333333333334
Track : 27427 with score 0.3333333333333334
Track : 86071 with score 0.3333333333333334


In [29]:
track_id_list = df_hist[df_hist.session_id == 0].track_id.tolist()

In [30]:
track_id_list

[63636, 111426, 51641]

In [24]:
recommended_track_list = [71979,76507,90558]

In [27]:
track_lookup[track_lookup.track_id == 63636]['trackname'][0]

'Mission Flats'

In [28]:
print("============LISTENED TO============")
for track in track_id_list:
    track_name = track_lookup[track_lookup.track_id == track]['trackname'][0]
    print(track_name)
print("============RECOMMENDED============")
for track in recommended_track_list:
    track_name = track_lookup[track_lookup.track_id == track]['trackname'][0]
    print(track_name)

Change Of Seasons
Seasons
Dope Stuff
Opfern
Primo
Stay A While


#### Note
Since the process of determining the neighbor sessions becomes very time-consuming as the number of sessions increases, we use an special in-memory index data structure (cache) in our implementation. Technically, in the training phase, we create a data structure that maps the training sessions to their set of items and one structure that maps the items to the sessions in which they appear. To make recommendations for the current session <b><i>s</i></b>, we €rst create a union of the sessions in which the items of <b><i>s</i></b> appear. This union will be the set of neighbors possible for the current session <b><i>s</i></b>.

To further reduce the computational complexity of the prediction process, we select a subsample of these possible neighbors using a heuristic. In this work, we took them most sessions as focusing on recent trends has shown recent to be effective for recommendations in e-commerce ([ Determining Characteristics of Successful Recommendations from Log Data – A Case Study. In SAC ’17.](http://ls13-www.cs.tu-dortmund.de/homepage/publications/jannach/Conference_SAC_2017_logs.pdf)). We then compute the similarity of thesem most recent possible neighbors and the current session and select the most "k" similar sessions as the neighbor sessions of the current session. Again through lookup and set union operations, we create the set of <i>recommendable</i> items R that contains items that appear in one of the k-sessions.