# kNN - Temporal Extension

The knn method, when using cosine similarity as a distance measure, does not consider the temporal sequence of the events in a session. The proposed tkNN method uses the same scoring scheme as the kNN method. The only difference is that, given the current session "s", we consider item "i" as being recommendable only if it appears in the neighbor session 'n' directly after a certain item. In our implementation, that certain item is the last item of the current session 's'.

In [1]:
import pandas as pd
import numpy as np
import gc
import dateutil.parser
from scipy.sparse import csr_matrix
import operator
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import correlation, cosine
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
df_hist = pd.read_table('userid-timestamp-artid-artname-traid-traname.tsv', error_bad_lines = False)

  """Entry point for launching an IPython kernel.
b'Skipping line 2120260: expected 6 fields, saw 8\n'
b'Skipping line 2446318: expected 6 fields, saw 8\n'
b'Skipping line 11141081: expected 6 fields, saw 8\n'
b'Skipping line 11152099: expected 6 fields, saw 12\nSkipping line 11152402: expected 6 fields, saw 8\n'
b'Skipping line 11882087: expected 6 fields, saw 8\n'
b'Skipping line 12902539: expected 6 fields, saw 8\nSkipping line 12935044: expected 6 fields, saw 8\n'
b'Skipping line 17589539: expected 6 fields, saw 8\n'


In [3]:
df_hist.columns = ['userid', 'timestamp', 'artistid','artistname','trackid','trackname']

In [4]:
df_profile = pd.read_table('userid-profile.tsv')

  """Entry point for launching an IPython kernel.


In [5]:
user_id = df_profile['#id'].tolist()[:200]

df_hist = df_hist[df_hist['userid'].isin(user_id)]

In [6]:
df_hist['timestamp'] = df_hist['timestamp'].apply(lambda x : dateutil.parser.parse(x))

In [7]:
df_hist.sort_values(by=['userid','timestamp'], inplace=True)
cond1 = df_hist.timestamp-df_hist.timestamp.shift(1) > pd.Timedelta(5, 'm')
cond2 = df_hist.userid != df_hist.userid.shift(1)
df_hist['sessionid'] = (cond1|cond2).cumsum()

In [8]:
df_hist.head()

Unnamed: 0,userid,timestamp,artistid,artistname,trackid,trackname,sessionid
16683,user_000001,2006-08-13 13:59:20+00:00,09a114d9-7723-4e14-b524-379697f6d2b5,Plaid & Bob Jaroc,c4633ab1-e715-477f-8685-afa5f2058e42,The Launching Of Big Face,1
16682,user_000001,2006-08-13 14:03:29+00:00,09a114d9-7723-4e14-b524-379697f6d2b5,Plaid & Bob Jaroc,bc2765af-208c-44c5-b3b0-cf597a646660,Zn Zero,1
16681,user_000001,2006-08-13 14:10:43+00:00,09a114d9-7723-4e14-b524-379697f6d2b5,Plaid & Bob Jaroc,aa9c5a80-5cbe-42aa-a966-eb3cfa37d832,The Return Of Super Barrio - End Credits,2
16680,user_000001,2006-08-13 14:17:40+00:00,67fb65b5-6589-47f0-9371-8a40eb268dfb,Tommy Guerrero,d9b1c1da-7e47-4f97-a135-77260f2f559d,Mission Flats,3
16679,user_000001,2006-08-13 14:19:06+00:00,1cfbc7d1-299c-46e6-ba4c-1facb84ba435,Artful Dodger,120bb01c-03e4-465f-94a0-dce5e9fac711,What You Gonna Do?,3


Since we'll be looking at most similar sessions for making recommendations, let's drop userid from our database.

In [9]:
df_hist.drop(['userid'], axis=1, inplace=True)

In [None]:
# For our training we will use sessions with atleast two songs long.
# session = df_hist['sessionid'].unique().tolist()
# rem_session = []
# for s in session:
#     if df_hist[df_hist.sessionid == s].shape[0] <2:
#         rem_session.append(s)

In [None]:
#df_hist = df_hist[~df_hist.sessionid.isin(rem_session)]

In [10]:
df_hist['sessionid'].unique()

array([      1,       2,       3, ..., 1092446, 1092447, 1092448],
      dtype=int64)

Let's map session id and tracks in our final dataset to values starting with zero and crete corresponding lookup tables.

In [None]:
#df_hist['session_id'] = df_hist['sessionid'].astype('category').cat.codes
#session_lookup = df_hist[['session_id','sessionid']].drop_duplicates()

In [11]:
df_hist['track_id'] = df_hist['trackname'].astype("category").cat.codes
track_lookup = df_hist[['track_id', 'trackname','artistname']].drop_duplicates()

In [12]:
df_hist.columns

Index(['timestamp', 'artistid', 'artistname', 'trackid', 'trackname',
       'sessionid', 'track_id'],
      dtype='object')

In [13]:
df_hist.drop(['timestamp','artistid','artistname','trackid','trackname'],axis=1,inplace=True)

In [20]:
df_hist.head()

Unnamed: 0,sessionid,track_id
16683,0,312469
16682,0,367762
16681,1,316815
16680,2,203056
16679,2,352259


In [21]:
session = df_hist.sessionid.unique().tolist()
sessn_past = {}
for ssn in session:
    sessn_past[ssn] = []
for row in df_hist.itertuples():
    sessn_past[row.sessionid].append(row.track_id)

In [22]:
session = []
track = []
plays = []
for row in df_hist.itertuples():
    session.append(row.sessionid)
    track.append(row.track_id)
    plays.append(1)

In [23]:
rows = len(list(np.sort(df_hist.sessionid.unique())))
columns = len(list(np.sort(df_hist.track_id.unique())))

In [24]:
data_sparse = csr_matrix((plays, (session, track)), shape=(rows, columns))

To further reduce the computational complexity of the prediction process, we select a subsample of these possible neighbors using a heuristic. In this work, we took them most sessions as focusing on recent trends has shown recent to be effective for recommendations in e-commerce ([ Determining Characteristics of Successful Recommendations from Log Data – A Case Study. In SAC ’17.](http://ls13-www.cs.tu-dortmund.de/homepage/publications/jannach/Conference_SAC_2017_logs.pdf)). We then compute the similarity of these 'm' most recent possible neighbors and the current session and select the most "k" similar sessions as the neighbor sessions of the current session.

For our implementation, we will use m = 1500, k = 500

In [25]:
similarities = cosine_similarity(data_sparse,dense_output=False)

In [26]:
def findksimilarsessions(sessionid, data_sparse = data_sparse, metric=cosine, k=200):
    sim_sparse = similarities[sessionid]
    sim_array = sim_sparse.toarray()
    # Let's get index for all non zeros entries in sim_array
    r,c = np.nonzero(sim_array)
    sim = {}
    for col in c:
        sim[col] = sim_array[0,col]
    sorted_sim =  sorted(sim.items(), key=operator.itemgetter(1),reverse=True)
    sim_session = []
    sim_score = []
    print("Top similar sessions are:")
    for i in range (1,len(sorted_sim)):
        sim_session.append(sorted_sim[i][0])
        sim_score.append(sorted_sim[i][1])
        print("Session : {0} with similarity {1}".format(sorted_sim[i][0],sorted_sim[i][1]))
    return sim_session,sim_score

In [51]:
def getrecommendation(sessionid):
    pred_dict = {}
    sessn,scor = findksimilarsessions(sessionid)
    last_track_of_session = sessn_past[sessionid][-1] 
    track_recommendable = []
    for ssn in sessn:
        if (last_track_of_session in sessn_past[ssn]) & (len(sessn_past[ssn])!=1):
            idx = sessn_past[ssn].index(last_track_of_session)
            if idx != len(sessn_past[ssn])-1:
                track_recommendable.append(sessn_past[ssn][idx+1])
            if idx != 0:
                track_recommendable.append(sessn_past[ssn][idx-1])
    track_recommendable = list(set(track_recommendable))
    
    for track in track_recommendable:
        if data_sparse[sessionid,track] != 1:
            wtd_sum = 0
            for i in range(len(sessn)):
                wtd_sum += data_sparse[sessn[i],track] * scor[i]
            pred_dict[track] = wtd_sum
    sorted_pred =  sorted(pred_dict.items(), key=operator.itemgetter(1),reverse=True)
    print("Top recommended songs are:")
    for i in range(len(sorted_pred)):
        print("Track : {0} with score {1}".format(sorted_pred[i][0],sorted_pred[i][1]))

In [52]:
getrecommendation(0)

Top similar sessions are:
Session : 241 with similarity 0.9999999999999998
Session : 390 with similarity 0.9999999999999998
Session : 427 with similarity 0.9999999999999998
Session : 1078 with similarity 0.9999999999999998
Session : 766 with similarity 0.816496580927726
Session : 1273 with similarity 0.816496580927726
Session : 690 with similarity 0.7071067811865475
Session : 967667 with similarity 0.7071067811865475
Session : 968504 with similarity 0.7071067811865475
Session : 495 with similarity 0.6324555320336758
Session : 6059 with similarity 0.6324555320336758
Session : 1322 with similarity 0.4999999999999999
Session : 967645 with similarity 0.4999999999999999
Session : 967682 with similarity 0.2886751345948129
Session : 700 with similarity 0.2357022603955158
Session : 967666 with similarity 0.2357022603955158
Top recommended songs are:
Track : 141947 with score 3.3979042259228036
Track : 48396 with score 0.2357022603955158


In [57]:
last_track_id = df_hist[df_hist.sessionid == 0].track_id.tolist()[-1]

In [59]:
last_track_id

367762

In [55]:
recommended_track_list = [141947,48396]

In [63]:
track_lookup.head()

Unnamed: 0,track_id,trackname,artistname
16683,312469,The Launching Of Big Face,Plaid & Bob Jaroc
16682,367762,Zn Zero,Plaid & Bob Jaroc
16681,316815,The Return Of Super Barrio - End Credits,Plaid & Bob Jaroc
16680,203056,Mission Flats,Tommy Guerrero
16679,352259,What You Gonna Do?,Artful Dodger


In [68]:
list(set(track_lookup[track_lookup.track_id == 367762].trackname.tolist()))[0]

'Zn Zero'

In [69]:
print("============LISTENED TO============")
track_name = list(set(track_lookup[track_lookup.track_id == last_track_id].trackname.tolist()))[0]
print(track_name)
print("============RECOMMENDED============")
for track in recommended_track_list:
    track_name = list(set(track_lookup[track_lookup.track_id == track].trackname.tolist()))[0]
    print(track_name)

Zn Zero
I Citizen The Loathsome
Breezin'


In [72]:
track_lookup[track_lookup.trackname == 'Zn Zero']

Unnamed: 0,track_id,trackname,artistname
16682,367762,Zn Zero,Plaid & Bob Jaroc
3524584,367762,Zn Zero,Plaid


In [73]:
track_lookup[track_lookup.trackname == 'I Citizen The Loathsome']

Unnamed: 0,track_id,trackname,artistname
15963,141947,I Citizen The Loathsome,Plaid & Bob Jaroc


In [76]:
track_lookup[track_lookup.trackname == 'Breezin']

Unnamed: 0,track_id,trackname,artistname
2669242,48395,Breezin,George Benson


So, the session which had "Zn Zero" by Plaid as it's last song will then be recommended "I Citizen The Loathsome" by Plaid again as suggested by previous sessions, and it makes sense too.