# Collaborative filtering based on the listening behavior in the previous week

The goal is to estimate the skipping behavior for the sessions on Sep 18, 2018 (20180918). Although larger sample size can calculate more precise collaborative filtering, considering the listening behavior and people's preference might change from day to day, the older data might have less informatuve information for the future. Therefore, only the data within the previous week was included here.

In [1]:
import pandas as pd
import numpy as np


In [2]:
# 7 days
file_list = []
for dateCode in range(911,918):
    for logN in range(10):
        file_list.append('../models/SVD/all_tracks/k300_v_log_'+str(logN)+'_20180'+str(dateCode)+'_000000000000.csv')
        
file_list

There could be many ways to perform the collaborative filtering, ranging from the most common options such as cosine similarity and Pearson correlation to something uncommon such as Chebyshev distance.

Anyway, here, without any prior assumption, I calculated 11 versions of similarity/distance for collaborative filtering. The idea is that the joint force of multiple indexes should provide more useful information than any one.

In [4]:
def cal_sim_svd(filename):
    v = pd.read_csv(filename).drop(columns=['Unnamed: 0'])
    v.columns = list(map(str, range(0,len(v)))) # make the column names 0 to 99 to match other data
    TrackTrackCosSim = v.transpose().dot(v) 
    TrackTrackLinCorr = v.corr()
    TrackTrackSpearCorr = v.corr(method = 'spearman')
    TrackTrackKendCorr = v.corr(method = 'kendall')
    
    from scipy.spatial.distance import cdist
    TrackTrackEuclDist = pd.DataFrame(cdist(v.T,v.T, 'euclidean'), index = TrackTrackCosSim.index, columns = TrackTrackCosSim.columns)
    TrackTrackManhDist = pd.DataFrame(cdist(v.T,v.T, 'cityblock'), index = TrackTrackCosSim.index, columns = TrackTrackCosSim.columns)
    TrackTrackSqEuclDist = pd.DataFrame(cdist(v.T,v.T, 'sqeuclidean'), index = TrackTrackCosSim.index, columns = TrackTrackCosSim.columns)
    TrackTrackCanbDist = pd.DataFrame(cdist(v.T,v.T, 'canberra'), index = TrackTrackCosSim.index, columns = TrackTrackCosSim.columns)
    TrackTrackHammDist = pd.DataFrame(cdist(v.T>0,v.T>0, 'hamming'), index = TrackTrackCosSim.index, columns = TrackTrackCosSim.columns)
    TrackTrackChebDist = pd.DataFrame(cdist(v.T,v.T, 'chebyshev'), index = TrackTrackCosSim.index, columns = TrackTrackCosSim.columns)
    TrackTrackBrayDist = pd.DataFrame(cdist(v.T,v.T, 'braycurtis'), index = TrackTrackCosSim.index, columns = TrackTrackCosSim.columns)

    
    
    sim_list = [TrackTrackCosSim, 
                TrackTrackLinCorr, 
                TrackTrackSpearCorr, 
                TrackTrackKendCorr, 
                TrackTrackEuclDist, 
                TrackTrackManhDist, 
                TrackTrackSqEuclDist, 
                TrackTrackCanbDist, 
                TrackTrackHammDist,
                TrackTrackChebDist,
                TrackTrackBrayDist]
    
    return sim_list
    

In [6]:
from timeit import default_timer as timer #to see how long the computation will take
# import glob
# file_list = glob.glob('../models/SVD/all_tracks/k300_v*.csv')

TrackTrackCosSim_mean = 0
TrackTrackLinCorr_mean = 0
TrackTrackSpearCorr_mean = 0
TrackTrackKendCorr_mean = 0
TrackTrackEuclDist_mean = 0
TrackTrackManhDist_mean = 0
TrackTrackSqEuclDist_mean = 0
TrackTrackCanbDist_mean = 0
TrackTrackHammDist_mean = 0
TrackTrackChebDist_mean = 0
TrackTrackBrayDist_mean = 0

start = timer()
for filename in file_list:
    sim_list = []
    sim_list = cal_sim_svd(filename)
    TrackTrackCosSim_mean = sim_list[0]+TrackTrackCosSim_mean
    TrackTrackLinCorr_mean = sim_list[1]+TrackTrackLinCorr_mean
    TrackTrackSpearCorr_mean = sim_list[2]+TrackTrackSpearCorr_mean
    TrackTrackKendCorr_mean = sim_list[3]+TrackTrackKendCorr_mean
    TrackTrackEuclDist_mean = sim_list[4]+TrackTrackEuclDist_mean
    TrackTrackManhDist_mean = sim_list[5]+TrackTrackManhDist_mean
    TrackTrackSqEuclDist_mean = sim_list[6]+TrackTrackSqEuclDist_mean
    TrackTrackCanbDist_mean = sim_list[7]+TrackTrackCanbDist_mean
    TrackTrackHammDist_mean = sim_list[8]+TrackTrackHammDist_mean
    TrackTrackChebDist_mean = sim_list[9]+TrackTrackChebDist_mean
    TrackTrackBrayDist_mean = sim_list[10]+TrackTrackBrayDist_mean
    

TrackTrackCosSim_mean = TrackTrackCosSim_mean/len(file_list)
TrackTrackLinCorr_mean = TrackTrackLinCorr_mean/len(file_list)
TrackTrackSpearCorr_mean = TrackTrackSpearCorr_mean/len(file_list)
TrackTrackKendCorr_mean = TrackTrackKendCorr_mean/len(file_list)
TrackTrackEuclDist_mean = TrackTrackEuclDist_mean/len(file_list)
TrackTrackManhDist_mean = TrackTrackManhDist_mean/len(file_list)
TrackTrackSqEuclDist_mean = TrackTrackSqEuclDist_mean/len(file_list)
TrackTrackCanbDist_mean = TrackTrackCanbDist_mean/len(file_list)
TrackTrackHammDist_mean = TrackTrackHammDist_mean/len(file_list)
TrackTrackChebDist_mean = TrackTrackChebDist_mean/len(file_list)
TrackTrackBrayDist_mean = TrackTrackBrayDist_mean/len(file_list)

TrackTrackCosSim_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_CosSim.csv')
TrackTrackLinCorr_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_LinCorr.csv')
TrackTrackSpearCorr_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_SpearCorr.csv')
TrackTrackKendCorr_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_KendCorr.csv')
TrackTrackEuclDist_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_EuclDist.csv') 
TrackTrackManhDist_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_ManhDist.csv')
TrackTrackSqEuclDist_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_SqEuclDist.csv') 
TrackTrackCanbDist_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_CanbDist.csv')
TrackTrackHammDist_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_HammDist.csv')
TrackTrackChebDist_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_ChebDist.csv')
TrackTrackBrayDist_mean.to_csv('../models/SVD/all_tracks/similarity_for20180918/k300_BrayDist.csv')


print('Runtime: %0.2fs' % (timer() - start))

Runtime: 165.87s


Note that the Euclidean distance metrices are not helpful here, as Euclidean distance between any 2 vectors in an SVD unitary matrix will be always the same. Therefore, they won't be used in the following scripts.