# Capstone : Recommender System

## Data set description

This dataset includes long-term (about 10 months) check-in data in New York city and Tokyo collected from Foursquare from 12 April 2012 to 16 February 2013.
It contains two files in tsv format. Each file contains 8 columns, which are:

1. User ID (anonymized)
2. Venue ID (Foursquare)
3. Venue category ID (Foursquare)
4. Venue category name (Fousquare)
5. Latitude
6. Longitude
7. Timezone offset in minutes (The offset in minutes between when this check-in occurred and the same time in UTC)
8. UTC time

The file dataset_TSMC2014_NYC.txt contains 227428 check-ins in New York city.
The file dataset_TSMC2014_TKY.txt contains 573703 check-ins in Tokyo.

To train the model, i took the whole new york data set, plus ten random users from the tokyo dataset.
I kept only the columns user ID, Venue ID, and Venue categroy name.

REFERENCES

@article{yang2014modeling,
	author={Yang, Dingqi and Zhang, Daqing and Zheng, Vincent. W. and Yu, Zhiyong},
	journal={IEEE Transactions on Systems, Man, and Cybernetics: Systems},
	title={Modeling User Activity Preference by Leveraging User Spatial Temporal Characteristics in LBSNs},
	year={2015},
	volume={45},
	number={1},
	pages={129--142},
	ISSN={2168-2216},
	publisher={IEEE}
}

## LCARS: A Location-Content-Aware Recommender System

LCARS is a location-content aware recommender system, based on LDA (latent dirichlet allocation) model. The LCARS model take into consideration both user preference and local preferences. Local preferences are event or venue that  In this way, it can recommend venue to a new user that don't have any user history, allocating more consideration to the local preferences. LCARS was proposed by Hongzhi Yin, Yizhou Sun, Bin Cui, Zhiting Hu, Ling Chen from QCIS, University of Technology, Sydney, in 2013.

LCARS consists of two components: offline modeling and online recommendation. The offline modeling part, called LCA- LDA, is designed to learn the interest of each individual user and the local preference of each individual city by capturing item co- occurrence patterns and exploiting item contents. The online rec- ommendation part automatically combines the learnt interest of the querying user and the local preference of the querying city to pro- duce the top-k recommendations.

Experiments, of which the results are reported in the original, has shown better performance than four others existings recommender systems : Category-based k-Nearest Neighbors Algorithm (CKNN),Item-based k-Nearest Neighbors Algorithm (IKNN),User interest, social and geographical influences (USG),standard LDA- based method.

## Summary

<h4>Part 1 : Initialization</h4>
<ul><li>1.1. Import Data Set</li>
<li>1.2. Data Wrangling</li>
<li>1.3. Initialize count matrices and topic assignment matrix</li></ul>
<h4>Part 2 : Inferring LCA-LDA model</h4>
<ul><li>2.1. Load matrices</li>
<li>2.2. LCA-LDA model Inference</li></ul>
<h4>Part 3 : Online Recommendation</h4>
<ul><li>3.1. Scoring functions</li>
<li>3.2. Threshold-Based Algorithm</li></ul>
<h4>Part 4 : Experimental Restults</h4>
<ul><li>4.3. Evaluation  (Recall@k)</li>
</ul>

In [1]:
import pandas as pd
import numpy as np
from numpy import load
from numpy import save
import random as rd

In [62]:
import gc
%reset_selective -f df_users_profile
%reset_selective -f dictionary
%reset_selective -f df_nyc
%reset_selective -f df_tokyo
gc.collect()

515

# PART 1 : Initialization

## 1.1. Import Data Set

In [63]:
df_user_profiles = pd.read_csv('data/df_user_profiles_us.csv')
print('Nb of rows ',df_user_profiles.shape[0])
print('Nb of users : ', len(np.unique(df_user_profiles['user_id'])) )
print('Nb of venues : ', len(np.unique(df_user_profiles['venue_id'])) )
print('Nb chekins from Boston : ',df_user_profiles[df_user_profiles.city == 'Boston'].shape[0])
print('Nb chekins from New York : ',df_user_profiles[df_user_profiles.city == 'New York'].shape[0])
print('Nb chekins from Chicago : ',df_user_profiles[df_user_profiles.city == 'Chicago'].shape[0])
df_user_profiles.head()

Nb of rows  161468
Nb of users :  11894
Nb of venues :  47969
Nb chekins from Boston :  54279
Nb chekins from New York :  45866
Nb chekins from Chicago :  61323


Unnamed: 0,user_id,venue_id,category,city
0,180962,4b3be5b9f964a520e37d25e3,Bar,Boston
1,38722,4b3be5b9f964a520e37d25e3,Bar,Boston
2,93711,4b3be5b9f964a520e37d25e3,Bar,Boston
3,68294,4b3be5b9f964a520e37d25e3,Bar,Boston
4,101835,4b3be5b9f964a520e37d25e3,Bar,Boston


In [64]:
# Checking for null value
df_user_profiles.isnull().sum()

user_id     0
venue_id    0
category    0
city        0
dtype: int64

## 1.2. Data Wrangling

In [65]:
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
import inspect
from gensim.models import LdaMulticore
snowball = SnowballStemmer("english")

### Lemmatization, Tokenization and Stemming of content words

From data frame we construct user profiles matrix, with tokenized and stemmed content words

In [66]:
print("Starting Lemmatization, Tokenization and Stemming of content words...")
mat = []
stop_words = ['/', ')','(']
for user_id, venue_id,cw, city in zip(df_user_profiles['user_id'],df_user_profiles['venue_id'],df_user_profiles['category'],df_user_profiles['city']):
    stemmed_words = []
    for word in word_tokenize(cw):
        if(word not in stop_words):
            word = word.lower()
            if(word.startswith('caf')): word='coffe'
            stemmed_words.append(snowball.stem(word))
    mat.append([user_id,venue_id,stemmed_words,city])

user_profiles = np.array(mat)
print("user profiles matrix created.")
print("user_profiles shape : ",user_profiles.shape)

Starting Lemmatization, Tokenization and Stemming of content words...
user profiles matrix created.
user_profiles shape :  (161468, 4)


## 1.3. Initialize count matrices and topic assignment matrix

We replace each value with a simple integer Id.

In [67]:
from gensim.corpora import Dictionary
# Dictionary encapsulates the mapping between normalized words and their integer ids.
dictionary = Dictionary(user_profiles[:,2])

In [68]:
#unique user_id
unique_users = np.unique(df_user_profiles['user_id'])
# unique venue
venues = np.unique(df_user_profiles['venue_id'])
print('Number of venues : ' + str(len(venues)))
# unique location
locations = np.unique(df_user_profiles['city'])
print('Number of location : ' + str(len(locations)))

Number of venues : 47969
Number of location : 3


In [69]:
#retrieve the position of an user interest (user_id,venue_id) in the dataset D
def ui_pos(user_id,venue_id):
    return np.where((user_profiles[:,0] == user_id) & (user_profiles[:,1] == venue_id))[0][0]

#map user id with its position into the the matrix n_us1, n_us0, n_uz
def user2id(uid):
    return np.where(unique_users == uid)[0][0]

#map location city with its position into the the matrix n_lz
def location2id(loc):
    return np.where(locations == loc)[0][0]


#map venues id with its position into the the matrix n_zv
def venue2id(venue_id):
    return np.where(venues == venue_id)[0][0]

def get_assignment_pos(user_id, venue_id):
    return df_user_profiles[(df_user_profiles.user_id == user_id) & ((df_user_profiles.venue_id == venue_id))].index[0]

### 1.4. Encode users profile matrix, and randomly assign topic  and s value to each user profile, and update count matrices 

In [70]:
k = 50
# Declare count matrices and initialize them with 0
n_users = len(unique_users)
n_venues = len(venues)
n_loc = len(locations)
n_words = len(dictionary)

# Number of time that s=0 and s=1 has been sampled in the user profile Du
n_us = np.zeros((n_users,2))

#if s = 1
#Number of time that topic k has been sampled from multinomial distribution specific to user u
n_uz = np.zeros((n_users, k))

#if s = 0
#Number of time that topic k has been sampled from multinomial distribution specific to location l
n_lz = np.zeros((n_loc, k ))

#Number of time that venue v has been sampled from topic z
n_zv = np.zeros((k,n_venues))

#Number of time that content word c has been sampled from topic z
n_zc = np.zeros((k,n_words))

# Initialize the topic assignment matrix, assigning random topic and random s value (0 | 1), to each user interest
# At the same time, we're going to update count matrices n_uz, n_zc, n_vz ...
mat = []
for row in user_profiles:
    uid = user2id(row[0])
    vid = venue2id(row[1])
    locid = location2id(row[3])
    # randomly assign a topic and update the count matrices associated
    z_init = np.random.multinomial(1, [1/k]*k, size=1)[0].argmax()
    n_uz[uid,z_init] += 1
    n_lz[locid,z_init] += 1
    n_zv[z_init,vid] += 1
    encoded_words = []   
    for word in row[2]:
        word_id = dictionary.token2id[word]
        encoded_words.append(word_id)
        n_zc[z_init,word_id] += 1
        
    # sample a coin S according to Bernouilli law
    s_init = np.random.randint(0,2)
    n_us[s_init] += 1
    
    mat.append([uid,vid,encoded_words,locid,s_init,z_init])
print("")
assignment_matrix = np.array(mat)
print("Assignment matrix created : ", assignment_matrix.shape)
print("-------------------------")
print("")
n_zv_sum = np.sum(n_zv,axis=1)
n_zc_sum = np.sum(n_zc,axis=1)
n_uz_sum = np.sum(n_uz,axis=1)
n_lz_sum = np.sum(n_lz,axis=1)
print('n_words (total number of words) : ', n_words)
print('n_venues (total number of venues) : ', n_venues)
print('n_users (total number of users) : ', n_users)
print('n_zc_sum : ', n_zc_sum.shape)
print('n_lz_sum : ', n_lz_sum.shape)
print('n_zv_sum : ', n_zv_sum.shape)
print('n_uz_sum : ', n_uz_sum.shape)
print("n_us : " + str(n_us.shape))
print("n_uz : " + str(n_uz.shape))
print("n_lz : " + str(n_lz.shape))
print("n_zc : " + str(n_zc.shape))
print("n_zv : " + str(n_zv.shape))


Assignment matrix created :  (161468, 6)
-------------------------

n_words (total number of words) :  439
n_venues (total number of venues) :  47969
n_users (total number of users) :  11894
n_zc_sum :  (50,)
n_lz_sum :  (3,)
n_zv_sum :  (50,)
n_uz_sum :  (11894,)
n_us : (11894, 2)
n_uz : (11894, 50)
n_lz : (3, 50)
n_zc : (50, 439)
n_zv : (50, 47969)


# PART 2 : Inferring

## 2.1. Load matrices

In [5]:
# just initialze count matrices, from a given assignment matrix
def initialize_count_matrices(assigmnent_matrix, k):
    # Declare count matrices and initialize them with 0
    n_users = len(np.unique(assigmnent_matrix[:,0]))
    n_venues = len(np.unique(assigmnent_matrix[:,1]))
    n_loc = len(np.unique(assigmnent_matrix[:,3]))
    n_words = len(np.unique([word for wordlist in np.unique(assigmnent_matrix[:,2]) for word in wordlist]))

    # Number of time that s=0 and s=1 has been sampled in the user profile Du
    n_us = np.zeros((n_users,2))

    #if s = 1
    #Number of time that topic k has been sampled from multinomial distribution specific to user u
    n_uz = np.zeros((n_users, k))

    #if s = 0
    #Number of time that topic k has been sampled from multinomial distribution specific to location l
    n_lz = np.zeros((n_loc, k ))

    #Number of time that venue v has been sampled from topic z
    n_zv = np.zeros((k,n_venues))

    #Number of time that content word c has been sampled from topic z
    n_zc = np.zeros((k,n_words))
    
    for row in assignment_matrix:
        n_uz[row[0],row[5]] += 1
        n_lz[row[3],row[5]] += 1
        n_zv[row[5],row[1]] += 1
        for word in row[2]:
            n_zc[row[5],word] += 1
        n_us[row[4]] += 1
    
    return n_us, n_uz, n_lz, n_zc, n_zv,n_words, n_venues

In [7]:
# GLOBAL
assignment_matrix = load('inferred_assignment_matrix_k=100.npy',allow_pickle=True)
print('assignment_matrix : ',assignment_matrix.shape)
n_us, n_uz, n_lz, n_zc, n_zv, n_words, n_venues = initialize_count_matrices(assignment_matrix, 100)
n_zv_sum = np.sum(n_zv,axis=1)
n_zc_sum = np.sum(n_zc,axis=1)
n_uz_sum = np.sum(n_uz,axis=1)
n_lz_sum = np.sum(n_lz,axis=1)



print('n_words : ', n_words)
print('n_venues : ', n_venues)
print('n_zc_sum : ', n_zc_sum.shape)
print('n_lz_sum : ', n_lz_sum.shape)
print('n_zv_sum : ', n_zv_sum.shape)
print('n_uz_sum : ', n_uz_sum.shape)
print("n_us : " + str(n_us.shape))
print("n_uz : " + str(n_uz.shape))
print("n_lz : " + str(n_lz.shape))
print("n_zc : " + str(n_zc.shape))
print("n_zv : " + str(n_zv.shape))

assignment_matrix :  (161468, 6)
n_words :  439
n_venues :  47969
n_zc_sum :  (100,)
n_lz_sum :  (3,)
n_zv_sum :  (100,)
n_uz_sum :  (11894,)
n_us : (11894, 2)
n_uz : (11894, 100)
n_lz : (3, 100)
n_zc : (100, 439)
n_zv : (100, 47969)


## 2.2. LCA-LDA model Inference

#### 2.2.1. Set Parameters

In [71]:
#parameteres
k = 50
alpha = 50/k
alpha_ = 50/k
beta = 0.01
beta_ = 0.01
gamma = 0.5
gamma_ = 0.5

#### 2.2.2. Set up assignment functions

In [72]:
# s assignment
#param c : user id, loc id, current topic z, current s value
def s_assignment(uui, lui, zui, sui):
    n_us[uui,sui]-1
    n_uz[uui,zui]-1
    n_lz[lui,zui]-1
    n_uz_sum[uui]-1
    n_lz_sum[lui]-1
    left1 = (n_uz[uui][zui] + alpha)/(n_uz_sum[uui] + k*alpha)
    left0 = (n_lz[lui][zui] + alpha_)/(n_lz_sum[lui] + k*alpha_)
    right1 = (n_us[uui][1] + gamma)/(n_us[uui][1] + n_us[uui][0] + gamma + gamma_)
    right0 = (n_us[uui][0] + gamma_)/(n_us[uui][1] + n_us[uui][0] + gamma + gamma_)
    ps1 = left1*right1
    ps0 = left0*right0
    prob_s = [ps0,ps1]/np.sum([ps0,ps1])
    #print('ps1 : ' + str(ps1) + ' ps0 : ' + str(ps0))
    new_s = np.random.multinomial(1,prob_s,1)[0].argmax()
    n_us[uui,new_s]+1
    n_uz[uui,zui]+1
    n_lz[lui,zui]+1
    n_lz_sum[lui]+1
    n_uz_sum[uui]+1
    return new_s

In [73]:
import sys
def z_assignment(uui, lui, vui,cui, zui, sui):
    neg = False
    n_zv[zui,vui] -= 1
    n_zv_sum[zui] -= 1
    for w in cui:
        n_zc[zui,w] -= 1
        n_zc_sum[zui] -=1
        if(n_zc[zui,w] < 0): 
            neg = True
    n_uz[uui,zui] -= 1
    n_lz[lui,zui] -= 1
    n_uz_sum[uui] -= 1
    n_lz_sum[lui] -= 1
    
    # Rollback
    if(n_uz[uui,zui]<0 or n_lz[lui,zui]<0 or n_zv[zui,vui]<0 or neg):
        n_uz[uui,zui] += 1
        n_lz[lui,zui] += 1
        n_zv[zui,vui] += 1
        for w in cui:
            n_zc[zui,w] += 1
        sys.exit("Error : count matrix cant be negative")
        
    if(sui == 1):
        left = (n_uz[uui] + alpha)/(n_uz_sum[uui] + k*alpha)
    else:
        left = (n_lz[lui] + alpha_)/(n_lz_sum[lui] + k*alpha_)
    
    number_words_in_z = 0
    for w in cui:
        number_words_in_z += n_zc[:,w]
    right2 = (number_words_in_z + beta_)/(n_zc_sum + n_words*beta_)
    right1 = (n_zv[:,vui] + beta)/(n_zv_sum + n_venues*beta)
    prob_z = (left * right1 * right2).astype('float64') # to avoid ValueError: sum(pvals[:-1]) > 1.0 : .astype('float64')
    # normalize the vector prob_z
    prob_z_n = (prob_z/np.sum(prob_z))
    #print(prob_z_n)
    
    # generate a new z according to the newly computed distribution over topic prob_z_n
    new_zui = np.random.multinomial(1, prob_z_n, size=1)[0].argmax()
     
    # Update count matrices
    n_zv[new_zui,vui] += 1
    n_zv_sum[zui] += 1
    for w in cui:
        n_zc[new_zui,w] += 1
        n_zc_sum[zui] +=1
    n_uz[uui,new_zui] += 1
    n_lz[lui,new_zui] += 1
    n_uz_sum[uui] += 1
    n_lz_sum[lui] += 1
   
    return new_zui
    

#### 2.2.3. Learning parameters : Gibbs Sampling

In [76]:
iterations = 100
it = 0
n_words = len(np.unique([word for wordlist in np.unique(assignment_matrix[:,2]) for word in wordlist]))
n_venues = len(np.unique(assignment_matrix[:,1]))

changes_s = 0
changes_z = 0
end_while = False
while(it < iterations):
    changes_s = 0
    changes_z = 0
    #print(str(changes_s) + ' ' + str(changes_z))
    for ui in assignment_matrix:
        uui = ui[0]
        sui = ui[4]
        zui = ui[5]
        vui = ui[1]
        cui = ui[2] # list
        lui = ui[3]

        #print(str(uui) + ' c :' + str(cui) + ' topic : ' + str(zui))
        #sample a coin s :
        new_s = s_assignment(uui, lui, zui, sui)
        ui[4] = new_s

        #update topic assignment list with newly sampled topic for token zui.
        new_zui = z_assignment(uui, lui,vui,cui ,zui,new_s)
        ui[5] = new_zui
        
        if(new_s != sui):
            changes_s += 1
        if(new_zui != zui):
            changes_z +=1
    
    it = it + 1
    #if(it%10 == 0):
    print('Iteration ' + str(it) + '. Changes s : ' + str(changes_s) + '\n Changes z : ' + str(changes_z))  

Iteration 10. Changes s : 63892
 Changes z : 125
Iteration 20. Changes s : 64037
 Changes z : 102
Iteration 30. Changes s : 63415
 Changes z : 1066
Iteration 40. Changes s : 62158
 Changes z : 184
Iteration 50. Changes s : 62283
 Changes z : 1
Iteration 60. Changes s : 62549
 Changes z : 1
Iteration 70. Changes s : 62072
 Changes z : 0
Iteration 80. Changes s : 62224
 Changes z : 0
Iteration 90. Changes s : 62448
 Changes z : 3
Iteration 100. Changes s : 62343
 Changes z : 1


In [77]:
# k=100 at iter = 400
# k=50 at iter = 300
save('inferred_assignment_matrix_k=50.npy',assignment_matrix)

In [59]:
assignment_matrix = load('inferred_assignment_matrix.npy',allow_pickle=True)

In [78]:
# GLOBAL
# Inferred distribution
theta_uz = (n_uz + alpha)/(np.sum(n_uz, axis=1).reshape(-1,1) + k *alpha)
theta_lz = (n_lz + alpha_)/(np.sum(n_lz, axis=1).reshape(-1,1) + k *alpha_)
phi_zv = (n_zv + beta)/(np.sum(n_zv, axis=1).reshape(-1,1) + n_venues * beta)
phi_zc = (n_zc + beta_)/(np.sum(n_zc, axis=1).reshape(-1,1) + n_words * beta)
gamma_u1 = (n_us[:,1]+gamma)/(n_us[:,0] + n_us[:,1] + gamma + gamma_)
print('theta_uz ', theta_uz.shape)
print('theta_lz ', theta_lz.shape)
print('phi_zv ', phi_zv.shape)
print('phi_zc ', phi_zc.shape)
print('gamma_u1 ', gamma_u1.shape)

theta_uz  (11894, 50)
theta_lz  (3, 50)
phi_zv  (50, 47969)
phi_zc  (50, 439)
gamma_u1  (11894,)


In [18]:
# List of length n_venues  : [location, content_words]
# allow fast retrieval of a venues location and associated content_words
venues_loc_words = [[assignment_matrix[assignment_matrix[:,1]==v,3][0],
                   assignment_matrix[assignment_matrix[:,1]==v,2][0]] for v in range(n_venues)]

In [21]:
save('venues_loc_words.npy', np.array(venues_loc_words))

## Part 3 : Online Recommendation

### 3.1. Scoring functions

In [79]:
#function weig_score denote the expected weight of the query (u,lu) on dimension z (topic)
def weight_score(u,lu,z):
    return (gamma_u1[u]*theta_uz[u,z] + (1-gamma_u1[u])*theta_lz[lu,z])

In [80]:
#S is the function that wrap up the above functions weight_score and F
#return the score of the venue_id, for a querying user and a querying city 
def S(u,lu, v):
    s_score = 0
    for z in range(k):
        s_score += weight_score(u,lu,z)*F(lu,v,z)
    return s_score

In [81]:
#compute the score the score of a venue with respect to location on topic
#Note that F(lu, venue, topic) is independent of querying user
def F(lu, v, z):
    # lv (int): location of the current venue
    lv = venues_loc_words[v][0]
    if(lv == lu):
        # vc (lists) : content words of the current venue
        vc = venues_loc_words[v][1]
        f_result = phi_zv[z,v] * sum([phi_zc[z,c] for c in vc])
    else:
        f_result = 0
    
    return f_result

### 3.2. Threshold-Based Algorithm

In [84]:
#param : location of the querying user
#Offline part : local preference
#return a list of venues for each topic, sorted by their F score 
def compute_sorted_lists_of_venues():
    print('Start computational of k sorted lists...')
    list_k_n = []
    for lu in range(len(locations)):
        list_k = []
        for t in range(k):
            #print('Topic ',t,'...')
            f_score = []
            for row in assignment_matrix[assignment_matrix[:,3] == lu]:
                f_score.append([row[1],F(lu,row[1],t)])
                #if(v%500==0):print(v,' ',location,' ',t,' fscore : ',F(location,v,t))
            f_score.sort(key=lambda x: x[1], reverse= True)
            list_k.append(f_score)
        list_k_n.append(list_k)
    print('computational of k sorted lists done.')
    return list_k_n


In [83]:
# Compute k top list of sorted venues for each locat (OFFLINE TASK)
list_k_n = compute_sorted_lists_of_venues()

Start computational of k sorted lists...
computational of k sorted lists done.



In [85]:
from collections import deque
def threshold_based_algorithm(u,lu):
    priorityQueue = []
    ranking = []
    list_k = list_k_n[lu]
    topk = 20
    for t in range(k):
        # get head element of each list and add it to the queue
        v = list_k[t][0][0]
        priorityQueue.append([t,S(u,lu,v)])
    priorityQueue.sort(key=lambda x: x[1], reverse= True)
    ta_score = compute_TA(u,lu,list_k,priorityQueue)
    while True:
        # pop
        nextListToCheck = priorityQueue.pop(0)[0]
        v = list_k[nextListToCheck].pop(0)[0]
        if((v in [x[0] for x in ranking]) == False):
            if(len(ranking) < topk):
                ranking.append([v, S(u,lu,v)])
                ranking.sort(key=lambda x: x[1], reverse= True)
            else:
                v_ = ranking[0][0]
                if(S(u,lu,v_) > ta_score):
                    break
                if(S(u,lu,v_) < S(u,lu,v)):
                    ranking.remove(topk)
                    ranking.append(v_,S(u,lu,v_))
                    ranking.sort(key=lambda x: x[1], reverse= True)

        
        if(len(list_k[nextListToCheck])>0):
            v = list_k[nextListToCheck][0][0]
            priorityQueue.append([nextListToCheck,S(u,lu,v)])
            priorityQueue.sort(key=lambda x: x[1], reverse= True)
            ta_score = compute_TA(u,lu,list_k,priorityQueue)
        else:
            break
    return ranking

In [86]:
#compute the theshold score 
#return:  the theshold score ta_score (float )
def compute_TA(u,lu,list_k,priorityQueue):
    ta_score = 0
    for i in range(k):
        z = priorityQueue[i][0]
        #get top rated venue of topic i+1 (head element of the list_k) 
        v = list_k[i][0][0]
        ta_score = ta_score + weight_score(u,lu,z)*F(lu,v,z)
    return ta_score

## Part 4 :  Experimental Results

To make an overall evaluation of the recommendation effectiveness of our proposed LCA-LDA,We first design the following two real settings : 
1) querying city are new city to querying users
2) querying cities are the home city to querying user
we divide a user's activity history into a test set and a training set.

### 4.1. Select User Profiles

In [33]:
#df_users_profile = pd.read_csv('./data/.csv')
# We select an user from chicago that have checkins in a non-home city (NYC)
# We select an user from NYC that have checkins in a non-home city (Chicago)
# We can find these users with the following steps :
# - 1 get users from new-york
nyc_users = np.unique(assignment_matrix[assignment_matrix[:,3] == location2id('New York')][:,0])
# - 2 get users from Chicago
chicago_users = np.unique(assignment_matrix[assignment_matrix[:,3] == location2id('Chicago')][:,0])
# - 3 get common users within the two arrays
nyc_chi_users = [x for x in nyc_users if x in chicago_users]

# Select some user profile where user_id is in one of the list above
user_profiles_set[assignment_matrix[assignment_matrix[:,0] == 1797],
assignment_matrix[assignment_matrix[:,0] == 1583],
assignment_matrix[assignment_matrix[:,0] == 2954]]

In [144]:
assignment_matrix[assignment_matrix[:,0] == 1797]
assignment_matrix[assignment_matrix[:,0] == 1583]
assignment_matrix[assignment_matrix[:,0] == 2954]

array([[2954, 1358, list([79, 26]), 1, 1, 33],
       [2954, 35261, list([20]), 1, 0, 33],
       [2954, 1445, list([390]), 1, 0, 33],
       [2954, 1918, list([74, 65]), 1, 0, 33],
       [2954, 25831, list([79, 26]), 1, 1, 33],
       [2954, 1021, list([176, 199]), 1, 0, 33],
       [2954, 17332, list([321, 12, 420, 419]), 1, 1, 33],
       [2954, 1446, list([51, 50]), 1, 0, 33],
       [2954, 974, list([0]), 1, 0, 33],
       [2954, 7251, list([46, 26]), 1, 1, 33],
       [2954, 7314, list([317, 52]), 1, 0, 33],
       [2954, 11186, list([74, 65]), 1, 0, 33],
       [2954, 2726, list([77]), 1, 0, 26],
       [2954, 17058, list([94, 0]), 1, 0, 33],
       [2954, 1097, list([111, 74, 65]), 1, 0, 33],
       [2954, 29432, list([13]), 1, 0, 33],
       [2954, 2385, list([46, 26]), 1, 0, 33],
       [2954, 35361, list([382, 40, 316, 15]), 1, 1, 33],
       [2954, 33546, list([60]), 1, 0, 33],
       [2954, 25422, list([21, 22]), 1, 0, 33],
       [2954, 9014, list([220]), 1, 1, 33],
    

### 4.2. Set up test data sets

#### Setting 1

For the first setting, we select all spatial items visited by the user in a non-home city as the test set and use the rest of the user’s activity history in other cities as the training set

In [134]:
# Select which User profile we want to test :
# -- 3 Chicago -> NYC
#up = user_profiles_set[0]
# -- 7 chicago -> nyc
up = user_profiles_set[1]
# -- 9 nyc -> chicago
#up = user_profiles_set[2]
home_city = location2id('Chicago')
non_home_city = location2id('New York')

In [135]:
# First setting : querying city are new city to querying users
# test set : [venue_id,loc_id]
# All items visited by the user in an non-home city
testset1 = up[up[:,3] == non_home_city][:,[1,3]]
# Rest of user activity history in others city as training set
trainset1 = up[up[:,3] != non_home_city][:,[1,3]]
print('Length test set : ', len(testset1))
print('Length train set : ', len(trainset1))

Length test set :  8
Length train set :  117


#### Setting 2

For the second setting, we randomly select 20% of spatial items visited by the user in personal home city as the test set, and use the rest of personal activity history as the training set.

In [137]:
#sample = rd.sample(list(up[up[:,3] == home_city][:,1]), int(len(up[up[:,3] == home_city])))
#testset2 = np.array([[v,home_city] for v in sample20])
testset2 = up[up[:,3] == home_city][:,[1,3]]
# the rest of user checkins
#trainset2 = np.array([[venue_id,loc_id] for venue_id,loc_id in up[:,[1,3]] if venue_id not in sample20])
trainset1 = up[up[:,3] != home_city][:,[1,3]]
print('Length test set : ', len(testset2))
print('Length train set : ', len(trainset2))
ratedByUser = list(testset2[:,0]) + list(trainset2[:,0])

Length test set :  117
Length train set :  8


### 4.3. Evaluation  (Recall@k)

To evaluate the recommender model, we adopt the testing methodology and the measurement Recall@k.
For each test case (u, v, lv) in test data set :
1. We randomly select 1000 additional spatial items located at lv and unrated by user u. We assume that most of them will not be of interest to user u.
2. We compute the ranking score for the test item v as well as the additional 1000 spatial items.
3. We form a ranked list by ordering all the 1001 spatial items according to their ranking scores. Let p denote the rank of the test item v within this list. The best result corresponds to the case where v precedes all the random items (i.e., p = 0).
4. We form a top-k recommendation list by picking the k top ranked items from the list. If p < k we have a hit (i.e., the test item v is recommended to the user). Otherwise we have a miss. The probability of a hit increases with the increasing value of k. When k = 1001 we always have a hit.

#### Setting 1

In [139]:
top = 20
ratedByUser = list(testset1[:,0]) + list(trainset1[:,0])
# Randomly select 1000 additional spatial items located at lv and unrated by user u
#  venues located at non home city unrated by user 
unratedVenuesSample = rd.sample(list(np.unique(assignment_matrix[(np.isin(assignment_matrix[:,1],ratedByUser) == False) 
                          & (assignment_matrix[:,3] == non_home_city)][:,1])),1000)
# hits denote the number of hits in the test case
# we have a hit when p < top (top ranked result)
hits = 0
for v_test, l_test in testset1:   
    user_id = up[0][0]
    rankList = [[v,S(user_id,l_test,v)] for v in unratedVenuesSample]
    rankList.append([v_test,S(user_id,l_test,v_test)])
    rankList.sort(key=lambda x: x[1], reverse= True)
    # p : rank of the test item v_test
    p = [x[0] for x in rankList].index(v_test)
    if(p < top):
        hits += 1

recall = hits/len(testset1)
print(hits,'/',len(testset1))
print('Recall where querying cities are new cities : ', recall)

5 / 8
Recall where querying cities are new cities :  0.625


topic = 50
uid:35 [0.23,23 ]
uid:391 [0.25]
uid:2954

In [46]:
ratedByUser = list(testset2[:,0]) + list(trainset2[:,0])
unratedVenuesSample = rd.sample(
        list(np.unique(assignment_matrix[(np.isin(assignment_matrix[:,1],ratedByUser) == False) 
                          & (assignment_matrix[:,3] == 2)][:,1])),1000)

In [53]:
print(unratedVenuesSample[0:10])
35949 in ratedByUser

[8259, 42645, 5659, 6683, 42825, 19597, 5859, 35949, 38319, 28955]


False

In [55]:
testset2[0,1]

2

#### Setting 2

In [141]:
top = 20
# Randomly select 1000 additional spatial items located at lv and unrated by user u
ratedByUser = list(testset2[:,0]) + list(trainset2[:,0])
#  venues located at l_test unrated by user 
unratedVenuesSample = rd.sample(
list(np.unique(assignment_matrix[(np.isin(assignment_matrix[:,1],ratedByUser) == False) 
                          & (assignment_matrix[:,3] == home_city)][:,1])),1000)
# hits denote the number of hits in the test case
# we have a hit when p < top (top ranked result)
hits = 0
for v_test, l_test in testset2:
    # up[0][0] => user_id
    rankList = [[v,S(up[0][0],l_test,v)] for v in unratedVenuesSample]
    rankList.append([v_test,S(up[0][0],l_test,v_test)])
    rankList.sort(key=lambda x: x[1], reverse= True)
    # p : rank of the test item v_test
    p = [x[0] for x in rankList].index(v_test)
    if(p < top):
        hits += 1

recall = hits/len(testset2)
print(hits,'/',len(testset2))
print('Recall where querying cities are home town cities : ', recall)

23 / 117
Recall where querying cities are home town cities :  0.19658119658119658


In [143]:
np.mean([0.5,0.25,0.63])
np.mean([0.25,0.16,0.20])


0.20333333333333337