<h1 style="text-align: center">A Context-Aware Recommender System</h1>
In this notebook, we will attempt at implementing a context-aware recommender system. The proposed approach uses a hybrid collaborative filtering method in order to recommend locations in a city for users, based on their history of visits, and users' contexts.<br>

1. <a href="#intro">Introduction</a>
2. <a href="#prefiltering">Pre-filtering Locations</a>
3. <a href="#model">Building Recommendation Model</a>
4. <a href="#finalrecom">Generate Final Recommendations</a>
5. <a href="#eval">Evaluation</a>

<span id="intro"><span>
# Introduction
---
This soultion includes two major steps, <strong>pre-filtering locations</strong> to detect tourists venues and <strong>building a recommendation model</strong> considering user's asymmetric similarities and visit probablity in the user's current context.

In [38]:
import os, warnings, folium
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.io as pio
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
from sklearn.cluster import DBSCAN
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances, mean_squared_error, precision_recall_fscore_support, precision_score
from sklearn.preprocessing import minmax_scale, MultiLabelBinarizer
from sklearn.decomposition import NMF
from random import randint
from ml_metrics import mapk, apk

# Pandas and Numpy configs
pd.set_option('display.max_columns', 20)
np.set_printoptions(suppress=True)

# Input data files are available in the data directory.
input_path = './data/'
    
# Init ploty in offline mode
init_notebook_mode(connected=True)
plot_template = 'plotly_white'
# if not os.path.exists(f'./plots'):
#     os.mkdir(f'./plots')

warnings.filterwarnings('ignore')

print(os.listdir(input_path))

ModuleNotFoundError: No module named 'folium'

## Loading Data
Let's load data into the <code>raw</code> dataframe. The dataset contains photo records including meta tags related to photos and users who took them.

In [15]:
# Path of file to read
data_file_path = f'{input_path}/london_20k.csv'

# Change data types
data_type = {
    'photo_id': 'object',
    'owner': 'object',
    'faves': 'float16',
    'lat': 'float32',
    'lon': 'float32',
    'taken': 'datetime64'
}

# Read file into a variable data
raw = pd.read_csv(data_file_path, 
                  engine='python', 
                  sep=',', 
                  encoding='utf-8', 
                  dtype=data_type, 
                  decimal=',')
data_dim = raw.shape

print(f'Dataframe dimentions: {data_dim}', f'\n{"-"*50}\nData Types:\n{raw.dtypes}')

# Show head
raw.head()

Dataframe dimentions: (20000, 13) 
--------------------------------------------------
Data Types:
photo_id               object
owner                  object
gender                float64
occupation             object
title                  object
description            object
tags                   object
faves                 float16
lat                   float32
lon                   float32
u_city                 object
u_country              object
taken          datetime64[ns]
dtype: object


Unnamed: 0,photo_id,owner,gender,occupation,title,description,tags,faves,lat,lon,u_city,u_country,taken
0,12056545693,78191777@N00,1.0,Producer/DJ,près,<i>Let us draw near with a true heart in full ...,"clapham, london, england, ukgarage, dubstep, e...",0.0,51.465164,-0.129085,30307,USA,2014-01-20 15:19:44
1,12453639663,41087279@N00,1.0,Accountant,DSC_4241 Chyna Whyne from Jamaica Live at Char...,Chyna Whyne from Jamaica Live at Charlie Wrigh...,"chyna, whyne, from, jamaica, live, charlie, wr...",1.0,51.527672,-0.083648,London,England,2014-02-09 23:05:35
2,13185773995,41087279@N00,1.0,Accountant,DSC_6178 Flirt 69 Birthday Party at Charlie Wr...,Flirt 69 Birthday Party at Charlie Wrights Mus...,"flirt, 69, birthday, party, charlie, wrights, ...",0.0,51.527672,-0.083648,London,England,2014-03-14 22:24:02
3,13295046445,30625665@N00,1.0,,Bank Limited,BREAKING NEWS My 806th picture to be viewed ov...,"bank, ex, cash, point, atm, brick, stone, lond...",5.0,51.513042,-0.089221,,,2014-03-20 09:13:03
4,13357656115,41087279@N00,1.0,Accountant,DSC_6743 Ray Estaire Live at Charlie Wrights M...,Ray Estaire Live at Charlie Wrights Music Loun...,"ray, estaire, jazz, ensemble, the, dominant, 7...",1.0,51.527672,-0.083648,London,England,2014-03-21 23:18:39


## Data Validation
In this step, we find and remove rows including coordinates (latitude and longitude) with <code>NA</code>/<code>Null</code> value.

In [16]:
# Find total missing values
data = raw[['photo_id','owner','lat','lon','taken']]
missing_nan = data.isna().sum()

print('TOTAL MISSINGS:', missing_nan, sep='\n')

# Remove missing values
data = data.dropna(subset=['lat','lon'])
new_size = len(data.index)
print(f'{"-"*50}\n{data_dim[0]-new_size} empty rows are removed.')

TOTAL MISSINGS:
photo_id    0
owner       0
lat         0
lon         0
taken       0
dtype: int64
--------------------------------------------------
0 empty rows are removed.


<span id="prefiltering"></span>
# Pre-Filtering
---
Using DBSCAN clustering, I try to remove noise coordinates, and find potential locations can be considered as tourist venues. This includes the steps below:
1. Determine <code>eps</code> and <code>min_sample</code> parameters for DBSCAN clustering.
2. Cluster <code>data</code> with DBSCAN and find potential points in each cluster.
3. Find center of each cluster:<br>Because of the arbitrary shape of clusters in DBSCAN method, I find the reference point of each cluster by assuming the summation of latitudes and longitudes divided by numbers of points inside of each cluster. Center is the coordinate of the nearest point to the reference point.
4. Profiling locations:<br>This step identifies tourist locations, their time of visits as well as visited pattern of each location in terms of different contextual factors.

## Finding DBSCAN Parameters
We need to know the best performing parameters of DBSCAN clustering. So, through a trial and error approach we determine the appropriate values for <code>eps</code> and <code>min_pts</code>/<code>min_samples</code>.

In [17]:
# Create dataframe filled with DBSCAN params and clusters
def paramsClusters(data, eps_range, minPts_range):
    m_per_rad = 6371.0088 * 1000
    df = pd.DataFrame(columns=['eps','min_pts','num_clusters'])
    for m in minPts_range:
        for e in eps_range:
            eps_rad = e/m_per_rad
            eps_rad = eps_rad
            db = DBSCAN(eps=eps_rad, min_samples=m, algorithm='ball_tree', metric='haversine').fit(np.radians(data[['lat','lon']]))
            c = len(set(db.labels_ + 1))
            df = df.append({'eps': e, 'min_pts': m, 'num_clusters': c}, ignore_index=True)
    
    return df

# DBSCAN trend - epsilons and clusters
epsilons = list(map(lambda n: n*20, range(1,16)))
min_points = list(map(lambda n: n*10, range(1,6)))
pc = paramsClusters(data, epsilons, min_points)
EVC = go.Figure()

for m in pc.min_pts.unique():
    df = pc[pc.min_pts == m]
    EVC.add_trace(go.Scatter(
        x=df.eps,
        y=df.num_clusters,
        name=f'Min Samples: {m} points',
        mode='lines+markers',
        marker=dict(size=8),
        line=dict(width=2),
        line_shape='spline'
    ))
    
EVC.update_layout(
    title='The number of detected clusters with different valuses of MinSamples',
    xaxis=dict(title='Epsilon', zeroline=False, dtick=40),
    yaxis=dict(title='Number of cluster', zeroline=False),
    template=plot_template
)

# DBSCAN trend - samples and clusters
epsilons = list(map(lambda n: n*40, range(1,6)))
min_points = list(map(lambda n: n*5, range(1,11)))
pc = paramsClusters(data, epsilons, min_points)
MVC = go.Figure()
    
for e in pc.eps.unique():
    df = pc[pc.eps == e]
    MVC.add_trace(go.Scatter(
        x=df.min_pts,
        y=df.num_clusters,
        name=f'Epsilon: {e} m',
        mode='lines+markers',
        marker=dict(size=8),
        line=dict(width=2),
        line_shape='spline'
    ))

MVC.update_layout(
    title='The number of detected clusters with different valuses of Eps',
    xaxis=dict(title='Minimum samples in the neighborhood', zeroline=False, dtick=5),
    yaxis=dict(title='Number of cluster', zeroline=False),
    template=plot_template
)

EVC.show()
MVC.show()

In [18]:
EVC.write_image(f'./plots/epsilon_by_cluster.jpeg', scale=3)
MVC.write_image(f'./plots/minpoint_by_cluster.jpeg', scale=3)

ValueError: Image generation requires the psutil package.

Install using pip:
    $ pip install psutil

Install using conda:
    $ conda install psutil


## DBSCAN
I choose <code>eps = 120</code> and <code>min_sample = 10</code> based on a trial and error technique.

In [24]:
# Calculate DBSCAN based on Haversine metric    
def HDBSCAN(df, epsilon, minPts, x='lat', y='lon'):
    
    # Find most centered sample in a cluster
    def getCenterMostPts(cluster):
        centroid = (MultiPoint(cluster.values).centroid.x, MultiPoint(cluster.values).centroid.y)
        centermost_point = min(cluster.values, key=lambda point: great_circle(point, centroid).m)
        return tuple(centermost_point)

    m_per_rad = 6371.0088 * 1000
    eps_rad = epsilon/m_per_rad
    photo_coords = df.loc[:, {x,y}]
    photo_coords = photo_coords[['lat','lon']]
    db = DBSCAN(eps=eps_rad, min_samples=minPts, algorithm='ball_tree', metric='haversine').fit(np.radians(photo_coords))
    cluster_labels = db.labels_ + 1
    num_clusters = len(set(cluster_labels))

    # Put clusters and their subset of coords in an array
    clusters = pd.Series([photo_coords[cluster_labels==n] for n in range(num_clusters)])

    # Find centroid of each cluster
    centroids = clusters.map(getCenterMostPts)
    
    # Pull rows from original data frame where row numbers match the clustered data
    rows = clusters.apply(lambda c: c.index.values)
    clustered_df = rows.apply(lambda row_num: df.loc[row_num])
    
    # Append cluster numbers and centroid coords to each clustered dataframe
    lats,lons = zip(*centroids)
    new_df = []
    for i, v in clustered_df.iteritems():
        v.loc[:, 'cluster_num'] = i
        v.loc[:, 'cent_lat'] = lats[i]
        v.loc[:, 'cent_lon'] = lons[i]
        new_df.append(v)    
    new_df = pd.concat(new_df)
    
    return new_df
    
cdata = HDBSCAN(data, epsilon=120, minPts=10)
print(f'Number of clusters: {len(cdata.cluster_num.unique())}')

Number of clusters: 195


## Cluster Analysis
The plots below illustrate all clusters and their centroids in two geographical and non-geographical views. The output of DBSCAN is a set of photo clusters $L=\{l_1,l_2,...,l_n\}$. Each member of this set is a tourist venue that can be considered as $l_i=\{P_{l_i},g_{l_i}\}$. $P_{l_i}$ is the set of all photos taken in location $l_i$, in which the geographical coordinate of the centroid is $g_{l_i}$.

In [25]:
# Convet matplotlib colormap to plotly
def matplotlibToPlotly(cmap, pl_entries):
    h = 1.0/(pl_entries-1)
    pl_colorscale = []
    
    for k in range(pl_entries):
        C = list(map(np.uint8, np.array(cmap(k*h)[:3])*255))
        pl_colorscale.append('rgb'+str((C[0], C[1], C[2])))
        
    return pl_colorscale


# Show plot
unique_labels = cdata.cluster_num.unique()
colors = matplotlibToPlotly(plt.cm.Spectral, len(unique_labels))
DB = go.Figure()
leaflet_map = folium.Map(location=[51.514205,-0.104371], zoom_start=12, tiles='Cartodb Positron')

for k,col in zip(unique_labels, colors):
    # Check if label number is 0, then create noisy points 
    if k == 0:
        col = 'gray'
        df = cdata[cdata.cluster_num == 0]
        
        DB.add_trace(go.Scatter(
            x=df.lat,
            y=df.lon,
            mode='markers',
            name='noise',
            marker=dict(size=3, color=col),
            hoverinfo='none'
        ))
        
    # Check the remaining clusters
    else:
        col = col
        df = cdata[cdata.cluster_num == k]
        lat = df.lat
        lon = df.lon
        cent_lat = df.cent_lat.unique()
        cent_lon = df.cent_lon.unique()
        
        # Bokeh plot
        DB.add_trace(go.Scatter(
            x=lat,
            y=lon,
            mode='markers',
            name='point',
            marker=dict(size=5, color=col),
            text=df.photo_id.apply(lambda id: f'photo_id: {id}'),
            hoverinfo='none',
            showlegend=False
        ))
        DB.add_trace(go.Scatter(
            x=cent_lat,
            y=cent_lon,
            mode='markers',
            name='centroid',
            text=f'cluster: {k}',
            marker=dict(
                size=12,
                color=col,
                line=dict(color='gray', width=1)
            ),
            hoverinfo='x+y+name+text'
        ))
        
        # Map plot
        folium.Marker(
            location=[cent_lat, cent_lon],
            icon=folium.Icon(icon='map-marker')
        ).add_to(leaflet_map)
        
        
DB.update_layout(
#     title='DBSCAN Based on Haversine Including Center Most Points',
    hovermode='closest',
    showlegend=False,
    xaxis=dict(title='Latitude', zeroline=False),
    yaxis=dict(title='Longitude', zeroline=False),
    template=plot_template
)

DB.show()
leaflet_map

NameError: name 'folium' is not defined

In [8]:
DB.write_image(f'./plots/dbscan_clusters.jpeg', scale=3)

In [26]:
# Remove noise cluster from the training set
clean_data = cdata[cdata.cluster_num!=0]

# Distribution plot
def chunk(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

distrib_df = clean_data.groupby(['cluster_num'])['cluster_num'].count().reset_index(name='photo_num')
chunked_distrib = enumerate(chunk(distrib_df, 50))

for i, v in chunked_distrib:
    cluster_distrib = go.Figure([go.Bar(x=v.cluster_num, y=v.photo_num)])
    cluster_distrib.update_layout(
        xaxis=dict(title='Cluster id', dtick=1),
        yaxis=dict(title='Number of images'),
        template=plot_template
    )
    cluster_distrib.show()
    cluster_distrib.write_image(f'./plots/cluster_distrib{i}.jpeg', scale=3)

ValueError: Image generation requires the psutil package.

Install using pip:
    $ pip install psutil

Install using conda:
    $ conda install psutil


## Location's Profile
At this step, we are looking for features related to each tourist location, extracted from the previous step. Each location identified in the clustering contains information such as the geographical location, id of the user who visited the venue, visiting time and visiting contexts. It is important to know that a user may take more than one photo of the place while visiting a venue. Therefore, if the duration between the time-stamps of two photos taken by a user at the same location is less than visit duration threshold (<code>threshold</code>), we can cosider that both photos belong to a same location. If not, the <code>median</code> of timestamps can be considered as the time of the visit with new contexts.
<br><br>
Next, the outcome dataframe, <code>POI</code>, is exported in order to extract contextual features based on photos' taken times, then it will be imported again.

In [27]:
# Find most frequent string in array
def mostFreqStr(array):
    array = [i for i in array if str(i) != 'nan']
    if len(array) != 0:
        counts = np.unique(array, return_counts=True)[1]
        max_index = np.argmax(counts)
        freq_bin = array[max_index]
        return freq_bin
    else:
        return np.nan

# Find median of array included Timestamps
def medTimestamps(array):
    if len(array) == 1:
        return array[0]
    else:
        if len(array) % 2 == 0:
            delta = array[int(len(array)/2)] - array[int(len(array)/2-1)]
            median = pd.Timestamp(array[int(len(array)/2-1)] + delta)
        else:
            time = pd.Timestamp(array[int(len(array)/2)]).time()
            ser = pd.Series(array)
            date = pd.Timestamp.fromordinal(int(ser.apply(lambda x: pd.to_datetime(x).toordinal()).median(skipna=True))).date()
            median = pd.Timestamp.combine(date,time)     
        return median

# Create database of locations
POI = pd.DataFrame(columns=['location_id', 'user_id', 'lat', 'lon', 'visit_time'])
threshold = np.timedelta64(6, 'h')

for i,g in clean_data.groupby(by='cluster_num'):
    l = {}
    l['location_id'] = randint(100000,999999)
    l['lat'] = g.cent_lat.unique()[0]
    l['lon'] = g.cent_lon.unique()[0]
    
    for u in g.owner.unique():
        l['user_id'] = u
        taken = g.loc[g.owner == u, 'taken'].sort_values()
        t_indices = taken.keys()
        t_values = taken.values
        visit_times = []
        
        if len(t_values) == 1:
            l['visit_time'] = pd.Timestamp(t_values[0])
            POI = POI.append(l, ignore_index=True)
        
        else:
            for t in range(1, len(t_values)):
                if t_values[t]-t_values[t-1] < threshold:
                    visit_times.append(t_values[t-1])
                else:
                    visit_times.append(t_values[t-1])
                    l['visit_time'] = medTimestamps(visit_times)
                    POI = POI.append(l, ignore_index=True)
                    visit_times = []

display(POI.head(10))

Unnamed: 0,location_id,user_id,lat,lon,visit_time
0,290360,41087279@N00,51.523766,-0.076417,2014-02-09 23:05:35
1,290360,41087279@N00,51.523766,-0.076417,2014-03-14 22:24:02
2,290360,41087279@N00,51.523766,-0.076417,2014-03-21 23:18:39
3,290360,41087279@N00,51.523766,-0.076417,2014-05-11 12:48:26
4,290360,41087279@N00,51.523766,-0.076417,2014-06-16 15:50:27
5,290360,41087279@N00,51.523766,-0.076417,2014-08-24 20:13:12
6,290360,41087279@N00,51.523766,-0.076417,2014-08-30 01:44:15
7,290360,41087279@N00,51.523766,-0.076417,2014-09-24 17:00:31
8,290360,41087279@N00,51.523766,-0.076417,2014-10-15 10:59:42
9,290360,41087279@N00,51.523766,-0.076417,2014-10-18 02:36:48


<span id="model"></span>
# Recommendation Model
---
The <code>POI</code> dataset including contextual factors is imported. This dataset is assumed as <code>LPD</code>, Location Profile Dataframe. For making recommendation model, only tourists who have visited at least 4 distinct locations were selected. Data then is split into training data and test data. 

In [28]:
# Path of file to read
# prefiltered_file_path = f'./prefiltered.csv'

# Change data types
# data_type = {
#     'faves': 'float16',
#     'lat': 'float32',
#     'lon': 'float32',
#     'visit_time': 'datetime64'
# }

# Read csv file and convert it to a Multiindex
# LPD = pd.read_csv(prefiltered_file_path, engine='python', sep=',', encoding='utf-8', dtype=data_type, decimal=',')
LPD = POI.set_index(keys=['user_id', 'location_id'])
display(LPD.head(10))

# Split dataset
visit_twice = LPD.groupby(level=[0,1])['visit_time'].count()
visit_twice = visit_twice[visit_twice>3]
mask = LPD.index.isin(visit_twice.index) == True
X = LPD[mask]
y = X.index.get_level_values(0)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=70)

Unnamed: 0_level_0,Unnamed: 1_level_0,lat,lon,visit_time
user_id,location_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
41087279@N00,290360,51.523766,-0.076417,2014-02-09 23:05:35
41087279@N00,290360,51.523766,-0.076417,2014-03-14 22:24:02
41087279@N00,290360,51.523766,-0.076417,2014-03-21 23:18:39
41087279@N00,290360,51.523766,-0.076417,2014-05-11 12:48:26
41087279@N00,290360,51.523766,-0.076417,2014-06-16 15:50:27
41087279@N00,290360,51.523766,-0.076417,2014-08-24 20:13:12
41087279@N00,290360,51.523766,-0.076417,2014-08-30 01:44:15
41087279@N00,290360,51.523766,-0.076417,2014-09-24 17:00:31
41087279@N00,290360,51.523766,-0.076417,2014-10-15 10:59:42
41087279@N00,290360,51.523766,-0.076417,2014-10-18 02:36:48


## Creating User-Location Matrix
To create rating matrix, we need to know the number of times when each user has visited different venues. Ratings is also normilized by the min-max normalization method.

In [29]:
# Find ratings
train_rating = X_train.groupby(['location_id','user_id'])['visit_time'].count().reset_index(name='rating')
train_rating.head(10)

Unnamed: 0,location_id,user_id,rating
0,148526,107026173@N05,3
1,148526,110405086@N05,3
2,148526,11200205@N02,6
3,148526,115168634@N05,3
4,148526,124483065@N03,3
5,148526,133876835@N08,3
6,148526,136315829@N03,3
7,148526,13816725@N04,4
8,148526,141330063@N03,4
9,148526,141443760@N06,4


In [30]:
def normalize(df):
    # Normalize number of visit into a range of 1 to 5
    df['rating'] = minmax_scale(df.rating, feature_range=[1,5])
    return df

r_df = normalize(train_rating)

# Create a rating matrix
r_df = train_rating.pivot_table(
    index='user_id', 
    columns='location_id', 
    values='rating', 
    fill_value=0
)
    
# Calculate the sparcity percentage of matrix
def calSparcity(m):
    m = m.fillna(0)
    non_zeros = np.count_nonzero(m)/np.prod(m.shape) * 100
    sparcity = 100 - non_zeros
    print(f'The sparcity percentage of matrix is %{round(sparcity,2)}')

display(r_df.head())
calSparcity(r_df)

location_id,148526,175122,290360,322960,340215,390609,413307,428653,446461,484510,...,568004,585726,587892,672362,675139,680638,742516,877042,974912,992183
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
107026173@N05,1.0,0.0,0,0.0,0.0,0.0,0,0.0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
110405086@N05,1.0,0.0,0,0.0,0.0,0.0,0,0.0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11200205@N02,1.363636,0.0,0,0.0,0.0,0.0,0,0.0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
115168634@N05,1.0,0.0,0,0.0,0.0,0.0,0,0.0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
124483065@N03,1.0,0.0,0,0.0,0.0,0.0,0,0.0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The sparcity percentage of matrix is %95.0


In the next step, the user similarity should be calculated. According to the asymmetric similarity concept:
> Most of the traditional similarity metrics assign equal value for the similarity relation between two users, This means, these methods are based on the assumption that $sim(u,v) = sim(v,u)$. Traditional methods cannot differentiate between these two users. However based on asymmetric similarity user $u$ is similar to user $v$, but not vice versa [[1]](http://dx.doi.org/10.1016/j.knosys.2015.03.006).

The user's similarity based on an asymmetric cosine similarity proposed by [Pirasteh et al [1]](http://dx.doi.org/10.1016/j.knosys.2015.03.006) is as follow:
<br><br>
\begin{equation}
ACOS(u,v)=\frac{\overrightarrow{r_1}.\overrightarrow{r_2}}{||\overrightarrow{r_1}||.||\overrightarrow{r_2}||}.\frac{|u \cap v|}{|u|}.\frac{2*|u \cap v|}{|u| + |v|}
\end{equation}
<br>
In case of having same numbers of items/venues rated by both users, the $ACOS(u_1,u_2)$ will be equal to $ACOS(u_2,u_1)$. So to address this problem a weighted user's influence coefficient is suggested.<br><br>
\begin{equation}
\bar{r_{u}} = \frac{\sum_{j=1}^{n}r_{i,j}}{n_i}
\end{equation}
<br>
\begin{equation}
r_{u}^\prime = 
    \begin{cases}
    1, \qquad if \quad r_u \geq \bar{r_{u}} \\
    0, \qquad otherwise
    \end{cases}
\end{equation}
<br>
\begin{equation}
W_{u,v}^\prime = \frac{\sum_{i=1}^{n} r_{u,i}^\prime \times r_{v,i}^\prime}{\sum_{i=1}^{n} r_{v,i}^\prime}
\end{equation}
<br><br>
We propose an extended version of $ACOS(u,v)$:
<br><br>
\begin{equation}
ACOS(u,v)=\frac{\overrightarrow{r_u}.\overrightarrow{r_v}}{||\overrightarrow{r_u}||.||\overrightarrow{r_v}||}.\frac{|u \cap v|}{|u|}.\frac{2*|u \cap v|}{|u| + |v|}.\frac{\sum_{i=1}^{n} r_{u,i}^\prime \times r_{v,i}^\prime}{\sum_{i=1}^{n} r_{v,i}^\prime}
\end{equation}
<br><br>

### Creating User-User Similarity Matrix
Now let's create the user-user similarity matrix.

In [31]:
# Create user-user similarity matrix
def improved_asym_cosine(m, mf=False,**kwarg):
    # Cosine similarity matrix distance
    cosine = cosine_similarity(m)

    # Asymmetric coefficient
    def asymCo(X,Y):
        co_rated_item = np.intersect1d(np.nonzero(X),np.nonzero(Y)).size
        coeff = co_rated_item / np.count_nonzero(X)
        return coeff
    asym_ind = pairwise_distances(m, metric=asymCo)

    # Sorensen similarity matrix distance
    sorensen = 1 - pairwise_distances(np.array(m, dtype=bool), metric='dice')

    # User influence coefficient
    def usrInfCo(m):
        binary = m.transform(lambda x: x >= x[x!=0].mean(), axis=1)*1
        res = pairwise_distances(binary, metric=lambda x,y: (x*y).sum()/y.sum() if y.sum()!=0 else 0)
        return res       
    usr_inf_ind = usrInfCo(m)

    similarity_matrix = np.multiply(np.multiply(cosine,asym_ind),np.multiply(sorensen,usr_inf_ind))

    usim = pd.DataFrame(similarity_matrix, m.index, m.index)
    
    # Check if matrix factorization was True
    if mf:
        # Binary similarity matrix
        binary = np.invert(usim.values.astype(bool))*1
        model = NMF(**kwarg)
        W = model.fit_transform(usim)
        H = model.components_
        factorized_usim = np.dot(W,H)*binary + usim
        usim = pd.DataFrame(factorized_usim, m.index, m.index)
                
    return usim

s_df = improved_asym_cosine(r_df)
display(s_df.head())
calSparcity(s_df)

user_id,107026173@N05,110405086@N05,11200205@N02,115168634@N05,124483065@N03,133876835@N08,136315829@N03,13816725@N04,141330063@N03,141443760@N06,...,7344912@N05,74925381@N03,75374243@N00,79986881@N00,81065266@N00,87402959@N02,89313125@N00,8952616@N07,95419715@N08,95665996@N03
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
107026173@N05,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0
110405086@N05,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0
11200205@N02,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0
115168634@N05,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0
124483065@N03,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0


The sparcity percentage of matrix is %50.44


### Creating Context-Location Matrix
The matrix below shows the number of visits from venues in various contexts. For example, a venue has been visited 3 times in the context <code>(1,1,9)</code> which [according to the discription of dataset](http://www.kaggle.com/amiralisa/flickr_london#readme.txt), means this context is equal to:

* season: spring
* daytime: partly-cloudy-day
* weather: day
<br><br>
We use a method based on TF-IDF to find the visit probability of locations [[2]](https://www.researchgate.net/publication/309541764_Context-Aware_Location_Recommendation_Using_Geotagged_Photos_in_Social_Media).
<br><br>
> We use the term frequency-inverse document frequency (TF-IDF) measure to compute the usage of a location in a specific situation $w_l^c$. TF-IDF is used in the field of information retrieval to measure how important a word is to a document in a collection or corpus. It increases proportionally with the number of times a word appears in the document, is offset by the frequency of the word in the corpus [[2]](https://www.researchgate.net/publication/309541764_Context-Aware_Location_Recommendation_Using_Geotagged_Photos_in_Social_Media).
<br><br>

\begin{equation}
w_l^c = TF_l \times IDF_l = \frac{N_{c,l}}{N_{c,\oslash}} \times \log\frac{N_{\oslash,\oslash}}{N_{\oslash,l}}
\end{equation}
<br><br>
$N_{c,l}$ is the number of visits in context $c$ that visited location $l$. $N_{c,\oslash}$ shows the number of visits to all locations in the context $c$. $N_{\oslash,\oslash}$ is the total number of visits to all locations, and $N_{\oslash,l}$ represents the total number of visits to the location $l$.

In [15]:
# Find probability of contexts
contexts = X_train.filter(['season','daytime','weather']).apply(lambda x: (x.season,x.daytime,x.weather), axis=1).reset_index(name='context')
IF = contexts.groupby(['location_id','context'])['context'].count()/contexts.groupby(['context'])['context'].count()
IDF = np.log10(contexts.groupby(['location_id','user_id'])['user_id'].count().sum()/contexts.groupby(['location_id'])['user_id'].count())
contexts_weight = (IF * IDF).to_frame().rename(columns={0: 'weight'})

# Create a context-location matrix
lc_df = contexts_weight.pivot_table(
    index='context', 
    columns='location_id', 
    values='weight',
    fill_value=0
)


display(lc_df.head())
calSparcity(lc_df)

location_id,140662,165202,185565,186890,220206,299823,330684,335960,422774,505088,...,594032,594659,606447,615127,675366,721771,767961,784986,934575,950630
context,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"(1, 1, 1)",0.0,0.0,0.113734,0.0,0.0,0.146529,0.164374,0.300584,0.0,0.0,...,0.0,0.0,0.164374,0.0,0.0,0.0,0.0,0.0,0.258684,0.0
"(1, 1, 3)",0.139086,0.0,0.168414,0.0,0.0,0.061993,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.170437,0.057965,0.0,0.055734,0.164165,0.0
"(1, 1, 7)",0.0,0.0,0.208512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.474255,0.0
"(1, 1, 8)",0.0,0.0,0.156384,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.711382,0.0
"(1, 1, 9)",0.038471,0.0,0.146402,0.0,0.0,0.102882,0.076941,0.035175,0.0,0.093354,...,0.0,0.0,0.0,0.0,0.031428,0.0,0.0,0.061664,0.121086,0.04158


The sparcity percentage of matrix is %82.71


### Creating Context-Context Matrix

In [16]:
cs_df = pd.DataFrame(cosine_similarity(lc_df), index=lc_df.index, columns=lc_df.index)
display(cs_df.head())
calSparcity(cs_df)

context,"(1, 1, 1)","(1, 1, 3)","(1, 1, 7)","(1, 1, 8)","(1, 1, 9)","(1, 2, 1)","(1, 2, 3)","(1, 2, 7)","(1, 2, 9)","(1, 3, 1)",...,"(4, 1, 1)","(4, 1, 3)","(4, 1, 7)","(4, 1, 8)","(4, 1, 9)","(4, 2, 3)","(4, 2, 7)","(4, 2, 9)","(4, 3, 1)","(4, 3, 3)"
context,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"(1, 1, 1)",1.0,0.41917,0.570075,0.558956,0.57961,0.229444,0.186359,0.229444,0.26618,0.0,...,0.072998,0.383062,0.034587,0.0,0.48168,0.494282,0.032384,0.115255,0.0,0.0
"(1, 1, 3)",0.41917,1.0,0.640823,0.577439,0.694518,0.494915,0.684546,0.494915,0.547822,0.0,...,0.288025,0.464719,0.074604,0.0,0.461028,0.330401,0.069853,0.248106,0.0,0.163786
"(1, 1, 7)",0.570075,0.640823,1.0,0.980494,0.565397,0.40248,0.326903,0.40248,0.130567,0.0,...,0.128049,0.204247,0.060671,0.0,0.209789,0.228537,0.056806,0.046803,0.0,0.0
"(1, 1, 8)",0.558956,0.577439,0.980494,1.0,0.498543,0.214705,0.300153,0.214705,0.069651,0.0,...,0.068308,0.108957,0.032365,0.0,0.111913,0.121914,0.030304,0.024967,0.0,0.0
"(1, 1, 9)",0.57961,0.694518,0.565397,0.498543,1.0,0.487574,0.408944,0.487574,0.579033,0.0,...,0.328573,0.488065,0.073498,0.310904,0.802233,0.656741,0.376608,0.370623,0.0,0.205363


The sparcity percentage of matrix is %24.48


<span id="finalrecom"></span>
# Final Recommendation
---
Based on the tourist location profiles and user similarities, the locations which have not been rated by the user could be predicted by applying user-based collaborative filtering. A post-filtering approach is used to adjust predicted ratings according to contextual information.

We use the user-based collaborative filtering to predict the initial ratings.

In [17]:
def CF(user_id, location_id, s_matrix):
    r = np.array(r_df)
    s = np.array(s_matrix)
    users = r_df.index
    locations = r_df.columns
    l = np.where(locations==location_id)[0]
    u_idx = np.where(users==user_id)[0]
        
    # Means of all users
    means = np.array([np.mean(row[row!=0]) for row in r])
    
    # Check if l is in r_rating
    if location_id in r_df:
        # Find similar users rated the location that target user hasn't visited
        idx = np.nonzero(r[:,l])[0]
        sim_scores = s[u_idx,idx].flatten()
        sim_users = zip(idx,sim_scores)
    
        # Check if there is any similar user to target user
        if idx.any():
            sim_ratings = r[idx,l]
            sim_means = means[idx]
            numerator = (sim_scores * (sim_ratings - sim_means)).sum()
            denominator = np.absolute(sim_scores).sum()
            weight = (numerator/denominator) if denominator!=0 else 0
            wmean = means[u_idx] + weight
            wmean_rating = wmean[0]
            
    else:
        wmean_rating = 0

    return wmean_rating

The visit probability of each candidate location is calculated in the current contexts, such that the probability of visiting the location $i$ equals the fraction of the users who visited the location $i$ in contexts similar to the target user, and the similarity of the current contexts with the visiting context of the neighbors is larger than a threshold like <code>delta</code>.

After calculating the visiting probability, the final rating associated with each candidate location is obtained by:
<br><br>
\begin{equation}
score(u_a,i) = (collaborative\ filtering\ rate) \times (visit\ probability)
\end{equation}

In [33]:
# Collaborative filtering with post-filtered contexts
def CaCF_Post(user_id, location_id, s_matrix, c_current, delta):
    
    # Calculate cf
    initial_pred = CF(user_id, location_id, s_matrix)
    
    if location_id in r_df:
        r = np.array(r_df)
        users = r_df.index
        locations = r_df.columns
        l = np.where(locations==location_id)[0]
        c_profile = contexts
        all_cnx = contexts.context.unique().tolist()
        c = np.array(c_profile)
        u_idx = np.where(users==user_id)[0]
        c_current = tuple(c_current)
        
        # Get contexts of similar users visited the location
        l_cnx = np.array(c_profile.loc[c_profile.location_id==location_id,['user_id','context']])
                
        if c_current in all_cnx:
            # Find similarity of the current context to location contexts
            cnx_scores = np.array([[uid, cs_df[c_current][cx]] for uid,cx in l_cnx])

            # Filter users whose similarity bigger than delta
            filtered_scores = cnx_scores[cnx_scores[:,1].astype(float)>delta]

            # Location popularity based on current context
            visit_prob = len(filtered_scores) / len(cnx_scores)
            
        else:
            visit_prob = 1

        return initial_pred * visit_prob

    else:
        return initial_pred

In [34]:
# Find ratings
test_rating = X_test.groupby(['location_id','user_id'])['visit_time'].count().reset_index(name='rating')
test_rating = normalize(test_rating)
r_df_test = test_rating.pivot_table(index='user_id', columns='location_id', values='rating', fill_value=0)

# Proposed approach
def EACOS_CaCF_Post(user_id, location_id, c_current, delta):
    res = CaCF_Post(user_id, location_id, s_df, c_current, delta)
    return res

# Recommendation
def predict(target_user, model, option=None):
    true = r_df_test.loc[target_user]
    
    # Check if model is context-aware 
    if option:
        pred_val = []
        for l in true.index:
            delta = option.get('delta')
            c_current = tuple(X_test.xs(target_user)[['season','daytime','weather']].head(1).values[0])
            r = model(user_id=target_user, location_id=l, c_current=c_current, delta=delta)
            pred_val.append(r)
    else:
        pred_val = [model(user_id=target_user, location_id=l) for l in true.index]

    pred = pd.Series(pred_val, index=true.index)

    return pred

As an example, let's see the result of rating prediction related to a user visited different locations in London city. Top-10 locations with the highest ratings would be a list of recommendations.

In [19]:
user = '41087279@N00'
options = {
    'delta': .3
}

def item_relevancy(col):
    relevant = 1
    r_color = 'background-color: lime'
    nr_color = 'background-color: red'
    res = []
    for v in col:
        if v > relevant:
            res.append(r_color)
        elif (v > 0) & (v <= relevant):
            res.append(nr_color)
        else:
            res.append('')
    return res
    
true = r_df_test.loc[user]
pred = predict(user, EACOS_CaCF_Post, option=options)

with pd.option_context("display.max_rows", None):
    prediction = pd.DataFrame({'true': true, 'pred': pred})
    display(prediction.style.apply(lambda col: item_relevancy(col)))

Unnamed: 0_level_0,true,pred
location_id,Unnamed: 1_level_1,Unnamed: 2_level_1
140662,0.0,2.93506
165202,0.0,0.0
185565,1.44444,1.19856
220206,0.0,2.28283
299823,0.0,1.55647
330684,0.0,1.95671
335960,0.0,1.3697
422774,0.0,0.0
505088,5.0,4.58333
519537,0.0,2.09259


In [20]:
# Top 10 recommendations
top_10 = prediction.nlargest(10, 'pred')
top_10.style.apply(lambda col: item_relevancy(col))

Unnamed: 0_level_0,true,pred
location_id,Unnamed: 1_level_1,Unnamed: 2_level_1
505088,5,4.58333
594659,0,3.42424
767961,0,3.42424
615127,0,3.22282
140662,0,2.93506
934575,0,2.81996
220206,0,2.28283
675366,0,2.28283
519537,0,2.09259
330684,0,1.95671


### <span id="eval"></span>
# Evaluation
---
In the final step, we evaluate the proposed method based on common evaluation metrics in recommendation systems, MAP and RMSE. We also compare the performance of proposed model against some other recommendation methods.

In [35]:
def rmse(true, pred):
    return np.sqrt(mean_squared_error(true, pred))

def mean_average_precision(true, pred, k=10):
    relevant = 1
    sort_rates = lambda s: s.sort_values(ascending=False)
    true = [r[1].where(r[1]>relevant).dropna().index.tolist() for r in true.iterrows()]
    pred = [sort_rates(r[1].where(r[1]>relevant).dropna()).index.tolist() for r in pred.iterrows()]
    map_score = mapk(true, pred, k)
    return map_score

In [22]:
def predict_all(model, option=None):
    users = r_df_test.index
    locations = r_df_test.columns
    pred = np.zeros(r_df_test.shape)
    
    for i in range(0,len(users)):
        uid = users[i]
        for j in range(0,len(locations)):
            lid = locations[j]
            # Check if model is context-aware 
            if option:
                delta = option.get('delta')
                c_current = X_test.xs(uid)[['season','daytime','weather']].head(1).values[0]
                pred[i,j] = model(user_id=uid, location_id=lid, c_current=c_current, delta=delta)
            else:
                pred[i,j] = model(user_id=uid, location_id=lid)
                        
    return pd.DataFrame(pred, index=users, columns=locations)

In [23]:
deltas = np.arange(0.1, 1, 0.1)
eval_scores = []

for d in deltas:
    options['delta'] = d
    pred = predict_all(EACOS_CaCF_Post, option=options)
    precision = mean_average_precision(r_df_test,pred)
    eval_scores.append(precision)
    
d_eval = pd.DataFrame(eval_scores, index=deltas, columns=['precision'])

# Delta influence on the prediction and racall
d_precision = go.Figure([go.Scatter(
    name='MAP', 
    x=d_eval.index, 
    y=d_eval.precision, 
    text=d_eval.precision,
    line_shape='spline'
)])

d_precision.update_layout(
    title='The impact of similarity threshold on the recommendation quality',
    xaxis=dict(title='Threshold of context similarity (\u03B4)', autorange='reversed'), 
    yaxis=dict(title='MAP'),
    template=plot_template
)

d_precision.show()

In [36]:
d_precision.write_image(f'./plots/d_precision.jpeg', scale=3)

NameError: name 'd_precision' is not defined

Five different models have been selected from previous studies to compare with the proposed model. They are classified into non-contextual and context- aware categories.

* <strong>Collaborative filtering using asymmetric cosine similarity (ACOS):</strong><br>This method uses an asymmetric cosine similarity measure to find similarities among users, then by collaborative filtering ratings can be predicted [[1]](http://dx.doi.org/10.1016/j.knosys.2015.03.006).
* <strong>Collaborative filtering using asymmetric cosine similarity and matrix factorization (MF_ACOS):</strong><br>The difference between this model and the cosine similarity approach is to eliminate the sparsity of the similarity matrix by using the matrix factorization [[1]](http://dx.doi.org/10.1016/j.knosys.2015.03.006).
* <strong>Popularity Ranking (PR):</strong><br>The basic idea of this method is to rank tourist locations based on the popularity of each location [[3]](https://www.sciencedirect.com/science/article/abs/pii/S0169023X14000962).
* <strong>Context-aware significant tourist locations recommendations (CSR):</strong><br>The basis of this model is to predict the ranking of the tourist location based on the context of the target user. This method uses the likelihood of visiting the destination exactly in the context of the target user to filter out the tourist destinations [[3]](https://www.sciencedirect.com/science/article/abs/pii/S0169023X14000962).
* <strong>Context-aware collaborative filtering using Sorensen-Dice coef- ficient (Sorensen CaCF Post):</strong><br>This method uses the Sorensen Dice coefficient to find similarity among users. It also uses a probability ratio to find the visit probability of locations similar to the target user’s context, which is applied to the collaborative filtering results as a post-filtering [[2]](https://www.researchgate.net/publication/309541764_Context-Aware_Location_Recommendation_Using_Geotagged_Photos_in_Social_Media).

In [37]:
## Non context-aware methodologies with asymetric similarity measure
# Asymmetric cosine similarity
def asymmetric_cosine(m, mf=False, **kwarg):
    # Cosine similarity matrix distance
    cosine = cosine_similarity(m)
    # Asymmetric coefficient
    def asymCo(X,Y):
        co_rated_item = np.intersect1d(np.nonzero(X),np.nonzero(Y)).size
        coeff = co_rated_item / np.count_nonzero(X)
        return coeff
    asym_ind = pairwise_distances(m, metric=asymCo)
    # Sorensen similarity matrix distance
    sorensen = 1 - pairwise_distances(np.array(m, dtype=bool), metric='dice')
    # Final similarity matrix
    usim = np.multiply(np.multiply(cosine,asym_ind),sorensen)
    # Check if matrix factorization was True
    if mf:
        binary = np.invert(usim.astype(bool))*1
        model = NMF(**kwarg)
        W = model.fit_transform(usim)
        H = model.components_
        factorized_usim = np.dot(W,H)*binary + usim
        usim = factorized_usim
            
    return pd.DataFrame(usim, index=m.index, columns=m.index)

# Calculate user similarities
asym_cos = asymmetric_cosine(r_df)
mf_asym_cos = asymmetric_cosine(r_df, mf=True, solver='mu')

# Methods
def ACOS(user_id, location_id):
    res = CF(user_id, location_id, asym_cos)
    return res

def MF_ACOS(user_id, location_id):
    res = CF(user_id, location_id, mf_asym_cos)
    return res

In [27]:
## Context-aware methodologies symmetric similarity measure
# Similarity measure based on location popularity
def loc_pop_sim(df, dist_method='correlation'):
    df = df.reset_index()
    # Calculate location pop
    loc_idf = np.log10(df.groupby('location_id')['user_id'].count().sum()
                    /df.groupby('location_id')['user_id'].count()
                   ).reset_index(name='idf_score')
    loc_idf = df.merge(loc_idf)
    
    # Create location popularity matrix
    r_df = loc_idf.pivot_table(
        index='user_id', 
        columns='location_id', 
        values='idf_score', 
        fill_value=0
    )
    
    # Calculate user similarities
    if dist_method == 'dice':
        dist = 1 - pairwise_distances(r_df.values, metric=dist_method)
    else:
        dist = pairwise_distances(r_df.values, metric=dist_method)
    return pd.DataFrame(dist, r_df.index, r_df.index)

# Calculate user similarities
sym_locpop_pearson = loc_pop_sim(X_train)
sym_locpop_sorensen = loc_pop_sim(X_train, dist_method='dice')

# Methods
def PR(user_id, location_id):
    res = CF(user_id, location_id, sym_locpop_pearson)
    return res

def CSR(user_id, location_id, c_current, delta):
    initial_pred = CF(user_id, location_id, sym_locpop_pearson)
    if location_id in r_df:
        r = np.array(r_df)
        users = r_df.index
        locations = r_df.columns
        l = np.where(locations==location_id)[0]
        c_profile = contexts
        c = np.array(c_profile)
        u_idx = np.where(users==user_id)[0]
        c_current = tuple(c_current)

        # Find users who visit the location in the current context 
        exact_match = contexts[(contexts.location_id==location_id)&(contexts.context==c_current)].user_id.unique()
        
        if exact_match.size != 0:
            idx = np.where(users.isin(exact_match))

            # Calculate visit probability in exact-match context
            visit_match_prob = r[idx,l].sum() / r[:,l].sum()

            # Calculate visit probability of location
            visit_loc_prob = r[:,l].sum() / r.sum()

            # Calculate visit probability in current context
            visit_cnx_prob = contexts[contexts.context==c_current].location_id.count()/r.sum()

            visit_prob = (visit_loc_prob * visit_match_prob) / visit_cnx_prob
        
            return initial_pred * visit_prob
        
        else:
            return initial_pred
    
    else:
        return initial_pred

def Sorensen_CaCF_Post(user_id, location_id, c_current, delta):
    res = CaCF_Post(user_id, location_id, sym_locpop_sorensen, c_current, delta=.3)
    return res

In [28]:
models = [PR, ACOS, MF_ACOS, CSR, Sorensen_CaCF_Post, EACOS_CaCF_Post]
k_range = [5,10,15,20]
eval_scores = {}
true = r_df_test
options['delta'] = .3

for model in models:
    option = None if model.__name__ in ['ACOS', 'PR', 'MF_ACOS'] else options
    val = []
    for k in k_range:
        pred = predict_all(model, option)
        mapk_score = mean_average_precision(true, pred, k)
        val.append(mapk_score)
        
    eval_scores[model.__name__] = val
    
map_at_k = pd.DataFrame(eval_scores, index=k_range)

mapk_comp = go.Figure()

for model, ser in map_at_k.iteritems():
    mapk_comp.add_trace(go.Bar(
        name=model,
        x=ser.index,
        y=ser.values,
        width=.6
    ))
    
mapk_comp.update_layout(
    barmode='group',
    title='Comparision of the proposed method with the benchmarking methods (MAP@k)',
    xaxis=dict(title='NUmber of recommendations'),
    yaxis=dict(title='MAP@k', range=[.7,.9]),
    template=plot_template
)

mapk_comp.show()

Unnamed: 0,model,precision
0,PR,0.35
1,CSR,0.5
2,MF_ACOS,0.5
3,Sorensen_CaCF_Post,0.611111
4,EACOS_CaCF_Post,0.5


In [29]:
k_range = [5,10,15,20]
eval_scores = {}
true = r_df_test
options['delta'] = .3

for model in models:
    option = None if model.__name__ in ['COS', 'ACOS', 'PR', 'MF_ACOS'] else options
    val = []
    for k in k_range:
        pred = predict_all(model, option)
        mapk_score = mean_average_precision(true, pred, k)
        val.append(mapk_score)
        
    eval_scores[model.__name__] = val
    
map_at_k = pd.DataFrame(eval_scores, index=k_range)

mapk_comp = go.Figure()

for model, ser in map_at_k.iteritems():
    mapk_comp.add_trace(go.Bar(
        name=model,
        x=ser.index,
        y=ser.values,
        width=.6
    ))
    
mapk_comp.update_layout(
    barmode='group',
#     title='Performance comparision in terms of MAP@k',
    xaxis=dict(title='Number of recommendations'),
    yaxis=dict(title='MAP@k', range=[.7,.9]),
    template=plot_template
)

mapk_comp.show()

In [30]:
mapk_comp.write_image(f'./plots/mapk_comp.jpeg', scale=3)

In [31]:
rmse_eval = []

for model in models:
    option = None if model.__name__ in ['ACOS', 'PR', 'MF_ACOS'] else options
    pred = predict_all(model, option)
    rmse_score = rmse(true, pred)
    rmse_eval.append([model.__name__, rmse_score])
    
rmse_perf = pd.DataFrame(rmse_eval, columns=['model','value'])

rmse_comp = go.Figure([go.Bar(
    x=rmse_perf.model, 
    y=rmse_perf.value,
    width=.5,
    text=round(rmse_perf.value,2),
    textposition='outside', 
    marker=dict(color=rmse_perf.index, colorscale='Viridis')
)])

rmse_comp.update_layout(
    barmode='group',
    title='Comparision of the proposed method with the benchmarking methods (RMSE)',
    yaxis=dict(title='RMSE'),
    template=plot_template
)

rmse_comp.show()

CSR


In [32]:
rmse_comp.write_image(f'./plots/rmse_comp.jpeg', scale=3)

# Conclusion
In this kernel, we create a tourism recommendation system based on contexts and geo-tagged photos. Our hybrid approach first looks for similarity among users using an asymmetric similarity metric, and then uses collaborative filtering to predict the item ratings. The proposed method ultimately uses a context-aware post-filtering approach to determine the final recommendations. The system is able to understand various contextual conditions such as location, time of visit, day/night, season and weather conditions of the venue at the time of visit.


References:
1. [P. Pirasteh, D. Hwang, and J. J. Jung, “Exploiting matrix factorization to asymmetric user similarities in recommendation systems,” Knowledge-Based Systems, vol. 83, pp. 51- 57, 2015.](http://dx.doi.org/10.1016/j.knosys.2015.03.006)
2. [H.Huang,“Context-Aware Location Recommendation Using Geotagged Photos in Social Media,” ISPRS International Journal of Geo-Information, vol. 5, no. 11, p. 195, 2016.](https://www.researchgate.net/publication/309541764_Context-Aware_Location_Recommendation_Using_Geotagged_Photos_in_Social_Media)
3. [A. Majid, L. Chen, H. T. Mirza, I. Hussain, and G. Chen, “A system for mining in- teresting tourist locations and travel sequences from public geo-tagged photos,” Data & Knowledge Engineering, vol. 95, pp. 66-86, 2015.](https://www.sciencedirect.com/science/article/abs/pii/S0169023X14000962)