Reproduce the Results of the IJCAI'15 Paper
===========================================

<a id='toc'></a>
1. [Dataset](#dataset)
1. [POI Geo-coordinates](#poi-coord)
1. [Recommend Itinerary](#recommendation)
1. [Issues](#issue)
1. [Results](#result)
 * [POI Popularity](#result1)
 * [Time-based User Interest](#result2)
 * [Frequency-based User Interest](#result3)
 * [Precision, Recall and F1-score](#result4)
 * [Transition Matrix for POI categories](#result5)
1. [Tour Enumeration](#enum)
1. [Transition Matrix for Time at POI](#transtime)

**NOTE: Please view this page via [IPython Notebook Viewer Service](http://nbviewer.ipython.org/), otherwise the within-page links may not work properly.**
<a id='dataset'></a>

1. Dataset [&#8648;](#toc)
-------------------
The dataset used in this paper can be downloaded [here](https://sites.google.com/site/limkwanhui/datacode#ijcai15).
It also gives some description and statistics of this dataset.
However, one critical portion of information is missing in this dataset, i.e. the geo-location of each Points of Interest(POI),
which makes it impossible to calculate the travel time from one POI to another unless the longitude and latitude of each POI is provided by other means.

Simple statistics of this dataset
<table>
<tr>
<td><b>City</b></td>
<td><b>#POIs</b></td>
<td><b>#Users</b></td>
<td><b>#POI_Visits</b></td>
<td><b>#Travel_Sequences</b></td></tr>
<tr><td>Toronto</td><td>29</td><td>1,395</td><td>39,419</td><td>6,057</td></tr>
<tr><td>Osaka</td><td>27</td><td>450</td><td>7,747</td><td>1,115</td></tr>
<tr><td>Glasgow</td><td>27</td><td>601</td><td>11,434</td><td>2,227</td></tr>
<tr><td>Edinburgh</td><td>28</td><td>1,454</td><td>33,944</td><td>5,028</td></tr>
</table>

*NOTE: the number of photos for each city described in paper is NOT available in this dataset*

Fortunately, this info could be retrived from the original YFCC100M dataset by search the individual photoID which is included in this dataset. The YFCC100M dataset could be downloaded from [here](http://www.referitgame.com/vicente/flickr100M/) easily (but with patience).

<a id='poi-coord'></a>

2. POI Geo-coordinates [&#8648;](#toc)
----------------------------------------
For each photo used in this paper, its POI ID is available in the [dataset](https://sites.google.com/site/limkwanhui/datacode#ijcai15), thus, the longitude and latitude of each POI could be approximated by the mean value of all the corresponding photos' coordinates which could be retrived from the YFCC100M dataset.

To accelerate the searching process, first extract the photo id, longitude and latitude columns from the whole dataset

In [None]:
cut -d $'\t' -f1,11,12 yfcc100m_dataset >> dataset.yfcc

then search the coordinates by grepping photo id

In [None]:
cut -d ';' -f 1,4 uservisit.txt |while read line
do
    photoid=`echo "$line" |cut -d ';' -f 1`
    result=`grep -P "^$photoid\t" dataset.yfcc`
    if [ ! -z "$result" ]; then
        coords=`echo "$result" |sed 's/\t/:/g'`
        echo "$coords" >> poi.coords
    fi  
done

For further accelertion, one could import YFCC100M dataset to a database, though very time-consuming, then search by photo id.
The SQL statements for create a database and a table for the dataset looks like

In [None]:
CREATE DATABASE yfcc100m;
CREATE TABLE yfcc100m.tdata(
    pv_id           BIGINT UNSIGNED NOT NULL UNIQUE PRIMARY KEY, /* Photo/video identifier */
    longitude       FLOAT,  /* Longitude */
    latitude        FLOAT   /* Latitude */
);
COMMIT;

and Python code for importing dataset

In [None]:
import mysql.connector as db

def import_data(fname):
    """Import data records from file"""
    dbconnection = db.connect(user='USERNAME', password='PASSWORD')
    cursor = dbconnection.cursor()
    with open(fname, 'r') as f:
        for line in f:
            items = line.split('\t')
            assert(len(items) == 3)
            pv_id     = items[0]
            longitude = items[1]
            latitude  = items[2]
            if len(longitude.strip()) == 0 or len(latitude.strip()) == 0:
                continue
            sqlstr = 'INSERT INTO yfcc100m.tdata VALUES (' + pv_id + ', ' + longitude.strip() + ', ' + latitude.strip() + ')' 
            try:
                cursor.execute(sqlstr)
            except db.Error as error:
                print('ERROR: {}'.format(error))
    dbconnection.commit()
    dbconnection.close()

Then, searching the coordinates of each photo is very easy and fast

In [None]:
import mysql.connector as db

def search_coords(fin, fout):
    """Search Longitude and Latitude for each geo-tagged photo"""
    dbconnection = db.connect(user='USERNAME', password='PASSWORD', database='yfcc100m')
    cursor = dbconnection.cursor()
    records = []
    with open(fin, 'r') as f:
        for line in f:
            items = line.split(';')
            assert len(items) == 7
            photoID = items[0]
            poiID   = items[3]
            sqlstr = "SELECT ROUND(longitude, 6), ROUND(latitude, 6) FROM tdata WHERE pv_id = " + photoID
            cursor.execute(sqlstr)
            for longitude, latitude in cursor:
                records.append(poiID + ':' + photoID + ':' + str(longitude) + ':' + str(latitude))
    dbconnection.commit()
    dbconnection.close()
    with open(fout, 'w') as f:
        for line in records:
            f.write(line + '\n')

<a id='recommendation'></a>

3. Recommend Itinerary [&#8648;](#toc)
----------------------
The paper formulates an Integer Linear Programming(ILP) to recommend an itinerary given a set of POIs, a budget, a source and destination POI, it maximize an objective function which combines travelling time with personalized visit durations.

[PuLP](https://github.com/coin-or/pulp) from the [COIN-OR](http://www.coin-or.org/) project is a useful Python library for modeling linear and integer programs, many LP solvers such as [GLPK](http://www.gnu.org/software/glpk/), [CBC](https://projects.coin-or.org/Cbc), [CPLEX](http://www.ibm.com/software/commerce/optimization/cplex-optimizer/) and [Gurobi](http://www.gurobi.com) can be called to solve the model.
Its comprehensive documentation is available [here](https://pythonhosted.org/PuLP/index.html).The formulation details can be found in the [MIP_recommend()](https://github.com/cdawei/digbeta/blob/master/example/ijcai15/ijcai15.py) method.

<a id='issue'></a>

4. Issues [&#8648;](#toc)
---------
1. Is it necessary to consider visiting a certain POI more than one times? This paper ignores this setting.

1. Dealing with edge case $\bar{V}(p) = 0$
 
 It appears when POIs at which just one photo was taken for each visited user (including some user just took/uploaded two or more photos with the same timestamp), the case does appear in this [dataset](https://sites.google.com/site/limkwanhui/datacode#ijcai15).

 For all users $U$, POI $p$, arrival time $p^a$ and depature time $p^d$, The Average POI Visit Duration is defined as: 
$\bar{V}(p) = \frac{1}{n}\sum_{u \in U}\sum_{p_x \in S_u}(t_{p_x}^d - t_{p_x}^a)\delta(p_x = p), \forall p \in P$

 and Time-based User Interest is defined as:
$Int_u^Time(c) = \sum_{p_x \in S_u} \frac{t_{p_x}^d - t_{p_x}^a}{\bar{V}(p_x)} \delta(Cat_{p_x} = c), \forall c \in C$

 Up to now, two strategies have been tried:
  * let the term $\frac{t_{p_x}^d - t_{p_x}^a}{\bar{V}(p_x)} = K$, where $K$ is a constant (e.g. 2). This approach seems to work, but the effects of different constants should be tested
  * discard all photo records in [dataset](https://sites.google.com/site/limkwanhui/datacode#ijcai15) related to the edge case. This approach suffers from throwing too much information, makes the useful dataset too small (at about 1% of the original dataset sometimes)

1. [CBC](https://projects.coin-or.org/Cbc) is still too slow for large sequences (length >= 4)
 * use [Gurobi](http://www.gurobi.com) on CECS servers

<a id='result'></a>

5. Results [&#8648;](#toc)
----------
<a id='result1'></a>

### 5.1 POI Popularity [&#8648;](#toc)

POI Popularity $Pop(p)$: the number of times POI $p$ has been visited.

<a href="images/Edin_poi_pop.png" title="Edinburgh"><img src="images/Edin_poi_pop.png" style="width:650px;position:relative;left:-100px"></a>
<a href="images/Glas_poi_pop.png" title="Glasgow"><img src="images/Glas_poi_pop.png" style="width:650px;position:relative;left:-100px"></a>
<a href="images/Osak_poi_pop.png" title="Osaka"><img src="images/Osak_poi_pop.png" style="width:650px;position:relative;left:-100px"></a>
<a href="images/Toro_poi_pop.png" title="Toronto"><img src="images/Toro_poi_pop.png" style="width:650px;position:relative;left:-100px"></a>
<a id='result2'></a>

### 5.2 Time-based User Interest [&#8648;](#toc)

The interest of a user $u$ in POI category $c$:
$
Int_u^{Time}(c) = \sum_{p_x \in S_u} \frac{(t_{p_x}^d - t_{p_x}^a)}{\bar{V}(p_x)}\delta(Cat_{p_x} = c), \forall c \in C
$

where $Cat_{p_x}$ is the category of POI $p_x$, $t_{p_x}^a$ is user $u$'s arrival time at POI $p_x$ and $t_{p_x}^d$ is the departure time. $S_u$ is the user $u$'s tranvel history and $\delta(Cat_{p_x} = c)$ equals $1$ if $Cat_{p_x} = c$ and $0$ otherwise.$\bar{V}(p)$ is the average visit duration at POI $p$ for all users, 
$
\bar{V}(p) = \frac{1}{n} \sum_{u \in U} \sum_{p_x \in S_u} (t_{p_x}^d - t_{p_x}^a) \delta(p_x = p), \forall p \in P
$

<a href="images/Edin_time_usr_interest.png" title="Edinburgh"><img src="images/Edin_time_usr_interest.png" style="width:650px;position:relative;left:-100px"></a>
<a href="images/Glas_time_usr_interest.png" title="Glasgow"><img src="images/Glas_time_usr_interest.png" style="width:650px;position:relative;left:-100px"></a>
<a href="images/Osak_time_usr_interest.png" title="Osaka"><img src="images/Osak_time_usr_interest.png" style="width:650px;position:relative;left:-100px"></a>
<a href="images/Toro_time_usr_interest.png" title="Toronto"><img src="images/Toro_time_usr_interest.png" style="width:650px;position:relative;left:-100px"></a>
<a id='result3'></a>

### 5.3 Frequency-based User Interest [&#8648;](#toc)

Frequency-based User Interest $Int_u^{Freq}(c)$: the number of times user $u$ visits POIs of category $c$

<a href="images/Edin_freq_usr_interest.png" title="Edinburgh"><img src="images/Edin_freq_usr_interest.png" style="width:650px;position:relative;left:-100px"></a>
<a href="images/Glas_freq_usr_interest.png" title="Glasgow"><img src="images/Glas_freq_usr_interest.png" style="width:650px;position:relative;left:-100px"></a>
<a href="images/Osak_freq_usr_interest.png" title="Osaka"><img src="images/Osak_freq_usr_interest.png" style="width:650px;position:relative;left:-100px"></a>
<a href="images/Toro_freq_usr_interest.png" title="Toronto"><img src="images/Toro_freq_usr_interest.png" style="width:650px;position:relative;left:-100px"></a>

<a id='result4'></a>

### 5.4 Precision, Recall and F1-score [&#8648;](#toc)

**Settings: Toronto, $\eta$=0.5 with time-based user interest and POI popularity, 42/335 &asymp; 12.5% solutions are suboptimal, leave-one-out**

<table>
<tr><td></td><td><b>Recall</b></td><td><b>Precision</b></td><td><b>F1-score</b></td></tr>
<tr><td><b>Paper</b></td><td>0.779&plusmn;0.010</td><td>0.706&plusmn;0.013</td><td>0.732&plusmn;0.012</td></tr>
<tr><td><b>Reproduce</b></td><td>0.732&plusmn;0.179</td><td>0.736&plusmn;0.181</td><td>0.734&plusmn;0.179</td></tr>
</table>

Length of Recall, Precision and F1-score: 335<br>
Unique values of Recall, Precision, F1-score: 0.40, 0.44, 0.50, 0.57, 0.60, 0.63, 0.67, 0.71, 0.75, 0.78, 0.80, 0.83, 0.85, 1.00<br>

Box plot of Recall, Precision and F1-score
<a href="images/Toro_bp_eta05_time.png" title="Toronto"><img src="images/Toro_bp_eta05_time.png" style="width:400px;position:relative;left:-100px"></a>


**Settings: Toronto, $\eta$=0.5 with frequency-based user interest and POI popularity, 43/335 &asymp; 12.8% solutions are suboptimal, leave-one-out**

<table>
<tr><td></td><td><b>Recall</b></td><td><b>Precision</b></td><td><b>F1-score</b></td></tr>
<tr><td><b>Paper</b></td><td>0.760&plusmn;0.009</td><td>0.679&plusmn;0.013</td><td>0.708&plusmn;0.012</td></tr>
<tr><td><b>Reproduce</b></td><td>0.709&plusmn;0.179</td><td>0.711&plusmn;0.180</td><td>0.710&plusmn;0.179</td></tr>
</table>

Length of Recall, Precision and F1-score: 335<br>
Unique values of Recall, Precision, F1-score: 0.40, 0.43, 0.50, 0.56, 0.57, 0.60, 0.63, 0.67, 0.71, 0.75, 0.80, 0.83, 0.85, 1.00<br>

Box plot of Recall, Precision and F1-score
<a href="images/Toro_bp_eta05_freq.png" title="Toronto"><img src="images/Toro_bp_eta05_freq.png" style="width:400px;position:relative;left:-100px"></a>

<a id='result5'></a>

### 5.5 Transition matrix for POI categories [&#8648;](#toc)

**Settings for recommended trajectories: Toronto, $\eta$=0.5 with time-based user interest and 59/335 &asymp; 17.6% solutions are suboptimal**

**NOTE**: the value of *matrix[i, j]* denotes the probability of visiting *category j* after visiting *category i* for an average visitor.

Transition matrix for *recommended* trajectories:

<table>
<tr>
<td></td>
<td><b>Amusement</b></td>
<td><b>Beach</b></td>
<td><b>Cultural</b></td>
<td><b>Shopping</b></td>
<td><b>Sport</b></td>
<td><b>Structure</b></td>
</tr>
<tr><td><b>Amusement</b></td>
<td>0.035</td><td>0.266</td><td>0.259</td><td>0.098</td><td>0.266</td><td>0.077</td></tr>
<tr><td><b>Beach</b></td>
<td>0.177</td><td>0.222</td><td>0.254</td><td>0.113</td><td>0.069</td><td>0.165</td></tr>
<tr><td><b>Cultural</b></td>
<td>0.209</td><td>0.295</td><td>0.175</td><td>0.107</td><td>0.124</td><td>0.090</td></tr>
<tr><td><b>Shopping</b></td>
<td>0.146</td><td>0.404</td><td>0.281</td><td>0.000</td><td>0.090</td><td>0.079</td></tr>
<tr><td><b>Sport</b></td>
<td>0.229</td><td>0.219</td><td>0.305</td><td>0.086</td><td>0.086</td><td>0.076</td></tr>
<tr><td><b>Structure</b></td>
<td>0.212</td><td>0.282</td><td>0.247</td><td>0.094</td><td>0.141</td><td>0.024</td></tr>
</table>

Transition matrix for *actual* trajectories:

<table>
<tr>
<td></td>
<td><b>Amusement</b></td>
<td><b>Beach</b></td>
<td><b>Cultural</b></td>
<td><b>Shopping</b></td>
<td><b>Sport</b></td>
<td><b>Structure</b></td>
</tr>
<tr><td><b>Amusement</b></td>
<td>0.091</td><td>0.117</td><td>0.344</td><td>0.110</td><td>0.240</td><td>0.097</td></tr>
<tr><td><b>Beach</b></td>
<td>0.059</td><td>0.127</td><td>0.183</td><td>0.269</td><td>0.056</td><td>0.305</td></tr>
<tr><td><b>Cultural</b></td>
<td>0.131</td><td>0.204</td><td>0.114</td><td>0.201</td><td>0.064</td><td>0.286</td></tr>
<tr><td><b>Shopping</b></td>
<td>0.057</td><td>0.358</td><td>0.219</td><td>0.057</td><td>0.065</td><td>0.244</td></tr>
<tr><td><b>Sport</b></td>
<td>0.317</td><td>0.183</td><td>0.167</td><td>0.103</td><td>0.063</td><td>0.167</td></tr>
<tr><td><b>Structure</b></td>
<td>0.084</td><td>0.303</td><td>0.261</td><td>0.200</td><td>0.077</td><td>0.074</td></tr>
</table>

<a id='enum'></a>

6. Tour Enumeration [&#8648;](#toc)
--------------------
For recommended tour/trajectories with length 3, 4 and 5, enumerating all possibilities are much easier to understand the recommendation results than using ILP to get a single best recommendation.

**Settings for enumeration:** Toronto, $\eta$=0.5 with time-based user interest, length for actual and recommended trajectories $\in \{3, 4, 5\}$, data are available [here](https://www.dropbox.com/sh/r50t19fb6a1m1ud/AADxEJMQWyEMJyB17ewf1-oma?dl=0).

[Interactive Python scripts](./src/plot_enumseq.py) to generate boxplots of enumerated trajectories' scores as well as the score of the actual travel sequences, e.g.

In [None]:
python3 plot_enumseq.py Toro_eta05_time_seqs.list

file ```Toro_eta05_time_seqs.list``` is available [here](https://www.dropbox.com/sh/r50t19fb6a1m1ud/AADxEJMQWyEMJyB17ewf1-oma?dl=0).

<a id='transtime'></a>

7. Transition matrix for time at POI [&#8648;](#toc)
------------------------------------
Visit duration matrix $M$ for each user at each POI, data: [POI_visit_duration.zip](https://www.dropbox.com/sh/r50t19fb6a1m1ud/AADxEJMQWyEMJyB17ewf1-oma?dl=0).

For each cell of a matrix, $M_{ij}$, denotes the $log10$ of visit duration for user $i$ at POI $j$, i.e.
$M_{ij}$ = $log10$(visit duration of user $i$ at POI $j$) if the visit duration is greater than $0$ and $0$ otherwise.