# Dataset Preprocess

<a id='toc'></a>

[Table of Contents](#toc)
1. [Load Data](#sec1)
1. [Compute POI Information](#sec2)
1. [Compute Trajectory Statistics](#sec3)
1. [Filtering out Short Trajectories](#sec4)

In [None]:
% matplotlib inline

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
dir_ijcai = 'data/data-ijcai15'
dir_recsys = 'data/data-recsys16'

In [None]:
#fvisit = os.path.join(dir_ijcai, 'userVisits-Osak.csv')
#fcoord = os.path.join(dir_ijcai, 'photoCoords-Osak.csv')
#fvisit = os.path.join(dir_ijcai, 'userVisits-Glas.csv')
#fcoord = os.path.join(dir_ijcai, 'photoCoords-Glas.csv')
#fvisit = os.path.join(dir_ijcai, 'userVisits-Edin.csv')
#fcoord = os.path.join(dir_ijcai, 'photoCoords-Edin.csv')
fvisit = os.path.join(dir_ijcai, 'userVisits-Toro.csv')
fcoord = os.path.join(dir_ijcai, 'photoCoords-Toro.csv')

In [None]:
suffix = fvisit.split('-')[-1].split('.')[0]

In [None]:
fpoi = os.path.join(dir_recsys, 'poi-' + suffix + '.csv')
fvisit_all = os.path.join(dir_recsys, 'visitAll-' + suffix + '.csv')
fvisit_noshort = os.path.join(dir_recsys, 'visitNoShort-' + suffix + '.csv')
fseqstats_all = os.path.join(dir_recsys, 'seqStatsAll-' + suffix + '.csv')
fseqstats_noshort = os.path.join(dir_recsys, 'seqStatsNoShort-' + suffix + '.csv')

<a id='sec1'></a>

## 1. Load Data

Load user visit data and photo coordinates.

In [None]:
visits = pd.read_csv(fvisit, sep=';')
coords = pd.read_csv(fcoord, sep=';')
assert(visits.shape[0] == coords.shape[0])
traj = pd.merge(visits, coords, on='photoID') # merge data frames according to column 'photoID'
traj.head()

In [None]:
num_photo = traj['photoID'].unique().shape[0]
num_user = traj['userID'].unique().shape[0]
num_poi = traj['poiID'].unique().shape[0]
num_seq = traj['seqID'].unique().shape[0]
pd.DataFrame({'#photo': num_photo, '#user': num_user, '#poi': num_poi, '#seq': num_seq, \
              '#photo/user': num_photo/num_user, '#seq/user': num_seq/num_user}, index=[str(suffix)])

<a id='sec2'></a>

## 2. Compute POI Information

Compute POI (Longitude, Latitude) as the average coordinates of the assigned photos.

In [None]:
poi_coords = traj[['poiID', 'photoLon', 'photoLat']].groupby('poiID').mean()
poi_coords.reset_index(inplace=True)
poi_coords.rename(columns={'photoLon':'poiLon', 'photoLat':'poiLat'}, inplace=True)

Extract POI category.

In [None]:
poi_cat = traj[['poiID', 'poiTheme']].groupby('poiID').first()
poi_cat.reset_index(inplace=True)

In [None]:
poi_all = pd.merge(poi_cat, poi_coords, on='poiID')
poi_all.set_index('poiID', inplace=True)
poi_all

Scatter plot of POI coordinates.

In [None]:
height = 3
ratio = (poi_all['poiLon'].max() - poi_all['poiLon'].min()) / (poi_all['poiLat'].max() - poi_all['poiLat'].min())
plt.figure(figsize=[height * np.round(ratio), height])
plt.scatter(poi_all['poiLon'], poi_all['poiLat'])

Save POI information to CSV file.

In [None]:
poi_all.to_csv(fpoi, index=True)

<a id='sec3'></a>

## 3. Compute Trajectory Statistics

In [None]:
seq_all = traj[['userID', 'seqID', 'poiID', 'dateTaken']].copy().groupby(['userID', 'seqID', 'poiID'])\
          .agg([np.min, np.max, np.size])
seq_all.columns = seq_all.columns.droplevel()
seq_all.reset_index(inplace=True)
seq_all.rename(columns={'amin':'arrivalTime', 'amax':'departureTime', 'size':'#photo'}, inplace=True)
seq_all['poiDuration(sec)'] = seq_all['departureTime'] - seq_all['arrivalTime']
seq_all.head()

Compute simple statistics of trajectories, i.e. length (#POIs), start time, endtime, etc.

In [None]:
seq_stats = seq_all[['userID', 'seqID', 'poiID']].copy().groupby(['userID', 'seqID']).agg(np.size)
seq_stats.reset_index(inplace=True)
seq_stats.rename(columns={'poiID':'seqLen'}, inplace=True)
seq_stats.set_index('seqID', inplace=True)
seq_stats.head()

Start time of each sequence.

In [None]:
seq_starttime = seq_all[['userID', 'seqID', 'arrivalTime']].copy().groupby(['userID', 'seqID']).agg(np.min)
seq_starttime.reset_index(inplace=True)
seq_starttime.rename(columns={'arrivalTime':'startTime'}, inplace=True)
seq_starttime.set_index('seqID', inplace=True)
seq_starttime.head()

End time of each sequence.

In [None]:
seq_endtime = seq_all[['userID', 'seqID', 'departureTime']].copy().groupby(['userID', 'seqID']).agg(np.max)
seq_endtime.reset_index(inplace=True)
seq_endtime.rename(columns={'departureTime':'endTime'}, inplace=True)
seq_endtime.set_index('seqID', inplace=True)
seq_endtime.head()

In [None]:
seq_stats['startTime'] = seq_starttime.loc[seq_stats.index, 'startTime']
seq_stats['endTime'] = seq_endtime.loc[seq_stats.index, 'endTime']
seq_stats['seqDuration(sec)'] = seq_stats['endTime'] - seq_stats['startTime']
seq_stats.head()

In [None]:
np.all(seq_stats['seqDuration(sec)'] >= 0)

Save trajectories and the associated stats to CSV files.

In [None]:
seq_all.to_csv(fvisit_all, index=False)
seq_stats.to_csv(fseqstats_all, index=True)

<a id='sec4'></a>

## 4. Filtering out Short Trajectories

Filtering out short trajectories, i.e., trajectories with only 1 or 2 POIs.

In [None]:
seq_stats = seq_stats[seq_stats['seqLen'] > 2]
seq_stats.head()

In [None]:
seq_all = seq_all[seq_all['seqID'].isin(seq_stats.index)]
seq_all.head()

Save trajectories and the associated stats without short trajectories to CSV files.

In [None]:
seq_all.to_csv(fvisit_noshort, index=False)
seq_stats.to_csv(fseqstats_noshort, index=True)