# Generate Trajectories from Flickr Photos Taken in Melbourne

[Table of Contents](#toc)
1. [Load POI Data](#sec1)
1. [Load Photo Data](#sec2)
1. [Map Photos to POIs & Build Trajectories](#sec3)
  1. [Approach I - Greedy](#sec3.1)
  1. [Approach II - Dynamic Programming](#sec3.2)
1. [Save Trajectory Data](#sec4)
  1. [Compute Trajectory Statistics](#sec4.1)
  1. [Filtering out Short Trajectories](#sec4.2)
  1. [Filtering out Users with Few Trajectories](#sec4.3)

In [1]:
%matplotlib inline

import os, sys, time
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt

In [2]:
def print_progress(cnt, total):
    """Display a progress bar"""
    assert(cnt > 0 and total > 0 and cnt <= total)
    length = 80
    ratio = cnt / total
    n = int(length * ratio)
    sys.stdout.write('\r[%-80s] %d%%' % ('-'*n, int(ratio*100)))
    sys.stdout.flush()

In [3]:
data_dir = '../data'
fpoi = os.path.join(data_dir, 'poi-Melb-0.csv')
fpoi_new = os.path.join(data_dir, 'poi-Melb.csv')
fphoto = os.path.join(data_dir, 'Melb_photos_bigbox.csv')
ftraj_all = os.path.join(data_dir, 'traj-all-Melb.csv')
ftraj_noshort = os.path.join(data_dir, 'traj-noshort-Melb.csv')
ftraj_nofew = os.path.join(data_dir, 'traj-nofew-Melb.csv')

<a id='sec1'></a>

## 1. Load POI Data

In [4]:
poi_df = pd.read_csv(fpoi)
poi_df.head()

Unnamed: 0,poiID,poiCat,poiLon,poiLat
0,0,City precincts,144.96778,-37.82167
1,1,City precincts,144.946,-37.817
2,2,City precincts,144.973,-37.8119
3,3,City precincts,144.96694,-37.79972
4,4,City precincts,144.96333,-37.80778


In [5]:
poi_df.set_index('poiID', inplace=True)

In [6]:
print('#POIs:', poi_df.shape[0])

#POIs: 88


<a id='sec2'></a>

## 2. Load Photo Data

In [7]:
photo_df = pd.read_csv(fphoto, skipinitialspace=True, parse_dates=[2])
photo_df.head()

Unnamed: 0,Photo_ID,User_ID,Timestamp,Longitude,Latitude,Accuracy,URL,Marker(photo=0 video=1)
0,5703013770,25287507@N02,2011-05-09 19:19:58,144.604775,-37.878579,16,http://www.flickr.com/photos/25287507@N02/5703...,0
1,5653121597,59335517@N02,2011-04-10 13:27:37,145.033779,-37.82231,16,http://www.flickr.com/photos/59335517@N02/5653...,0
2,5522325184,26303188@N00,2011-03-13 20:44:24,144.981122,-37.824344,14,http://www.flickr.com/photos/26303188@N00/5522...,0
3,7978703060,82732068@N02,2012-07-14 12:29:43,145.947854,-38.479344,15,http://www.flickr.com/photos/82732068@N02/7978...,0
4,174030514,19677632@N00,2004-08-01 19:28:40,145.533485,-37.949003,12,http://www.flickr.com/photos/19677632@N00/1740...,0


Remove photos with low accuracies (accuracy $< 16$).

In [8]:
print(photo_df['Accuracy'].unique())
photo_df = photo_df[photo_df['Accuracy'] == 16]
print(photo_df['Accuracy'].unique())

[16 14 15 12 11 13  8 10  9  3  5  7  6  4  1  2]
[16]


Remove columns that will not be used.

In [9]:
photo_df.drop(['Accuracy', 'URL', 'Marker(photo=0 video=1)'], axis=1, inplace=True)

Convert datatime to unix epoch.

In [10]:
photo_df['dateTaken'] = photo_df['Timestamp'].apply(lambda x: x.timestamp())
photo_df.drop('Timestamp', axis=1, inplace=True)

Rename columns.

In [11]:
photo_df.rename(columns={'Photo_ID':'photoID', 'User_ID':'userID', 'Longitude':'photoLon', 'Latitude':'photoLat'}, \
                inplace=True)
photo_df.head()

Unnamed: 0,photoID,userID,photoLon,photoLat,dateTaken
0,5703013770,25287507@N02,144.604775,-37.878579,1304932798
1,5653121597,59335517@N02,145.033779,-37.82231,1302406057
5,9588963220,67774014@N00,144.96506,-37.815725,1377408461
9,6191232325,63488421@N08,144.666981,-37.922733,1316741616
10,6644759687,10559879@N00,144.961177,-37.812759,1325813367


In [12]:
photo_df.shape

(94142, 5)

In [13]:
print('#Photos:', photo_df['photoID'].unique().shape[0])
print('#Users:', photo_df['userID'].unique().shape[0])

#Photos: 94142
#Users: 1659


In [14]:
photo_df.set_index('photoID', inplace=True)
photo_df['poiID'] = -1
photo_df['trajID'] = -1
photo_df.head()

Unnamed: 0_level_0,userID,photoLon,photoLat,dateTaken,poiID,trajID
photoID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5703013770,25287507@N02,144.604775,-37.878579,1304932798,-1,-1
5653121597,59335517@N02,145.033779,-37.82231,1302406057,-1,-1
9588963220,67774014@N00,144.96506,-37.815725,1377408461,-1,-1
6191232325,63488421@N08,144.666981,-37.922733,1316741616,-1,-1
6644759687,10559879@N00,144.961177,-37.812759,1325813367,-1,-1


<a id='sec3'></a>

## 3. Map Photos to POIs & Build Trajectories

Generate travel history for each user from the photos taken by him/her.

In [15]:
def calc_dist_vec(longitudes1, latitudes1, longitudes2, latitudes2):
    """Calculate the distance (unit: km) between two places on earth, vectorised"""
    # convert degrees to radians
    lng1 = np.radians(longitudes1)
    lat1 = np.radians(latitudes1)
    lng2 = np.radians(longitudes2)
    lat2 = np.radians(latitudes2)
    radius = 6371.0088 # mean earth radius, en.wikipedia.org/wiki/Earth_radius#Mean_radius

    # The haversine formula, en.wikipedia.org/wiki/Great-circle_distance
    dlng = np.fabs(lng1 - lng2)
    dlat = np.fabs(lat1 - lat2)
    dist =  2 * radius * np.arcsin( np.sqrt( 
                (np.sin(0.5*dlat))**2 + np.cos(lat1) * np.cos(lat2) * (np.sin(0.5*dlng))**2 ))
    return dist

Sanity check.

In [16]:
calc_dist_vec(poi_df.loc[0, 'poiLon'], poi_df.loc[0, 'poiLat'], poi_df.loc[0, 'poiLon'], poi_df.loc[0, 'poiLat'])

0.0

Filtering out photos that leads to super fast speeds.

In [17]:
SUPER_FAST = 150 / (60 * 60)  # 150 km/h

In [18]:
filter_tags = pd.Series(data=np.zeros(photo_df.shape[0], dtype=np.bool), index=photo_df.index)

In [19]:
cnt = 0
total = photo_df['userID'].unique().shape[0]
for user in sorted(photo_df['userID'].unique().tolist()):
    udf = photo_df[photo_df['userID'] == user].copy()
    udf.sort_values(by='dateTaken', ascending=True, inplace=True)
    udists = calc_dist_vec(udf['photoLon'][:-1].values, udf['photoLat'][:-1].values, \
                           udf['photoLon'][1: ].values, udf['photoLat'][1: ].values)
    assert(udists.shape[0] == udf.shape[0]-1)
    superfast = np.zeros(udf.shape[0]-1, dtype=np.bool)
    for i in range(udf.shape[0]-1):
        ix1 = udf.index[i]
        ix2 = udf.index[i+1]
        dtime = udf.loc[ix2, 'dateTaken'] - udf.loc[ix1, 'dateTaken']
        assert(dtime >= 0)
        if dtime == 0: superfast[i] = True
        speed = udists[i] / dtime
        if speed > SUPER_FAST: superfast[i] = True
    for j in range(superfast.shape[0]-1):
        if superfast[j] and superfast[j+1]:  # jx0-->SUPER_FAST-->jx-->SUPER_FAST-->jx1: remove photo jx
            jx = udf.index[j+1]
            filter_tags.loc[jx] = True
    cnt += 1; print_progress(cnt, total)

[--------------------------------------------------------------------------------] 100%

In [20]:
for jx in filter_tags.index:
    if filter_tags.loc[jx] == True:
        photo_df.drop(jx, axis=0, inplace=True)

In [21]:
photo_df.shape

(92758, 6)

Distance between POIs.

In [22]:
poi_distmat = pd.DataFrame(data=np.zeros((poi_df.shape[0], poi_df.shape[0]), dtype=np.float), \
                           index=poi_df.index, columns=poi_df.index)

In [23]:
for ix in poi_df.index:
    poi_distmat.loc[ix] = calc_dist_vec(poi_df.loc[ix, 'poiLon'], poi_df.loc[ix, 'poiLat'], \
                                        poi_df['poiLon'], poi_df['poiLat'])

Distance between photos and POIs.

In [24]:
photo_poi_distmat = pd.DataFrame(data=np.zeros((photo_df.shape[0], poi_df.shape[0]), dtype=np.float), \
                                 index=photo_df.index, columns=poi_df.index)

In [25]:
for i in range(photo_df.shape[0]):
    ix = photo_df.index[i]
    photo_poi_distmat.loc[ix] = calc_dist_vec(photo_df.loc[ix, 'photoLon'], photo_df.loc[ix, 'photoLat'], \
                                              poi_df['poiLon'], poi_df['poiLat'])
    print_progress(i+1, photo_df.shape[0])

[--------------------------------------------------------------------------------] 100%

"Map a photo to a POI if their coordinates differ by $<200$m based on the Haversine formula" according to the [IJCAI15 paper](https://www.nicta.com.au/pub-download/full/8557/).

In [26]:
DIST_MAX = 0.2  # 0.2km

Time gap is $8$ hours according to the [IJCAI15 paper](https://www.nicta.com.au/pub-download/full/8557/).

In [27]:
TIME_GAP = 8 * 60 * 60  # 8 hours

In [28]:
users = sorted(photo_df['userID'].unique().tolist())

<a id='sec3.1'></a>

### 3.1 Map Photos to POIs: Approach I - Greedy

Map photo to the closest POI.

In [29]:
traj_greedy = photo_df.copy()

In [30]:
cnt = 0
for ix in traj_greedy.index:
    min_ix = photo_poi_distmat.loc[ix].idxmin()
    if photo_poi_distmat.loc[ix, min_ix] > DIST_MAX:  # photo is taken at position far from any POI, do NOT use it
        pass
    else:
        traj_greedy.loc[ix, 'poiID'] = poi_df.index[min_ix]  # map photo to the closest POI
        # all POIs that are very close to a photo are an option to map
        #photo_df.loc[ix, 'poiID'] = str(poi_df.index[~(dists > dist_max)].tolist())
    cnt += 1; print_progress(cnt, traj_greedy.shape[0])

[--------------------------------------------------------------------------------] 100%

Build trajectories.

In [31]:
traj_greedy = traj_greedy[traj_greedy['poiID'] != -1]

In [32]:
tid = 0
cnt = 0
for user in users:
    udf = traj_greedy[traj_greedy['userID'] == user].copy()
    udf.sort_values(by='dateTaken', ascending=True, inplace=True)
    if udf.shape[0] == 0: 
        cnt += 1; print_progress(cnt, len(users))
        continue
    
    traj_greedy.loc[udf.index[0], 'trajID'] = tid
    for i in range(1, udf.shape[0]):
        ix1 = udf.index[i-1]
        ix2 = udf.index[i]
        if udf.loc[ix2, 'dateTaken'] - udf.loc[ix1, 'dateTaken'] > TIME_GAP:
            tid += 1
            traj_greedy.loc[ix2, 'trajID'] = tid
        else:
            traj_greedy.loc[ix2, 'trajID'] = tid
    tid += 1  # for trajectories of the next user
    cnt += 1; print_progress(cnt, len(users))

[--------------------------------------------------------------------------------] 100%

<a id='sec3.2'></a>

### 3.2 Map Photos to POIs: Approach II - Dynamic Programming

Given a sequence of photos and a set of POIs, map the sequence of photos to the set of POI such that the total cost is minimised, i.e.,

\begin{equation}
\text{minimize} \sum_i \text{distance}(\text{photo}_i, \text{POI}_i) + \alpha \sum_i \text{distance}(\text{POI}_i, \text{POI}_{i+1})
\end{equation}
where $\text{photo}_i$ is mapped to $\text{POI}_i$, $\alpha$ is a trade-off parameter.

In [33]:
def decode_photo_seq(photo_seq, poi_distmat, photo_poi_distmat, DIST_MAX, ALPHA=1):
    """
    Map a sequence of photos to a set of POI such that the total cost, i.e.
    cost = sum(distance(photo_i, POI_i)) + sum(distance(POI_i, POI_{i+1})) 
    is minimised.
    Implemented using DP.
    """
    assert(len(photo_seq) > 0)
    assert(DIST_MAX > 0)
    
    if len(photo_seq) == 1:  # only one POI in this sequence
        ix = photo_seq[0]
        assert(ix in photo_poi_distmat.index)
        return [photo_poi_distmat.loc[ix].idxmin()]
    
    # set of POIs that are close to any photo in the input sequence of photos
    poi_t = []
    for jx in photo_seq:
        poi_t = poi_t + poi_distmat.index[~(photo_poi_distmat.loc[jx] > DIST_MAX)].tolist()
    columns = sorted(set(poi_t))
    
    # cost_df.iloc[i, j] stores the minimum cost of photo sequence [..., 'photo_i'] among all 
    # possible POI sequences end with 'POI_j', 'photo_i' was mapped to 'POI_j'
    cost_df = pd.DataFrame(data=np.zeros((len(photo_seq), len(columns)), dtype=np.float), \
                           index=photo_seq, columns=columns)
    
    # trace_df.iloc[i, j] stores the (previous) 'POI_k' such that the cost of POI sequence 
    # [... --> 'POI_k' (prev POI) --> 'POI_j' (current POI)] is cost_df.iloc[i, j]
    trace_df = pd.DataFrame(data=np.zeros((len(photo_seq), len(columns)), dtype=np.int), \
                            index=photo_seq, columns=columns)
    # NO predecessor for the start POI
    trace_df.iloc[0] = -1
    
    # costs for the first row are just the distances (or np.inf) from the first photo to all POIs
    for kx in cost_df.columns:
        ix = photo_seq[0]
        dist = photo_poi_distmat.loc[ix, kx]
        cost_df.loc[ix, kx] = np.inf if dist > DIST_MAX else dist
    
    # compute minimum costs recursively
    for i in range(1, len(photo_seq)):
        ix = cost_df.index[i]
        prev = cost_df.index[i-1]
        for jx in cost_df.columns:
            # distance(photo_i, POI_j) + alpha * distance(POI_k, POI_j) + previous cost
            # if distance(photo_i, POI_j) <= DIST_MAX else np.inf
            costs = [np.inf if photo_poi_distmat.loc[ix, jx] > DIST_MAX else \
                     photo_poi_distmat.loc[ix, jx] + ALPHA * poi_distmat.loc[kx, jx] + cost_df.loc[prev, kx] \
                     for kx in cost_df.columns]
            min_idx = np.argmin(costs)
            cost_df.loc[ix, jx] = costs[min_idx]
            trace_df.loc[ix, jx] = cost_df.columns[min_idx]
    
    # trace back
    pN = cost_df.loc[cost_df.index[-1]].idxmin()  # the end POI
    seq_reverse = [pN]  # sequence of POI in reverse order
    row_idx = trace_df.shape[0] - 1  # trace back from the last row of trace_df
    while (row_idx > 0):  # the first row are all -1
        ix = trace_df.index[row_idx]
        jx = seq_reverse[-1]
        poi = trace_df.loc[ix, jx]
        seq_reverse.append(poi)
        row_idx -= 1
        
    return seq_reverse[::-1]  # reverse the sequence

Build travel sequences and map photos to POIs.

In [34]:
traj_dp = photo_df.copy()

In [35]:
ALPHA = 1

In [36]:
tid = 0
cnt = 0
for user in users:
    udf = traj_dp[traj_dp['userID'] == user].copy()
    udf.sort_values(by='dateTaken', ascending=True, inplace=True)
    
    # filtering out photos that are far from all POIs
    for ix in udf.index:
        if photo_poi_distmat.loc[ix].min() > DIST_MAX:
            udf.drop(ix, axis=0, inplace=True)
            
    if udf.shape[0] == 0: 
        cnt += 1; print_progress(cnt, len(users))
        continue
    
    photo_seq = [udf.index[0]]
    for i in range(1, udf.shape[0]):
        ix1 = photo_seq[-1]
        ix2 = udf.index[i]
        if udf.loc[ix2, 'dateTaken'] - udf.loc[ix1, 'dateTaken'] > TIME_GAP:
            assert(len(photo_seq) > 0)
            poi_seq = decode_photo_seq(photo_seq, poi_distmat, photo_poi_distmat, DIST_MAX, ALPHA)
            assert(len(poi_seq) == len(photo_seq))
            for j in range(len(poi_seq)):
                jx = photo_seq[j]
                poi = poi_seq[j]
                traj_dp.loc[jx, 'poiID'] = poi
                traj_dp.loc[jx, 'trajID'] = tid
            tid += 1
            photo_seq.clear()
            photo_seq.append(ix2)
        else:
            photo_seq.append(ix2)
            
    assert(len(photo_seq) > 0)
    poi_seq = decode_photo_seq(photo_seq, poi_distmat, photo_poi_distmat, DIST_MAX, ALPHA)
    assert(len(poi_seq) == len(photo_seq))
    for j in range(len(poi_seq)):
        jx = photo_seq[j]
        poi = poi_seq[j]
        traj_dp.loc[jx, 'poiID'] = poi
        traj_dp.loc[jx, 'trajID'] = tid
    tid += 1

    cnt += 1; print_progress(cnt, len(users))

[--------------------------------------------------------------------------------] 100%

Compute the total cost of trajectories.

In [37]:
def calc_cost(traj_df, poi_distmat, photo_poi_distmat):
    cost = 0
    traj_df = traj_df[traj_df['poiID'] != -1]
    for tid in sorted(traj_df['trajID'].unique().tolist()):
        tdf = traj_df[traj_df['trajID'] == tid].copy()
        tdf.sort_values(by='dateTaken', ascending=True, inplace=True)
        cost += np.trace(photo_poi_distmat.loc[tdf.index, tdf['poiID']])
        cost += np.trace(poi_distmat.loc[tdf['poiID'][:-1].values, tdf['poiID'][1:].values])
    return cost

In [38]:
calc_cost(traj_greedy, poi_distmat, photo_poi_distmat)

3936.8923548191797

In [39]:
calc_cost(traj_dp, poi_distmat, photo_poi_distmat)

3824.619840492885

In [60]:
traj_greedy['trajID'].max()

5105

In [61]:
traj_dp['trajID'].max()

5105

Compare trajectories interactively.

In [None]:
for tid in sorted(traj_dp['trajID'].unique().tolist()):
    if tid == -1: continue
    tdf1 = traj_dp[traj_dp['trajID'] == tid].copy()
    tdf1.sort_values(by='dateTaken', ascending=True, inplace=True)
    tdf2 = traj_greedy[traj_greedy['trajID'] == tid].copy()
    tdf2.sort_values(by='dateTaken', ascending=True, inplace=True)
    print(tdf1)
    print(tdf2)
    input('Press any key to continue...')

<a id='sec4'></a>

## 4. Save Trajectory Data

Save trajectories and related POIs to files.

In [66]:
visits = traj_greedy[traj_greedy['poiID'] != -1]
#visits = traj_dp[traj_dp['poiID'] != -1]

Save POIs to CSV file.

In [67]:
poiix = sorted(visits['poiID'].unique().tolist())

In [68]:
poi_df.loc[poiix].to_csv(fpoi_new, index=True)

<a id='sec4.1'></a>

### 4.1 Compute Trajectory Statistics

Compute trajectories information including simple statistics such as length (#POIs), POI start time, POI endtime, etc.

In [69]:
def calc_traj_df(tid, visits):
    """Compute trajectories info, taking care of trajectories that contain sub-tours"""
    traj_df = visits[visits['trajID'] == tid].copy()
    traj_df.sort_values(by='dateTaken', ascending=True, inplace=True)
    df_ = pd.DataFrame(columns=['poiID', 'startTime', 'endTime', '#photo'])
    assert(traj_df.shape[0] > 0)
    ix = traj_df.index[0]
    j = 0
    df_.loc[j] = [traj_df.loc[ix, 'poiID'], traj_df.loc[ix, 'dateTaken'], traj_df.loc[ix, 'dateTaken'], 1]
    for i in range(1, traj_df.shape[0]):
        ix = traj_df.index[i]
        if traj_df.loc[ix, 'poiID'] == df_.loc[j, 'poiID']:
            df_.loc[j, 'endTime'] = traj_df.loc[ix, 'dateTaken']
            df_.loc[j, '#photo'] += 1
        else:
            j += 1
            df_.loc[j] = [traj_df.loc[ix, 'poiID'], traj_df.loc[ix, 'dateTaken'], traj_df.loc[ix, 'dateTaken'], 1]
    df_['userID'] = traj_df.loc[traj_df.index[0], 'userID']
    df_['trajID'] = traj_df.loc[traj_df.index[0], 'trajID']
    df_['trajLen'] = df_.shape[0]
    return df_

In [70]:
traj_all = pd.DataFrame(columns=['userID', 'trajID', 'poiID', 'startTime', 'endTime', '#photo', 'trajLen'])
for tid in sorted(visits['trajID'].unique().tolist()):
    traj_df = calc_traj_df(tid, visits)
    traj_all = traj_all.append(traj_df, ignore_index=True)
traj_all.head()

Unnamed: 0,#photo,endTime,poiID,startTime,trajID,trajLen,userID
0,1,1226726126,25,1226726126,0,1,10058801@N06
1,2,1205332541,58,1205332532,1,2,10087938@N02
2,2,1205342729,66,1205342722,1,2,10087938@N02
3,1,1205374109,59,1205374109,2,1,10087938@N02
4,1,1205417265,58,1205417265,3,1,10087938@N02


In [71]:
traj_all.dtypes

#photo       float64
endTime      float64
poiID        float64
startTime    float64
trajID       float64
trajLen      float64
userID        object
dtype: object

In [72]:
int_cols = ['trajID', 'poiID', 'trajLen', 'startTime', 'endTime', '#photo']
traj_all[int_cols] = traj_all[int_cols].astype(np.int, copy=False)

Sanity check.

In [73]:
print(np.all(traj_all['trajLen'] >= 1))
print(np.all(traj_all['#photo'] >= 1))
print(np.all(traj_all['startTime'] <= traj_all['endTime']))

True
True
True


In [74]:
traj_all['poiDuration'] = traj_all['endTime'] - traj_all['startTime']
print(traj_all.shape)
traj_all.head()

(7671, 8)


Unnamed: 0,#photo,endTime,poiID,startTime,trajID,trajLen,userID,poiDuration
0,1,1226726126,25,1226726126,0,1,10058801@N06,0
1,2,1205332541,58,1205332532,1,2,10087938@N02,9
2,2,1205342729,66,1205342722,1,2,10087938@N02,7
3,1,1205374109,59,1205374109,2,1,10087938@N02,0
4,1,1205417265,58,1205417265,3,1,10087938@N02,0


In [75]:
traj_all.dtypes

#photo          int64
endTime         int64
poiID           int64
startTime       int64
trajID          int64
trajLen         int64
userID         object
poiDuration     int64
dtype: object

Save trajectories and the associated stats to CSV files.

In [76]:
traj_all.to_csv(ftraj_all, index=False)

<a id='sec4.2'></a>

### 4.2 Filtering out Short Trajectories

Filtering out short trajectories, i.e., trajectories with only 1 or 2 POIs.

In [77]:
traj_noshort = traj_all[traj_all['trajLen'] >= 3].copy()
print(traj_noshort.shape)
traj_noshort.head()

(2559, 8)


Unnamed: 0,#photo,endTime,poiID,startTime,trajID,trajLen,userID,poiDuration
5,2,1205511700,58,1205508764,4,3,10087938@N02,2936
6,22,1205514882,59,1205512653,4,3,10087938@N02,2229
7,1,1205538412,58,1205538412,4,3,10087938@N02,0
19,1,1188620918,81,1188620918,14,6,10195518@N02,0
20,1,1188621562,75,1188621562,14,6,10195518@N02,0


In [78]:
traj_noshort['#photo'].sum()

7374

Save trajectories and the associated stats without short trajectories to CSV files.

In [79]:
traj_noshort.to_csv(ftraj_noshort, index=False)

<a id='sec4.3'></a>

### 4.3 Filtering out Users with Few Trajectories

Filtering out users (and related trajectories) with few trajectories, e.g. less than 5 trajectories.

In [80]:
MIN_N = 5

In [81]:
user_list = []

In [82]:
for user in sorted(traj_all['userID'].unique().tolist()):
    ntraj = traj_all[traj_all['userID'] == user]['trajID'].unique().shape[0]
    if ntraj >= MIN_N:
        user_list.append(user)

In [83]:
traj_nofew = traj_all[traj_all['userID'].isin(user_list)].copy()
print(traj_nofew.shape)
traj_nofew.head()

(5223, 8)


Unnamed: 0,#photo,endTime,poiID,startTime,trajID,trajLen,userID,poiDuration
1,2,1205332541,58,1205332532,1,2,10087938@N02,9
2,2,1205342729,66,1205342722,1,2,10087938@N02,7
3,1,1205374109,59,1205374109,2,1,10087938@N02,0
4,1,1205417265,58,1205417265,3,1,10087938@N02,0
5,2,1205511700,58,1205508764,4,3,10087938@N02,2936


Save trajectories.

In [84]:
traj_nofew.to_csv(ftraj_nofew, index=False)

Sanity check.

In [85]:
for user in sorted(traj_nofew['userID'].unique().tolist()):
    udf = traj_nofew[traj_nofew['userID'] == user]
    assert(udf['trajID'].unique().shape[0] >= MIN_N)
print('Checking finished.')

Checking finished.
