# Trajectory Construction 

This note describes how we construct trajectories from photos/movies of [YFCC100M dataset](http://arxiv.org/abs/1502.03409).

## Construction Process
* [1. Extract initial relevant points from YFCC100M dataset](#1.-Extracting-relevant-points-from-YFCC100M-dataset)
    * [1.1. Basic stats of initial dataset](#1.1.-Basic-stats-of-initial-dataset)
    * [1.2. Scatter plot of extracted points](#1.2.-Scatter-plot-of-extracted-points)
* [2. Extract initial trajectories from extracted points](#2.-Extract-initial-trajectories-from-extracted-points)
    * [2.1 Basic stats after extracting initial trajectories](#2.1-Basic-stats-after-extracting-initial-trajectories)
* [3. Filter Trajectory](#3.-Filter-Trajectory)
    * [3.1. Filter by Timeframe](#3.1.-Filter-by-Timeframe)
        * [3.1.1. Bar chart of taken photos by year](#3.1.1.-Bar-chart-of-taken-photos-by-year)
    * [3.2. Filter by Duration](#3.2.-Filter-by-Duration)
        * [3.2.1. Histogram of trajectory duration](#3.2.1.-Histogram-of-trajectory-duration)
    * [3.3. Filter by Minimum distance](#3.3.-Filter-by-Minimum-distance)
        * [3.3.1. Histogram of trajectory length](#3.3.1.-Histogram-of-trajectory-length)
    * [3.4. Filter by Speed](#3.4.-Filter-by-Speed)
        * [3.4.1. Drop trajectory by average speed](#3.4.1.-Drop-trajectory-by-averag--speed)
            * [3.4.1.1. Histogram of trajectory speed](#3.4.1.1.-Histogram-of-trajectory-speed)
        * [3.4.2. Drop trajectory by point-to-point speed](#3.4.2.-Drop-trajectory-by-point-to-point-speed)
            * [3.4.2.1. Histogram of point-to-point speed](#3.4.2.1.-Histogram-of-point-to-point-speed)
        * [3.4.3. Drop trajectory by sophisticated method](#3.4.3-Drop-trajectory-by-sophisticated-method)
* [4. Filtered Trajectory](#4.-Filtered-Trajectory)
    * [4.1. Basic Stats](#4.1.-Basic-Stats)

## 1. Extract relevant points from YFCC100M dataset

From the original YFCC100M dataset, we first extract the photos/movies belongs to the below region.

![big-box](./img/bigbox.png)

`filtering_bigbox.py` file take the original YFCC100M file to extract photos and videos from above region, and will generate a cvs file containing:
* Photo/video ID
* NSID (user ID)
* Date
* Longitude
* Latitude
* Accuracy (GPS accuracy)
* Photo/video URL
* Photo/video identifier (0 = photo, 1 = video)

The usage of this file is :
> `python filtering_bigbox.py YFCC100M_DATA_FILE`

which will generate `YFCC100M_DATA_FILE.out` file

### 1.1. Basic stats of initial dataset


### 1.2. Scatter plot of extracted points

## 2. Extract initial trajectories from extracted points

With extracted photos(videos) by `filtering_bigbox.py`, we construct initial trajectories with following processes based on several basic criteria.

1. Group photos by user
2. Sort grouped photos by timestamp
3. Split the sorted photos where the second photo of two adjacent photos are taken more than `time_gap` time after
4. Keep trajectories of which at **least one photo** is taken from the below region:

![small-box](./img/smallbox.png)

Here's the argument list:

In [1]:
extracted_points_file = '../data/Melb-bigbox.csv' # outputfile path of extracted points
time_gap = 8  # hour
minimum_photo = 1 # minimum number of photos for each trajectory

# small bounding box
lng_min = 144.597363
lat_min = -38.072257
lng_max = 145.360413
lat_max = -37.591764

In [2]:
%run generate_tables $extracted_points_file $lng_min $lat_min $lng_max $lat_max $minimum_photo $time_gap 

### 2.1. Basic stats after extracting initial trajectories

## 3. Filter Trajectory

After getting an initial list of trajectories, we further filter out improbable trajectories with various criteria.
We use four different criteria as follows:

1. [Timeframe](#3.1.-Filter-by-Timeframe): We only maintain photos/videos taken during a certain period of time (`start_date`, `end_date`)
2. [Duration](#3.2.-Filter-by-Duration): Some suspicious trajectory span over more than 16 hours. We remove trajectories spanning more than certain minutes (`minimum_duration`).
3. [Minimum distance](#3.3.-Filter-by-Minimum-distance): Trajectories consist of photos taken from single location is not meaningful as a trajectory. We remove these trajectories (`minimum_distance`)
4. [Speed](#3.4-Filter-by-Speed): Due to the GPS error, there are some trajectories in which a user moves unbelievably fast speed. We remove these trajectories, but try to recover as much information as possible from some trajectories.

Here's the list of argument we used to generate final trajectories

In [3]:
start_date = '2000-01-01'
end_date = '2015-99-99'
minimum_distance = 1e-3   # km
speed_filter = 0    #(0: filter by average speed, 1: filter by point-to-point speed, 2: filter by sophispicated method)
minimum_speed = 200    # km/h
maximum_duration = 1000    # minute

### 3.1. Filter by Timeframe


#### 3.1.1. Bar chart of taken photos by year

First, we filter out trajectories taken from certain period of time to remove any future photos caused by some errors and too old photos.

Here's the bar chart that plots the number of photos taken by each year.

### 3.2. Filter by Duration

Second, we filter out trajectories which have suspiciously long travel time.

#### 3.2.1. Histogram of trajectory duration

###  3.3. Filter by Minimum distance

Third, we filter out trajectories taken from a single location.

#### 3.3.1. Histogram of trajectory length

### 3.4. Filter by Speed

Some trajectories have suspiciously high speed 
There are three (or more) alternative ways to filter out trajectory which has suspiciously high speed.

#### 3.4.1. Drop trajectory by average speed

#### 3.4.1.1. Histogram of trajectory speed

#### 3.4.2. Drop trajectory by point-to-point speed

#### 3.4.2.1. Histogram of point-to-point speed

#### 3.4.3 Drop trajectory by sophisticated method


## 4. Filtered Trajectory

### 4.1. Basic Stats

More detail analysis will be included in `filckr_analysis.ipynb` and slides. Here we show simple stats from the final result.