# Old Shenzhen Data
We where provided several days worth of data for Shenzhen.  At first glance this data seems to be very clean.  This notebook goes through some of the initial processing.

## Observations
The following sections capture some of the observations on the data.

### Cities
There are two cities represented in the data.
1. Shenzhen
    1. Contains the majority of the taxis
    1. All taxis whose plate begins with the letter *'B'*
1. Dongguan
    1. Contains a smaller subset of the total taxis but still a significant amount
    1. All taxis whose plate begins with the letter *'S'*

For more information, refer to the wikipedia article https://en.wikipedia.org/wiki/Vehicle_registration_plates_of_China#Guangdong

### Failed Trip Status
Some of the taxis seem to drive around all day and never pick up a passenger.  Interestingly, all of the taxis that do this have the same beginning plate combination.  For example, all taxis beginning *'B51K'*, *'B51V'*, *'B56V'* exhibit this behavior.  Some of the taxis with plates beginning with that combination also don't seem to create any trips at all. The list of taxis observed with this bad passenger flag is provided below.  Where the taxi is listed as having no trips, it means that the taxi is found but all samples are removed due to the implausible or other filters that run AFTER the data file is separated into taxis.  So a rough estimate at how many usable taxis there are would come from finding how many taxis have 2 or more trips.

### Final Stats
In this notebook there several cells that provide additional statistics on the taxi data.

## Setup

In [7]:
# These packages are here solely to support the use of the IPython Notebook.
%matplotlib inline
# %pylab inline
import numpy as np
import pandas as pd
import os
from IPython.display import HTML, display  # Allows rendering data as HTML, for example DataFrame tables.
import matplotlib.pyplot as plt
from datetime import datetime

# plt styles include:
# 'bmh', 'classic', 'seaborn-dark', 'seaborn-muted', 'seaborn-talk', 'fivethirtyeight', 'seaborn-whitegrid',
# 'seaborn-white', 'seaborn-darkgrid', 'ggplot', 'seaborn-notebook', 'seaborn-pastel', 'seaborn-deep',
# 'seaborn-poster', 'grayscale', 'seaborn-bright', 'seaborn-colorblind', 'dark_background', 'seaborn-ticks',
# 'seaborn-dark-palette', 'seaborn-paper'
plt.style.use('ggplot')
figsize(15, 5)

print('Using pandas version', pd.__version__)

Using pandas version 0.18.1


In [8]:
data_dir = '/home/dingbat/data/taxi/shenzhen/2012-Shenzhen'

## Helper Functions
These are functions used outside the data processing to provide some additional insight into the processing.

In [9]:
def sample_df(df, rows=5):
    """ Returns N rows as a sample of the passed in dataframe. """
    return df.sample(rows).sort_index()

In [10]:
def human_size(num, suffix=''):
    """ Given a number in bytes, format it to the nearest size increment.  e.g. 1024 is 1K """
    for unit in ['','K','M','G','T','P','E','Z']:
        if abs(num) < 1024.0:
            return "%3.1f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

## Reading in the Data

With the UNIX time format, it is much faster to read in the data directly and then perform conversions on the time column.  You'll note in the cells below that we do this incrementally over a few cells.

In [11]:
taxi_file = os.path.join(data_dir, '2012-06-27.good.sample')
header_names = ['common_id', 'timestamp', 'passenger', 'speed', 'heading', 'latitude', 'longitude']
usecols = [0,1,2,3,4,5,6]  # Omit road and road id columns

In [12]:
start_time = datetime.now()
df = pd.read_csv(
    taxi_file,
    index_col=['common_id', 'timestamp'],
    parse_dates=['timestamp'],
    names=header_names,
    usecols=usecols,
    converters={
        'common_id': lambda p: p.strip()[2:]  # Cleans the corrupted unicode from the front of the plate.
    },
    engine='python'
)

print(
    datetime.now() - start_time,
    'to read in',
    human_size(os.path.getsize(taxi_file))
)
sample_df(df)

0:00:06.095388 to read in 68.9M


Unnamed: 0_level_0,Unnamed: 1_level_0,passenger,speed,heading,latitude,longitude
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
таB087U2,2012-06-27 00:10:52,0,58,0,22.66765,114.207367
таB27F36,2012-06-27 00:17:20,0,0,315,22.541416,114.117966
таB357R0,2012-06-27 00:24:56,0,25,135,22.529484,114.046799
таSKH441,2012-06-27 00:14:16,0,0,180,23.027666,113.808083
таSKP625,2012-06-27 00:17:57,0,74,315,23.017683,113.704102


In [13]:
start_time = datetime.now()
df = pd.read_csv(
    taxi_file,
    index_col=['common_id', 'timestamp'],
    parse_dates=['timestamp'],
    names=header_names,
    usecols=usecols,
    converters={
        'common_id': lambda p: p.strip()[2:]  # Cleans the corrupted unicode from the front of the plate.
    }
)

print(
    datetime.now() - start_time,
    'to read in',
    human_size(os.path.getsize(taxi_file))
)
sample_df(df)

0:00:01.803301 to read in 68.9M


Unnamed: 0_level_0,Unnamed: 1_level_0,passenger,speed,heading,latitude,longitude
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B22T72,2012-06-27 00:23:31,0,0,0,22.6143,114.038597
B22T99,2012-06-27 00:22:27,0,72,0,22.674299,113.980904
B7488B,2012-06-27 00:27:52,0,46,270,22.691233,114.1315
BL3H29,2012-06-27 00:03:17,0,0,270,22.525949,114.060349
SBS653,2012-06-27 00:20:38,0,11,180,23.045532,113.744118


In [14]:
# Convert the timestamp column to a DatetimeIndex and assign the correct timezone.
df.index = df.index.set_levels(
    df.index.levels[1].tz_localize('Asia/Shanghai')  # .tz_convert('UTC')
    , level=1
)

df.iloc[:10]  # Use iloc instead of sample to help illustrate sort (next cell)

Unnamed: 0_level_0,Unnamed: 1_level_0,passenger,speed,heading,latitude,longitude
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B40P00,2012-06-27 00:01:46+08:00,1,0,0,22.541918,114.110046
SCC661,2012-06-27 00:01:39+08:00,0,55,180,22.649248,113.824486
SKS991,2012-06-27 00:01:39+08:00,0,85,315,23.087866,113.673447
SBZ910,2012-06-27 00:01:39+08:00,0,0,0,22.858015,113.843796
SBS623,2012-06-27 00:01:39+08:00,0,0,270,22.98815,113.701981
SBR001,2012-06-27 00:01:39+08:00,0,0,270,23.034866,113.7612
SLP610,2012-06-27 00:01:39+08:00,1,0,0,22.90605,114.062347
SBZ205,2012-06-27 00:01:39+08:00,0,22,135,23.0182,114.092865
SBG776,2012-06-27 00:01:39+08:00,0,0,0,22.982033,113.998901
SKZ403,2012-06-27 00:01:39+08:00,0,0,0,23.040434,113.773163


In [15]:
start_time = datetime.now()
# Sorts based on the index so that all taxis are together and then all timestamps are chronological.
df = df.sort_index()
df.iloc[:10]  # Use iloc instead of sample to help illustrate sort (next cell)
print(datetime.now() - start_time)

0:00:03.198977


In [16]:
# Time range of the data can be pulled from the timestamp index.
df.index.levels[1].min(), df.index.levels[1].max()

(Timestamp('2012-06-27 00:00:00+0800', tz='Asia/Shanghai'),
 Timestamp('2012-06-27 00:29:59+0800', tz='Asia/Shanghai'))

In [17]:
# Provides each plate and will printout total number of taxis
df.index.levels[0]

Index(['B000H6', 'B001B1', 'B001B2', 'B001B6', 'B001B7', 'B001H0', 'B002B1',
       'B002V7', 'B002Y1', 'B002Z6',
       ...
       'SYB470', 'SYB472', 'SYB540', 'SYB541', 'SYB542', 'SYB547', 'SYB747',
       'SYC437', 'SYC452', 'SYC472'],
      dtype='object', name='common_id', length=21071)

In [18]:
shenzhen_taxis = [x for x in df.index.levels[0].values.tolist() if x.startswith('B')]
dongguan_taxis = [x for x in df.index.levels[0].values.tolist() if x.startswith('S')]
print('Shenzhen Taxis = %d\nDongguan Taxis = %d' % (len(shenzhen_taxis), len(dongguan_taxis)))

Shenzhen Taxis = 14092
Dongguan Taxis = 6979


In [19]:
# Provides useful information such as:
#   the data types of each column,
#   number of rows in the index,
#   memory use.
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1000000 entries, (B000H6, 2012-06-27 00:00:08+08:00) to (SYC472, 2012-06-27 00:29:36+08:00)
Data columns (total 5 columns):
passenger    1000000 non-null int64
speed        1000000 non-null int64
heading      1000000 non-null int64
latitude     1000000 non-null float64
longitude    1000000 non-null float64
dtypes: float64(2), int64(3)
memory usage: 45.8+ MB


In [20]:
# Provides information for each column such as:
#   the number of samples
#   Statistics such as mean, std dev, min, max
df.describe()

Unnamed: 0,passenger,speed,heading,latitude,longitude
count,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0
mean,0.237655,20.946756,115.44615,22.743649,113.944251
std,0.425647,25.715687,110.814725,0.317429,0.458263
min,0.0,0.0,0.0,1.134467,11.67055
25%,0.0,0.0,0.0,22.559433,113.808746
50%,0.0,7.0,90.0,22.6632,113.972847
75%,0.0,39.0,225.0,22.964333,114.084084
max,1.0,136.0,315.0,27.113766,174.067581


## Cleaning up the Data

There are several things that can be done at the global level to help clean up the data.  This includes things like removing duplicate rows and checking for bad geo-position data.

The Shenzhen data is fairly clean in that regard but still contains duplicates.  We classify duplicates as exact duplicates and index duplicates.  For purposes of studying the data we cannot have any index duplicates as by definition these are impossible.  Exact duplicates are index duplicates where the remaining data is exactly the same.  Removing them is generally benign.  However, duplicates that have different data are problematic.

### Duplicates

In [16]:
def safe_duplicate_filter(df):
    # Note that pandas 0.17.0 allows use of a keep keyword argument to define
    # which duplicates are kept (first, last, none).
    dups_data = df.duplicated()
    dups_index = df.index.duplicated()
    dups = dups_data & dups_index
    return dups

def index_duplicates_exist(df):
    # This will print True if there are rows that violate the duplicate time with different states constraint.
    return df.index.duplicated().any()

In [17]:
prev_len = len(df)
start_time = datetime.now()
dups_safe = safe_duplicate_filter(df)
prev_len = len(df)
print('Removed {} exact duplicates in {}.'.format(
    prev_len - len(df),
    datetime.now() - start_time
))
# dups_safe
df[dups_safe][:10]

Removed 0 exact duplicates in 0:00:00.461839.


Unnamed: 0_level_0,Unnamed: 1_level_0,passenger,speed,heading,latitude,longitude
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B014G5,2012-06-27 00:01:22+08:00,0,0,135,22.661751,114.026871
B014G5,2012-06-27 00:06:02+08:00,0,0,135,22.661751,114.026871
B014G5,2012-06-27 00:06:42+08:00,0,26,225,22.660749,114.025948
B014G5,2012-06-27 00:07:22+08:00,0,12,45,22.659817,114.024765
B014G5,2012-06-27 00:08:02+08:00,0,27,45,22.6609,114.02655
B014G5,2012-06-27 00:08:42+08:00,0,26,90,22.660933,114.028435
B014G5,2012-06-27 00:09:22+08:00,0,22,225,22.658934,114.02993
B014G5,2012-06-27 00:10:02+08:00,0,0,270,22.658533,114.028618
B014G5,2012-06-27 00:10:42+08:00,0,17,90,22.658266,114.028503
B014G5,2012-06-27 00:11:22+08:00,0,30,225,22.656384,114.026604


In [18]:
# Drop duplicates that are exact copies of data.
start_samples = len(df)
start_time = datetime.now()
df = df[~dups_safe]
print('Removed {} samples in {}.'.format(
    start_samples - len(df),
    datetime.now() - start_time
))

Removed 20780 samples in 0:00:00.017571.


In [19]:
len(df)

979220

In [20]:
# Now we see what duplicates are left where the index is the same but different data.
# This only prints the second and beyond of the duplicates.
dups_index = df.index.duplicated()
df[dups_index][:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,passenger,speed,heading,latitude,longitude
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B00D16,2012-06-27 00:08:29+08:00,0,27,0,22.5714,113.865402
B00D16,2012-06-27 00:10:29+08:00,0,16,0,22.575899,113.868698
B00D16,2012-06-27 00:12:30+08:00,0,53,0,22.573799,113.877701
B02F71,2012-06-27 00:01:48+08:00,0,0,225,22.54775,114.116631
B02S47,2012-06-27 00:01:47+08:00,1,0,315,22.567734,114.035431
B02T07,2012-06-27 00:19:17+08:00,0,78,180,22.659849,114.212013
B02T07,2012-06-27 00:20:17+08:00,0,50,225,22.656799,114.208466
B02T07,2012-06-27 00:22:17+08:00,0,24,225,22.644716,114.194435
B02T07,2012-06-27 00:26:07+08:00,0,65,225,22.63695,114.174553
B02T07,2012-06-27 00:26:27+08:00,0,50,225,22.636583,114.171753


In [21]:
# Print out the first n rows that have the same index.
n = 20
dup_list = []
for i, (idx, data) in enumerate(df[dups_index].iterrows()):
    dup_list.append(df.loc[idx])
    if i > n:
        break
pd.concat(dup_list)

Unnamed: 0_level_0,Unnamed: 1_level_0,passenger,speed,heading,latitude,longitude
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B00D16,2012-06-27 00:08:29+08:00,0,0,225,22.582899,113.910301
B00D16,2012-06-27 00:08:29+08:00,0,27,0,22.5714,113.865402
B00D16,2012-06-27 00:10:29+08:00,0,0,225,22.582899,113.910301
B00D16,2012-06-27 00:10:29+08:00,0,16,0,22.575899,113.868698
B00D16,2012-06-27 00:12:30+08:00,0,0,225,22.582899,113.910301
B00D16,2012-06-27 00:12:30+08:00,0,53,0,22.573799,113.877701
B02F71,2012-06-27 00:01:48+08:00,0,0,180,22.547768,114.116936
B02F71,2012-06-27 00:01:48+08:00,0,0,225,22.54775,114.116631
B02S47,2012-06-27 00:01:47+08:00,1,58,225,22.5655,114.046318
B02S47,2012-06-27 00:01:47+08:00,1,0,315,22.567734,114.035431


Looking at each of these items, it appears there are at least two reasons for the duplicates.

1. When the state changes is seems that a second point is recorded for some taxis
1. Some duplicates appear to be a second sample but the one second time fidelity can't differentiate the points (i.e. the samples are different but less than a second apart.

In [22]:
# Print a report of the data around the duplicate as defined by the Timedelta.
# This is useful to determine what might be causing the duplicates or how to better handle them.
td = pd.Timedelta(minutes=1, seconds=30)
dup_list = []
for i, (idx, data) in enumerate(df[dups_index].iterrows()):
    dup_list.append(df.loc[(idx[0],idx[1]-td):(idx[0],idx[1]+td)])
    if i > n:
        break

pd.concat(dup_list)

Unnamed: 0_level_0,Unnamed: 1_level_0,passenger,speed,heading,latitude,longitude
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B00D16,2012-06-27 00:06:59+08:00,0,25,225,22.565399,113.867798
B00D16,2012-06-27 00:08:29+08:00,0,0,225,22.582899,113.910301
B00D16,2012-06-27 00:08:29+08:00,0,27,0,22.571400,113.865402
B00D16,2012-06-27 00:08:59+08:00,0,0,0,22.571699,113.865601
B00D16,2012-06-27 00:09:29+08:00,0,0,0,22.571699,113.865601
B00D16,2012-06-27 00:09:59+08:00,0,38,0,22.573400,113.866898
B00D16,2012-06-27 00:08:59+08:00,0,0,0,22.571699,113.865601
B00D16,2012-06-27 00:09:29+08:00,0,0,0,22.571699,113.865601
B00D16,2012-06-27 00:09:59+08:00,0,38,0,22.573400,113.866898
B00D16,2012-06-27 00:10:29+08:00,0,0,225,22.582899,113.910301


In [23]:
# drop_duplicates will drop all values with duplicate timestamps.
# Pandas 17.1 provides an option to keep the first, last, neither.
df.drop_duplicates(keep='first').loc['B000H6', pd.Timestamp('2012-06-27 00:06:38+08:00')]

Unnamed: 0_level_0,Unnamed: 1_level_0,passenger,speed,heading,latitude,longitude
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B000H6,2012-06-27 00:06:38+08:00,0,11,315,22.556683,114.219086


In [24]:
df.index.duplicated().any()  # Prints True if the index contains any duplicates.

True

In [25]:
# Drop only one of the duplicates
(df[~df.index.duplicated()]).loc['B000H6', pd.Timestamp('2012-06-27 00:06:38+08:00')]

passenger      0.000000
speed         11.000000
heading      315.000000
latitude      22.556683
longitude    114.219086
Name: (B000H6, 2012-06-27 00:06:38+08:00), dtype: float64

In [26]:
'Data points before dropping duplicates', len(df)

('Data points before dropping duplicates', 979220)

In [27]:
start_time = datetime.now()
df = df[~df.index.duplicated()]
print(datetime.now() - start_time)

0:00:00.096972


In [28]:
'Data points after dropping duplicates', len(df)

('Data points after dropping duplicates', 978646)

In [29]:
df.index.duplicated().any()  # Should print False at this point.

False

### Removing Bad GPS Points

In [30]:
def remove_impossible_filter(taxi):
    """ Removes GPS points that are impossible for a taxi """
    return (
        (taxi['longitude'] != 0) &
        (taxi['latitude'] != 0)
    )

In [31]:
# Let's explore the implementation of the function above.
filter_bad_gps = remove_impossible_filter(df)
filtered = df[filter_bad_gps]
print('Impossible removing', len(df) - len(filtered), 'points')

print('Num bad GPS points', len(df[~filter_bad_gps]))
sample_df(filter_bad_gps)

Impossible removing 0 points
Num bad GPS points 0


common_id  timestamp                
B15V56     2012-06-27 00:17:50+08:00    True
B52F07     2012-06-27 00:27:27+08:00    True
SCL351     2012-06-27 00:29:03+08:00    True
SEU532     2012-06-27 00:12:59+08:00    True
SKU426     2012-06-27 00:12:13+08:00    True
dtype: bool

## Partition by taxi
Using the groupby method the big datafile can be broken down into individual taxis and the sub-dataframe accessed using the get_group method.  Here we access the data for a single taxi.

In [32]:
start_time = datetime.now()
taxis = df.groupby(level='common_id')
print(datetime.now() - start_time)

0:00:00.024304


In [33]:
# The individual taxi data can be extracted using the get_group method.  We also drop the taxi id from the index.
# taxi = taxis.get_group('B002Z6')  # This taxi picks up a passenger and has dups.
taxi = taxis.get_group('BL1M49').copy()  # This taxi has duplicate rows.
sample_df(taxi, 10)

Unnamed: 0_level_0,Unnamed: 1_level_0,passenger,speed,heading,latitude,longitude
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BL1M49,2012-06-27 00:04:23+08:00,0,81,180,22.674168,114.218147
BL1M49,2012-06-27 00:07:14+08:00,0,2,180,22.657566,114.209183
BL1M49,2012-06-27 00:07:41+08:00,0,4,135,22.656,114.207901
BL1M49,2012-06-27 00:10:36+08:00,0,21,225,22.648382,114.200401
BL1M49,2012-06-27 00:10:44+08:00,0,6,180,22.648251,114.200233
BL1M49,2012-06-27 00:10:45+08:00,0,6,180,22.648251,114.200233
BL1M49,2012-06-27 00:11:15+08:00,0,58,225,22.646633,114.197952
BL1M49,2012-06-27 00:11:23+08:00,0,64,225,22.6458,114.196671
BL1M49,2012-06-27 00:15:26+08:00,0,44,225,22.636818,114.172798
BL1M49,2012-06-27 00:21:59+08:00,0,79,225,22.622967,114.147102


In [34]:
taxi_id = taxi.index.values[0][0]
print('Taxi ID:', taxi_id)
taxi.index = taxi.index.droplevel(0)
sample_df(taxi)

Taxi ID: BL1M49


Unnamed: 0_level_0,passenger,speed,heading,latitude,longitude
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2012-06-27 00:07:34+08:00,0,23,180,22.656099,114.20787
2012-06-27 00:09:46+08:00,0,2,315,22.649567,114.201782
2012-06-27 00:12:24+08:00,0,43,225,22.644567,114.193199
2012-06-27 00:23:23+08:00,0,71,180,22.619133,114.134819
2012-06-27 00:27:04+08:00,0,59,225,22.61055,114.118767


In [35]:
taxi.loc[:,'index'] = taxi.index
taxi[:20].to_json(
    date_format='iso', orient='records'
)

'[{"passenger":0,"speed":23,"heading":180,"latitude":22.705717,"longitude":114.232903,"index":"2012-06-26T16:00:03.000Z"},{"passenger":0,"speed":30,"heading":180,"latitude":22.705183,"longitude":114.232536,"index":"2012-06-26T16:00:13.000Z"},{"passenger":0,"speed":49,"heading":180,"latitude":22.704483,"longitude":114.231949,"index":"2012-06-26T16:00:23.000Z"},{"passenger":0,"speed":57,"heading":180,"latitude":22.703283,"longitude":114.231247,"index":"2012-06-26T16:00:33.000Z"},{"passenger":0,"speed":69,"heading":180,"latitude":22.702217,"longitude":114.230652,"index":"2012-06-26T16:00:41.000Z"},{"passenger":0,"speed":76,"heading":180,"latitude":22.701883,"longitude":114.230453,"index":"2012-06-26T16:00:43.000Z"},{"passenger":0,"speed":75,"heading":180,"latitude":22.700317,"longitude":114.229385,"index":"2012-06-26T16:00:53.000Z"},{"passenger":0,"speed":72,"heading":180,"latitude":22.698816,"longitude":114.2285,"index":"2012-06-26T16:01:03.000Z"},{"passenger":0,"speed":73,"heading":180,

Another thing that is required is to filter out duplicate data as it seems this is common.  For purposes of taxi data, it is impossible to have the same taxi at two different locations.  This is true at least for as long as scientists are still refining the whole quantum mechanics thing.  We need to verify duplicates on both the time and the data values.  Where those two criteria are the same we can safely drop the row because it does not violate our constraint (or being in only one state at each given time).  After dropping the exact duplicate rows we check again for duplicate times to verify the data is good.

If there are dups in the data, this will print those duplicates.  It only prints the duplicate records so if there are two copies in the data only one is shown here.  If there are three copies in the data then there will be three records shown here.

With those items down, it is possible to iterate the groups to perform processing of each taxi one-by-one.  The code below just illustrates the looping logic.  Later sections detail additional processing at the taxi level that would replace the contents of process_taxi.  Note that the data has all duplicates removed and is sorted by timestamp in the main processing loop.

In [36]:
# This cell takes a while to execute if there are a lot of samples (i.e. it has to churn through the data)
print('There are {} taxis'.format(len(taxis)))

def remove_safe_dups(d):
    """
    :param d: The dataframe to check, must have indicies of taxi ID and timetstamp
    :return: The dataframe with the exact duplicate rows removed
    """
    dups_data = d.duplicated()
    dups_index = d.index.duplicated()
    dups = dups_data & dups_index
    return d[~dups]

def check_dups(d):
    """
    :param d: The dataframe to check, must have indicies of taxi ID and timetstamp
    :return: True if there are duplicates in the index
    """
    return d.index.duplicated().any()

def count_dups(taxi_id, data):
    """ Counts the taxis in the data that have duplicate timestamps with different data """
    if check_dups(data):
        count_dups.num_dups += 1
        count_dups.d = taxi_id, data      
count_dups.d = None
count_dups.num_dups = 0

def process_data(df, f):
    """ Iterates the dataframe by partitioning into individual taxis """
    no_dups = remove_safe_dups(df)
    taxis = no_dups.groupby(level='common_id', sort=False)
    for taxi_id, data in taxis:
        data = data.sort_index()
        f(taxi_id, no_dups)

start_time = datetime.now()
process_data(df, count_dups)
print(datetime.now() - start_time)

print('{} taxis violate duplicate timestamp, different data constraint'.format(count_dups.num_dups))
if count_dups.d is not None:
    print('Sample data from Taxi ID:', count_dups.d[0])
    sample_df(count_dups.d[1])

There are 21071 taxis


KeyboardInterrupt: 

In [37]:
if count_dups.d is not None:
    # Print duplicates that violate the constraints.
    dups_data = count_dups.d[1].index.duplicated()
    count_dups.d[1][dups_data]

For subsequent processing we will remove the truly duplicate records from the data.

## Taxi Data Filtering

There are certain things that require knowing that the data belongs to a single taxi.  This includes implausible filtering, which determines whether the data represents something that could have plausibly occurred.  This helps isolate the more extreme GPS jitter since positional information can be calculated to determine speed (the change in distance over time is speed.  So we detect where the taxi may have traveled more than 180km/hr (there are speed limits up to 160km/r).  This equates to 50m/s

The plausible distance calculation is a little more involved as we need to calculate the distance traveled and the time between points.  The following function will calculate the haversince distance in kilometers between vectors of positions.  This will be used for data correction.

In [38]:
def vector_haversine(df_lon_from, df_lat_from, df_lon_to, df_lat_to):    
    dlon = df_lon_to - df_lon_from
    dlat = df_lat_to - df_lat_from
    a = np.sin(dlat/2) * np.sin(dlat/2) + np.cos(df_lat_from) * np.cos(df_lat_to) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    dist_meters = 6371.0 * c
    return dist_meters

Now we can apply the haversine formula to calculate the distances between points.  To do this as a vector operation the shift method will provide rows that line up for the vector calculations.
1. shift(1) will provide the position from the previous point
1. shift(-1) will provide the position to the next point

To save on calculations, we calculate the distance in one direction (previous position to current) and then shift that distance calculation to obtain the distance to the next point in aligning vectors.  The first value becomes NaN so that is forced to zero so the point is preserved in subsequent calculations.

In [39]:
filtered_taxi = taxi[filter_bad_gps]

lon = filtered_taxi.longitude.map(radians)
lat = filtered_taxi.latitude.map(radians)

dist = vector_haversine(lon, lat, lon.shift(1), lat.shift(1))
dist.iloc[0] = 0
dist.iloc[:5]

  if __name__ == '__main__':


ValueError: cannot include dtype 'M' in a buffer

Additionally, we calculate the time delta between the points using the index and the shift operation so that each time is subtracted by the previous time.  The time delta is represented as seconds using the astype('timedelta64[s]') operation.  As with the distance calculation, the first value becomes NaN so that is forced to zero so the point is preserved in subsequent calculations.

In [None]:
times = dist.index.to_series()
time_delta_prev = (times - times.shift(1)).astype('timedelta64[s]')
time_delta_prev.iloc[0] = 0
time_delta_prev.iloc[:5]

Then, the m/s calculation is a simple vector operation.  Since we know the undefined 0/0 in the first point is still a point to keep, we force it to zero as well.

In [None]:
mps = dist / time_delta_prev
mps.iloc[0] = 0
mps[:5]

Now the filter is created by finding points where the speed is too great to have produced that point.  The actual filter is looking at the previous and next points to locate consecutive points where the distance is bad.  This won't find points that have consecutive bad points.  As with the bad GPS point filter, the filter is True at samples where the GPS is good and False when the difference in distance is bad.

In [None]:
# Due to the shift(-1), the last value became NaN and is forced to zero here.
next_mps = mps.shift(-1)
next_mps.iloc[-1] = 0

dist_delta = pd.DataFrame({
    'prev_mps': mps,
    'next_mps': next_mps,
})

filter_implausible_speed = (
    (dist_delta['prev_mps'] < 50) &
    (dist_delta['next_mps'] < 50)
)

# This is used to extract the times at which our implausible speed is identified.
filter_implausible_speed[filter_implausible_speed.isin([False])]

The following just provides a convenient table to compare the combined results of the individual calculations.

In [None]:
kmph = mps * 3.6  # 3600s/hr / 1000m/km
dist_df = pd.DataFrame({
    'prev_dist': dist,
    'prev_time': time_delta_prev,
    'prev_mps': mps,
    'prev_kmph': kmph,
    'next_dist': dist.shift(-1),
    'next_time': time_delta_prev.shift(-1),
    'next_mps': mps.shift(-1),
    'next_kmph': kmph.shift(-1),
})
dist_df.iloc[:5]

Now we put all that together into a handy function for later use.

In [43]:
def remove_implausible(taxi):
    """ Removes GPS points that are implausible for a taxi """
    # Great circle distance between consecutive GPS samples
    lon = taxi.longitude.map(radians)
    lat = taxi.latitude.map(radians)

    dist = vector_haversine(lon, lat, lon.shift(1), lat.shift(1))
    dist.iloc[0] = 0
    
    # Time difference the distance was traveled.
    times = dist.index.to_series()
    time_delta_prev = (times - times.shift(1)).astype('timedelta64[s]')
    time_delta_prev.iloc[0] = 0
    
    # Calculate meters per second
    mps = dist / time_delta_prev
    mps.iloc[0] = 0
    
    next_mps = mps.shift(-1)
    next_mps.iloc[-1] = 0

    dist_delta = pd.DataFrame({
        'prev_mps': mps,
        'next_mps': next_mps,
    })

    # Speed greater than 50 meters per second is not likely and indicative of GPS error
    filter_implausible_speed = (
        (dist_delta['prev_mps'] < 50) &
        (dist_delta['next_mps'] < 50)
    )
    
    filtered = taxi[filter_implausible_speed]
    print('Implausible removing', len(taxi) - len(filtered), 'points')
    
    return filtered

def remove_impossible(taxi):
    filter_bad_gps = remove_impossible_filter(taxi)
    return taxi[filter_bad_gps]

In [None]:
lon = taxi.longitude.map(radians)
lat = taxi.latitude.map(radians)

dist = vector_haversine(lon, lat, lon.shift(1), lat.shift(1))
dist.iloc[0] = 0

# Time difference the distance was traveled.
times = dist.index.to_series()
times[0]
# (times - times.shift(1)).astype('timedelta64[s]')

So let's create a function for later use that applies all the filtering.

In [41]:
def clean_taxi(taxi):
    taxi = remove_safe_dups(taxi)
    taxi = remove_impossible(taxi)
    taxi = remove_implausible(taxi)
    return taxi

## Map Matching

Map matching can be done at the global or taxi level.  We choose to do it at the taxi level simply for conciseness of the operation.  This helps with scale  Note that the trip level might produce an even finer grain scalability but we replicate points when partitioning at the trip level so map-matching at the taxi level is the most efficient option.

We want to create the following derived data parameters:
1. gid of the matched road
1. distance from the GPS sample to the closest point on the road
1. delta of the sample heading to the calculated road segment heading

In [None]:
# Used for road matching
import psycopg2
connect_str = "dbname='osm' user='django' host='127.0.0.1' password='djangopsql2015'"

In [None]:
def road_match(taxi):
    conn = psycopg2.connect(connect_str)
    curr = conn.cursor()

    positions = taxi[['longitude', 'latitude', 'heading']].values
    query_str = (
        'select gid, dist, road_seg_hdg'
        ' from osm_road_match_line_hdg(\'SRID=4326;LINESTRING M(%s)\'::geometry);'
    ) % (
        ', '.join([('%f %f %d' % (lon, lat, hdg)) for lon, lat, hdg in positions])
    )

    curr.execute(query_str)
    res = curr.fetchall()
    derived = pd.DataFrame(
        data=res,
        columns=columns,
        index=taxi.index
    )

    return derived

%timeit road_match(taxi)

## Partition Taxi by Trip

Now that we have some reasonably good data identified, we can split the taxi into trips.  The following partitioning is done using the passenger status such that each time the passenger status changes, a new trip is created.  In order to maintain continuity between the partitions, the first point of the subsequent trip is used as the last point of the current trip.

In order to partition the trips by the passenger status, a temporary series can be created as a shifted status and then the changes in the status change added up to label each trip.  The trip ID is added to the taxi DataFrame to enable the pandas groupby functionality.

In [44]:
# Since the column is already a flag it can be used directly.  Otherwise, this would convert it to a flag.
# trips = (taxi.passenger - taxi.passenger.shift(1)).cumsum()

taxi = taxis.get_group('B002Z6')  # This taxi picks up a passenger and has dups.
# taxi = taxis.get_group('BL1M49')  # This taxi has duplicate rows.
taxi.index = taxi.index.droplevel(0)
taxi = clean_taxi(taxi)
trips = (taxi.passenger.diff(1) != 0).astype('int').cumsum()
sample_df(trips, 10)

Implausible removing 0 points


timestamp
2012-06-27 00:05:09+08:00    2
2012-06-27 00:06:59+08:00    3
2012-06-27 00:08:29+08:00    3
2012-06-27 00:09:29+08:00    3
2012-06-27 00:11:09+08:00    3
2012-06-27 00:16:47+08:00    4
2012-06-27 00:16:49+08:00    4
2012-06-27 00:19:39+08:00    4
2012-06-27 00:19:59+08:00    4
2012-06-27 00:27:19+08:00    5
Name: passenger, dtype: int64

Now the trips can be iterated to integrate into the functionality for other systems.  When creating the trips we want to add the first point of the next sequence in to the current sequence to preserve continuity.  After the loop there is one final, albeit incomplete, trip left.

In [None]:
def process_trip(trip):
    print(
        trip.index[0], ':',
        trip.index[-1] - trip.index[0],
        'Samples:', len(trip),
        '- Passenger' if trip.iloc[0].passenger else ''
    )

def process_taxi(taxi, f):
    trip_groups = taxi.groupby(trips, sort=False)
    prev_seq = None
    for name, trip in trip_groups:
        if prev_seq is not None:
            # Trip from beginning of previous sequence through first point of current.
            start_time = prev_seq.index[0]
            end_time = trip.index[0]
            passenger = '- Passenger' if prev_seq.iloc[0].passenger else ''
            # Combined represents the desired trajectory partition.
            combined = pd.concat([prev_seq, trip.iloc[:1]])
            f(combined)
        prev_seq = trip
    
    f(prev_seq)

process_taxi(taxi, process_trip)

#### Taxis with one Trip
Some of the taxis seem to drive around all day and never pick up a passenger.  This is more likely that the status flag for these taxis is broken or the taxis do not have the appropriate instrumentation to provide the data.  Here we determine how many taxis exhibit this behavior.

In [None]:
no_trip_taxis = []
for taxi_id, taxi_data in taxis:
    trips = (taxi_data.passenger.diff(1) != 0).astype('int').cumsum()
    trip_groups = taxi_data.groupby(trips, sort=False)
    if len(trip_groups) == 1:
        no_trip_taxis.append(taxi_id)
max_print = 100
print(len(no_trip_taxis), ' taxis have only one trip.  ID listed below (up to {}).'.format(max_print))
print(no_trip_taxis[:max_print])

## Convert dataframe to array for postgres

In [45]:
taxi
trips = (taxi.passenger.diff(1) != 0).astype('int').cumsum()
trip_groups = taxi.groupby(trips, sort=False)
for passenger, trip_data in trip_groups:
    break
print("There is a passenger" if passenger else "Thereis not a passenger")
trip_data

There is a passenger


Unnamed: 0_level_0,passenger,speed,heading,latitude,longitude
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2012-06-27 00:00:05+08:00,0,0,180,22.534018,114.113785
2012-06-27 00:00:07+08:00,0,0,180,22.5338,114.113815
2012-06-27 00:00:17+08:00,0,0,180,22.534033,114.113914
2012-06-27 00:00:27+08:00,0,0,180,22.534033,114.113914
2012-06-27 00:00:37+08:00,0,0,180,22.534033,114.113914


In [48]:
'{{{}}}'.format(','.join(trip_data.index.strftime("'%Y-%m-%d %H:%M:%S'")))

"{'2012-06-27 00:00:05','2012-06-27 00:00:07','2012-06-27 00:00:17','2012-06-27 00:00:27','2012-06-27 00:00:37'}"

In [49]:
'{{{}}}'.format(','.join(trip_data.columns))

'{passenger,speed,heading,latitude,longitude}'

In [53]:
'{{{{{}}}'.format(trip_data.to_csv(header=False, index=False, line_terminator='},{')[:-2])

'{{0,0,180,22.534018,114.11378500000001},{0,0,180,22.5338,114.11381499999999},{0,0,180,22.534032999999997,114.113914},{0,0,180,22.534032999999997,114.113914},{0,0,180,22.534032999999997,114.113914}}'

In [83]:
values

[array([0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0]),
 array([180, 180, 180, 180, 180]),
 array([ 22.534018,  22.5338  ,  22.534033,  22.534033,  22.534033]),
 array([ 114.113785,  114.113815,  114.113914,  114.113914,  114.113914])]

## Create a LineString
Apply gps filters and create a LineString.  The LineString is used in the database to support geospatial analysis.

In [None]:
from django.contrib.gis.geos import LineString

def create_linestring(taxi):
    positions = taxi[['longitude', 'latitude']]
    start_time = taxi.index[0]
    tuples = [
        tuple((x[0][0], x[0][1], (x[1]-start_time).total_seconds()))
        for x in zip(positions.values, taxi.index)
    ]
    if len(tuples) == 1:
        tuples = (tuples[0], tuples[0])
    return LineString(tuples, srid=4326)

def print_linestring_geojson(trip):
    print(create_linestring(trip).json, ',')

# At this point the taxi data frame has been filtered to remove impossible and implausible data.
# Calling this will print the linestring for each trip.
process_taxi(taxi, print_linestring_geojson)

In [None]:
%timeit create_linestring(taxi)

In [None]:
def create_linestring2(taxi):
    start_datetime = taxi.index[0]
    positions = [
        [d.longitude, d.latitude, (time - start_datetime).total_seconds()]
        for time, d in taxi[['longitude', 'latitude']].dropna().iterrows()
    ]
    return LineString(positions, srid=4326)

In [None]:
%timeit create_linestring2(taxi)

## Miscellaneous Code

This section just captures some things that were useful when assessing the data but aren't actually needed for the final processing code.  They are retained in case further assessment is needed.

### Create a LineString to Display on a Map
This bit of code will dump out a line string that can be copied to a mapping application such as:
1. http://arthur-e.github.io/Wicket/sandbox-gmaps3.html
    1. Uses Google Maps, which apply the China Map shift
    1. Requires WKT format
1. http://geojsonlint.com/
    1. Uses OpenStreetMap data (processed by MapQuest)
    1. Requires GeoJSON format

In [None]:
my_taxi = taxi[['longitude', 'latitude', 'heading']]
my_taxi.index = my_taxi.index.droplevel(0)  # Remove the common_id

print('LINESTRING M({})'.format(
    ', '.join(['{} {} {}'.format(lon, lat, hdg) for lon, lat, hdg in my_taxi.values[:10]])
))


### Create CSV of a Single Taxi
This is useful for later or more directed testing.  That is, if the entire dataset is read in then this will create CSVs for the entire time range but only for specific taxis.  Then the single taxi file can be read in later for quicker testing.

In [None]:
taxi_ids = [
    'B000H6',
    'B001B1',
    'B001B2',
    'B001B6',
    'B001B7',
    'B001H0',
    'B002B1',
    'B002V7',
    'B002Y1',
    'B002Z6'
]

for taxi_id in taxi_ids:
    taxi = taxis.get_group(taxi_id)  # This taxi has duplicate rows.
    with open('%s.csv' % taxi_id, 'w') as csv:
        csv.write(taxi.to_csv())