# New Shenzhen Data
We where provided several days worth of data for Shenzhen.  At first glance this data seems to be very clean.  This notebook goes through some of the initial processing.

## Preliminaries
There are several things we can determine from a quick inspection

In [1]:
# Python libraries
from datetime import datetime
import os, io
import pandas as pd
from IPython.display import display

In [2]:
# Custom Code
from entity.loader.taxi.taxi_common import (
    sample_df,
    human_size,
    remove_safe_dups,
    remove_impossible,
    remove_implausible,
)
from processing.coordinates import wgs2gcj, gcj2wgs

### Available Files and Data Size

In [3]:
data_stats = io.StringIO()
buf = ""
for root, dirs, files in os.walk('/home/dingbat/data/taxi/shenzhen/2011-Shenzhen/data'):
    for f in files:
        if not f.startswith('.'):
            buf += '{},{},{}\n'.format(
                    os.path.splitext(f)[0],
                    human_size(os.stat(os.path.join(root, f)).st_size),
                    datetime.strptime(os.path.splitext(f)[0], '%Y-%m-%d').strftime("%A")
            )
csv_buf = io.BytesIO(buf.encode('utf-8'))
pd.read_csv(csv_buf, parse_dates='Day', index_col='Day', names=['Day','Size','Day_Of_Week']).sort_index()

Unnamed: 0_level_0,Size,Day_Of_Week
Day,Unnamed: 1_level_1,Unnamed: 2_level_1
2011-09-15,493.9M,Thursday
2011-09-16,1.8G,Friday
2011-09-17,1.9G,Saturday
2011-09-18,2.0G,Sunday
2011-09-19,1.7G,Monday
2011-09-20,2.2G,Tuesday
2011-09-21,388.6M,Wednesday
2011-09-23,1.0G,Friday
2011-09-24,2.0G,Saturday
2011-09-25,2.0G,Sunday


### Observations on the Data
#### Missing Files

Just looking at the filenames we have data between September 15, 2011 and October 28, 2011.  However, the following days are missing.
1. 2011-09-22
1. 2011-09-30
1. 2011-10-05
1. 2011-10-07
1. 2011-10-08

#### Repeated Timestamps
Some of the datafiles have duplicate timestamps but the data at each timestamp is different.  Unless a way is determined to choose which file is correct, these data files are unusable.  The files include:
1. 2011-09-15.txt
1. 2011-10-09.txt

#### Significant Dates
October 1 - 7 is National Day holiday.  It is the case that workers will instead work Saturday and Sunday on the 8th and 9th to have the holiday off (http://www.startinchina.com/china/public_holidays/2011.html, http://www.shenzhen-standard.com/2011/01/13/2011-public-holidays/).  Classes at Shenzhen University are beginning mid-September (http://www.szu.edu.cn/2014/en/cb/1968.html), which might offer some different travel patterns.  Freshman start Sept 26 but the schedule seems to indicate they arrive on Sept 6.  The mid-autumn fesitival has already concluded (9/10 - 9/12) prior to the start of the data.

#### Incomplete Data
Of the remaining files, there are some that seem to be missing data as determined by the abnormally small size.  One might expect this is related to holiday or other irregularity.  However, the 'National Day' holiday October 1 - 7 shows that the data size maintains pattern during this time.
1. 2011-9-21.txt
1. 2011-10-06.txt
1. 2011-10-09.txt
1. 2011-10-22.txt

#### Chinese Map Shift
This data applies the Chinese map shift: https://en.wikipedia.org/wiki/Restrictions_on_geographic_data_in_China.  The shift results in the data in China being offset from straight WGS-84 coordinate system employed by GPS and generally translated to mercator projection of the map.  It will appear well on maps that also apply this shift such as Google but will not look good on other maps that consistently use the same projection such as OpenStreetMap.

#### Notes
In September 2011, Shenzhen introducted a lot of new electric taxis. http://senseable.mit.edu/wef/pdfs/04_SHENZHEN.pdf

#### Conclusion

In other words, there are 39 days of data files but one of the days (2011-9-15) is suspect and should not be used unless a way to reconcile the duplicate timestamp is determined.  An additional 4 days offer incomplete data with one being extremely small size.  This result is about **35 days of good, full data**.  A **complete week's worth of data can be found starting October 10th**.

### Data Columns

From a visual inspection, the columns seem to be in the following format.  There are some unknowns from looking at the data such as what the use of the unknown column might be and whether the Loaded column is accurate.

Taxi ID|Longitude|Latitude|UNIX Timestamp|Speed|Heading|Unknown?|Loaded?
-------|---------|--------|--------------|-----|-------|-------|--------
1046148|113.927121|22.684149|1318206571.000000|22|90|0|0
1046148|113.932452|22.681797|1318206661.000000|50|90|0|0
1046148|113.934130|22.681058|1318206691.000000|20|90|0|0
1046148|113.935274|22.680526|1318206721.000000|22|90|0|0
1046148|113.936159|22.680207|1318206751.000000|0|45|0|0
1046148|113.936731|22.680116|1318206781.000000|9|90|0|0
1046148|113.937006|22.680156|1318206811.000000|0|45|0|0
1046148|113.937044|22.680167|1318206841.000000|7|90|0|0
1046148|113.938599|22.680173|1318206871.000000|22|45|0|0
1046148|113.939461|22.680196|1318206901.000000|31|45|0|0

## Data Parsing and Initial Conversion
### Imports

### Reading in the Data File

It is much faster to read in the data directly and then perform conversions on the time column using the read in dataframe.  In the next cell the CSV file is being read in.

In [4]:
from entity.loader.taxi.shenzhen import Shenzhen2011

taxi_file = '/home/dingbat/data/taxi/shenzhen/2011-Shenzhen/data/201110-Shenzhen/2011-10-10.txt'
reader = Shenzhen2011(organization='Hangzhou')

start_time = datetime.now()

df = reader.resource_to_dataframe(taxi_file)

print('{} to read in {} size data file'.format(
    datetime.now() - start_time,
    human_size(os.path.getsize(taxi_file))
))

sample_df(df)

0:00:34.846592 to read in 1.2G size data file


Unnamed: 0_level_0,Unnamed: 1_level_0,longitude,latitude,speed,heading,unknown,passenger
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1047839,2011-10-10 14:17:03+08:00,114.12286,22.569472,69,180,0,0
1076787,2011-10-10 10:51:31+08:00,114.212215,22.724362,49,225,0,0
1137048,2011-10-10 07:57:22+08:00,114.118159,22.584978,0,0,0,0
1150513,2011-10-10 01:38:30+08:00,114.654096,23.653488,74,180,0,0
1208242,2011-10-10 12:41:45+08:00,113.955007,22.567787,0,0,0,0


In [5]:
start_time = datetime.now()
# Unshift the data back into WGS-84
tuple_wgs = gcj2wgs(df.loc[:,'longitude'], df.loc[:,'latitude'])
df.loc[:,'longitude'] = tuple_wgs[0]
df.loc[:,'latitude'] = tuple_wgs[1]
print('{} to unshift data points'.format(datetime.now() - start_time))

KeyboardInterrupt: 

### Cleanup and post-processing

In [None]:
start_time = datetime.now()

df = remove_safe_dups(df)  # Remove rows where all data is the same
df = remove_impossible(df)  # Remove rows with data that is impossible
df = df[~df.index.duplicated()] # Remove rows where the index is the same, keeps the first instance
df.sort_index(inplace=True)

print('{} to filter and sort'.format(datetime.now() - start_time))
df.iloc[:10]

### Stats on the data
To view some additional information surrounding the data, print a couple statistical tables such as the data types used on each column and general statistics about the data.  One fundamental attribute is the time bounds of the data.

In [None]:
df.index.levels[0].min(), df.index.levels[0].max()

In [None]:
df.index.levels[1].min(), df.index.levels[1].max()

In [None]:
df.info()

In [None]:
df.describe()

## Partitioning the data
Now let's look at splitting up the data to get individual taxis and even individual trips for those taxis.

### Partition by taxi
Using the groupby method the big datafile can be broken down into individual taxis and the sub-dataframe accessed using the get_group method.  More generally, the entire set of taxis can be iterated using the grouped object.

In [None]:
taxi_partitions = df.groupby(level='common_id', sort=False)

In [None]:
sample_df(taxi_partitions.get_group(1211897))

In [None]:
for common_id, taxi_data in taxi_partitions:
    taxi_data.index = taxi_data.index.droplevel(0)
    taxi_data = remove_implausible(taxi_data)  # Removes points that are implausible such as traveling too fast
    break
print('Taxi ID: {}'.format(common_id))
sample_df(taxi_data)

### Partition Taxi by Trip

Now that we have some reasonably good data identified, we can split the taxi into trips.  The following partitioning is done using the passenger status such that each time the passenger status changes, a new trip is created.  In order to maintain continuity between the partitions, the first point of the subsequent trip is used as the last point of the current trip.

In order to partition the trips by the passenger status, a temporary series can be created as a shifted status and then the changes in the status change added up to label each trip.  The trip ID is added to the taxi DataFrame to enable the pandas groupby functionality.

In [None]:
# Since the column is already a flag it can be used directly.  Otherwise, this would convert it to a flag.
# trips = (taxi.passenger - taxi.passenger.shift(1)).cumsum()
for common_id, taxi_data in taxi_partitions:
    taxi_data.index = taxi_data.index.droplevel(0)
    trips = (taxi_data.passenger.diff(1) != 0).astype('int').cumsum()
    trip_groups = taxi_data.groupby(trips, sort=False)
    if len(trip_groups) > 1:
        taxi_data = remove_implausible(taxi_data)  # Removes points that are implausible such as traveling too fast
        break
sample_df(trips, 10)

Now the trips can be iterated to integrate into the functionality for other systems.

In [None]:
trip_groups = taxi_data.groupby(trips, sort=False)
def print_trip(trip):
    start_time = trip.index[0]
    end_time = trip.index[-1]
    passenger = '- Passenger' if trip.passenger[0] else ''
    print('Start({}): Duration({}): Samples({}){}'.format(
        start_time, end_time - start_time, len(trip) + 1, passenger
    ))

prev_seq = None
for name, trip in trip_groups:
    if prev_seq is not None:
        # Trip from beginning of previous sequence through first point of current.
        combined = pd.concat([prev_seq, trip.iloc[:1]])
        print_trip(combined)
    prev_seq = trip
print_trip(prev_seq)  # Last trip

## Translating the Coordinate System
This part shifts the data to the WGS-84 representation.

In [None]:
from entity.loader.taxi.taxi_common import create_linestring
data_gcj = prev_seq
data_wgs = prev_seq.copy()
start_time = datetime.now()
tuple_wgs = gcj2wgs(prev_seq.loc[:,'longitude'], prev_seq.loc[:,'latitude'])
data_wgs.loc[:,'longitude'] = tuple_wgs[0]
data_wgs.loc[:,'latitude'] = tuple_wgs[1]
print('{} to convert to WGS-84'.format(datetime.now() - start_time))

In [None]:
display("Shifted")
display(data_gcj[:5])
display("Unshifted")
display(data_wgs[:5])

In [None]:
import geojson, json
feature_gcj = geojson.Feature(id='0', geometry=json.loads(create_linestring(data_gcj).json))
feature_wgs = geojson.Feature(id='1', geometry=json.loads(create_linestring(data_wgs).json))

In [None]:
%%javascript
var ss = document.createElement("link");
ss.type = "text/css";
ss.rel = "stylesheet";
ss.href = '../tree/entity/static/entity/css/ol.css';
document.getElementsByTagName("head")[0].appendChild(ss);
element.append("<div id='content'></div>");
element.append("<h3 align='center'>Red is 'shifted'<BR>Green is 'unshifted' WGS-84</h3>");
require.config({
    'baseUrl': '/',
    paths : {
        "entity/js/ol": "tree/entity/static/entity/js/ol",
        "entity/js/map": "tree/entity/static/entity/js/map",
        "entity/js/color": "tree/entity/static/entity/js/color",
        "js/3rdparty/tinycolor": "tree/static/js/3rdparty/tinycolor"
    },
    shim: {
        "js/3rdparty/tinycolor": {"exports": "tinycolor"},
        "ol": {"exports": "ol"},
    }
});
require(['tree/entity/static/entity/js/map'], function(map) {
        function callback(msg, element_name) {
            map.setData(JSON.parse(msg.content.data["text/plain"].replace(/\'/g, "")));
        }
        map.init()
        var kernel = IPython.notebook.kernel;
        kernel.execute("feature_wgs", {iopub: {output: callback}}, {silent:false});
        kernel.execute("feature_gcj", {iopub: {output: callback}}, {silent:false});
});

## Conclusions
It appears that the new Shenzhen data (i.e. for the year 2011) can be 'unshifted' to put it back into WGS-84 so that it matches up to the map.