<div class="frontmatter text-center">
<h1>Geospatial Data Science</h1>
<h2>Exercise 9: Mobility with scikit-mobility</h2>
<h3>IT University of Copenhagen, Spring 2022</h3>
<h3>Instructors: Anastassia Vybornova & Ane Rahbek Vierø</h3>
</div>

In today's exercise we will practice working with the scikit-mobility package and get some hands on experience with how to handle mobility data.

This notebook was developed with inspration from the [scikit-mobility tutorials](https://github.com/scikit-mobility/scikit-mobility/tree/master/examples) and [examples](https://github.com/scikit-mobility/scikit-mobility/tree/master/examples). 

### Working with scikit mobility - useful links

* [Documentation and install instructions](https://scikit-mobility.github.io/scikit-mobility/)
* [GitHub Repository](https://github.com/scikit-mobility/scikit-mobility)
* [Examples and demos](https://github.com/scikit-mobility/scikit-mobility/tree/master/examples)
* [Tutorials](https://github.com/scikit-mobility/tutorials/tree/master/mda_masterbd2020)
* [GitHub repo](https://github.com/gboeing/osmnx)
* [See also the partly overlapping package *movingpandas*](https://github.com/anitagraser/movingpandas)

## Short Instructions 
The exercise is divided into 2 parts: 
One with flow data and one with trajectory data. 

For each part the work flow is to:

1. **Load the data** (We have done a bit of the pre-processing for you)

2. **Convert it to a scikit-mobility format (i.e. respectively a FlowDataFrame and a TrajDataFrame)** (Already done for the trajectories)

3. **Visualise the data in a way of your own choice (*we recommend using scikit-mobility's build-in support for easy and nice looking plots with folium*)**

4. **Analyse some aspect of the data with a measure of your own choice. You can see all build-in measures [right here](https://scikit-mobility.github.io/scikit-mobility/reference/measures.html).** 
    *Most of the build in methods are for trajectory data, but see if you can come up with something for the flow data yourself.*

*At the end of the exercise, we will invite a couple of you to show your results.*

## Data
The **flow** data we will be using today is an open data set of origins and destinations of trips on Oslo's city bikes.

A json file with trips from February 2022 is available with this notebook. You can also download more data from Oslo Bysykkel [here](https://oslobysykkel.no/apne-data/historisk).

The **trajectory** data is a dataset of taxi trips in Porto from 2015. Data can be downloaded [here](https://archive.ics.uci.edu/ml/datasets/Taxi+Service+Trajectory+-+Prediction+Challenge,+ECML+PKDD+2015) (train.csv.zip), where you also can read more about the dataset. This data set is very big - we suggest to only import the first x rows (x in the order of 10^4).

Below, we give you an example of the visual outcome for this exercise.

#### Example of output from flow data visualisation (showing the tessellation data)

In [1]:
from IPython.display import IFrame
IFrame(src='exercise09_example_tesselation.html', width=900, height=700, )

#### Example of output from flow data visualisation (showing the flows)

In [2]:
IFrame(src='exercise09_example_flows.html', width=900, height=700, )

# Coding starts here :)

In [3]:
# import libraries needed
import skmob
import geopandas as gpd
import pandas as pd
from skmob.tessellation import tilers
from skmob.utils import plot
import matplotlib.pyplot as plt
import numpy as np
import datetime

tess_style = {'color':'gray', 'fillColor':'gray', 'opacity':0.2}

## Flows

Flow data is also often known as origin-destination data (O-D) and, as implied by the name, describes the origin and destination of a trip - but not the actual trajectory between the two points.

In our dataset we have all the information on the coordinates for origin and destination in the same row, so we need to modify the data structure a little bit before we can convert it to a scikit-mobility dataframe.
The goal is to have a dataframe were each row as the id of the origin and destination location, plus a geodataframe with the location referenced by the origins and destinations.

This geodataframe is usually referred to as the 'tesselation' - but you can still use point geometries as tesselations, as we will be doing here.

In [4]:
# Load data
bike_trips = pd.read_json('data/oslo_bysykkel_0222.json')

bike_trips.head()

Unnamed: 0,started_at,ended_at,duration,start_station_id,start_station_name,start_station_description,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_description,end_station_latitude,end_station_longitude
0,2022-02-01 04:46:50.639000+00:00,2022-02-01 04:53:41.698000+00:00,411,507,Jens Bjelkes Gate,ved Trondheimsveien,59.919147,10.76413,552,Trelastgata,ved Nordenga bru,59.908005,10.76257
1,2022-02-01 05:14:21.118000+00:00,2022-02-01 05:19:10.323000+00:00,289,584,Henrik Wergelands allé,ved Bogstadveien,59.926894,10.720789,603,Frogner plass,i rundkjøringen,59.922539,10.704541
2,2022-02-01 05:15:49.373000+00:00,2022-02-01 05:21:54.926000+00:00,365,385,Søndre gate,ved Ankerbrua,59.918632,10.757867,448,Oslo Plaza,ved rundkjøringen,59.912183,10.754434
3,2022-02-01 05:16:09.283000+00:00,2022-02-01 05:23:21.159000+00:00,431,480,Helga Helgesens plass,langs Grønlandsleiret,59.912111,10.766194,549,Linaaes gate,langs Møllergata,59.913824,10.745704
4,2022-02-01 05:17:33.512000+00:00,2022-02-01 05:23:58.635000+00:00,385,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,443,Sjøsiden ved trappen,Oslo S,59.910154,10.751981


### A bit of data processing...

In [5]:
# Reformat data

ods = bike_trips[['start_station_id','end_station_id']].copy(deep=True) # You can include more columns if you want to use them in your analysis
tesselation_start = bike_trips[['start_station_name','start_station_id','start_station_latitude','start_station_longitude']].copy(deep=True)
tesselation_end = bike_trips[['end_station_name','end_station_id','end_station_latitude','end_station_longitude']].copy(deep=True)

# Count trips between the same origin and destination
ods['flow'] = 1
flows = ods.groupby(['start_station_id','end_station_id']).flow.count().reset_index()

# Rename columns
tesselation_start.rename(columns={'start_station_name':'station_name','start_station_id':'tile_ID','start_station_latitude':'latitude','start_station_longitude':'longitude'},inplace=True)
tesselation_start.drop_duplicates(inplace=True)
tesselation_end.rename(columns={'end_station_name':'station_name','end_station_id':'tile_ID','end_station_latitude':'latitude','end_station_longitude':'longitude'},inplace=True)
tesselation_end.drop_duplicates(inplace=True)

# Combine and remove duplicates
tesselation = pd.concat([tesselation_start, tesselation_end], ignore_index=True)
tesselation.drop_duplicates(inplace=True)
# In our data all origin locations appear to also be destinations and vice versa, but just to be use we combine both

# Convert to geodataframe
tess = gpd.GeoDataFrame(tesselation, geometry=gpd.points_from_xy(tesselation.longitude, tesselation.latitude), crs='EPSG:4226') # There is no defined CRS in the data, but we can read on Oslo Bysykkels website that coordinates are in WGS 84

In [7]:
# Load to Flowdataframe 

# Add you code here

Unnamed: 0,origin,destination,flow
0,377,384,1
1,377,385,1
2,377,387,1
3,377,396,1
4,377,411,1
...,...,...,...
9322,2332,521,1
9323,2332,540,1
9324,2332,611,1
9325,2332,737,1


### Plot tesselation

### Plot flows

## Trajectories

With trajectory data, we both have location of start and end point, but also of the location *in-between* the origin and destination (depending on the quality of the trajectory data it might represent the actual trajectory or be closer to a 'sample' of the location points defining the trajectory).

In [13]:
# import data set

df = pd.read_csv('data/train.csv.zip', compression='zip', sep=',', nrows=20000)
df.head()

Unnamed: 0,TRIP_ID,CALL_TYPE,ORIGIN_CALL,ORIGIN_STAND,TAXI_ID,TIMESTAMP,DAY_TYPE,MISSING_DATA,POLYLINE
0,1372636858620000589,C,,,20000589,1372636858,A,False,"[[-8.618643,41.141412],[-8.618499,41.141376],[..."
1,1372637303620000596,B,,7.0,20000596,1372637303,A,False,"[[-8.639847,41.159826],[-8.640351,41.159871],[..."
2,1372636951620000320,C,,,20000320,1372636951,A,False,"[[-8.612964,41.140359],[-8.613378,41.14035],[-..."
3,1372636854620000520,C,,,20000520,1372636854,A,False,"[[-8.574678,41.151951],[-8.574705,41.151942],[..."
4,1372637091620000337,C,,,20000337,1372637091,A,False,"[[-8.645994,41.18049],[-8.645949,41.180517],[-..."


In [14]:
# process data - each point from the polyline should have its own row
# each point needs lat and lon coordinates in separate columns
rows = []
for i, row in df.iterrows():
    uid, tid = row['TAXI_ID'], row['TRIP_ID']
    call_type = row['CALL_TYPE']
    timestamp = row['TIMESTAMP']
    day_type = row['DAY_TYPE']
    for point in eval(row['POLYLINE']):
        rows.append([uid, tid, call_type, timestamp, day_type, point[1], point[0]])

temp_df = pd.DataFrame(rows, columns=['uid', 'tid', 'call_type', 'datetime', 
                                       'day_type', 'lat', 'lng'])
temp_df['datetime'] = pd.to_datetime(temp_df['datetime'], unit='s')
tdf = skmob.TrajDataFrame(temp_df)
tdf.head()

Unnamed: 0,uid,tid,call_type,datetime,day_type,lat,lng
0,20000589,1372636858620000589,C,2013-07-01 00:00:58,A,41.141412,-8.618643
1,20000589,1372636858620000589,C,2013-07-01 00:00:58,A,41.141376,-8.618499
2,20000589,1372636858620000589,C,2013-07-01 00:00:58,A,41.14251,-8.620326
3,20000589,1372636858620000589,C,2013-07-01 00:00:58,A,41.143815,-8.622153
4,20000589,1372636858620000589,C,2013-07-01 00:00:58,A,41.144373,-8.623953


In [15]:
# Inspect trajectories data frame:
print(tdf.crs)
print()
print(tdf.parameters)
print()
print(tdf.dtypes)
print()
print("Unique UIDs: ", len(np.unique(tdf.uid)))

{'init': 'epsg:4326'}

{}

uid                   int64
tid                   int64
call_type            object
datetime     datetime64[ns]
day_type             object
lat                 float64
lng                 float64
dtype: object

Unique UIDs:  415
