<center><img src="logo_skmob.png" width=450 align="left" /></center>

# Introduction

- Repo: [http://bit.ly/skmob_repo](http://bit.ly/skmob_repo)
- Docs: [http://bit.ly/skmob_doc](http://bit.ly/skmob_doc)
- Paper: [http://bit.ly/skmob_paper](http://bit.ly/skmob_paper)



### What is *scikit-mobility*?

a library to analyze *mobility data*, suited for working with:

- **trajectories** composed by latitude/longitude points (e.g., GPS data)
- **fluxes** of movements between places (e.g., OD matrix)


*scikit-mobility* provides two user-friendly data structures that extends the *pandas* `DataFrame`:

- `TrajDataFrame` - for spatio-temporal trajectories
- `FlowDataFrame` - for fluxes mapped into a tessellation


### What you can do with *scikit-mobility*?

- **Preprocessing** of mobility data
- **Measuring** individual and collective behaviours
- **Generating** synthetic trajectories
- **Predicting** migration flows
- **Assessing** privacy risk
    

In [1]:
# Import the library
import skmob

# `TrajDataFrame`


Each row describes a trajectory's point and contains the columns:

- `lat` - latitude of the point
- `lng` - longitude of the point
- `datetime` - date and time of the point

For multi-user and multi-trajectory data sets, there are two optional columns:

- `uid` - identifier for the user to which the trajectory belongs to
- `tid` - identifier for the trajectory

A `TrajDataFrame` can be created from:

- a python list or *numpy* array
- a python dictionary
- a *pandas* `DataFrame`
- a text file

### From a `list`

In [2]:
# From a list
data_list = [[1, 39.984094, 116.319236, '2008-10-23 13:53:05'],
             [1, 39.984198, 116.319322, '2008-10-23 13:53:06'],
             [1, 39.984224, 116.319402, '2008-10-23 13:53:11'],
             [1, 39.984211, 116.319389, '2008-10-23 13:53:16']]
data_list

[[1, 39.984094, 116.319236, '2008-10-23 13:53:05'],
 [1, 39.984198, 116.319322, '2008-10-23 13:53:06'],
 [1, 39.984224, 116.319402, '2008-10-23 13:53:11'],
 [1, 39.984211, 116.319389, '2008-10-23 13:53:16']]

We must set the indexes of the mandatory columns using arguments `latitude`, `longitude` and `datetime`.

In [3]:
tdf = skmob.TrajDataFrame(data_list, 
                          latitude=1, longitude=2, datetime=3)
tdf 

Unnamed: 0,0,lat,lng,datetime
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


If present, we can specify the index of columns `uid` and `tid` using arguments `user_id` and `trajectory_id`.

In [4]:
tdf = skmob.TrajDataFrame(data_list, 
                          latitude=1, longitude=2, datetime=3, 
                          user_id=0)
tdf.head(5)

Unnamed: 0,uid,lat,lng,datetime
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


### From a `DataFrame`

In [6]:
# Let's import the pandas library
import pandas as pd 

# Let's build a dataframe from the 2D list
data_df = pd.DataFrame(data_list, 
                       columns=['user', 'latitude', 
                                'lng', 'hour'])

In [7]:
print(type(data_df)) # type of the structure
data_df.head() # head of the DataFrame

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,user,latitude,lng,hour
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


Note that: 
- the name of the columns name in `data_df` don't match the names required
- We must specify the names of the mandatory columns using arguments `latitude`, `longitude` and `datetime` 

In [8]:
# Create a TrajDataFrame from a DataFrame
tdf = skmob.TrajDataFrame(data_df, 
                          latitude='latitude', 
                          datetime='hour', 
                          user_id='user')

print(type(tdf))
tdf.head()

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


Unnamed: 0,uid,lat,lng,datetime
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


### From a dictionary

In [9]:
# Let's build a dataframe from the 2D list
data_dict = data_df.to_dict(orient='list')
data_dict

{'user': [1, 1, 1, 1],
 'latitude': [39.984094, 39.984198, 39.984224, 39.984211],
 'lng': [116.319236, 116.319322, 116.319402, 116.319389],
 'hour': ['2008-10-23 13:53:05',
  '2008-10-23 13:53:06',
  '2008-10-23 13:53:11',
  '2008-10-23 13:53:16']}

In [10]:
# Create a TrajDataFrame from a dictionary
tdf = skmob.TrajDataFrame(data_dict, 
                          latitude='latitude', 
                          datetime='hour', 
                          user_id='user' )
print(type(tdf))
tdf

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


Unnamed: 0,uid,lat,lng,datetime
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


### From a text file

Class `TrajDataFrame` has a method `from_file` to construct the object from an input file.

Let's try with a subsample of the **GeoLife** trajectories. The whole dataset can be found [here](https://www.microsoft.com/en-us/download/details.aspx?id=52367)

In [11]:
# Create a TrajDataFrame from a dataset of trajectories 
tdf = skmob.TrajDataFrame.from_file(
    './data/geolife_sample.txt.gz', sep=',')
print(type(tdf))

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


In [12]:
# Let's explore the TrajDataFrame as we would do with pandas
tdf.head()

Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


## Attributes of a `TrajDataFrame`


- `crs`: the coordinate reference system. Default: `epsg:4326` (lat/long)
- `parameters`: dictionary to add as many as necessary additional properties

In [13]:
tdf.crs

{'init': 'epsg:4326'}

In [14]:
tdf.parameters

{'from_file': './data/geolife_sample.txt.gz'}

In [15]:
# add your own parameter
tdf.parameters['something'] = 5
tdf.parameters

{'from_file': './data/geolife_sample.txt.gz', 'something': 5}

Columns of `TrajDataFrame` have specific types

In [16]:
# In the DataFrame
print(type(data_df))
data_df.dtypes

<class 'pandas.core.frame.DataFrame'>


user          int64
latitude    float64
lng         float64
hour         object
dtype: object

In [17]:
print(type(tdf)) # In the TrajDataFrame
tdf.dtypes

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


lat                float64
lng                float64
datetime    datetime64[ns]
uid                  int64
dtype: object

In [18]:
tdf.lat.head()

0    39.984094
1    39.984198
2    39.984224
3    39.984211
4    39.984217
Name: lat, dtype: float64

## Write and read 

To write/read a `TrajDataFrame` into a file, *scikit-mobility* provides ad-hoc methods.

### Writing a `TrajDataFrame` to a file

- includes the `parameters` and `crs`attributes
- preserves the `dtype` of columns with time stamps (time zone info is lost though).

**Caveat**: `dtype`s other than `int`, `float` and `datetime` may not be identical to the original `dtype` after loading from a `json` file. 

Check with `tdf.dtypes` and manually convert each column to the proper dtype, if needed. 

In [19]:
skmob.write(tdf, './tdf.json')

In [20]:
tdf.dtypes

lat                float64
lng                float64
datetime    datetime64[ns]
uid                  int64
dtype: object

In [21]:
tdf.parameters

{'from_file': './data/geolife_sample.txt.gz', 'something': 5}

### Read a `TrajDataFrame` from file

In [22]:
# Let's read the file written before
tdf2 = skmob.read('./tdf.json') 
tdf2[:4]

Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1


`dtype`s, as well as the `parameters` and `crs` attributes, are preserved

In [23]:
print(tdf2.dtypes)
tdf2.parameters

lat                float64
lng                float64
datetime    datetime64[ns]
uid                  int64
dtype: object


{'from_file': './data/geolife_sample.txt.gz', 'something': 5}

## Plotting

*scikit-mobility* relies on the *folium* library to plot trajectories, flows, and tessellations.

In [24]:
tdf.plot_trajectory(zoom=12, weight=3, opacity=0.9, 
                    tiles='Stamen Toner')

# `FlowDataFrame`

Each row describes a flow and contains the columns:

- `origin`: ID of the origin tile
- `destination`: ID of the destination tile
- `flow`: number of people travelling from `origin` to `destination`

<!-- NOTE: `FlowDataFrame` is a dataframe way of having Origin-Destination Matrix. -->

### Tessellation
Each `FlowDataFrame` is associated  with a tessellation, a `GeoDataFrame` that  contains two columns:
- `tile_ID`, identifier of a location
- `geometry`, geometric shape of the location


It can be created from:

- the name of the area of interest, e.g., `"Florence, Italy"`
- a *geopandas* `GeoDataFrame` with Points or Polygons
- there are two types of tessellations: **squared** and **voronoi**.

In [78]:
from skmob.tessellation import tilers

# Create tessellation from a base shape
tessellation = tilers.tiler.get("squared", meters=1000, 
                                base_shape="Naples, Italy")
tessellation.head(5)

Unnamed: 0,tile_ID,geometry
0,0,"POLYGON ((14.1320999 40.84557954145333, 14.132..."
1,1,"POLYGON ((14.1320999 40.85237472368723, 14.132..."
2,2,"POLYGON ((14.1320999 40.85916920907432, 14.132..."
3,3,"POLYGON ((14.1320999 40.86596299759054, 14.132..."
4,4,"POLYGON ((14.14108305284119 40.84557954145333,..."


In [82]:
from skmob.utils import plot
plot.plot_gdf(tessellation, zoom=12, popup_features=['tile_ID'])

Create a tessellation from `TrajDataFrame` in two steps:
1. use method `to_geodataframe`

In [92]:
tdf = skmob.TrajDataFrame.from_file('./data/geolife_sample.txt.gz', sep=',')
gdf = tdf.to_geodataframe() 
gdf.head(4)

Unnamed: 0,lat,lng,datetime,uid,geometry
0,39.984094,116.319236,2008-10-23 05:53:05,1,POINT (116.319236 39.984094)
1,39.984198,116.319322,2008-10-23 05:53:06,1,POINT (116.319322 39.984198)
2,39.984224,116.319402,2008-10-23 05:53:11,1,POINT (116.319402 39.984224)
3,39.984211,116.319389,2008-10-23 05:53:16,1,POINT (116.319389 39.984211)


2. use `tilers.tiler.get` specifying `base_shape=gdf`

In [89]:
tessellation = tilers.tiler.get("squared", base_shape=gdf, 
                                meters=10000)
# NOTE: It accepts also geodataframe with list of polygons

In [91]:
tessellation.head(4)

Unnamed: 0,tile_ID,geometry
0,0,"POLYGON ((113.548843 22.14757699999999, 113.54..."
1,1,"POLYGON ((113.548843 22.23075577635448, 113.54..."
2,2,"POLYGON ((113.548843 22.31388522745751, 113.54..."
3,3,"POLYGON ((113.548843 22.39696520767058, 113.54..."


## Create of a `FlowDataFrame`

The `FlowDataFrame` can be created from:

- a python list or a numpy array
- a *pandas* `DataFrame`
- a python dictionary
- a text file


It supports the input data format:
    
- `origin`, `destination`, `flow`, `datetime` (optional)

NOTE: the field `tessellation` is mandatory. In the case the tessellation doesn't have the field tile_ID, the name of this column must be specified with the argument `tile_id`.

### From `list`

In [None]:
# From a list

data_list = [[10, 1, 10, '2008-10-23 13:53:05'],
             [16, 23, 45, '2008-10-23 13:53:06'],
             [1, 11, 45, '2008-10-23 13:53:11']]

df = pd.DataFrame(data_list, columns=['flow_value', 
                  'origin_id', 'destination_id', 'datetime'])
df

In [None]:
fdf = skmob.FlowDataFrame(df, origin='origin_id', destination='destination_id', 
                          tessellation=tessellation, 
                          flow='flow_value')
fdf

In [None]:
# to access to the single flow
fdf.get_flow("1","10")

In [None]:
# TrajDataFrame can be converted into a sparse matrix of shape (len(tessellation), len(tessellation))
fdf.to_matrix()

### From a file

method `from_file` creates a `FlowDataFrame` from a text file with the format:
    
- `origin`, `destination`, `flow`, `datetime` (optional)


In [25]:
import geopandas as gpd # Let's import geopandas

In [26]:
tessellation = gpd.GeoDataFrame.from_file(
    "data/NY_counties_2011.geojson") # load a tessellation

# create a FlowDataFrame from a file and a tessellation
fdf = skmob.FlowDataFrame.from_file(
    "data/NY_commuting_flows_2011.csv",
    tessellation=tessellation, tile_id='tile_id', sep=",")

In [27]:
fdf.head(3)

Unnamed: 0,flow,origin,destination
0,121606,36001,36001
1,5,36001,36005
2,29,36001,36007


In [28]:
fdf.dtypes

flow            int64
origin         object
destination    object
dtype: object

In [29]:
# The tessellation is an attribute of the FlowDataFrame
fdf.tessellation.head() 

Unnamed: 0,tile_ID,population,geometry
0,36019,81716,"POLYGON ((-74.006668 44.886017, -74.027389 44...."
1,36101,99145,"POLYGON ((-77.099754 42.274215, -77.0996569999..."
2,36107,50872,"POLYGON ((-76.25014899999999 42.296676, -76.24..."
3,36059,1346176,"POLYGON ((-73.707662 40.727831, -73.700272 40...."
4,36011,79693,"POLYGON ((-76.279067 42.785866, -76.2753479999..."


### Plot the tessellation

In [30]:
fdf.plot_tessellation(popup_features=['tile_ID', 'population']) 

### Plot the flows

In [31]:
fdf.plot_flows(flow_color='green')

#### Plot tessellation and flows

In [107]:
map_f = fdf.plot_tessellation(style_func_args={'color':'blue', 'fillColor':'blue'})
fdf[fdf['origin'] == '36061'].plot_flows(map_f=map_f, flow_exp=0., flow_popup=True)

#### Alternative format

In [112]:
# Flows can be provided also with an expanded format

df = pd.read_csv("data/expanded_flow.csv", sep=",")
df.head()

Unnamed: 0,origin_lat,origin_lng,destination_lat,destination_lng,flow
0,42.554151,-71.103398,42.553818,-71.102749,1
1,42.463658,-70.945966,42.46358,-70.945903,1
2,42.463552,-70.945845,42.463561,-70.945911,1
3,42.116042,-71.462125,42.116698,-71.4618,1
4,42.388193,-71.039085,42.388191,-71.039177,1


When the arguments `origin_lat`, `origin_lng`, `destination_lat`, `destination_lng` are set, `from_file` loads file with the expanded format and collapse each row into a format `origin`, `destination`.

If tessellation is None, it is imputed from the data as a Voronoi Tessellation that is then automatically added to the `FlowDataFrame`. NOTE: it can be slow when the Voronoi tessellation is big (> 100 points)

In [113]:
fdf = skmob.FlowDataFrame.from_file("data/expanded_flow.csv", origin_lat='origin_lat', 
                                    origin_lng='origin_lng', destination_lat='destination_lat', 
                                    destination_lng='destination_lng', 
                                    flow='flow', sep=",")

In [114]:
fdf.head()

Unnamed: 0,flow,origin,destination
0,1,0,150
1,1,1,151
2,1,2,68
3,1,3,152
4,1,4,153


In [115]:
fdf.tessellation.head()

Unnamed: 0,tile_ID,geometry
0,0,POINT (-71.103398 42.554151)
1,1,POINT (-70.945966 42.463658)
2,2,POINT (-70.94584499999998 42.463552)
3,3,POINT (-71.462125 42.116042)
4,4,POINT (-71.039085 42.388193)


method `to_flowdataframe` builds a `FlowDataFrame` from a `TrajDataFrame`

In [108]:
# Load trajectories (Beijing, China)
tdf = skmob.TrajDataFrame.from_file(
    './data/geolife_sample.txt.gz', sep=',')

In [109]:
# Build a tessellation over the city
from skmob.tessellation import tilers
tessellation = tilers.tiler.get("squared", 
                                base_shape="Beijing, China", 
                                meters=15000)

In [110]:
# remove_na enable removing points that are not contained in the tessellation
fdf = tdf.to_flowdataframe(tessellation=tessellation, 
                           self_loops=False, remove_na=True)
fdf.sort_values(by='flow', ascending=False).head()

Unnamed: 0,origin,destination,flow
13,63,62,17
10,62,77,15
9,62,63,13
19,77,62,12
23,78,63,11


In [111]:
fdf.plot_flows(flow_color='blue', zoom=9)