<img src="../cover.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />



<img src="../humanmobility.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />

## What is scikit-mobility?

## a library to analyze <font color="red">*mobility data*</font>, suited for working with:

### - **trajectories** composed by lat/long points (e.g., GPS data)
### - **fluxes** of movements between places (e.g., OD matrix)


In [1]:
# import the library
import skmob

scikit-mobility provides two user-friendly data structures that extends the *pandas* `DataFrame`:

- `TrajDataFrame` - for spatio-temporal <font color="blue">**trajectories**</font>
- `FlowDataFrame` - for <font color="blue">**fluxes**</font> mapped into a tessellation


### What you can do with scikit-mobility?

✅;Preprocessing of mobility data </br>
✅Measuring individual and collective behaviours </br>
✅Assessing privacy risk </br>
✅Predicting migration flows </br>
✅<font color="grey">**Generating** synthetic trajectories</font>
    

## `TrajDataFrame`


Each row describes a trajectory's point and contains the following columns:

- `lat` - latitude of the point
- `lng` - longitude of the point
- `datetime` - date and time of the point

For multi-user data sets, there are two *optional* columns:

- `uid` - user's identifier to which the trajectory belongs to
- `tid` - identifier for the trajectory

A `TrajDataFrame` can be created from:

- a python list or *numpy* array
- a python dictionary
- a *pandas* `DataFrame`
- a text file

### From a `list`

In [2]:
# From a list
data_list = [[1, 39.984094, 116.319236, '2008-10-23 13:53:05'],
             [1, 39.984198, 116.319322, '2008-10-23 13:53:06'],
             [1, 39.984224, 116.319402, '2008-10-23 13:53:11'],
             [1, 39.984211, 116.319389, '2008-10-23 13:53:16']]
data_list

[[1, 39.984094, 116.319236, '2008-10-23 13:53:05'],
 [1, 39.984198, 116.319322, '2008-10-23 13:53:06'],
 [1, 39.984224, 116.319402, '2008-10-23 13:53:11'],
 [1, 39.984211, 116.319389, '2008-10-23 13:53:16']]

We must set the indexes of the mandatory columns using arguments `latitude`, `longitude` and `datetime`.

In [3]:
tdf = skmob.TrajDataFrame(data_list, 
                          latitude=1, longitude=2, 
                          datetime=3)
print(type(tdf))
tdf


<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


Unnamed: 0,0,lat,lng,datetime
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


### From a `DataFrame`

In [4]:
# import the pandas library
import pandas as pd 
# build a dataframe from the 2D list
data_df = pd.DataFrame(data_list, 
                       columns=['user', 'latitude', 'lng', 'hour']) 

In [5]:
print(type(data_df)) # type of the structure
data_df.head() # head of the DataFrame

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,user,latitude,lng,hour
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


Note that: 
- name of columns in `data_df` don't match the names required
- you must specify the names of the mandatory columns using arguments `latitude`, `longitude` and `datetime` 

In [6]:
# Create a TrajDataFrame from a DataFrame
tdf = skmob.TrajDataFrame(data_df, 
                          latitude='latitude', 
                          datetime='hour', 
                          user_id='user')

print(type(tdf))
tdf.head()

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


Unnamed: 0,uid,lat,lng,datetime
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


### From a text file

Class `TrajDataFrame` has a method `from_file` to construct the object from an input text file.

Let's try with a subsample of the <font color="blue">**GeoLife**</font> trajectories. The whole dataset can be found [here](https://www.microsoft.com/en-us/download/details.aspx?id=52367).

In [7]:
# create a TrajDataFrame from a dataset of trajectories 
tdf = skmob.TrajDataFrame.from_file(
    './data/geolife_sample.txt.gz', sep=',')
print(type(tdf))

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


In [8]:
# explore the TrajDataFrame
tdf.head(10)


Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1
5,39.98471,116.319865,2008-10-23 05:53:23,1
6,39.984674,116.31981,2008-10-23 05:53:28,1
7,39.984623,116.319773,2008-10-23 05:53:33,1
8,39.984606,116.319732,2008-10-23 05:53:38,1
9,39.984555,116.319728,2008-10-23 05:53:43,1


In [9]:
tdf.tail(10)

Unnamed: 0,lat,lng,datetime,uid
217643,40.000205,116.327173,2009-03-19 05:45:37,5
217644,40.000128,116.327171,2009-03-19 05:45:42,5
217645,40.000069,116.327179,2009-03-19 05:45:47,5
217646,40.000001,116.327219,2009-03-19 05:45:52,5
217647,39.999919,116.327211,2009-03-19 05:45:57,5
217648,39.999896,116.32729,2009-03-19 05:46:02,5
217649,39.999899,116.327352,2009-03-19 05:46:07,5
217650,39.999945,116.327394,2009-03-19 05:46:12,5
217651,40.000015,116.327433,2009-03-19 05:46:17,5
217652,39.999978,116.32746,2009-03-19 05:46:37,5


### Attributes of a `TrajDataFrame`


- `crs`: the coordinate reference system. Default: `epsg:4326` (lat/long)
- `parameters`: dictionary to add as many as necessary additional properties

In [10]:
tdf.crs

{'init': 'epsg:4326'}

In [11]:
tdf.parameters

{'from_file': './data/geolife_sample.txt.gz'}

In [12]:
# add your own parameter
tdf.parameters['IGS Tunis'] = 'Groupe IGC 2022'
tdf.parameters

{'from_file': './data/geolife_sample.txt.gz', 'IGS Tunis': 'Groupe IGC 2022'}

Columns of `TrajDataFrame` have specific types

In [13]:
# In the DataFrame
print(type(data_df))
data_df.dtypes

<class 'pandas.core.frame.DataFrame'>


user          int64
latitude    float64
lng         float64
hour         object
dtype: object

In [14]:
print(type(tdf)) # In the TrajDataFrame
tdf.dtypes

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


lat                float64
lng                float64
datetime    datetime64[ns]
uid                  int64
dtype: object

In [15]:
tdf.lat.head()

0    39.984094
1    39.984198
2    39.984224
3    39.984211
4    39.984217
Name: lat, dtype: float64

### Write and read 

To write/read a `TrajDataFrame` into a file, scikit-mobility provides ad-hoc methods.

#### Writing a `TrajDataFrame` to a file

- includes the `parameters` and `crs`attributes
- preserves `dtype` of columns with timestamps (time zone info is lost though).

In [16]:
skmob.write(tdf, './tdf.json')

In [17]:
tdf.parameters

{'from_file': './data/geolife_sample.txt.gz', 'IGS Tunis': 'Groupe IGC 2022'}

### Read a `TrajDataFrame` from a json file

In [18]:
# read the file written before
tdf2 = skmob.read('./tdf.json') 
tdf2[:6]

Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1
5,39.98471,116.319865,2008-10-23 05:53:23,1


`dtype`s and the `parameters` and `crs` attributes are preserved

In [19]:
print(tdf2.dtypes)
tdf2.parameters

lat                float64
lng                float64
datetime    datetime64[ns]
uid                  int64
dtype: object


{'from_file': './data/geolife_sample.txt.gz', 'IGS Tunis': 'Groupe IGC 2022'}

### Plotting trajectories and flows

*scikit-mobility* relies on the *folium* library to plot:
- trajectories
- flows
- tessellations

In [20]:
tdf.plot_trajectory(zoom=12, weight=3, opacity=0.8, tiles='Stamen Toner')

  return plot.plot_trajectory(self, map_f=map_f, max_users=max_users, max_points=max_points, style_function=style_function,
  return plot.plot_trajectory(self, map_f=map_f, max_users=max_users, max_points=max_points, style_function=style_function,


## `FlowDataFrame`

Each row describes a flow and contains the columns:

- `origin`: ID of the origin tile
- `destination`: ID of the destination tile
- `flow`: number of people travelling from `origin` to `destination`

<!-- NOTE: `FlowDataFrame` is a dataframe way of having Origin-Destination Matrix. -->

### Tessellation
Each `FlowDataFrame` is associated  with a <font color="blue">**tessellation**</font>, i.e., a `GeoDataFrame` that  contains two columns:
- `tile_ID`, identifier of a location
- `geometry`, geometric shape of the location

### Create of a `FlowDataFrame`

The `FlowDataFrame` can be created from:

- a python list or a numpy array
- a *pandas* `DataFrame`
- a python dictionary
- a text file


### From a file

method `from_file` creates a `FlowDataFrame` from a text file with the format:
    
- `origin`, `destination`, `flow`, `datetime` (optional)


In [21]:
import geopandas as gpd # Let's import geopandas

In [22]:
tessellation = gpd.GeoDataFrame.from_file(
    "data/NY_counties_2011.geojson") # load a tessellation

# create a FlowDataFrame from a file and a tessellation
fdf = skmob.FlowDataFrame.from_file(
    "data/NY_commuting_flows_2011.csv",
    tessellation=tessellation, tile_id='tile_id', sep=",")

In [23]:
fdf.head()

Unnamed: 0,flow,origin,destination
0,121606,36001,36001
1,5,36001,36005
2,29,36001,36007
3,11,36001,36017
4,30,36001,36019


In [24]:
fdf.dtypes

flow            int64
origin         object
destination    object
dtype: object

In [25]:
# The tessellation is an attribute of the FlowDataFrame
fdf.tessellation.head() 

Unnamed: 0,tile_ID,population,geometry
0,36019,81716,"POLYGON ((-74.00667 44.88602, -74.02739 44.995..."
1,36101,99145,"POLYGON ((-77.09975 42.27421, -77.09966 42.272..."
2,36107,50872,"POLYGON ((-76.25015 42.29668, -76.24914 42.302..."
3,36059,1346176,"POLYGON ((-73.70766 40.72783, -73.70027 40.739..."
4,36011,79693,"POLYGON ((-76.27907 42.78587, -76.27535 42.780..."


### Plot the tessellation

In [26]:
fdf.plot_tessellation(popup_features=['tile_ID', 'population']) 

  vertices = [list(zip(*p.exterior.xy)) for p in gway]


### Plot the flows

In [27]:
fdf.plot_flows(flow_color='green')

### Plot tessellation and flows

In [28]:
map_f = fdf.plot_tessellation(style_func_args={'color':'blue', 'fillColor':'blue'})
fdf[fdf['origin'] == '36061'].plot_flows(map_f=map_f, flow_exp=0., flow_popup=True)

  vertices = [list(zip(*p.exterior.xy)) for p in gway]
