<img src="../../images/BikeDNA_logo.svg" width="250"  alt="BikeDNA logo" style="display:block; margin-left: auto; margin-right: auto;">
<a href="https://github.com/anerv/BikeDNA">Github</a>

# Example reference data preprocessing: City of Copenhagen

This notebook provides an example of how a spatial dataset with data on cycling infrastructure can be converted to the format required by BikeDNA. When using your own data, The preprocessing must be adapted to content and format.

The data used in this notebook are from the City of Copenhagen and was downloaded from [opendata.dk](https://www.opendata.dk/city-of-copenhagen/cykeldata) under the [Open Data DK license](https://www.opendata.dk/open-data-dk/open-data-dk-licens).

As stated in the data set requirements, the reference data should:

- only contain **cycling infrastructure** (i.e. not also the regular street network)
- have all geometries as **LineStrings** (not MultiLineString)
- for each row, the geometry should be a **straight** LineString only defined by its start- and end nodes
- have start/end nodes at **intersections**
- be in a **CRS** recognised by GeoPandas
- contain a column describing whether each feature is a physically **protected**/separated infrastructure or if it is **unprotected**
- contain a column describing whether each feture is **bidirectional** or not
- contain a column describing how features have been **digitized** ('geometry type')
- contain a column with a unique **ID** for each feature

In [3]:
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import momepy
from shapely.ops import linemerge

from src import plotting_functions as pf
%run ../settings/tiledict.py

In [4]:
kk = gpd.GeoDataFrame.from_file(
    "data_ex1_cph_municipality/raw/cykeldata_kk/cykeldata_kkLine.shp"
)

kk.sample(10)

Unnamed: 0,id,rute_nr,rutenavn,status,kategori,under_kate,kommune,ogc_fid,geometry
497,783,,,Eksisterende,Cykelmulighed,Grøn,København,487,"LINESTRING (12.56631 55.62597, 12.56619 55.625..."
1931,1269,,,Eksisterende,Cykelsti,,København,1929,"LINESTRING (12.47163 55.71152, 12.47080 55.71129)"
152,5311,,,Eksisterende,Cykelmulighed,,København,150,"LINESTRING (12.54016 55.65140, 12.54046 55.651..."
1638,1095,,,Eksisterende,Cykelsti,P,København,1633,"LINESTRING (12.49370 55.64762, 12.49403 55.647..."
303,455,,,Eksisterende,Cykelmulighed,,København,299,"LINESTRING (12.52817 55.66393, 12.52756 55.663..."
1077,2823,,,Eksisterende,Cykelsti,P,København,1073,"LINESTRING (12.51437 55.69198, 12.51431 55.692..."
3204,3781,16.0,Søruten,Planlagt,Grøn,P,København,3202,"LINESTRING (12.58764 55.69968, 12.58612 55.699..."
1162,2938,,Planlagt i Cykelstiprioriteringsplan 2,Planlagt,Cykelsti,,København,1158,"LINESTRING (12.58191 55.70194, 12.58201 55.701..."
2312,1768,,,Eksisterende,Cykelsti,,København,2310,"LINESTRING (12.53690 55.71033, 12.53782 55.71040)"
1523,2842,,,Eksisterende,Cykelsti,P,København,1517,"LINESTRING (12.51314 55.69661, 12.51361 55.69692)"


Our dataset both contains physical infrastructure and bicycle routes etc. We are only interested in the physical infrastructure and thus need to select a subset of the data.

Some of the data might be outside of the study area we are interested in, but the data processing in notebook 2a will clip all data to the desired extent.

In [5]:
# Creating subset only with existing cycling infrastructure

kk_selection = kk.loc[
    (kk.kategori == "Cykelsti") & (kk.status == "Eksisterende")
].copy()

kk_selection.explore()

For all code to run without errors, our dataset can only contain LineString geometries. Let's check what we have:

In [6]:
kk_selection.geom_type.unique()

array(['LineString', 'MultiLineString'], dtype=object)

We both have LineStrings and MultiLineStrings here. To fix this, we first try to merge the MultiLineStrings. 
If some of the MultiLinestrings are not connected (i.e. there are gaps in the lines), the aboves step will not be able to merge them. In that case we can instead 'explode' them.

In [7]:
kk_linestrings = kk_selection.copy()
# Convert MultiLineStrings to LineString
kk_linestrings["geometry"] = kk_linestrings["geometry"].apply(
    lambda x: linemerge(x) if x.geom_type == "MultiLineString" else x
)

if (
    len(kk_linestrings.geom_type.unique()) > 1
    or kk_linestrings.geom_type.unique()[0] != "LineString"
):

    print("Exploding MultiLineStrings...")
    kk_linestrings = kk_selection.explode(ignore_index=True)

assert len(kk_linestrings.geom_type.unique()) == 1
assert kk_linestrings.geom_type.unique()[0] == "LineString"
kk_linestrings.geom_type.unique()

array(['LineString'], dtype=object)

For the code to work, the data need to be in a CRS recognized by GeoPandas, and to have that CRS defined. Let's check that we have a CRS defined:

In [8]:
kk_linestrings.crs

<Geographic 2D CRS: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84" ...>
Name: WGS 84
Axis Info [ellipsoidal]:
- lon[east]: Longitude (degree)
- lat[north]: Latitude (degree)
Area of Use:
- undefined
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

The analysis of data quality is based on the concept of a *network*. For the results to be accurate we need a dataset with nodes at intersections (i.e. where the lines defining the cycling infrastructure intersect).

Use the folium plot below to check that you do have nodes at intersections.
If not, this will have to be fixed - or it will be an aspect of low data quality that will become apparent in the analysis of data quality...

Don't worry if there are more nodes than just those at intersections and start/end points - we will take care of that in the data loading notebook.

In [9]:
G = momepy.gdf_to_nx(
    kk_linestrings.to_crs("EPSG:25832"), approach="primal", directed=True
)  # We reproject the network data to avoid warnings - final reprojection will happen later

nodes, edges = momepy.nx_to_gdf(G)

# Feature groups for OSM
edges_folium = pf.make_edgefeaturegroup(
    gdf=edges, mycolor="black", myweight=2, nametag="edges", show_edges=True
)

nodes_folim = pf.make_nodefeaturegroup(
    gdf=nodes, mycolor="red", mysize=2, nametag="nodes", show_nodes=True
)

feature_groups = [edges_folium, nodes_folim]

m = pf.make_foliumplot(
    feature_groups=feature_groups,
    layers_dict=folium_layers,
    center_gdf=nodes,
    center_crs=nodes.crs,
)

display(m)

We don't technically need to drop any unnecessary columns, but let's avoid loading unnecessary data later on.

In [10]:
kk_linestrings.columns

Index(['id', 'rute_nr', 'rutenavn', 'status', 'kategori', 'under_kate',
       'kommune', 'ogc_fid', 'geometry'],
      dtype='object')

In [11]:
kk.under_kate.unique()

array([None, 'Grøn', 'P', 'Cykelsti', 'P Cykelsti'], dtype=object)

In [12]:
# Drop unnecessary columns

kk_linestrings.drop(
    ["rute_nr", "rutenavn", "under_kate", "kommune", "status"], axis=1, inplace=True
)

For this dataset we assume of all features to be center line mappings and bidirectional, so we can specify this in config file and do not have to add it to the data.

The rest of the pre-processing, such as projecting to the chosen CRS, clipping the data to the study area etc. will happen in [notebook 2a](../REFERENCE/2a_initialize_reference.ipynb).

**Final dataset**

In [13]:
kk_linestrings.sample(10)

Unnamed: 0,id,kategori,ogc_fid,geometry
1598,1660,Cykelsti,1593,"LINESTRING (12.61334 55.64854, 12.61334 55.648..."
2535,2159,Cykelsti,2535,"LINESTRING (12.56790 55.71414, 12.56845 55.714..."
1021,2768,Cykelsti,1019,"LINESTRING (12.56446 55.64294, 12.56414 55.64234)"
1005,2754,Cykelsti,1003,"LINESTRING (12.53342 55.70179, 12.53326 55.701..."
2667,2401,Cykelsti,2665,"LINESTRING (12.51535 55.70873, 12.51553 55.709..."
2064,1439,Cykelsti,2063,"LINESTRING (12.59267 55.65869, 12.59253 55.658..."
1389,3278,Cykelsti,1383,"LINESTRING (12.50203 55.66728, 12.50144 55.66740)"
1280,3137,Cykelsti,1275,"LINESTRING (12.58131 55.68130, 12.58130 55.681..."
1481,5304,Cykelsti,1476,"LINESTRING (12.54290 55.68794, 12.54204 55.68836)"
2033,1412,Cykelsti,2034,"LINESTRING (12.62873 55.66570, 12.62880 55.66563)"


*Export dataset*

In [15]:
kk_linestrings.to_file("data_ex1_cph_municipality/processed/cph_cycling_infra.gpkg", driver="GPKG")