<h1>BikeDNA</h1>
<a href="https://github.com/anerv/BikeDNA">Github</a>

# Example reference data preprocessing: City of Copenhagen

This notebook provides an example of how a spatial dataset with data on cycling infrastructure can be converted to the format required by BikeDNA. When using your own data, The preprocessing must be adapted to content and format.

The data used in this notebook are from the City of Copenhagen and was downloaded from [opendata.dk](https://www.opendata.dk/city-of-copenhagen/cykeldata) under the [Open Data DK license](https://www.opendata.dk/open-data-dk/open-data-dk-licens).

As stated in the data set requirements, the reference data should:

- only contain **cycling infrastructure** (i.e. not also the regular street network)
- have all geometries as **LineStrings** (not MultiLineString)
- for each row, the geometry should be a **straight** LineString only defined by its start- and end nodes
- have start/end nodes at **intersections**
- be in a **CRS** recognised by GeoPandas
- contain a column describing whether each feature is a physically **protected**/separated infrastructure or if it is **unprotected**
- contain a column describing whether each feture is **bidirectional** or not
- contain a column describing how features have been **digitized** ('geometry type')
- contain a column with a unique **ID** for each feature

In [13]:
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import momepy
from shapely.ops import linemerge

from src import plotting_functions as pf
%run ../settings/tiledict.py

In [14]:
kk = gpd.GeoDataFrame.from_file(
    "ex1_cph_municipality/raw/cykeldata_kk/cykeldata_kkLine.shp"
)

kk.sample(10)

Unnamed: 0,id,rute_nr,rutenavn,status,kategori,under_kate,kommune,ogc_fid,geometry
1592,1479,,,Eksisterende,Cykelsti,P,København,1587,"LINESTRING (12.61227 55.65090, 12.61227 55.650..."
1350,3237,,,Eksisterende,Cykelsti,,København,1346,"LINESTRING (12.60193 55.67726, 12.60195 55.677..."
152,5311,,,Eksisterende,Cykelmulighed,,København,150,"LINESTRING (12.54016 55.65140, 12.54046 55.651..."
1682,1175,,,Eksisterende,Cykelsti,P,København,1676,"LINESTRING (12.48849 55.67444, 12.48847 55.674..."
3213,4785,C77,Ishoejruten,Planlagt,Supercykelsti,,København,3205,"LINESTRING (12.57219 55.67359, 12.57211 55.673..."
1717,1161,,,Eksisterende,Cykelsti,,København,1868,"LINESTRING (12.48523 55.67864, 12.48475 55.680..."
1926,1258,,,Eksisterende,Cykelsti,,København,1924,"LINESTRING (12.46445 55.70946, 12.46334 55.709..."
3297,5188,,oelstykke - Hilleroed (Overdrevsvej),Planlagt,Supercykelsti,,Egedal,3296,"LINESTRING (12.16402 55.79313, 12.16426 55.793..."
140,828,,,Eksisterende,Cykelmulighed,Grøn,København,138,"LINESTRING (12.59540 55.69337, 12.59541 55.693..."
2486,2067,,,Eksisterende,Cykelsti,,København,2486,"LINESTRING (12.57534 55.69327, 12.57537 55.693..."


Our dataset both contains physical infrastructure and bicycle routes etc. We are only interested in the physical infrastructure and thus need to select a subset of the data.

Some of the data might be outside of the study area we are interested in, but the data processing in notebook 2a will clip all data to the desired extent.

In [15]:
# Creating subset only with existing cycling infrastructure

kk_selection = kk.loc[
    (kk.kategori == "Cykelsti") & (kk.status == "Eksisterende")
].copy()

kk_selection.explore()

For all code to run without errors, our dataset can only contain LineString geometries. Let's check what we have:

In [16]:
kk_selection.geom_type.unique()

array(['LineString', 'MultiLineString'], dtype=object)

We both have LineStrings and MultiLineStrings here. To fix this, we first try to merge the MultiLineStrings. 
If some of the MultiLinestrings are not connected (i.e. there are gaps in the lines), the aboves step will not be able to merge them. In that case we can instead 'explode' them.

In [17]:
kk_linestrings = kk_selection.copy()
# Convert MultiLineStrings to LineString
kk_linestrings["geometry"] = kk_linestrings["geometry"].apply(
    lambda x: linemerge(x) if x.geom_type == "MultiLineString" else x
)

if (
    len(kk_linestrings.geom_type.unique()) > 1
    or kk_linestrings.geom_type.unique()[0] != "LineString"
):

    print("Exploding MultiLineStrings...")
    kk_linestrings = kk_selection.explode(ignore_index=True)

assert len(kk_linestrings.geom_type.unique()) == 1
assert kk_linestrings.geom_type.unique()[0] == "LineString"
kk_linestrings.geom_type.unique()

array(['LineString'], dtype=object)

For the code to work, the data need to be in a CRS recognized by GeoPandas, and to have that CRS defined. Let's check that we have a CRS defined:

In [18]:
kk_linestrings.crs

<Geographic 2D CRS: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84" ...>
Name: WGS 84
Axis Info [ellipsoidal]:
- lon[east]: Longitude (degree)
- lat[north]: Latitude (degree)
Area of Use:
- undefined
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

The analysis of data quality is based on the concept of a *network*. For the results to be accurate we need a dataset with nodes at intersections (i.e. where the lines defining the cycling infrastructure intersect).

Use the folium plot below to check that you do have nodes at intersections.
If not, this will have to be fixed - or it will be an aspect of low data quality that will become apparent in the analysis of data quality...

Don't worry if there are more nodes than just those at intersections and start/end points - we will take care of that in the data loading notebook.

In [19]:
G = momepy.gdf_to_nx(
    kk_linestrings.to_crs("EPSG:25832"), approach="primal", directed=True
)  # We reproject the network data to avoid warnings - final reprojection will happen later

nodes, edges = momepy.nx_to_gdf(G)

# Feature groups for OSM
edges_folium = pf.make_edgefeaturegroup(
    gdf=edges, mycolor="black", myweight=2, nametag="edges", show_edges=True
)

nodes_folim = pf.make_nodefeaturegroup(
    gdf=nodes, mycolor="red", mysize=2, nametag="nodes", show_nodes=True
)

feature_groups = [edges_folium, nodes_folim]

m = pf.make_foliumplot(
    feature_groups=feature_groups,
    layers_dict=folium_layers,
    center_gdf=nodes,
    center_crs=nodes.crs,
)

display(m)

We don't technically need to drop any unnecessary columns, but let's avoid loading unnecessary data later on.

In [20]:
kk_linestrings.columns

Index(['id', 'rute_nr', 'rutenavn', 'status', 'kategori', 'under_kate',
       'kommune', 'ogc_fid', 'geometry'],
      dtype='object')

In [21]:
kk.under_kate.unique()

array([None, 'Grøn', 'P', 'Cykelsti', 'P Cykelsti'], dtype=object)

In [22]:
# Drop unnecessary columns

kk_linestrings.drop(
    ["rute_nr", "rutenavn", "under_kate", "kommune", "status"], axis=1, inplace=True
)

For this dataset we assume of all features to be center line mappings and bidirectional, so we can specify this in config file and do not have to add it to the data.

The rest of the pre-processing, such as projecting to the chosen CRS, clipping the data to the study area etc. will happen in [notebook 2a](../REFERENCE/2a_initialize_reference.ipynb).

**Final dataset**

In [23]:
kk_linestrings.sample(10)

Unnamed: 0,id,kategori,ogc_fid,geometry
576,2493,Cykelsti,570,"LINESTRING (12.51616 55.66435, 12.51614 55.664..."
1425,3316,Cykelsti,1420,"LINESTRING (12.62675 55.64947, 12.62689 55.649..."
2306,1760,Cykelsti,2304,"LINESTRING (12.56565 55.69331, 12.56620 55.693..."
1310,3175,Cykelsti,1307,"LINESTRING (12.59780 55.67450, 12.59782 55.674..."
2128,1521,Cykelsti,2125,"LINESTRING (12.60088 55.66397, 12.60089 55.663..."
2466,2020,Cykelsti,2466,"LINESTRING (12.57074 55.68271, 12.57068 55.682..."
2429,1982,Cykelsti,2429,"LINESTRING (12.57438 55.68942, 12.57479 55.689..."
1191,2966,Cykelsti,1186,"LINESTRING (12.48998 55.70597, 12.48979 55.706..."
744,2390,Cykelsti,739,"LINESTRING (12.51423 55.70649, 12.51361 55.706..."
2665,2399,Cykelsti,2663,"LINESTRING (12.51482 55.70771, 12.51500 55.70807)"


*Export dataset*

In [24]:
kk_linestrings.to_file("ex1_cph_municipality/processed/cph_cycling_infra.gpkg", driver="GPKG")