Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
228 lines (137 sloc) 8.04 KB

From OpenData to OpenDataSet

Context:

In supervised learning, you can't expect to obtain a good trained model from inacurate labels: Garbage In, Garbage Out. Data preparation, however, can be tedious if you don't use efficient tools.

Even though OpenData datasets are widely available, those reliable enough to be used verbatim to train decents models are still scarse. Even with state of art model training algorithms, best results are only achieved by people who can afford to tag manually, with pixel accuracy, their own datasets.

So how can we load and qualify OpenData sets in order to create our own training samples ? That's what this tutorial is about !

Retrieve OpenData:

We use OpenData from Grand Lyon metropole because it offers recent imagery and several vector layers throught standard Web Services.

First step is to set the spatial extent and the zoom level:

rsp cover --bbox 4.795,45.628,4.935,45.853 --zoom 18 ds/cover

Not to forget to create an ad hoc RoboSat.pink config file:

echo '
[[channels]]
  name   = "images"
  bands = [1, 2, 3]

[[classes]]
  title = "Building"
  color = "deeppink"

[model]
  nn = "Albunet"
  loss = "Lovasz"
  loader = "SemSegTiles"
  da = "Strong"
  bs = 4
  lr = 0.000025
  metrics = ["iou"]
' > ~/.rsp_config

Download imagery using WMS:

rsp download --type WMS 'https://download.data.grandlyon.com/wms/grandlyon?SERVICE=WMS&REQUEST=GetMap&VERSION=1.3.0&LAYERS=Ortho2015_vue_ensemble_16cm_CC46&WIDTH=512&HEIGHT=512&CRS=EPSG:3857&BBOX={xmin},{ymin},{xmax},{ymax}&FORMAT=image/jpeg' ds/cover ds/images

NOTA:

  • Retina resolution of 512px is preferred to a regular 256px, because it improves the training accuracy.
  • Roofprint choice is important here, as we use aerial imagery to retrieve patterns. If we used the building's footprints instead, the training accuracy would be poorer.

Then to download buildings vector roofprints with WFS,

wget -O ds/lyon_roofprint.json 'https://download.data.grandlyon.com/wfs/grandlyon?SERVICE=WFS&REQUEST=GetFeature&TYPENAME=ms:fpc_fond_plan_communaut.fpctoit&VERSION=1.1.0&srsName=EPSG:4326&outputFormat=application/json; subtype=geojson'

Prepare DataSet

To transform the vector roofprints to raster labels:

rsp rasterize --type Building --geojson ds/lyon_roofprint.json --cover ds/cover ds/labels

Then to create a training and validation dataset, with imagery and related roofprint labels:

rsp cover --dir ds/images --splits 70/30 ds/training/cover ds/validation/cover
rsp subset --dir ds/images --cover ds/training/cover  ds/training/images
rsp subset --dir ds/labels --cover ds/training/cover  ds/training/labels
rsp subset --dir ds/images --cover ds/validation/cover  ds/validation/images
rsp subset --dir ds/labels --cover ds/validation/cover  ds/validation/labels

Train

Now to launch a first model training:

rsp train --epochs 10 ds ds/model

After ten epochs only, the building IoU metric on validation dataset is about 0.80. It's already a fair result with plain Open Data, but we will see below how to improve it again.

Predictive masks

To create predictive masks from our first model, on the entire coverage:

rsp predict --checkpoint ds/model/checkpoint-00010.pth ds ds/masks

Compare

Then to assess how our first model behaves with this raw data, we compute a composite stack image with imagery, label and predicted mask.

Pixel colors legend:

  • pink: predicted by the model (but not present on the initial labels)
  • green: present on the labels (but not predicted by the model)
  • grey: both model prediction and labels are agree.
rsp compare --images ds/images ds/labels ds/masks --mode stack --labels ds/labels --masks ds/masks ds/compare

rsp compare --mode list --labels ds/labels --maximum_qod 80 --minimum_fg 5 --masks ds/masks --geojson ds/compare/tiles.json

We also run a csv list diff, in order to pick tiles with a low Quality of Data metric (here below 80% on QoD metric) and with at least a handful of buildings' pixels assumed to lie within the tile (5% of foreground building at minimum).

If we zoom back on the map, we can see tiles matching the previous filters:

It is obvious that some areas are not labelled correctly in the original OpenData, so we ought to remove them from the training and validation dataset.

To do so, first step is to select the wrongly labelled tiles. The "compare" tool is again helpful, as it allows to check several tiles side by side, and to manually select those we want to keep.

rsp compare --mode side --images ds/images ds/compare --labels ds/labels --maximum_qod 80 --minimum_fg 5 --masks ds/masks  ds/compare_side

Filter

The compare selection produces a csv cover list into the clipboard. We store the result in ds/to_remove.cover

wget -O ds/to_remove.cover http://datapink.tools/rsp/opendata_to_opendataset/to_remove.cover

Then we just remove all the related tiles from the dataset:

rsp subset --dir ds/training/images --cover ds/to_remove.cover --delete > /dev/null
rsp subset --dir ds/training/labels --cover ds/to_remove.cover --delete > /dev/null
rsp subset --dir ds/validation/images --cover ds/to_remove.cover --delete > /dev/null
rsp subset --dir ds/validation/labels --cover ds/to_remove.cover --delete > /dev/null

For information, we remove about 500 tiles from this raw dataset in order to clean it up.

Train

Having a cleaner training and validation dataset, we can launch a second and longer training:

rsp train --epochs 100 ds ds/model_clean

Building IoU metrics on validation dataset:

  • After 10 epochs: 0.82
  • After 100 epochs: 0.86

Predict and compare

And now to generate masks prediction, and compare composite images, as previously done:

rsp predict --checkpoint ds/model_clean/checkpoint-00100.pth ds ds/masks_clean

rsp compare --images ds/images ds/labels ds/masks_clean --mode stack --labels ds/labels --masks ds/masks_clean ds/compare_clean

rsp compare --mode list --labels ds/labels --maximum_qod 80 --minimum_fg 5 --masks ds/masks_clean --geojson ds/compare_clean/tiles.json

And to compare only with filtered validation tiles, in side by side mode:

rsp cover --dir ds/validation/images  ds/validation/cover.clean

rsp subset --dir ds/compare_clean --cover ds/validation/cover.clean ds/validation/compare_clean

rsp subset --dir ds/masks_clean --cover ds/validation/cover.clean ds/validation/masks_clean

rsp compare --mode side --images ds/validation/images ds/validation/compare_clean --labels ds/validation/labels --masks ds/validation/masks_clean ds/validation/compare_side_clean

You can’t perform that action at this time.