Permalink
Browse files

Corrections du texte :)

  • Loading branch information...
Veence committed Feb 5, 2019
1 parent 9d336c0 commit 22cfdaf83a4b8e656a8fc5c1e03e01627a10be7e
Showing with 35 additions and 40 deletions.
  1. +35 −40 docs/from_opendata_to_opendataset.md
@@ -4,63 +4,59 @@
Context:
-------

Data preparation could be a painful and time consuming task, if you don't use tools abstract and efficient enough.
In supervised learning, you can't expect to obtain a good trained model from inacurate labels: Garbage In, Garbage Out. Data preparation, however, can be tedious if you don't use efficient enough abstract tools.

In supervised learning, you can't expect to produce a good trained model, from inacurate labels: Garbage In Garbage Out.
Even though OpenData datasets are widely available, those reliable enough to be used verbatim to train decents models are still scarse. Even with state of art model training algorithms, best results are only achieved by people who can afford to tag manually, with pixel accuracy, their own datasets.

OpenData became more and more available. But we still lack OpenDataSets good enough to be used as-is to train decents models.
And at the state of art, best results in model training are achieved by players who can afford to labelize by (small) hands and with pixel accuracy their own dataset.

So how could we retrieve and qualify OpenData in order to create our own training DataSet ?
That's what's this tutorial is about !
So how can we load and qualify OpenData sets in order to create our own training samples? That’s what this tutorial is about!



Retrieve OpenData:
------------------

We choose to use OpenData from <a href="https://rdata-grandlyon.readthedocs.io/en/latest/">Grand Lyon metropole</a> because they provide recent imagery and several vector layers throught standardized Web Services.
We decided to use OpenData from <a href="https://rdata-grandlyon.readthedocs.io/en/latest/">the Grand Lyon metropole</a> because it offers to download recent imagery and several vector layers through standard web services.



First step is to define the coverage geospatial extent and a <a href="https://wiki.openstreetmap.org/wiki/Zoom_levels">zoom level</a>:
The first step is to set the spatial extent and the <a href="https://wiki.openstreetmap.org/wiki/Zoom_levels">zoom level</a>:

```
rsp cover --zoom 18 --type bbox 4.795,45.628,4.935,45.853 ~/rsp_dataset/cover
```


Then to download imagery, throught <a href="https://www.opengeospatial.org/standards/wms">WMS</a>:
To download imagery using <a href="https://www.opengeospatial.org/standards/wms">WMS</a>:

```
rsp download --type WMS 'https://download.data.grandlyon.com/wms/grandlyon?SERVICE=WMS&REQUEST=GetMap&VERSION=1.3.0&LAYERS=Ortho2015_vue_ensemble_16cm_CC46&WIDTH=512&HEIGHT=512&CRS=EPSG:3857&BBOX={xmin},{ymin},{xmax},{ymax}&FORMAT=image/jpeg' --web_ui --ext jpeg ~/rsp_dataset/cover ~/rsp_dataset/images
```

NOTA:
- Retina resolution of 512px is prefered to a regular 256px, because it will improve the training accuracy result.
- Launch this command again, if any tile download error, till the whole coverage is fully downloaded.
- Jpeg is prefered over default webp only because some browser still not handle webp format
- Retina resolution of 512px is preferred to a regular 256px, because it improves the training accuracy.
- Relaunch this command in case of download error, till the whole coverage is downloaded.
- Jpeg is preferred over default webp only because a few browsers still not handle the webp format.



<a href="http://www.datapink.tools/rsp/opendata_to_opendataset/images/"><img src="img/from_opendata_to_opendataset/images.png" /></a>


Then to download buildings vector roof print, throught <a href="https://www.opengeospatial.org/standards/wfs">WFS</a>,
Then to download buildings vector roofprints with <a href="https://www.opengeospatial.org/standards/wfs">WFS</a>,

```
wget -O ~/rsp_dataset/lyon_roofprint.json 'https://download.data.grandlyon.com/wfs/grandlyon?SERVICE=WFS&REQUEST=GetFeature&TYPENAME=ms:fpc_fond_plan_communaut.fpctoit&VERSION=1.1.0&srsName=EPSG:4326&outputFormat=application/json; subtype=geojson'
```

Roofprint choice is meaningful here, as we use aerial imagery to retrieve patterns. If we used building's footprints instead, our later training accuracy performances would be poorer.
Roofprint choice is important here, as we use aerial imagery to retrieve patterns. If we used the buildings' footprints instead, the training accuracy would be poorer.




Prepare DataSet
----------------

Now to transform the vector roofprints, to raster labels:
Now to transform the vector roofprints and raster labels:

```
rsp rasterize --config config.toml --zoom 18 --web_ui ~/rsp_dataset/lyon_roofprint.json ~/rsp_dataset/cover ~/rsp_dataset/labels
@@ -84,30 +80,30 @@ rsp subset --web_ui --dir ~/rsp_dataset/images --cover ~/rsp_dataset/validation/
rsp subset --web_ui --dir ~/rsp_dataset/labels --cover ~/rsp_dataset/validation/cover --out ~/rsp_dataset/validation/labels
```

Two points to emphasize there:
- It's a good idea to take enough data for the validation part (here we took a 70/30 ratio).
- The shuffle step help to reduce spatial bias in train/validation sets.
Two points to emphasise here:
- It's a good idea to pick enough data for the validation part (here we took a 70/30 ratio).
- The shuffle step helps to offset spatial bias in training/validation sets.


Train
-----

Now to launch a first model train:
Now to launch a first model training:

```
rsp train --config config.toml ~/rsp_dataset/pth
```

After only 10 epochs, the building IoU metric on validation dataset, is about **0.82**.
It's already a good result, at the state of art, with real world data, but we will see how to increase it.
After ten epochs only, the building IoU metric on validation dataset is about **0.82**.
It's already a fair result for the state of art processing with real world data, but we will see how to improve it.




Predict masks
-------------
Predictive masks
----------------

To create predict masks from our first model, on the whole coverage:
To create predictive masks from our first model, on the entire coverage:

```
rsp predict --config config.toml --checkpoint ~/rsp_dataset/pth/checkpoint-00010-of-00010.pth --web_ui ~/rsp_dataset/images ~/rsp_dataset/masks
@@ -119,12 +115,12 @@ rsp predict --config config.toml --checkpoint ~/rsp_dataset/pth/checkpoint-00010
Compare
-------

Then to compare how our first model reacts with this raw data, we compute a composite stack image, with imagery, label and predicted mask.
Then to assess how our first model behaves with this raw data, we compute a composite stack image with imagery, label and predicted mask.

Color representation meaning is:
The colour of the patches means:
- pink: predicted by the model (but not present in the initial labels)
- green: present in the labels (but not predicted by the model)
- grey: both model prediction and labels are synchronized.
- grey: both model prediction and labels agree.



@@ -137,19 +133,18 @@ rsp compare --mode list --labels ~/rsp_dataset/labels --maximum_qod 80 --minimum

<a href="http://www.datapink.tools/rsp/opendata_to_opendataset/compare/"><img src="img/from_opendata_to_opendataset/compare.png" /></a>

We launch also a csv list diff, to only keep tiles with a low Quality of Data metric (here below 80% on QoD metric as a threshold), and with at least few buildings pixels supposed to be present in the tile (5% of foreground building as a threshold).
We also run a csv list diff, in order to pick tiles with a low Quality of Data metric (here below 80% on QoD metric) and with at least a handful of buildings' pixels assumed to lie within the tile (5% of foreground building at minimum).

And if we zoom back on the map, we could see tiles matching the previous filters:
If we zoom back on the map, we can see tiles matching the previous filters:


<img src="img/from_opendata_to_opendataset/compare_zoom_out.png" />


And it becomes clear that some area are not well labelled in the original OpenData.
So we would have to remove them from the training and validation dataset.
It is obvious that some areas are not labelled correctly in the original OpenData, so we ought to remove them from the training and validation dataset.

To do so, first step is to select the wrong labelled ones, and the compare tool again is helpfull,
as it allow to check side by side several tiles directory, and to manual select thoses we want.
To do so, first step is to select the wrongly labelled tiles. The "compare" tool is again helpful,
as it allows to check several tiles side by side, and to manually select those we want to keep.

```
rsp compare --mode side --images ~/rsp_dataset/images ~/rsp_dataset/compare --labels ~/rsp_dataset/labels --maximum_qod 80 --minimum_fg 5 --masks ~/rsp_dataset/masks --config config.toml --ext jpeg --web_ui ~/rsp_dataset/compare_side
@@ -163,24 +158,24 @@ rsp compare --mode side --images ~/rsp_dataset/images ~/rsp_dataset/compare --la
Filter
------

The result from the compare selection produce a csv cover list, in the clipboard.
We put the result in `~rsp_dataset/cover.to_remove`
The compare selection produces a csv cover list into the clipboard.
We store the result in `~rsp_dataset/cover.to_remove`

Then we just remove all theses tiles from the dataset:
Then we just remove all the tiles from the dataset:
```
rsp subset --mode delete --dir ~/rsp_dataset/training/images --cover ~/rsp_dataset/cover.to_remove > /dev/null
rsp subset --mode delete --dir ~/rsp_dataset/training/labels --cover ~/rsp_dataset/cover.to_remove > /dev/null
rsp subset --mode delete --dir ~/rsp_dataset/validation/images --cover ~/rsp_dataset/cover.to_remove > /dev/null
rsp subset --mode delete --dir ~/rsp_dataset/validation/labels --cover ~/rsp_dataset/cover.to_remove > /dev/null
```

For information, we remove about 500 tiles from this raw dataset, in order to clean it up, from obvious inconsistency labelling.
For information, we remove about 500 tiles from this raw dataset in order to clean up obvious inconsistent labelling.


Train
-----

Then with a cleanest training and validation dataset, we can launch a new, and longer, training:
Having a cleaner training and validation dataset, we can launch a new and longer training:

```
rsp train --config config.toml --epochs 100 ~/rsp_dataset/pth_clean
@@ -195,7 +190,7 @@ Building IoU metrics on validation dataset:
Predict and compare
-------------------

And now to generate masks prediction, and compare composite images, as previously:
And now to generate masks prediction, and compare composite images, as previously done:

```
rsp predict --config config.toml --checkpoint ~/rsp_dataset/pth_clean/checkpoint-00100-of-00100.pth ~/rsp_dataset/images ~/rsp_dataset/masks_clean

0 comments on commit 22cfdaf

Please sign in to comment.