<img src="../../../img/logo-bdc.png" align="right" width="64"/>

# <span style="color:#336699">Labelling data points with WLTS</span>
<hr style="border:2px solid #0077b9;">

<div style=text-align: left;>
    <a href="https://nbviewer.jupyter.org/github/brazil-data-cube/code-gallery/blob/master/jupyter/Python/wlts/wlts-introduction.ipynb"><img src="https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg" align="center"/></a>
</div>

<br/>

<div style="text-align: center;font-size: 90%;">
    Fabiana Zioti<sup><a href="https://orcid.org/0000-0002-7305-6043"><i class="fab fa-lg fa-orcid" style="color: #a6ce39"></i></a></sup>, Felipe Menino Carlos<sup><a href="https://orcid.org/0000-0002-3334-4315"><i class="fab fa-lg fa-orcid" style="color: #a6ce39"></i></a></sup>, Karine Reis Ferreira<sup><a href="https://orcid.org/0000-0003-2656-5504"><i class="fab fa-lg fa-orcid" style="color: #a6ce39"></i></a></sup>, Gilberto R. Queiroz<sup><a href="https://orcid.org/0000-0001-7534-0219"><i class="fab fa-lg fa-orcid" style="color: #a6ce39"></i></a></sup>
    <br/><br/>
    Earth Observation and Geoinformatics Division, National Institute for Space Research (INPE)
    <br/>
    Avenida dos Astronautas, 1758, Jardim da Granja, São José dos Campos, SP 12227-010, Brazil
    <br/><br/>
    Contact: <a href="mailto:brazildatacube@inpe.br">brazildatacube@inpe.br</a>
    <br/><br/>
    Last Update: March 24, 2021
</div>

<br/>

<div style="text-align: justify;  margin-left: 25%; margin-right: 25%;">
<b>Abstract.</b> This Jupyter Notebook shows how to use the WLTS API to label a set of data points considering two well known land use and land cover collections: Projeto de Mapeamento Anual da Cobertura e Uso do Solo no Brasil (MapBiomas) version 5 - Mapa de uso e cobertura da Terra and Instituto Brasileiro de Geografia e Estatística (IBGE) - Monitoramento e uso da Terra. The data points were created using a regular grid. After labelling these data points, the example illustrates a possible scenario for comparing the class agreement between the collections regarding these points. Finally, the data points with agreement in both collections are used to compute a water mask over an region of interest.
</div>    

<br/>

<div style="text-align: justify;  margin-left: 15%; margin-right: 15%;font-size: 75%; border-style: solid; border-color: #0077b9; border-width: 1px; padding: 5px;">
    <b>This Jupyter Notebook is supplement to the following papers:</b>
    <div style="margin-left: 10px; margin-right: 10px; margin-top:10px">
      <p> Ferreira, K.R.; Queiroz, G.R.; Vinhas, L.; Marujo, R.F.B.; Simoes, R.E.O.; Picoli, M.C.A.; Camara, G.; Cartaxo, R.; Gomes, V.C.F.; Santos, L.A.; Sanchez, A.H.; Arcanjo, J.S.; Fronza, J.G.; Noronha, C.A.; Costa, R.W.; Zaglia, M.C.; Zioti, F.; Korting, T.S.; Soares, A.R.; Chaves, M.E.D.; Fonseca, L.M.G. 2020. Earth Observation Data Cubes for Brazil: Requirements, Methodology and Products. Remote Sens. 12, no. 24: 4033. DOI: <a href="https://doi.org/10.3390/rs12244033" target="_blank">10.3390/rs12244033</a>. </p>
      <p> Zioti, F.; Gomes, V.C.F.; Ferreira, K.R.; Queiroz, G.R.; Rodriguez, E. L. 2019. Um ambiente para acesso e análise de trajetórias de uso e cobertura da Terra. Anais do XIX Simpósio Brasileiro de Sensoriamento Remoto.São José dos Campos, INPE, 2019. <a href="https://proceedings.science/sbsr-2019/papers/um-ambiente-para-acesso-e-analise-de-trajetorias-de-uso-e-cobertura-da-terra" target="_blank"> Online </a>. </p>
    </div>
</div>

# Study Area
<hr style="border:1px solid #0077b9;">

The study area is located in the Pará state, Brazil, in Amazon biome as depicted in Figure 1.

<center>
    <img src="../../../img/wlts/example_wlts_area.png" width="600" />,
    <br/>
    <b>Figure 1</b> - Labelling data points with WLTS- Study Area.
</center>

# Python Client API
<hr style="border:1px solid #0077b9;">

For running the examples in this Jupyter Notebook you will need to install the [WLTS client for Python](https://github.com/brazil-data-cube/wlts.py).To install it from GitHub using `pip`, use the following command:

In [None]:
# !pip install git+https://github.com/brazil-data-cube/wlts.py@v0.4.0-0

We also use the follow libraries: [numpy](https://numpy.org/), [rasterio](https://rasterio.readthedocs.io/en/latest/), [pandas](https://pandas.pydata.org/), [geopandas](https://geopandas.org/), [seaborn](https://seaborn.pydata.org/), [matplotlib](https://matplotlib.org/), [sklearn](https://scikit-learn.org/stable/), [stac.py] (https://github.com/brazil-data-cube/stac.py). To install those libraries from PyPI using pip, use the following commands:

> pip install numpy rasterio pandas geopandas seaborn matplotlib sklearn folium stac.py

# Set the service and load the data points
<hr style="border:1px solid #0077b9;">

In [None]:
import wlts

Define the service to be used:

In [None]:
service = wlts.WLTS('https://brazildatacube.dpi.inpe.br/wlts/')
service

In [None]:
service.collections

**Sampling GRID**

To extract the samples, use will be made of a sampling grid with equally spaced locations. Below, the grid is loaded using the GeoPandas library.

>  The sample points used below were generated using [QGIS GIS](https://qgis.org/pt_BR/site/). If you wish, you can use the [Verde] library (https://www.fatiando.org/verde/latest/).


In [None]:
import geopandas

In [None]:
samples_df = geopandas.read_file("/vsicurl/https://brazildatacube.dpi.inpe.br/public/workshop/bdc-2020-03/wlts/samples-points/sample-points.shp")
samples_df.head()

Below, each grid point's spatial location is presented 

In [None]:
import folium

In [None]:
#
# extract sample long, lat
#
latlon = samples_df.geometry.apply(lambda p: (p.y, p.x)).tolist()

#
# create folium map
#
folium_map = folium.Map( location=[-0.52, -51.1526], zoom_start=12)

#
# Google Satellite Layer
#
tile = folium.TileLayer(
        tiles = "https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}",
        attr = 'Google',
        name = 'Google Satellite',
        overlay = False,
        control = True
       ).add_to(folium_map)

#
# add marker to map
#
for coord in latlon:
    folium.CircleMarker( location=[ coord[0], coord[1] ], fill_color='#43d9de', radius=3).add_to( folium_map )

folium_map

# Labelling the data points
<hr style="border:1px solid #0077b9;">

We will use the grid points' location to extract classes defined by different mapping projects. In this way, each point will be associated with a class.

> In WLTS, the data from each of the projects is represented through collections. A collection is an aggregation of data from different years of the same project.

In this example scenario, we will perform a concordance analysis between each project to choose the best samples. This analysis provides a set of samples based on the knowledge applied by each project for the choice of classes.

> Note that this is an example scenario. The complexities of a real scenario are not considered here. Therefore, problems related to Spatio-temporal differences between samples or their creation methodology are not considered!

The sample labels will be extracted separately in the subsections to facilitate their application in the example that will be created, but the [wlts.py library](https://github.com/brazil-data-cube/wlts.py/) supports the extraction of data considering multiple projects at once.

**IBGE - Monitoramento e uso da Terra (2018)**

In WLTS, the collection with IBGE data from the Land Use Monitoring project is in the collection named `ibge_land_use_cover`. The code below extracts the label of this collection in the year 2018.

In [None]:
import pandas

In [None]:
samples_ibge = []

#
# Extract classes with WLTS
#
for point_row in samples_df.iterrows():
    point_row = point_row[1]
    
    ibge_class = service.tj(latitude  = point_row.geometry.y, 
                            longitude = point_row.geometry.x, 
                            start_date = 2018, end_date = 2018,
                            collections = "ibge_cobertura_uso_terra")
    
    samples_ibge.append(ibge_class.df())

#
# Create a Data Frame
#
samples_ibge = pandas.concat(samples_ibge).reset_index(drop=True)
samples_ibge["geometry"] = samples_df["geometry"]

The table below presents the classes, with only one year, extracted for all the grid points presented above.

In [None]:
samples_ibge.head()

**MapBiomas version 5 - Mapa de uso e cobertura da Terra**

Analogous to the IBGE data, this section extracts the data from MapBiomas. In WLTS, the data from MapBiomas (Version 5) are represented through the collection `mapbiomas5_amazonia`.

In [None]:
samples_mapbiomas = []

#
# Extract classes with WLTS
#
for point_row in samples_df.iterrows():
    point_row = point_row[1]
    
    mapbiomas_class = service.tj(latitude  = point_row.geometry.y, 
                                 longitude = point_row.geometry.x, 
                                 start_date = 2018,
                                 end_date = 2018,
                                 collections = "mapbiomas5_amazonia")
    
    samples_mapbiomas.append(mapbiomas_class.df())

#
# Create a Data Frame
#
samples_mapbiomas = pandas.concat(samples_mapbiomas).reset_index(drop=True)
samples_mapbiomas["geometry"] = samples_df["geometry"]

In [None]:
samples_mapbiomas.head()

# Prepare data to a concordance analysis
<hr style="border:1px solid #0077b9;">

This section prepares the data for the concordance analysis. In this process, all points identified as water have their path values converted to `1`, while all other values are represented by `0`.

This conversion is applied considering that there is one class that represents the Water element for each collection. The table below summarizes how each collection does this representation.

|         Collection        	|      Nomenclature for water class   	|
|:-------------------------:	|:----------------------------------:	|
|        IBGE (2018)        	|      Corpo d'água Continental      	|
| MapBiomas Versão 5 (2018) 	|         Rio, Lago e Oceano         	|

Considering the information in the table, below each of the collections is prepared for classification.

`IBGE Collection (2018)`

> After running the command below, notice that the `class` column has its value summed to the values `0` and `1`.


In [None]:
samples_ibge.loc[samples_ibge["class"] != "Corpo d'água Continental", "class"] = 0
samples_ibge.loc[samples_ibge["class"] == "Corpo d'água Continental", "class"] = 1

In [None]:
samples_ibge.head(3)

`MaBiomas Collection (2018)`

In [None]:
samples_mapbiomas.loc[samples_mapbiomas["class"] != "Rio, Lago e Oceano", "class"] = 0
samples_mapbiomas.loc[samples_mapbiomas["class"] == "Rio, Lago e Oceano", "class"] = 1

In [None]:
samples_mapbiomas.head(3)

# Concordance analysis
<hr style="border:1px solid #0077b9;">

Below we will generate an example of a concordance analysis. A confusion matrix is generated to visualize and quantify the points that have a concordance. After visualizing the matrix, the data is filtered so that only the points where there is concordance are considered.

> **Note**: The analysis below does not consider many of the practical complexities involved in this process.


In [None]:
import seaborn

from matplotlib import pyplot as plt

from sklearn.metrics import confusion_matrix

In [None]:
#
# generate the confusion matrix
#

cm_arr = confusion_matrix(samples_ibge["class"].astype("int"), samples_mapbiomas["class"].astype("int"))

#
# formating results
#
reference = ["Non-Water", "Water"]

cm_pd = pandas.DataFrame(cm_arr, columns = reference, index = reference)

In [None]:
plt.figure(dpi = 100)

#
# plot matrix
#
seaborn.heatmap(cm_pd, annot=True, fmt = 'g', cmap="YlGnBu", cbar = False)

#
# configure labels
#
plt.title("Concordance matrix")
plt.ylabel("IBGE (2018)")
plt.xlabel("MapBiomas 5 (2018)")

plt.show()

> Below, the samples are filtered considering the equality between both data sets

In [None]:
#
# generate the "true matrix" based on classes matching
#
true_matrix = samples_ibge["class"].values == samples_mapbiomas["class"].values

#
# filtering data
#
samples_ibge_filtered = samples_ibge[true_matrix]
samples_mapbiomas_filtered  = samples_mapbiomas[true_matrix]


In [None]:
samples_ibge_filtered.head(5)

**Visualizing the filtered points in the geographical space**

The map below shows the filtered samples. The `blue` samples represent the concordant elements. On the other hand, the `yellow` ones are the points where the ready did not agree.


In [None]:
#
# extract sample long, lat
#
latlon = samples_ibge_filtered.geometry.apply(lambda p: (p.y, p.x)).tolist()
latlon_not_valid = samples_ibge[~true_matrix].geometry.apply(lambda p: (p.y, p.x)).tolist()

#
# create folium map
#
folium_map = folium.Map( location=[-0.52, -51.1526], zoom_start=12)

#
# Google Satellite Layer
#
tile = folium.TileLayer(
        tiles = "https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}",
        attr = 'Google',
        name = 'Google Satellite',
        overlay = False,
        control = True
       ).add_to(folium_map)

#
# add marker to map (concordance samples)
#
for coord in latlon:
    folium.CircleMarker( location=[ coord[0], coord[1] ], fill_color='#43d9de', radius=9).add_to( folium_map )

#
# add marker to map (not concordante samples)
#
for coord in latlon_not_valid:
    folium.CircleMarker( location=[ coord[0], coord[1] ], color='#F5AD46', radius=9).add_to( folium_map )

folium_map

# Example: Computing Water Mask for a single image
<hr style="border:1px solid #0077b9;">

In this section, the previously extracted and filtered samples will be used for training a linear classifier. After training, the model is applied. The classification process will be done considering a scene extracted from the Landsat-8/OLI data cube (temporal composition of 16 days and the pixel choice with less cloud influence done through the STACK algorithm).

The defined study region is located within the Amazon biome, in a cube tile in Pará.

> In this example, to reduce the computational requirements, a small region of the scene will be used, this one intersecting with the location of the grid points presented earlier. Furthermore, to facilitate classification, the **N**ormalized **D**ifference **W**ater **I**ndex (NDWI) is calculated.

For the acquisition of the Landsat-8/OLI scene the [stac.py](https://github.com/brazil-data-cube/stac.py) client will be used. Below, the client is imported and the credentials are set.

In [None]:
import stac

In [None]:
service = stac.STAC('https://brazildatacube.dpi.inpe.br/stac/', access_token='change-me')

In [None]:
#
# defining roi bbox
#
bbox = (-51.232048, 
        -0.594217, 
        -51.078365, 
        -0.464596)

#
# query stac!
#
items = service.search({
    'limit': 1,
    'bbox': bbox,
    'datetime': '2018-06-10/2018-06-25',
    'collections': ['LC8_30_16D_STK-1']
})

#
# visualizing the result (must be have scene)
#
items

In [None]:
#
# select item
#
item = items.features[0]

#
# load bands to generate the rgb
#

band7 = item.read("band7", bbox = bbox)
band5 = item.read("band5", bbox = bbox)
band4 = item.read("band4", bbox = bbox)

The code above, shows the scene loaded in grid region. In this example, the RGB visualization uses the `band 7`, `band 5` and `band 4`.

In [None]:
import numpy

from rasterio.plot import show

In [None]:
#
# defining the rgb matrix
#
rgb = numpy.stack((band7, band5, band4))

#
# create a figure
#
plt.figure(dpi = 120)

#
# plot!
#
show(rgb / 10000)

plt.show()

**Calculating the Normalized Difference Water Index**

In [None]:
import rasterio

Loading images using [rasterio](https://rasterio.readthedocs.io/en/latest/). To do this, the url of each scene, returned in stac query is used.

In [None]:
band3ds = rasterio.open(item["assets"]["band3"]["href"])
band5ds = rasterio.open(item["assets"]["band5"]["href"])

To extract values, we need reproject grid data point to raster **C**oordinate **R**eference **S**ystem (CRS).

In [None]:
#
# reprojects grid points
#
samples_ibge_filtered = geopandas.GeoDataFrame(samples_ibge_filtered)\
                                .set_geometry("geometry")\
                                .set_crs("EPSG:4326")

points = samples_ibge_filtered["geometry"].to_crs(band3ds.crs)
points

now, we can extract spectral data (Bands 3 and 5) for each data points filterd above. 

In [None]:
#
# Extract values for band 3
#
band3_values = numpy.array(list(
    band3ds.sample((x, y) for x, y in zip(points.x, points.y))
))  / 10000

#
# Extract values for band 5
#
band5_values = numpy.array(list(
    band5ds.sample((x, y) for x, y in zip(points.x, points.y))
)) / 10000

#
# generate ndwi for sampled points
#
ndwi_values = ( band3_values - band5_values ) / ( band3_values + band5_values )


To finish this step, let's bind all extracted values in the same array. In this form, each point has values from `band 3`, `band 5` and `ndwi`.

In [None]:
#
# binding results
#
points = numpy.hstack((band3_values, band5_values, ndwi_values))

points[0:5, ]

**Water Mask using linear separation**

In [None]:
from sklearn.linear_model import SGDClassifier

Create a linear separator

In [None]:
model = SGDClassifier().fit(points, 
                            samples_ibge_filtered["class"].astype("int"))

Generate the water mask

In [None]:
#
# create data with bands 3, 5 and ndwi
#
band3 = item.read("band3", bbox = bbox) / 10000
band5 = item.read("band5", bbox = bbox) / 10000

ndwi  = ( band3 - band5 ) / ( band3 + band5 )

#
# bind all matrices
#
raster = numpy.stack( (band3, band5, ndwi) )

In [None]:
#
# use raster to generate the water mask
#
prediction_array = model.predict(raster.T.reshape((-1, 3)))

#
# reshape to input raster dimensions
#
prediction_array = prediction_array.reshape(raster.shape[2], raster.shape[1]).T.astype(int)

Plot classified image

In [None]:
plt.figure(figsize = (10, 10))

plt.imshow(prediction_array, cmap='GnBu')

**Save results**

In [None]:
profile = band3ds.profile
profile["dtype"] = "int16"
profile["count"] = 1

with rasterio.open("water-mask-classification.tif", "w", **profile) as file:
    file.write(prediction_array[numpy.newaxis, ...].astype('int16'))
