<img src="https://github.com/brazil-data-cube/code-gallery/blob/master/img/logo-bdc.png?raw=true" align="right" width="110" />


# <span style="color:#336699">Samples Analysis - Clustering</span>
<hr style="border:2px solid #0077b9;">

<br>

<div style="text-align: center;font-size: 90%;">
    Lorena Alves dos Santos<sup><a href="https://orcid.org/0000-0002-5826-1700"><i class="fab fa-lg fa-orcid" style="color: #a6ce39"></i></a></sup>
    <br/><br/>
    Earth Observation and Geoinformatics Division, National Institute for Space Research (INPE)
    <br/>
    Avenida dos Astronautas, 1758, Jardim da Granja, São José dos Campos, SP 12227-010, Brazil
    <br/><br/>
    Contact: <a href="mailto:brazildatacube@inpe.br">brazildatacube@inpe.br</a>
    <br/><br/>
    Last Update: March 31, 2021
</div>

<br>

<div style="text-align: justify;  margin-left: 25%; margin-right: 25%;">
<b>Abstract.</b> This document presents the steps to assess the samples' quality and reduce the noise in the land use and cover reference datasets. The main idea is to identify mislabeled samples, data with low discrimination when mixed with other classes, and explore the samples' spatiotemporal variability using satellite image time series. The method is based on unsupervised neural networks, the self-organizing map (SOM), and Bayesian Inference. It provides measures to identify mislabeled samples and assess the reliability of the samples
</em>.
</div>    


In [None]:
#
# install packages on-the-fly
#
system("cp -u -R ../input/sitspackages/sits-bundle/* /usr/local/lib/R/site-library/")

# remotes::install_github("e-sensing/sits")

In [None]:
library(sits)
library(dplyr)
library(ggplot2)
library(sf)

In [None]:
# Configure plot size 
options(repr.plot.width = 15, repr.plot.height = 6)

## <span style="color: #336699">Samples</span>
<hr style="border:0.5px solid #0077b9;">



<img src="https://github.com/lorenalves/code-gallery/blob/master/img/bdc-workshop/samples_cerrado.png?raw=true" align="center" width="600" />
<br>


In [None]:
# Loading noisy dataset
input_data.tb <- readRDS("../input/samples-dataset/samples_workshop_bdc.rds")


sits_labels_summary (input_data.tb)

In [None]:
sits_bands(input_data.tb)
plot(sits_select(input_data.tb, bands = "NDVI"))

## <span style="color: #336699"> Clustering samples with self-organizing maps (SOM)</span>
<hr style="border:0.5px solid #0077b9;">




<img src="https://github.com/lorenalves/code-gallery/blob/master/img/bdc-workshop/SA_SOM_properties.png?raw=true" align="center" width="800" />
<br>



This function uses package Konhonen to find clusters in satellite image time series to cluster the samples. Besides that, this function evaluates the quality of each sample through SOM properties, such as evaluate each sample allocated in aneuron-cluster and its neighbourhood.

The main parameters for this functions are:

(a) input_data.tb - A tibble with samples to be clustered.

(b) Grid size (X,Y).

(c) alpha - Learning rate.

(d) rlen - Number of iterations to converge SOM method.

(e) distance -The type of similarity measure (distance).

(f) som_radius - Radius of SOM neighborhood

The function sits_som_map returns a list with two tibble and the SOM properties (provided by kohonen package). The first tibble contains the samples and the additional information about the neuron indentifier. The second tibble contains the information about each neuron, identifier,number of samples associate to it and their labels, and the metrics provided by each class assoaciate to a neuron.


<img src="https://github.com/lorenalves/code-gallery/blob/master/img/bdc-workshop/SA_SOM_metrics.png?raw=true" align="center" width="500" />
<br>





In [None]:
set.seed(777)
clustering_CB4_workshop.lst <- sits::sits_som_map(
  input_data.tb,
  grid_xdim = 9,
  grid_ydim = 9,
  alpha = c(0.5, 0.01),
  distance = "euclidean",
  rlen = 100,
  som_radius = 1
)

clustering_CB4_workshop.lst

In [None]:
#List bands
sits_bands(input_data.tb)

#Plot the SOM grid - Using "codes" we can see the samples representad by the low dimension
plot(clustering_CB4_workshop.lst, type = "codes", whatmap = 5)

In [None]:
# Where the samples are mapped?
plot(clustering_CB4_workshop.lst, type = "mapping")

This function evaluate the clusters created by SOM. Each cluster is a neuron or a set of neuron categorized with same label. It produces a sits tibble indicating the percentage of mixture of classes in each cluster.

In [None]:
cluster_purity <- sits_som_evaluate_cluster(clustering_CB4_workshop.lst)

#show the percentage of samples by class in each cluster
cluster_purity
plot(cluster_purity)

## <span style="color: #336699"> Output - Metrics </span>
<hr style="border:0.5px solid #0077b9;">




This function evaluate the quality of the samples based on the results of the SOM map. It produces a sits tibble with an evaluation column indicating if each sample is clean, should be analyzed or should be removed, and with a new column indicating the posterior probability of the sample

<img src="https://github.com/lorenalves/code-gallery/blob/master/img/bdc-workshop/SA_SOM_decision_making.png?raw=true" align="center" width="400" />
<br>



In [None]:
samples_complete.tb <-
  sits::sits_som_clean_samples(
    clustering_CB4_workshop.lst,
    prior_threshold = 0.5,
    posterior_threshold = 0.5
  )


In [None]:
# you can check the fields that sits_som_clean_samples() returns using colnames()
colnames(samples_complete.tb)

In [None]:
# output information
dplyr::select(samples_complete.tb, id_sample, id_neuron, label, eval, post_prob)

## <span style="color: #336699">Samples to remove</span>
<hr style="border:0.5px solid #0077b9;">

In [None]:
samples_to_remove.tb <- dplyr::filter(samples_complete.tb, eval == "remove")
#dplyr::select(samples_to_remove.tb,id_sample, id_neuron, label)

#plot samples to remove
plot(sits_select(samples_to_remove.tb, bands = c("NDVI")))


#plot only samples labeled as "Natural"
#plot(sits_select(dplyr::filter(samples_to_remove.tb, label == "Natural"), bands = c("NDVI", "EVI")))

#plot all bands
#plot(dplyr::filter(samples_to_remove.tb, label == "Natural"))



## <span style="color: #336699">Neurons Analysis</span>
<hr style="border:0.5px solid #0077b9;">


In [None]:

samples_to_analyze <-
  dplyr::arrange(dplyr::filter(samples_complete.tb, eval == "analyze"), id_neuron)

table(samples_to_analyze$id_neuron, samples_to_analyze$label)

#To check the complete table, print samples_to_analyze:
#samples_to_analyze

In [None]:
#Get the outliers neuron identifier orderly 
neurons_to_analyze <- sort(unique(samples_to_analyze$id_neuron))
print('Neurons to analyze')
neurons_to_analyze

# Identify in the SOM grid the outlier neurons
som_clustering <- clustering_CB4_workshop.lst

#paint the neuron identifed as outlier
som_clustering$som_properties$paint_map [neurons_to_analyze] = "black"

#relabel the neurons identifed as outlier
som_clustering$som_properties$neuron_label [neurons_to_analyze] = "analyze"

#plot SOM grid
par(mfrow = c(1,2))
plot(clustering_CB4_workshop.lst, type = "codes", whatmap = 6)
plot(som_clustering, type = "codes", whatmap = 6)

#  <span style="color: #336695"> Neuron 9 </span>

In [None]:
#We can analyze all samples assoaciated to these neurons using the dplyr
dplyr::filter(dplyr::select(samples_to_analyze,id_sample, id_neuron, label, post_prob ), id_neuron == 9)

#plot samples grouped in neuron 9
plot(dplyr::filter(samples_complete.tb, id_neuron == 9 & eval == "analyze"))

#  <span style="color: #336695"> Neuron 62 </span>

In [None]:
#We can analyze all samples assoaciated to these neurons using the dplyr
dplyr::filter(dplyr::select(samples_to_analyze,id_sample, id_neuron, label, post_prob ), id_neuron == 62)

#plot samples grouped in neuron 62
plot(dplyr::filter(samples_complete.tb,  id_neuron == 62 & eval == "analyze"))



#  <span style="color: #336695"> Neuron 74 </span>

In [None]:
#We can analyze all samples assoaciated to these neurons using the dplyr
dplyr::filter(dplyr::select(samples_to_analyze,id_sample, id_neuron, label, post_prob ), id_neuron == 74)

#plot samples grouped in neuron 74
plot(dplyr::filter(samples_complete.tb,  id_neuron == 74 & eval == "analyze"))


In [None]:
#select the point
points_to_plot.tb <- dplyr::filter(samples_complete.tb, eval == "analyze" & id_neuron %in% c(74, 9) )

#load the shapefile
brasil.shp <- sf::st_read("../input/shapefilebrasil/br_biomes.shp")

#plot points
ggplot2::ggplot(data = st_transform(brasil.shp, 4326)) +
  ggplot2::geom_sf(fill = "transparent") +
  ggplot2::geom_point(data = points_to_plot.tb, aes(x = longitude, y = latitude, color = (label)))


#  <span style="color: #336695"> New dataset </span>

During the samples analysis, we identified that the samples grouped in Nuron 74 and the sample 443 (grouped in Neuron 9) are mislabeled, for this reason they will be removed from  the dataset.

In [None]:
#Samples after analaysis

#Remove sample 443 and the all the samples grouped in neuron 74

keep_after_analysis.tb <- dplyr::filter(samples_to_analyze, !(id_sample %in% c(443)) & !(id_neuron %in% c(74)))
keep_after_analysis.tb


In [None]:
#filter the "clean" samples
cleaned_samples.tb <- dplyr::filter(samples_complete.tb, eval == "clean")

#join the clean samples with the dataset that was analyaetz preioisly 
new_dataset.tb <- rbind(cleaned_samples.tb, keep_after_analysis.tb)
new_dataset.tb

In [None]:
saveRDS(new_dataset.tb, file = "new_dataset.rds")