# Exercise 2. Raster data preparations

In this exercise prepare the raster data for the classification excercises, where prediction of forest type is done based on a satellite image.

The data used in these exercises is orinally from:
* [Forest stands](https://www.metsaan.fi/paikkatietoaineistot) from Forest center (Metsäkeskus). The exercise area is covered by 2 files: Uusimaa and Salo. These will be merged.
* [Sentinel 2A satellite image](https://sentinels.copernicus.eu/web/sentinel/missions/sentinel-2/data-products) (10m x 10m) from ESA. The data is provided with each band in separate file, so the bands will be merged.

The goal of this exercise is to have 6 raster files:

Images:
1. Sentinel image rescaled to original reflections valus for training area
1. Sentinel image rescaled to original reflections valus for prediction area

Labels:
1. Spruce forests as binary raster for training area
1. Spruce forests as binary raster for prediction area
1. Multi-class (spurce, pine, birch, other) forest raster for training area
1. Multi-class (spurce, pine, birch, other) forest raster for prediction area

In this exercise GDAL commandline commands are used, **not Python**. 

In Jupyter Notebooks, commandline commands start with **!** or **%**
* **%** means the command will be ran so that the result persists for other code cells as well. You can navigate folders
* **!** runs the command in a separate subprocess. This means that switching folders with `cd` would not work

## 2.1 Download and unzip the data 

Using basic Linux commands:
* `wget` downloads files from a URL
* `unzip` 

See the generated files from the File browser in the left panel of Jupyter Labs.

In [None]:
NOTEBOOK_HOME='/home/jovyan/work/geocomputing/machineLearning/data'

In [None]:
!mkdir -p {NOTEBOOK_HOME}
%cd {NOTEBOOK_HOME}

In [None]:
! wget https://a3s.fi/gis-courses/gis_ml/forest.zip
! unzip -qu forest.zip

## 2.2 Satellite image preparations

The original satellite image has each band as separate file. For **joining the bands** create first the false color composite as a virtual raster (.vrt) from the different bands.

* **B08** = infrared
* **B04** = red
* **B03** = green

Virtual raster is a handy concept for merging files. The created .vrt file is a small text file, that includes only links to the original files with data. Ofteb virtual raster file is used with data divided to mapsheets, but here all files are for the same mapsheet, so use the `-separate` option to create a file with 3 bands.

TODO: remove PROJ_LIB

In [None]:
%env PROJ_LIB=/opt/conda/share/proj/

In [None]:
%cd {NOTEBOOK_HOME}/forest

In [None]:
! gdalbuildvrt T34VFM_20180829T100019.vrt \
    S2B_MSIL2A_20180829T100019_N0208_R122_T34VFM_20180829T184909.SAFE/GRANULE/L2A_T34VFM_A007727_20180829T100017/IMG_DATA/R10m/T34VFM_20180829T100019_B08_10m.jp2 \
    S2B_MSIL2A_20180829T100019_N0208_R122_T34VFM_20180829T184909.SAFE/GRANULE/L2A_T34VFM_A007727_20180829T100017/IMG_DATA/R10m/T34VFM_20180829T100019_B04_10m.jp2 \
    S2B_MSIL2A_20180829T100019_N0208_R122_T34VFM_20180829T184909.SAFE/GRANULE/L2A_T34VFM_A007727_20180829T100017/IMG_DATA/R10m/T34VFM_20180829T100019_B03_10m.jp2 \
    -separate

Finally **clip and rescale** the image. In Sentinel images, the original values have been multiplied by 10 000 to get rid of decimals (0.0001 takes more disk space than 10 000). Machine learning models like to have values between 0 and 1, so let's scale the pixel values back to original: 0 to 10 000 -> 0 to 1.

Options for the gdal_translate command:
* `-projwin` defines the new bounding box (bbox) for data. Use smaller bbox for training the models and bigger bbox for predicting. Additionally use extra small bbox for shallow learning models to get results in reasonable time during course.
* `-ot` image value type. Originally the data had integer type, chaning it to Float32.
* `-scale` how to scale the value: 0 to 10 000 -> 0 to 1

In [None]:
# Rescale and clip a little bit. This image will be used for CNN predictions.
! gdal_translate T34VFM_20180829T100019.vrt T34VFM_20180829T100019_scaled.tif \
    -projwin 604500 6698500 677000 6640000 \
    -ot Float32 \
    -scale 0 10000 0 1

In [None]:
# Clip with medium bbox. This image will be used for training with all deep learning models and for predicting with fullyConnected DL model, and for clustering exercise.
! gdal_translate T34VFM_20180829T100019_scaled.tif T34VFM_20180829T100019_clipped_scaled.tif \
    -projwin 614500 6668500 644500 6640500

In [None]:
# Clip with small bbox. This image will be used for training and predicting with shallow learning models.
! gdal_translate T34VFM_20180829T100019_scaled.tif T34VFM_20180829T100019_clipped_small_scaled.tif \
    -projwin 617500 6654000 624000 6647500 

## 2.3 Forest stand preparations

**Merge** the two GeoPackage files to one file, **clip** to predicting bbox and **change coordinte system** to the same as satellite image.

Options for the ogr2ogr-command:
* `stand` is the table name in original GeoPackage
* `-f` output file format - GeoPackage.
* `-t_srs` new coordinate system, EPSG:32634 is the code for UTM 34N
* `-spat` prediction bbox in UTM 34N coordinates
* `-spat_srs` EPSG code of the spat coodrinates - UTM 34N
* `-append -update` - add the second dataset to the first one.

This will take a moment, so please wait.

In [None]:
! ogr2ogr forest_clipped.gpkg MV_Salo.gpkg stand \
    -f GPKG \
    -t_srs epsg:32634 \
    -spat_srs epsg:32634 \
    -spat 604500 6698500 677000 6640000 \
    
! ogr2ogr forest_clipped.gpkg MV_Uusimaa.gpkg stand \
    -f GPKG \
    -t_srs epsg:32634 \
    -spat_srs epsg:32634 \
    -spat 604500 6698500 677000 6640000 \
    -append -update 


**Rasterize** forest stand polygons and clip to the predicting bbox.

Options for the gdal_translate command:
* `-a` attribute to be used as the raster value
* `-ot` raster data type
* `-tr` pixel size
* `-te` bbox

In [None]:
! gdal_rasterize forest_clipped.gpkg -l stand forest_species.tif \
    -a maintreespecies \
    -ot Byte \
    -tr 10 10 \
    -te 604500 6640000 677000 6698500

Use `gdalinfo -hist` for printing the histogram of the raster values. The histogram is at the very end of the long print out.

In [None]:
! gdalinfo forest_species.tif -hist

From the histogram it can be seen, that the data has ~25 different tree species presented, but most of them have too few observations to be used for machine learning. So **reclassify** the forest main tree species to 4 classes to have enough data for each class:

Pine (1), Spruce (2), Deciduous trees (3), No forest (0)

Options for gdal_calc.py:
* `--calc` - how to calculate the values of the new raster
* `--NoDataValue` - what is the NoDataValue of the created raster

In [None]:
! gdal_calc.py -A forest_species.tif --outfile=forest_species_reclassified.tif \
--calc="0*(A==0)+1*(A==1)+2*(A==2)+3*(A>=3)" \
--NoDataValue=254 \
--quiet 

In [None]:
! gdalinfo forest_species_reclassified.tif -hist -stats

Some excercises use only the spruce data for binary classification. 
Create a binary raster, with selecting only class 2 from the original rasterized image and recoding it to have value 1 in the raster.

In [None]:
! gdal_calc.py -A forest_species.tif --outfile=forest_spruce.tif \
--calc="0*(A<2)+0*(A>2)+1*(A==2)" --quiet --NoDataValue=254

In [None]:
!gdalinfo forest_spruce.tif -hist

**Clip** to training area bbox for both 4-class and 1-class datasets. Additionally clip for the shallow learning models with the smaller bbox.

In [None]:
! gdal_translate forest_spruce.tif forest_spruce_clip.tif \
    -ot Byte \
    -projwin 614500 6668500 644500 6640500
! gdal_translate forest_species_reclassified.tif forest_species_reclassified_clip.tif \
    -ot Byte \
    -projwin 614500 6668500 644500 6640500
! gdal_translate forest_species_reclassified.tif forest_species_reclassified_clip_small.tif \
    -ot Byte \
    -projwin 617500 6654000 624000 6647500

In [None]:
! gdalinfo forest_species_reclassified_clip.tif -hist

## Plotting the datasets

See the results with plotting:
* Satellite image
* Forest classification with 3 classes
* Spruce forest
* Forest classification with 3 classes histogram

In [None]:
import matplotlib.pyplot as plt
import matplotlib.colors
%matplotlib inline
import rasterio
import numpy as np
from rasterio.plot import show
from rasterio.plot import show_hist

In [None]:
### Help function to normalize band values and enhance contrast. Just like what QGIS does automatically
def normalize(array):
    min_percent = 2   # Low percentile
    max_percent = 98  # High percentile
    lo, hi = np.percentile(array, (min_percent, max_percent))
    return (array - lo) / (hi - lo)

In [None]:
### Create a subplot for 4 images 
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(20, 20))

### Plot the sentinel image 
### The Sentinel image used for training  
sentinel = rasterio.open("T34VFM_20180829T100019_clipped_scaled.tif") 
#sentinel = rasterio.open("T34VFM_20180829T100019_clipped_small_scaled.tif")

### Read the bands separately and apply the normalize function to each of them to increase contrast
nir, red, green = sentinel.read(1), sentinel.read(2), sentinel.read(3)
nirn, redn, greenn = normalize(nir), normalize(red), normalize(green)
stacked = np.dstack((nirn, redn, greenn))

ax[0, 0].imshow(stacked)

### The forest classification 
### Plot it a bit differently as it is not an RGB image
forest_classes = rasterio.open("forest_species_reclassified_clip.tif") 
#forest_classes = rasterio.open("forest_species_reclassified_clip_small.tif")
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ["white","orange","darkgreen","violet"])
show(forest_classes, ax=ax[0, 1], cmap=cmap, title='Forest classes')

### Spruce forest 
forest_spruce = rasterio.open("forest_spruce_clip.tif")
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ["white","darkgreen"])
show(forest_spruce, ax=ax[1, 0], cmap=cmap, title='Spruce forests')

### Plot the histogram of forest classification
show_hist(forest_classes, ax=ax[1, 1], title="Forest classes histogram")

Try also plotting the datasets with smaller bbox with chaning the input file names for `sentinel` and `forest_classes`.
