In [None]:
# Raster data preparations for CNN

In this exercise, the raster data for the CNN classification excercis is prepared.

Before starting with this exercise, the [general raster data preparations exercise](raster_preparations.ipynb) must be done or alternatively download the data and labels images as input data, code provided below. Also the data requirements listed in previous exercise are valid here.

Satellite images are usually too big for CNN models as such, se we need to tile them to smaller pieces. In our example the original image is 5000 x 5000 pixels, and the model 512 x 512 pixels. 

## Data inputs

* Coordinate system: Finnish ETRS-TM35FIN, EPSG:3067
* Resolution: 20m
* BBOX: 200000, 6700000, 300000, 6800000

#### Labels

* Binary classification raster: 1 - forest, 0 - everything else.
* Multiclass classification raster: 1 - forest, 2 - fields, 3 - water, 4 - urban, 0 - everything else.

#### Data image

* Sentinel2 mosaic, we include data from 2 different dates (May and July), to have more data values. Final dataset has 8 bands based on bands: 2, 3, 4 and 8 on dates: 2021-05-11 and 2021-07-21, reflection values scaled to [0 ... 1].

## Data processing results

The goal of this exercise is to have 4 sets of raster tiles.

#### Label tiles for training
For better augmentation we make the training tiles bigger (1024 x 1024) than the model, so that at training time a random clip can be done. Use also overlapping tiling scheme to get more training data.

* Binary classification tiles. Tile size: 1024 x 1024, overlap 512.
* Multiclass classification tiles. Tile size: 1024 x 1024, overlap 512.

#### Data tiles for training
* Sentinel2 mosaic tiles. Tile size: 1024 x 1024, overlap 512.

#### Data tiles for predicting
* Sentinel2 mosaic tiles. Tile size: 512 x 512 (same size as model). No overlap.

#### Tar-files for training
1. Binary classification tiles + Sentinel2 mosaic tiles.
2. Multiclass classification tiles + Sentinel2 mosaic tiles.


## Data processing main steps
(Download input data if needed.)
1. Create folders for tiles
2. Tile the input images as specified above.
3. GDAL creates also smaller then specified tiles on image edges, so remove too small tiles.
4. Create a .tar-package of tiles for binary and multiclass training. Tar-package is easy to move to the GPU-node for faster reads.

## Imports

In [8]:
import os, glob
import rasterio
import tarfile
import urllib.request

Set file names.

In [19]:
base_folder = ".."

image_url = 'https://a3s.fi/gis-courses/gis_ml/image.tif'
binary_classification_url = 'https://a3s.fi/gis-courses/gis_ml/labels_forest.tif'
multiclass_classification_url = 'https://a3s.fi/gis-courses/gis_ml/labels_multiclass.tif'

image_file = os.path.join(base_folder, 'image.tif')
binary_classification_file = os.path.join(base_folder, 'labels_forest.tif')
multiclass_classification_file = os.path.join(base_folder, 'labels_multiclass.tif')
cnn_folder = os.path.join(base_folder, '04_cnn_keras') 

trainingTileSize = 1024
modelTileSize = 512

imageTilesForTrainingFolder = os.path.join(cnn_folder, ('imageTiles_' + str(trainingTileSize)))
labelTilesForBinaryFolder = os.path.join(cnn_folder, ('binaryLabelTiles_' + str(trainingTileSize)))
labelTilesForMultiClassFolder = os.path.join(cnn_folder, ('multiclassLabelTiles_' + str(trainingTileSize)))
imageTilesForPredictionFolder = os.path.join(cnn_folder, ('imageTiles_' + str(modelTileSize)))

trainingBinaryTarFile = os.path.join(cnn_folder,('trainingTilesBinary_' + str(trainingTileSize) + '.tar'))
trainingMultiClassTarFile = os.path.join(cnn_folder,('trainingTilesMulti_' + str(trainingTileSize) + '.tar'))

(Download input data if needed.)

In [11]:
if not os.path.exists(image_file):
    urllib.request.urlretrieve(image_url, image_file)
    
if not os.path.exists(binary_classification_file):
    urllib.request.urlretrieve(binary_classification_url, binary_classification_file)    
    
if not os.path.exists(multiclass_classification_file):
    urllib.request.urlretrieve(multiclass_classification_url, multiclass_classification_file)       

1. Create folders for tiles

In [20]:
if not os.path.exists(imageTilesForTrainingFolder):
    os.makedirs(imageTilesForTrainingFolder)
    
if not os.path.exists(labelTilesForBinaryFolder):
    os.makedirs(labelTilesForBinaryFolder)
    
if not os.path.exists(labelTilesForMultiClassFolder):
    os.makedirs(labelTilesForMultiClassFolder)
    
if not os.path.exists(imageTilesForPredictionFolder):
    os.makedirs(imageTilesForPredictionFolder)    

2. Tile the input images as specified above, using GDAL.

In [21]:
!gdal_retile.py -ps {trainingTileSize} {trainingTileSize} -overlap {modelTileSize} -targetDir {imageTilesForTrainingFolder} {image_file}
# -ps - tile size in pixels
# -overlap - overlap of tiles in pixels
# -targetDir - the directory of output tiles

0...10...20...30...40...50...60...70...80...90...100 - done.


In [22]:
# Tile the satellite image for predicting with the bigger bbox and model size tiles.
!gdal_retile.py -ps {modelTileSize} {modelTileSize} -targetDir {imageTilesForPredictionFolder} {image_file}

0...10...20...30...40...50...60...70...80...90...100 - done.


In [24]:
# # Tile the labels with the same setting as the image. Only spurce for CNN
!gdal_retile.py -ps {trainingTileSize} {trainingTileSize} -overlap {modelTileSize} -targetDir {labelTilesForBinaryFolder} {binary_classification_file}

0...10...20...30...40...50...60...70...80...90...100 - done.


In [26]:
# # Labels for multiclass CNN
!gdal_retile.py -ps {trainingTileSize} {trainingTileSize} -overlap {modelTileSize} -targetDir {labelTilesForMultiClassFolder} {multiclass_classification_file}

0...10...20...30...40...50...60...70...80...90...100 - done.


3. GDAL creates also smaller then specified tiles on image edges, so remove too small tiles.

In [27]:
def remove_too_small_tiles(folder, size):
    all_tiles = glob.glob(folder+"/*.tif")
    for tile in all_tiles:
        with rasterio.open(tile) as src:
            if src.meta["height"] != size:
                print(tile)
                os.remove(tile)
                continue
            if src.meta["width"] != size:
                print(tile)
                os.remove(tile) 

In [29]:
# CNN model requires at least 512x512 size of images, so the remove the files from right and bottom edge, that are too small.
print('Too small tiles, that are removed:')
remove_too_small_tiles(labelTilesForBinaryFolder, trainingTileSize)
remove_too_small_tiles(labelTilesForMultiClassFolder, trainingTileSize)
remove_too_small_tiles(imageTilesForTrainingFolder, trainingTileSize)
remove_too_small_tiles(imageTilesForPredictionFolder, modelTileSize)

Too small tiles, that are removed:
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_6_9.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_9_2.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_9_4.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_9_1.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_7_9.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_8_9.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_9_3.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_9_9.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_9_5.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_9_7.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_9_6.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_9_8.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_5_9.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_3_9.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_4_9.tif
../04_cnn_keras/binaryLabelTiles_1024/labels_forest_2_9.tif
../04

4. Create a .tar-package of tiles for binary and multiclass training.

In [None]:
def make_tarfile(output_filename, source_dirs):
    if os.path.exists(output_filename):
        os.remove(output_filename)
    with tarfile.open(output_filename, "w") as tar:
        for folder in source_dirs:
            tar.add(folder, arcname=os.path.basename(folder))

In [None]:
make_tarfile(trainingBinaryTarFile, [imageTilesForTrainingFolder, labelTilesForBinaryFolder])
make_tarfile(trainingMultiClassTarFile, [imageTilesForTrainingFolder, labelTilesForMultiClassFolder])