# Exercise 5. Classification, shallow learning

The aim of this exercise is to train 4 different shallow learning models to predict different land-use classes from satellite data. It also assesses the model accuracy with a test dataset.

## Input data

2 raster files with:

* Coordinate system: Finnish ETRS-TM35FIN, EPSG:3067
* Resolution: 20m
* BBOX: 200000, 6700000, 300000, 6800000

#### Labels

* Multiclass classification raster: 1 - forest, 2 - fields, 3 - water, 4 - urban, 0 - everything else.

#### Data image

* Sentinel2 mosaic, with data from 2 different dates (May and July), to have more data values. Dataset has 8 bands based on bands: 2, 3, 4 and 8 on dates: 2021-05-11 and 2021-07-21, reflection values scaled to [0 ... 1]. The bands source data is: 
     *  'b02' / '2021-05-11'
     *  'b02' / '2021-07-21'
     *  'b03' / '2021-05-11'
     *  'b03' / '2021-07-21'
     *  'b04' / '2021-05-11'
     *  'b04' / '2021-07-21'
     *  'b08' / '2021-05-11'
     *  'b08' / '2021-07-21'
     
[Bands](https://custom-scripts.sentinel-hub.com/custom-scripts/sentinel-2/bands/): b02=blue, b03=green, b04=red, b08=infrared     

## Results

The trained models: 
* Random forest
* Stochastic Gradient Decent
* Gradient Boost
* SVM Suppot Vector Classifier

For each model:
* Trained model
* Model accuracy estimation
* Class confusion matrix
* Predicted image

## Main steps

1) Read data and shape it to suitable form for scikit-learn.
2) Divide the data to training, validation and test datasets.
3) Undersample to balance the training dataset.
4) For each model:
   * Train the model.
   * Estimate the model on test data, inc class confusion matrix classification report creation.
   * Predict classification based on the data image and save it.
5) For SVM use grid search to find optimal settings.
6) Plot the results

## 5.0 Imports and paths

In [None]:
import os, time
from imblearn.under_sampling import RandomUnderSampler
from joblib import dump, load
import matplotlib.pyplot as plt
import matplotlib.colors
import numpy as np
import rasterio
import urllib
from rasterio.plot import show
from rasterio.plot import show_hist
from rasterio.windows import from_bounds
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
%matplotlib inline

In [None]:
### File paths.
# Source data URLs
image_url = 'https://a3s.fi/gis-courses/gis_ml/image.tif'
multiclass_classification_url = 'https://a3s.fi/gis-courses/gis_ml/labels_multiclass.tif'

# Folders
user = os.environ.get('USER')
base_folder = os.path.join('/scratch/project_2002044', user, '2022/GeoML')
dataFolder = os.path.join(base_folder,'data')
outputBaseFolder= os.path.join(base_folder,'05_shallow_classification')

# Source data local paths
image_file = os.path.join(dataFolder, 'image.tif')
multiclass_classification_file = os.path.join(dataFolder, 'labels_multiclass.tif')

# BBOX for exercise data, we use less than full image for shallow learning training, because of speed and to better see the results when plotting.
minx = 240500
miny = 6775500
maxx = 253500
maxy = 6788500 

# Available cores. During the course only 1 core is available, outside of this course more cores might be available 
# You can make use of multiple cores by setting this number to the number of cores available.
n_jobs = 1

(Download input data if needed.)

In [None]:
if not os.path.isdir(dataFolder):
    os.makedirs(dataFolder)

if not os.path.exists(image_file):
    urllib.request.urlretrieve(image_url, image_file)
    
if not os.path.exists(multiclass_classification_file):
    urllib.request.urlretrieve(multiclass_classification_url, multiclass_classification_file) 

## 5.1 Read data and shape it to suitable form for scikit-learn

Read the input datasets with Rasterio and shape it to suitable form for scikit-learn.

Exactly the same as for K-means for image data, the similar processing only added for the labels image.

### Satellite image

The satellite image has 8 channels, so rasterio reads it in as 3D data cube.

For scikit-learn we reshape the data to 2D, having in dataframe one row for each pixel. Each pixel has eight values, one for each band/date.

In [None]:
# Read the pixel values from .tif file as dataframe
with rasterio.open(image_file) as image_dataset:
    image_data = image_dataset.read(window=from_bounds(minx, miny, maxx, maxy, image_dataset.transform))

# Check shape of input data
print ('Dataframe original shape, 3D: ', image_data.shape)    

Save number of bands for later, to be able to reshape data back to 2D.

In [None]:
no_bands_in_image = image_data.shape[0]
no_bands_in_image

As a mid-step transponse the axis order, so that the bands are the last. Notice how the dataframe size changes.

In [None]:
image_data2 = np.transpose(image_data, (1, 2, 0))
# Check again the data shape, now the bands should be last.
print ('Dataframe shape after transpose, 3D: ', image_data2.shape) 

In [None]:
# Then reshape to 2D.
pixels = image_data2.reshape(-1, no_bands_in_image)
print ('Dataframe shape after transpose and reshape, 2D: ', pixels.shape) 

### Labels

Do the same for labels.

In [None]:
# For labels only reshape to 1D is enough.
with rasterio.open(multiclass_classification_file) as labels_src:
    labels_data = labels_src.read(window=from_bounds(minx, miny, maxx, maxy, labels_src.transform))
    input_labels = labels_data.reshape(-1)
    print ('Labels shape after reshape, 1D: ', input_labels.shape)

Notice that labels data has only one band.

In [None]:
labels_data.shape

## 5.2 Divide the data to training, validation and test datasets

Set training, validation and test data ratios, how big part of the pixels is assigned to different sets.

In [None]:
train_ratio = 0.7
validation_ratio = 0.2
test_ratio = 0.1

First separate test set.

In [None]:
x_rest, x_test, y_rest, y_test = train_test_split(pixels, input_labels, test_size=test_ratio, random_state=63, stratify=input_labels)

... and then training and validation set, using the ratios set above and keeping class representation the same in all sets.

In [None]:
x_train1, x_validation, y_train1, y_validation= train_test_split(x_rest, y_rest, test_size=validation_ratio/(train_ratio + validation_ratio), random_state=63, stratify=y_rest)

## 5.3 Resample to balance the dataset

The classes are very imbalanced in the dataset, so undersample the majority classes in the training set, so that all classes are represented about similar number of pixels. 
Notice that validation and test set keep the original class-distribution.

In [None]:
a = show_hist(labels_data, label='Classes')

In [None]:
rus = RandomUnderSampler(random_state=63)
x_train, y_train = rus.fit_resample(x_train1, y_train1)   
print ('Dataframe shape after undersampling of majority classes, 2D: ', x_train.shape)

*How many pixels of different classes are included in training dataset?*

Notice that we lost a lot of pixel at this point, in real cases that may be undesired. See [inbalanced-learn User guide](https://imbalanced-learn.org/stable/user_guide.html#user-guide) for other options.

In [None]:
print('Labels before splitting:           ', np.unique(input_labels, return_counts=True)[1])
print('Training data before undersampling:', np.unique(y_train1, return_counts=True)[1])
print('Training data after undersampling: ', np.unique(y_train, return_counts=True)[1])
print('Validation data:                   ', np.unique(y_validation, return_counts=True)[1])
print('Test data:                         ', np.unique(y_test, return_counts=True)[1])

## 5.4 Modelling
### Funcitons for training and estimating the models and predicting based on the models

Similar functions will be used by different algorithms. Here the functions are only defined, they will be used later.

### Train the model

In [None]:
def trainModel(x_train, y_train, clf, classifierName):
    start_time = time.time()    
    clf.fit(x_train, y_train)
    print('Model training took: ', round((time.time() - start_time), 2), ' seconds')
    
    # Save the model to a file
    modelFilePath = os.path.join(outputBaseFolder, ('model_' + classifierName + '.sav'))
    dump(clf, modelFilePath) 
    return clf

### Estimate the model

Model may be estimated first with validation data and then with test data. Both confusion matrix and classification report are generated.

In [None]:
def estimateModel(clf, x_test, y_test):
    test_predictions = clf.predict(x_test)
    print('Confusion matrix: \n', confusion_matrix(y_test, test_predictions))
    print('Classification report: \n', classification_report(y_test, test_predictions))

### Predict classification based on the data image and save it

In [None]:
def predictImage(modelName):
    start_time = time.time()    
    
    #Set file paths
    classifiedImageFile = os.path.join(outputBaseFolder, ('classification_' + modelName + '.tif'))
    modelFile = os.path.join(outputBaseFolder, ('model_' + modelName + '.sav'))    
         
    #Load the model from the saved file
    trained_model = load(modelFile)

    # predict the class for each pixel
    prediction = trained_model.predict(pixels)

    # Reshape back to 2D
    print('Prediction shape in 1D: ', prediction.shape)
    prediction2D = np.reshape(prediction, (image_data.shape[1], image_data.shape[2]))
    print('Prediction shape in 2D: ', prediction2D.shape)

    # Save the results as .tif file.
    # Copy metadata from the labels image 
    outputMeta = labels_src.meta
    # Writing the image on the disk
    with rasterio.open(classifiedImageFile, 'w', **outputMeta) as dst:
        dst.write(prediction2D, 1)
    print('Predicting took: ', round((time.time() - start_time), 1), ' seconds')

### Random forest     

In [None]:
classifierName = 'random_forest'
# Initialize the random forest classifier and give the hyperparameters.
clf_random_forest = RandomForestClassifier(n_estimators=200, max_depth=75, random_state=0, n_jobs=n_jobs)
clf_random_forest = trainModel(x_train, y_train, clf_random_forest, classifierName)
estimateModel(clf_random_forest, x_validation, y_validation) #Validation data

*Feel free, to modify some some of the hyper-parameters above to get better results.*
And then see with test data, if the modifications help also for previously unseen data.

In [None]:
estimateModel(clf_random_forest, x_test, y_test) #Test data
predictImage(classifierName)

In [None]:
print('Feature importances:')
classnames = ['blue-may', 'blue-july', 'green-may', 'green-july', 'red-may', 'red-july', 'infrared-may', 'infrared-july']
for importance in list(zip(classnames,clf_random_forest.feature_importances_)):
    print(importance)

### Stochastic Gradient Decent

In [None]:
classifierName = 'SGD'    
clf_SGD = SGDClassifier(loss="log_loss", learning_rate='adaptive', eta0=.1, alpha=1e-5,  n_jobs=n_jobs, max_iter=2000, penalty='l1') #
clf_SGD = trainModel(x_train, y_train, clf_SGD, classifierName)
estimateModel(clf_SGD, x_validation, y_validation) #Validation data

In [None]:
estimateModel(clf_SGD, x_test, y_test) #Test data
predictImage(classifierName)

### Gradient Boosting   

In [None]:
classifierName = 'gradient_boosting'    
clf_gradient_boosting = GradientBoostingClassifier(n_estimators=1000, learning_rate=.05)
clf_gradient_boosting = trainModel(x_train, y_train, clf_gradient_boosting, classifierName)
estimateModel(clf_gradient_boosting, x_validation, y_validation) #Validation data

In [None]:
estimateModel(clf_gradient_boosting, x_test, y_test) #Test data
predictImage(classifierName)

In [None]:
print('Feature importances:')
classnames = ['blue-may', 'blue-july', 'green-may', 'green-july', 'red-may', 'red-july', 'infrared-may', 'infrared-july']
for importance in list(zip(classnames,clf_gradient_boosting.feature_importances_)):
    print(importance)    

### SVM Support Vector Classifier

*SVM is slower than others, wait a moment.*

In [None]:
classifierName = 'SVM'        
clf_svc = SVC(kernel='rbf', gamma='auto',  decision_function_shape='ovr')
clf_svc = trainModel(x_train, y_train, clf_svc, classifierName)
estimateModel(clf_svc, x_validation, y_validation) #Validation data

## 5.5 Grid Search for SVC

Different models have different settings (hyperparameters) that can be used for searching best model. Grid search is one option to automatically search for better option. For more options in hyperparameter search see [CSC machine learning guide](https://docs.csc.fi/support/tutorials/hyperparameter_search/)

Here we try different `C` and `gamma` values for the SVM model. Grid search automatically saves the best model.

*Notice, how the results are improved from the first SVM result above.*

In [None]:
classifierName = 'SVC_grid_search'        
# Find the optimal parameters for SVM
param_grid = {'C': [1000, 10000], 'gamma': [1, 10]}
# Initialize the grid search, cv is the number of iterations, kept at minimum here for faster results.
grid = GridSearchCV(SVC(), param_grid, verbose=1, n_jobs=n_jobs, cv=2)    
# Try different options
grid = trainModel(x_train, y_train, grid, classifierName)

# Plot the best option
print('Best selected parameters: ',format(grid.best_params_))
print('Best estimator: ',format(grid.best_estimator_))

# Test the classifier using test data
estimateModel(grid, x_validation, y_validation) #Validation data

In [None]:
estimateModel(grid, x_test, y_test) #Test data

*Slow, wait.*

In [None]:
predictImage(classifierName)  

## 5.6 Plot the results

In [None]:
### Help function to normalize band values and enhance contrast. Just like what QGIS does automatically
def normalize(array):
    min_percent = 2   # Low percentile
    max_percent = 98  # High percentile
    lo, hi = np.percentile(array, (min_percent, max_percent))
    return (array - lo) / (hi - lo)

In [None]:
# Create a subplot for 6 images: 4 classification, 1 data image and 1 training labels. 
fig, ax = plt.subplots(ncols=2, nrows=3, figsize=(10,15))
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ["white","green","orange","blue","violet"])

# The prediction results
rf_results = rasterio.open(os.path.join(outputBaseFolder,'classification_random_forest.tif'))
show(rf_results, ax=ax[0, 0], cmap=cmap, title='Random forest')

SGD_results = rasterio.open(os.path.join(outputBaseFolder,'classification_SGD.tif'))
show(SGD_results, ax=ax[0, 1], cmap=cmap, title='SGD')

gradient_boost_results = rasterio.open(os.path.join(outputBaseFolder,'classification_gradient_boosting.tif'))
show(gradient_boost_results, ax=ax[2, 0], cmap=cmap, title='gradient_boost')

SVM_grid_search_results = rasterio.open(os.path.join(outputBaseFolder,'classification_SVC_grid_search.tif'))
show(SVM_grid_search_results, ax=ax[2, 1], cmap=cmap, title='SVM grid search')

# Plot the sentinel image 
nir, red, green = image_data[7,], image_data[3,], image_data[1,]
nirn, redn, greenn = normalize(nir), normalize(red), normalize(green)
stacked = np.stack((nirn, redn, greenn))
show(stacked, ax=ax[1,0], title='image') 

# Labels 
show(labels_data, ax=ax[1,1], cmap=cmap, title='labels')