<a href="https://colab.research.google.com/github/duncansnh/Bare-peat/blob/master/Data_Split_Classification_for_github.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Script to perform supervised classification. Applied to bare peat classification on nationwide scale using sentinel 2.  Joblog 92279 1November 2019  
Inputs are
•  indices generated from sentinel 2 imagery (23 layers are generated from previous script,  but other numbers of layers may be applied) 
•	polygon dataset of training samples (known classes  of bare peat = 10, other = 20, rock/ stone=30, shadow =40, water=50.). ID field must contain the class label (short integer), Poly_ID field contains unique polygon ID (short integer).
Main steps:
•	extracts pixel values for polygons
• resamples pixels , default is 300 pixel per class (total number pixels required from each polygon are estimated based on polygon size, sampling is random where total number of pixels required is less than that in polygon, if number of pixels required is greater than number in polygon then all are sampled and addtional pixels are randomly sampled.)
•	splits into training/ validation data (does not take into account source polygon), scales values based on training data
•	shows accuracy matrices for 6 classifiers : random forest, KNN, logistic regression, xg boost, linear discriminant analysis, ensemble classifier
•	needs user input to choose most accurate classifier
•	classifies input image and writes out:
  o	Probability for class == bare peat.

  Currently based on running in Google co- lab, with folders in google drive labelled:
  Imagery, Training_Data

Adapted from original script from Tom Wilson, Foresty Research, which was used with sentinel 1 data to classify forestry as felled/ mature trees. 

##To use a different number of bands:
 

*   Update the script to the new number in the the 9th code block ('extract all data for all pixels'), where code is highlighted with lines of ######, and select which bands should be included (again highlighted in this code block with lines of #########).
*   Update the final code block with band to be excluded (highlighted with lines of ####)



In [0]:
#This is only required if running in colab notebook to install the libraries
#If running Python code elsewhere need to make sure below libraries are installed
! pip install geopandas
! pip install descartes
! pip install rasterio
! pip install rasterstats

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import json
from sklearn.model_selection import train_test_split
import geopandas as gpd
import descartes
import rasterio
from rasterio.mask import mask
from rasterio.features import geometry_mask
from shapely.geometry import mapping
from rasterstats import zonal_stats
import datetime
import math

In [0]:
#Only if running in Google Colab, in which case input image, training polygons and output results need to be in Google Drive.
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

set working drive, iteration (for naming outputs), number of pixels per class

In [0]:

wd = '/content/drive/My Drive'
iteration = 'ML13_v3'
samplesize = 300 #number of pixels to be sampled per class (approximate)

###Set image directory, training directory

In [0]:
image_dir = os.path.join(wd, 'Imagery')
training_dir= os.path.join(wd, 'Training_Data')



###'open' input image

In [0]:
#Read image
s2 = rasterio.open(os.path.join(image_dir,'ML13_23_indices.tif'))
#Print number of bands
B = s2.count
print(B)
print(s2.shape)
#Copy raster profile to for later output
s2prof = s2.profile.copy()
s2prof.update(count = 1, nodata=None)

OPTIONAL - get max and min values from each band ( if raster is small) 

In [0]:
array = s2.read()

# Calculate statistics for each band
stats = []

for band in array:
  stats.append({
    'min': band.min(),
#   'mean': band.mean(),
#   'median': np.median(band),
    'max': band.max()})

del(array)
print((stats))


#open polygon dataset

In [0]:
TaggedPolys= gpd.read_file(os.path.join(training_dir, 'Training_ML13.shp'))

##extract all data for all pixels where the centroid falls within polygons, add polygon id and category id to final 2 columns






In [0]:
#####################################XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX###########################################
#####################################XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX###########################################
#ALTER B if changing the number of bands
#B = 22
B=23

#####################################XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX###########################################
#####################################XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX###########################################


def getPixels(image, poly, indexInput, polygons,  target):
    global B
    shape=[mapping(poly)] 
    print((shape))
    print(("got to -1"))


    ######################## USE THE 2 LINES BELOW TO UPDATE BAND LIST IF NEEDING TO EXCLUDE BAND, COMMENT OUT FIRST LINE##############################
    #####################################XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX###########################################
    outImage, out_transform = mask(image, shape, crop=True, nodata=np.nan)#reduce imagery to pixels overlapping polygon
    #bandlist = [1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20, 21,22,23] # EXCLUDES DARKNESS
    #outImage, out_transform = mask(image, shape, crop=True, nodata=np.nan, indexes = bandlist)# optionally reduce number of bands with indexes, CHANGE B if this is the case
    
    
    #####################################XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX###########################################
    #####################################XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX###########################################


    # if nodata is set to a figure pixels where centroid is outwith polygon are included and not excluded with drop.na
    outList = outImage.reshape((B, -1)).T# reshape output array to rows equal to number of bands, and number of columns to match input (-1) 
    
    currentPolyID = polygons.loc[indexInput,"Poly_ID"]# get current polygon ID
   
    currentPolyIDarr= np.repeat(currentPolyID, outList.shape[0])# creates 1D array of polyID, size equal to number of pixels (shape returns rows, columns)
    currentPolyIDarr= currentPolyIDarr.reshape((outList.shape[0],1))# creates 2D array, 1column
    currentCategory =polygons.loc[indexInput,"ID"]
    currentCategoryarr= np.repeat(currentCategory, outList.shape[0])
    currentCategoryarr= currentCategoryarr.reshape((outList.shape[0],1))# create 2D array of current class / category
    
    outList = np.concatenate((outList,currentPolyIDarr), axis = 1)# add poly ID to pixel values
    outList = np.append(outList,currentCategoryarr, axis=1)# add class to pixel values
    outList = pd.DataFrame(outList).dropna()
 
    return np.append(target, outList, axis=0)


def extractAllPolygons(image, featuresgeom, features):
    global B # number of bands in input imagt
    finalcolno = B+2 # number of columns in extracted pixel dataset
    flatten = np.array([]).reshape(0,finalcolno).astype(float)# empty dataset with number of colums set and datatype set to float
    for index, f in enumerate(featuresgeom): #iterate through each polygon
      indexInput= index# iteration number
      flatten = getPixels(image,f,indexInput, features, flatten)
    flattenArr = np.ma.masked_array(flatten, mask=(flatten == np.nan))
    return pd.DataFrame(flattenArr).dropna()# remove any na - machine learning models can't deal with them 


totValues = extractAllPolygons(s2,TaggedPolys.geometry.values, TaggedPolys)# run both of above functions, input is imagery, geometry part of pandas arraym, gp dataframe


In [0]:
print((totValues.shape))
print((totValues.size))

#### select appropriate number of pixels for each polygon, based on size of polygon and total size of area for each class. Select all data and resample where required number of pixels exceeds total pixels in polygon, where number of pixels required is less than that in polygon take random selection.



In [0]:
#get unique class IDs
ClassColumnIndex = B + 1 # get column index for class ID
PolygonColumnIndex = B # get column index for class ID
Classes = totValues.iloc[:,ClassColumnIndex].unique()


#iloc is index rather than name
FinValues = pd.DataFrame()#create empty pandas dataframe
for Class in Classes: #iterate through each class
  ClassValues = totValues[totValues.iloc[:,ClassColumnIndex]==Class]# select all rows from totValues for this class 
  totpixels = ClassValues.shape[0] # get number of rows (equals number of pixels) for this class from all training polygons
  print((' class :{}  tot pixels = {}'. format(Class,totpixels) ))

  ClassPolygonIDs = ClassValues.iloc[:,B].unique() #obtain polyIDs for this class
  for polyID in ClassPolygonIDs: #iterate through polygons, sample from each polygon based on size
    ClassPolygonValues = ClassValues[ClassValues.iloc[:,B]==polyID]# pixel values for this class , this polygon
    PolySize = ClassPolygonValues.shape[0] # get number of rows (equals number of pixels) for this polygon
    ReqPixels = ((int(math.ceil((PolySize/totpixels)*samplesize)))) #number of pixels required from this polygon, taking into account overall number in training data and total pixels required for class
    print(("Total pixels for poly = {}".format(PolySize)))
    print(("Number of pixels for class {}, polyId {} is :{}".format(Class, polyID,ReqPixels )))

    if PolySize>= ReqPixels:# random selection if number of pixels in polygon is greater than number required
      SelectedValues = ClassPolygonValues.sample(ReqPixels, replace=False)
      
      FinValues = FinValues.append(SelectedValues)
      
    elif PolySize < ReqPixels:# select all pixels and then resample with replacement if number of pixels in polygon is lower than number required
      Extrasamplesize = ReqPixels - PolySize
      ExtraValues = ClassPolygonValues.sample(Extrasamplesize, replace=True)
      print(type(ExtraValues))
      SelectedValues = pd.concat((ExtraValues,ClassPolygonValues), axis = 0)
      FinValues = FinValues.append(SelectedValues)
      print((FinValues.shape))

print((type(Classes)))
print((Classes))



print(FinValues.shape)
print((FinValues.iloc[0,:]))




##Split into training and validation datasets, (not grouped by polygon so pixels from the same polygon may appear in training AND test samples - ensures fraction of split is honoured and may give better output classification as uses data from wider range of pixels )

In [0]:
#iterate through each class

#unique class IDs from previous step are in Classes

trainX =pd.DataFrame(columns=range(B))
testX = pd.DataFrame(columns=range(B))
trainy = pd.DataFrame(columns=range(1))
testy = pd.DataFrame(columns=range(1))
print((testy.shape))



for Class in Classes:
  print((Class))
  ClassSampledValues = FinValues[FinValues.iloc[:,ClassColumnIndex]==Class ] #select current class from all pixel values


  ClassSampledValuesTrain, ClassSampledValuesTest = train_test_split(ClassSampledValues, test_size = 0.2, random_state = 999 )

  trainXClass = ClassSampledValuesTrain.iloc[:,0:B]#select training data colums
  #selects columns from index 0 up to but not including the number of bands - so if you start counting from 1 it selects up to  column B
  trainyClass = ClassSampledValuesTrain.iloc[:, ClassColumnIndex].values.reshape(-1,1)# select class column only 

  testXClass = ClassSampledValuesTest.iloc[:,0:B]#select training data colums
  testyClass = ClassSampledValuesTest.iloc[:, ClassColumnIndex].values.reshape(-1,1)# select class column only 
 
  trainX =  np.append(trainX, trainXClass, axis = 0)
  testX = np.append(testX, testXClass, axis = 0)
  trainy = np.append(trainy,trainyClass, axis = 0)
  testy = np.append(testy, testyClass, axis = 0)
  

  




### scale DFs based on training data
May need to Check no missing, 0 or repeated fields in dataframe or something has gone wrong in image creation process 

In [0]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

trainX = sc.fit_transform(trainX)# scale training data
testX = sc.transform(testX)# scale test data based on training data 



### Supervised ML classifiers. Scale training, testing matrices, train different models, show different accuracy metrics


In [0]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score



Random Forest

In [0]:
trainy= trainy.astype(int)#  class format needs to be particular format - int works - for classifer
testy= testy.astype(int)#  class format needs to be particular format - int works - for classifer

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( random_state = 99, max_features=2, n_estimators=1000)
                            
# note that higher values of  n_estimators (n trees in the forest) may increase time to generate final output 
#max_features = max number of features considered for splitting a node
rf.fit(trainX, trainy)# train model
pred_y = rf.predict(testX)
cm = confusion_matrix(testy, pred_y)
acc = accuracy_score(testy, pred_y)
f1 = f1_score(testy, pred_y, average = None  )

print ("RF:\n{0}\nOverall: {1}%\nF1: {2}".format(cm,round(acc*99, 3),(f1)))

print(("Parameters currently in use : {}".format((rf.get_params()))))

print(("classes: {}".format(rf.classes_)))



#print feature importance table and plot

In [0]:
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier

#get importances
importances = rf.feature_importances_


std = np.std([tree.feature_importances_ for tree in rf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1] # invert sorted array

# Print the feature rank
print("Feature ranking:")

for f in range(trainX.shape[1]):
    print("{}. feature {} ({})" .format(f + 1, indices[f]+1, importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(trainX.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(trainX.shape[1]), indices+1)
plt.xlim([-1, trainX.shape[1]])
plt.show()



#Optional  - fine tune RF paramters - note that it is worth trying default values too , as not all combinations are tested in the fine tune . Note that fine tune for random selection of training data may not be the same when applied to all data combined

In [0]:


import sklearn.ensemble  
import sklearn.metrics  
from sklearn.model_selection import GridSearchCV
 


rftune = sklearn.ensemble.RandomForestClassifier()  

param_grid = {  
           "n_estimators" : [5, 10, 100,500, 1000],  
           "max_features" : ["auto", "sqrt", "log2", 2,3,4,10], #add more parameters here if required - but this will increase time to run
           "random_state" : [99]
           }  

flattrainy = np.ndarray.flatten(trainy)
CV_rf = GridSearchCV(estimator=rftune, param_grid=param_grid)  
CV_rf.fit(trainX, flattrainy)  
print(CV_rf.best_params_)  

KNN

In [0]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 2, metric = 'minkowski', p = 2)
knn.fit(trainX, trainy)
pred_y = knn.predict(testX)
cm = confusion_matrix(testy, pred_y)
acc = accuracy_score(testy, pred_y)
f1 = f1_score(testy, pred_y, average=None)
print ("KNN:\n{0}\nOverall: {1}%\nF1: {2}".format(cm,round(acc*100, 3),(f1)))


Logistic Regression

In [0]:
from sklearn.linear_model import LogisticRegression
lgr = LogisticRegression(random_state = 99, max_iter=5000)
lgr.fit(trainX, trainy)
pred_y = lgr.predict(testX)
cm = confusion_matrix(testy, pred_y)
acc = accuracy_score(testy, pred_y)
f1 = f1_score(testy, pred_y, average=None)
print ("logistic regression:\n{0}\nOverall: {1}%\nF1: {2}".format(cm,round(acc*100, 3),(f1)))

XG Boost

In [0]:
from xgboost import XGBClassifier
xg = XGBClassifier()
xg.fit(trainX, trainy)
pred_y = xg.predict(testX)
cm = confusion_matrix(testy, pred_y)
acc = accuracy_score(testy, pred_y)
f1 = f1_score(testy, pred_y,average=None)
print ("XGBoost:\n{0}\nOverall: {1}%\nF1: {2}".format(cm,round(acc*100, 3),(f1)))

Linear Discriminant Analysis

In [0]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
ld = LDA()
ld.fit(trainX, trainy)
pred_y = ld.predict(testX)
cm = confusion_matrix(testy, pred_y)
acc = accuracy_score(testy, pred_y)
f1 = f1_score(testy, pred_y, average=None)
print ("LDA:\n{0}\nOverall: {1}%\nF1: {2}".format(cm,round(acc*100, 3),(f1)))

Ensemble classifier - try different combinations of models to try and get zero error on class of interest (bare peat is class 10, first row/ column) - it may be possible to 'balance' errors out eg if one model has class 10 ommission error and another has class 10 commission error for the problematic class - usually 20)

In [0]:
from sklearn.ensemble import VotingClassifier
#voting_clf = VotingClassifier(estimators=[('lr', lgr),('rf', rf),('knn',knn),('xg',xg)],voting='soft')#voting='hard'
voting_clf = VotingClassifier(estimators=[('rf',rf) ,('knn',knn)],voting='soft')#voting='hard'
voting_clf.fit(trainX, trainy)
pred_y = voting_clf.predict(testX)
cm = confusion_matrix(testy, pred_y)
acc = accuracy_score(testy, pred_y)
f1 = f1_score(testy, pred_y, average=None)
print ("Ensemble:\n{0}\nOverall: {1}%\nF1: {2}".format(cm,round(acc*100, 3),(f1)))

Optional - run on all data

In [0]:

alltrainX = np.append(testX, trainX, axis = 0)
alltrainy = np.append(testy, trainy, axis = 0)
alltrainy = np.ndarray.flatten(alltrainy)
#knn.fit(alltrainX, alltrainy)
#rf.fit(alltrainX, alltrainy)


voting_clf.fit(alltrainX, alltrainy)

Output probability image - this stage takes longer for ensemble classifier and time may also depend on parameters of random forest 

In [0]:


#Select model of choice
model = voting_clf #xg, rf, knn, lgr ,voting_clf



##NOTE - unable to predict probability for ensemble classifer when voting = hard.
##NOTE - knn probability output only has a few categories
s2prof.update(count=1, nodata=None, dtype=np.float32)
dst = rasterio.open(os.path.join(image_dir,'{}_RF_KNN_alldata1000trees.tif'.format(iteration)), 'w', **s2prof)

for block_index, window in s2.block_windows(1):
    s2_block = s2.read(window=window, masked=True)
    ######################################################################################
    #SELECTION OF BANDS##########################################
    # delete the band that needs excluding (band number minus 1, eg darkness is index 13)
    #s2_block = np.delete(s2_block,13, 0)
    ################################################################################
    v= s2_block.shape
    s2_block = s2_block.reshape(B, -1).T
    s2_block = sc.transform(s2_block)
    s2_block[s2_block<-3.4e+35]=9999
    result_block = model.predict_proba(s2_block).astype('float32')
    #select probabilies for class 1 only
    result_block = result_block[:,0]
    result_block = result_block.reshape(1,v[1],v[2])
    dst.write(result_block, window=window)
dst.close()