# Machine learning in Geosciences -  Ensemble learning

**Department of Applied Geoinformatics and Carthography, Charles University** 

03.04.2023

---------------
Mgr. Daniel Bicák    
*bicakd@natur.cuni.cz*

---------------

### Dataset

Firstly, we need to create a dataset, which will be used for testing our algorithms. We will use **scipy** library. You can install it using `conda install -c anaconda scipy`. Our dataset will consist of **200 samples** (instances) and has **4 features**. 

In [None]:
# import the libraries 
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import math

In [None]:
%matplotlib inline

In [None]:
# lets create a function for min max scalling

def custom_scaler(data, min_v, max_v):
    return min_v+(((data-data.min())*(max_v-min_v))/(data.max()-data.min()))

In [None]:
# define dataset size
dataset_size = 200

# generate all features, we will sample it from logistic distribution
from scipy.stats import logistic

feature1 = logistic.rvs(size=dataset_size).reshape(-1,1)
feature2 = logistic.rvs(size=dataset_size).reshape(-1,1)
feature3 = logistic.rvs(size=dataset_size).reshape(-1,1)
feature4 = logistic.rvs(size=dataset_size).reshape(-1,1)


# scale features
feature1 = custom_scaler(feature1, 0,1)
feature2 = custom_scaler(feature2, 0,1)
feature3 = custom_scaler(feature3, 0,1)
feature4 = custom_scaler(feature4, 0,10)


# create relationship
feature1_v = 1.2 * feature1
feature2_v = (-2*(feature2**2)) + (2*feature2)
feature3_v = (3*(feature3**2)) + (feature3)
feature4_v = np.sin(feature4)

# calculate the dependent variable
D = (feature1_v+feature2_v+feature3_v+feature4_v).flatten()

# inject the noise
# noise has max size of 10% of max value of D
max_size = D.max()/10

from scipy.stats import uniform

# we will generate it from uniform distribution and randomly add to D
noise = uniform.rvs(size=dataset_size)

# scale the noise from -max to max
noise_scaled = custom_scaler(noise, -max_size, max_size)

# add to the dependent variable
D_with_noise = D + noise_scaled

# scale back feature4
feature4 = custom_scaler(feature4, 0,1)

Let's explore the dataset. 

In [None]:
fig1 = plt.figure(figsize=(10,4))

fig1ax1 = fig1.add_subplot(121)
fig1ax2 = fig1.add_subplot(122)

fig1ax1.scatter(feature1, D_with_noise, s=3)
fig1ax1.set_title('feature 1')
fig1ax2.scatter(feature4, D_with_noise, s=3)
fig1ax2.set_title('feature 4')

In [None]:
fig2 = plt.figure(figsize=(6,4))

fig2ax1 = fig2.add_subplot(111)
fig2ax1.hist(D_with_noise, bins=20)
fig2ax1.set_title('Dependent variable')

There are no visible patterns in the data. Can our algorithms approximate relationship accurately? Let's find out! 

In [None]:
from sklearn.model_selection import train_test_split

# merge all array into one
Dataset = np.hstack((feature1, feature2, feature3, feature4))

# split dataset into training and testing set
X_train, X_test, y_train, y_test = train_test_split(Dataset, D_with_noise, test_size=0.25, random_state=42)

### Random Forest and Decision trees

`scikit learn` contains Random Forest class object. We can work with regressor similarly to other algorithms.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# Create a Random Forest instance
RF_reg = RandomForestRegressor(max_depth=10, random_state=42, n_estimators=200, max_features=3)
RF_reg.fit(X_train, y_train)

# Create a Decision tree instance
DT_reg = DecisionTreeRegressor(random_state=42)
DT_reg.fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_squared_error

# make a prediction for each algorithm
RF_pred = RF_reg.predict(X_test)
RF_rmse = mean_squared_error(y_test, RF_pred, squared=False)

DT_pred = DT_reg.predict(X_test)
DT_rmse = mean_squared_error(y_test, DT_pred, squared=False)

print(f'RMSE for Random Forest is: {RF_rmse}, and for Decision Tree is: {DT_rmse}')

What is the importance value for each feature? 

In [None]:
RF_reg.feature_importances_

We can explore how can Random Forest approximate the relationship. We will generate values for desired feature, other will be set to mean value. Let's investigate the relationship between feature4 and dependent variable.

In [None]:
# generate array from 0 to 1, dummy values for feature we want to explore
size=500

desired_feature = np.linspace(0,1,size)

# other features can have random single value, however best suited is mean value
feat1_mean = np.full(shape=size, fill_value=0, dtype=float)
feat2_mean = np.full(shape=size, fill_value=0, dtype=float)
feat3_mean = np.full(shape=size, fill_value=0, dtype=float)

# merge features
RF_test_set = np.hstack((feat1_mean.reshape(-1,1), feat2_mean.reshape(-1,1), feat3_mean.reshape(-1,1), desired_feature.reshape(-1,1)))

# and predict
RF_output = RF_reg.predict(RF_test_set)

# we can do the same for decision tree
DT_output = DT_reg.predict(RF_test_set)


In [None]:
# plot the output

fig3 = plt.figure(figsize=(14,6))

fig3ax1 = fig3.add_subplot(121)
fig3ax1.scatter(desired_feature, RF_output, s=2)
fig3ax1.set_title('Random Forest')
fig3ax1.plot(desired_feature, np.sin(custom_scaler(desired_feature, 0, 10)), c='r' )


fig3ax2 = fig3.add_subplot(122)
fig3ax2.scatter(desired_feature, DT_output, s=2, c='r')
fig3ax2.set_title('Decision Tree')
fig3ax2.plot(desired_feature, (custom_scaler(np.sin(custom_scaler(desired_feature, 0, 10)), 0,1)), c='b' )

### Exercise 1

Can you tune the hyperparameters of these two algorithms? Try to achieve better results! You can use *Out-of-bag* samples when tuning Random Forest.

### Exercise 2

Plot the parameter *number of trees* and RMSE of Random Forest model.

### Adaboost and Gradient Boost

An AdaBoost regressor is a meta-estimator that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction. As such, subsequent regressors focus more on difficult cases.

Gradient Tree Boosting or Gradient Boosted Decision Trees (GBDT) is a generalization of boosting to arbitrary differentiable loss functions, see the seminal work of J.H. Friedman. GBDT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems in a variety of areas including Web search ranking and ecology.

*from scikit learn*

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html#sklearn.ensemble.AdaBoostRegressor

https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor

In [None]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor

Ada_reg = AdaBoostRegressor(random_state=42, n_estimators=100, learning_rate=1)
Grad_reg = GradientBoostingRegressor(random_state=42, learning_rate=0.1)

Ada_reg.fit(X_train, y_train)
Grad_reg.fit(X_train, y_train)

Ada_pred = Ada_reg.predict(X_test)
Grad_pred = Grad_reg.predict(X_test)

In [None]:
Ada_rmse = mean_squared_error(y_test, Ada_pred, squared=False)
Grad_rmse = mean_squared_error(y_test, Grad_pred, squared=False)

print(f'RMSE for AdaBoost is: {Ada_rmse}, and for Gradient Boosting is: {Grad_rmse}')

### Exercise 3

Find out, what is the best learning rate for Gradient Boosting.

### Exercise 4

Read the documentation and apply the *early stopping* option on Gradient Boosting algorithm.


## Bagging -  Exercise 5

Bagging is a general concept. This example will focus on applying bagging to the K-Nearest Neigbour algorithm. KNN is relatively weak algorithm, we want to know, whether bagging can improve the performace. Can KNN match the power of Random Forest or Gradient Boost? We will use the same dataset. Find the information about bagging here; https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html


In [None]:
from sklearn.neighbors import KNeighborsRegressor

# create an instance of regressor
KNN_reg = KNeighborsRegressor(n_neighbors=5)

# find the best possible parameters using gridsearchCV

# create an instance of bagging

# fit the data
.fit(X_train, y_train)

# predict new values
.predict(X_test)

# calculate RMSE


In [None]:
# you can improve the accuracy further by changing the parameters of bagging
# parameter "max_features" will induce diversity into the model



## Boosting - Exercise 6

Stack of estimators with a final regressor.

Stacked generalization consists in stacking the output of individual estimator and use a regressor to compute the final prediction. Stacking allows to use the strength of each individual estimator by using their output as input of a final estimator.

*- scikit learn*

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html#sklearn.ensemble.StackingRegressor

Aggregate all predictor and create a new model. Hypothethically, new model should achieve best results. It is up to you, which and how many of sub-model you include in meta-learner.