# Tree Loss Prediction with Gaussian Process Regression
This notebook predicts the tree loss of a country in a given year based on the Gaussian process regression model. The Gaussian process regression model is a very powerful non-parametric Bayesian model, which takes minimal assumptions for the data. Furthermore, it can provide the uncertainty of the prediction, which is useful for future analysis and decision making.

- Author: Xinshuang Liu
- Email: xil235@ucsd.edu

In [52]:
from dataset import TreeCoverLossDataset
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel
import math
import warnings
warnings.filterwarnings("ignore")

In [53]:
def gp_prediction(x_train, y_train, x_test):
    kernel = DotProduct() + WhiteKernel()
    gpr = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
    x_train = np.array(x_train).reshape(-1, 1)
    y_train = np.array(y_train).reshape(-1, 1)
    x_test = np.array(x_test).reshape(-1, 1)
    gpr.fit(x_train, y_train)
    y_pred, y_std = gpr.predict(x_test, return_std=True)
    return y_pred, y_std

def eval_prediction(y_pred, y_test):
    mse = np.mean((y_pred - np.array(y_test)) ** 2)
    return mse

def eval_gp_model(dataset, mode):
    target_pos = 1 if mode == "tree_loss" else 2
    mse_list = []
    for train_data, test_data, country in dataset:
        if len(train_data) == 0 or len(test_data) == 0:
            continue
        x_train = train_data[:, 0]
        y_train = train_data[:, target_pos]
        x_test = test_data[:, 0]
        y_test = test_data[:, target_pos]
        y_pred, y_std = gp_prediction(x_train, y_train, x_test)
        mse = eval_prediction(y_pred, y_test)
        mse_list.append(mse)
    return math.sqrt(sum(mse_list) / len(mse_list))


In [54]:
dataset = TreeCoverLossDataset("TreeCoverLoss_2001-2020_ByRegion.csv", split_train_test=True)
mse = eval_gp_model(dataset, "tree_loss")
print(f'RMSE: {mse}, TreeCoverLoss_2001-2020_ByRegion, tree loss prediction')
mse = eval_gp_model(dataset, "co2")
print(f'RMSE: {mse}, TreeCoverLoss_2001-2020_ByRegion, co2 prediction')

RMSE: 208067.93602323104, TreeCoverLoss_2001-2020_ByRegion, tree loss prediction
RMSE: 83548704.11298247, TreeCoverLoss_2001-2020_ByRegion, co2 prediction


In [55]:
dataset = TreeCoverLossDataset("TreeCoverLoss_2001-2020 _InPrimaryForest.csv", split_train_test=True)
mse = eval_gp_model(dataset, "tree_loss")
print(f'RMSE: {mse}, TreeCoverLoss_2001-2020 _InPrimaryForest, tree loss prediction')
mse = eval_gp_model(dataset, "co2")
print(f'RMSE: {mse}, TreeCoverLoss_2001-2020 _InPrimaryForest, co2 prediction')

RMSE: 78798.8782742828, TreeCoverLoss_2001-2020 _InPrimaryForest, tree loss prediction
RMSE: 52458708.48116069, TreeCoverLoss_2001-2020 _InPrimaryForest, co2 prediction


## Conclusion
- In this notebook, the Gaussian process regression model predicted the tree loss and co2 emission of a country in a given year after 2010, based on the data before 2010 (include 2010). The RMSE of the prediction for both tree loss and co2 emission are significantly smaller than the variance of the data. This indicates that the Gaussian process regression model is effective for this task. 
- Furthermore, the std value predicted by the model provides a good estimation of the uncertainty of the prediction---a relatively small std value indicates a high confidence in the prediction.