# **Predicting Land Assessment Values in Alberta, Canada**

**Dr. Eric Asare**

**March 27, 2020**

# Problem Statement

A) There is a high cost (both monetary and time) of accessible land assessment value/acre of farmland in Alberta.

B) Land assessment value is a principal argument in the function used to estimate the economic value of wetlands.


# Objective

A) To estimate a model that could provide accurate predictions of land assessment values in sub-basin 14 in Alberta.  


# Methodology

A) A myriad of supervised models with increasing levels of complexity will be estimated.

B) Generic model: Land assessment value  = f(Land feritlity Classes) 

C) Metric for choosing best model: Coeficient of Determination (R-squared) 

# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.

In [40]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import pandas as pd
import os

# to make this notebook's output stable across runs
np.random.seed(42)

#suppress warnings
import warnings 
warnings.simplefilter('ignore')

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "ensembles"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)
    
#for eg after the plotting use save_fig("law_of_large_numbers_plot") here the fig_id is "law of ..."

# Datasets

In [6]:
from sklearn.model_selection import train_test_split
df = pd.read_csv("datasets/finalfinal_june29_5_49pm.csv", header=None) #already prepared dataset using R
dataset = df.values
X = dataset[1:,1:16].astype(float)
Y = dataset[1:,0].astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [57]:
#%matplotlib inline
#import matplotlib.pyplot as plt
#df.hist(bins=50, figsize=(20,15))
#save_fig("attribute_histogram_plots")
#plt.show()

# Linear regression

In [455]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lin_reg = LinearRegression()
model_lin = lin_reg.fit(X_train,y_train)
y_pred_lin = lin_reg.predict(X_test)

In [456]:
from sklearn import metrics
print('R-square:', metrics.r2_score(y_test, y_pred_lin))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred_lin))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_lin))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_lin)))

R-square: 0.32018565992570924
MAE: 6154.386096342833
MSE: 65328948.4846742
RMSE: 8082.632521937033


# Regularised LS : Ridge (l2) and Lasso(l1)

In [8]:
from sklearn.linear_model import ElasticNet #li_ratio =0 ridge; 1 lasso
elastic_net = ElasticNet(alpha =0.5 , l1_ratio =1.)
model_elas = elastic_net.fit(X_train,y_train)
y_pred_elas = elastic_net.predict(X_test)
from sklearn import metrics
print('R-square:', metrics.r2_score(y_test, y_pred_elas))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred_elas))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_elas))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_elas)))

R-square: 0.3201135895962952
MAE: 6155.197508084036
MSE: 65335874.31509586
RMSE: 8083.060949609118


# Voting : Based on Linear regression models and Random Forest

In [44]:
from sklearn.ensemble import VotingRegressor
from sklearn import metrics
from sklearn.model_selection import cross_val_score

ElasticNet_lasso = ElasticNet(alpha = 0.1, l1_ratio =1) #lasso
ElasticNet_ridge = ElasticNet(alpha = 0.1, l1_ratio =0) #ridge
RndForest_reg = RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth =50, max_features=9, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=3,
                      min_samples_split=8, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

clf_voting = [ElasticNet_lasso,ElasticNet_ridge]

eclf = VotingRegressor(estimators=[('Random Forests', RndForest_reg), 
                                   ('Lasso', ElasticNet_lasso), ('Ridge Regression', ElasticNet_ridge)])
for clf, label in zip([RndForest_reg,ElasticNet_lasso,ElasticNet_ridge, eclf], 
                      ['Random Forest', 'Lasso', 'Ridge Regression', 'Ensemble']):
    scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='r2')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.58 (+/- 0.04) [Random Forest]
Accuracy: 0.33 (+/- 0.04) [Lasso]
Accuracy: 0.28 (+/- 0.03) [Ridge Regression]
Accuracy: 0.46 (+/- 0.03) [Ensemble]


# Ensemble Methods

# A) Boosting

In [55]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost
from sklearn.metrics import mean_squared_error

ada_boost = AdaBoostRegressor()
grad_boost = AdaBoostRegressor()
xgb_boost = xgboost.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 1, learning_rate = 1,
                max_depth = 12, alpha = 1, n_estimators = 400)

ereg = VotingRegressor(estimators=[ada_boost, grad_boost, xgb_boost])

#ereg = VotingRegressor(estimators=[('gb', reg1), ('rf', reg2), ('lr', reg3)])

labels = ["Adaboost", "gradient", "xgb", "votingre"]

boost_array = [ada_boost, grad_boost, xgb_boost]

for clf, label in zip([ada_boost, grad_boost, xgb_boost, eclf], labels):
    scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='r2')
    print("Mean: {0:.3f}, std: (+/-) {1:.3f} [{2}]".format(scores.mean(), scores.std(), label))

Mean: 0.227, std: (+/-) 0.079 [Adaboost]
Mean: 0.245, std: (+/-) 0.080 [gradient]
Mean: 0.384, std: (+/-) 0.035 [xgb]
Mean: 0.462, std: (+/-) 0.035 [votingre]


# B) Bagging

In [428]:
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.svm import SVR
from sklearn.svm import LinearSVR
from sklearn.linear_model import ElasticNet #li_ratio =0 ridge; 1 lasso
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

seed = 1075
np.random.seed(seed)

SVR_reg = SVR()
SVR_lin = LinearSVR(epsilon=0.7, C=100)
ElasticNet_lasso = ElasticNet(alpha = 0.1, l1_ratio =1) #lasso
ElasticNet_ridge = ElasticNet(alpha = 0.1, l1_ratio =0) #ridge
RndForest_reg = RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth =50, max_features=9, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=3,
                      min_samples_split=8, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

clf_bagging = [RndForest_reg, SVR_reg, SVR_lin, ElasticNet_lasso,ElasticNet_ridge]

for clf in clf_bagging:
    non_bagging_score = cross_val_score(clf, X_train, y_train, cv=10, n_jobs=-1)
    bagging_reg = BaggingRegressor(clf,max_samples=0.4, max_features=10, random_state=seed)
    bagging_scores = cross_val_score(bagging_reg, X_train, y_train, cv=10,n_jobs=-1,  scoring='r2')
    print ("Mean of: {1:.3f}, std: (+/-) {2:.3f} [{0}]".format(clf.__class__.__name__,non_bagging_score.mean(),
                                                              non_bagging_score.std()))
    print ("Mean of: {1:.3f}, std: (+/-) {2:.3f} [Bagging {0}]\n".format(clf.__class__.__name__,
                                                                         bagging_scores.mean(), bagging_scores.std()))
 

Mean of: 0.577, std: (+/-) 0.038 [RandomForestRegressor]
Mean of: 0.516, std: (+/-) 0.034 [Bagging RandomForestRegressor]

Mean of: -0.001, std: (+/-) 0.001 [SVR]
Mean of: -0.000, std: (+/-) 0.002 [Bagging SVR]

Mean of: -0.853, std: (+/-) 1.516 [LinearSVR]
Mean of: -0.161, std: (+/-) 0.219 [Bagging LinearSVR]

Mean of: 0.327, std: (+/-) 0.038 [ElasticNet]
Mean of: 0.310, std: (+/-) 0.032 [Bagging ElasticNet]

Mean of: 0.285, std: (+/-) 0.027 [ElasticNet]
Mean of: 0.250, std: (+/-) 0.024 [Bagging ElasticNet]



In [None]:
# Create the parameter grid based on the results of random search 
#param_grid = {
   # 'bootstrap': [True],
   # 'max_depth': [80, 90, 100, 110, 200, 300, 400],
  #  'max_features': [2, 3],
  #  'min_samples_leaf': [3, 4, 5],
  #  'min_samples_split': [8, 10, 12],
   # 'n_estimators': [100, 200, 300, 1000]
#}
# Create a based model
#rf = RandomForestRegressor()
# Instantiate the grid search model
#grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
#                          cv = 3, n_jobs = -1, verbose = 2)
#grid_search.fit(X, Y)
#best_gri = grid_search.best_estimator_
#best_gri
#params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
             #   'max_depth': 5, 'alpha': 10}

#xgb_reg.cv(X_train, y_train, params=params, nfold=3,
                   # num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)

# Preliminary Results

A) Random forest has the predictive power based on r-squared.


# Future Works
A) Gather more variables.

B) Random forest will be explored further using grid searching of parameter space to improve on the results.

B) Python GUI Application with Tkinter.
