# Habitat Train and Test

**Written by Timm Nawrocki, Amanda Droghini**

*Last updated Wednesday, June 9, 2021.*

In [1]:
# -*- coding: utf-8 -*-
# ---------------------------------------------------------------------------
# Habitat Train and Test
# Author: Timm Nawrocki, Amanda Droghini Alaska Center for Conservation Science
# Created on: 2021-06-09
# Usage: Must be executed as a Jupyter Notebook in an Anaconda 3 installation.
# Description: "Habitat Train and Test" trains a random forest model (i.e., path selection function) to predict habitat
# from path covariate means. A threshold for avoidance/selection conversion is selected empirically. All model performance
# metrics are calculated on independent test partitions.
# ---------------------------------------------------------------------------

## 1. Introduction

This script runs the model train and test steps to output a model performance and variable importance report, trained classifier file, and threshold file that can be transferred to the prediction script. The train-test classifier is set to use 4 cores. The script must be run on a machine that can support 4 cores. For information on generating inputs for this script, see the [project readme](https://github.com/accs-uaa/southwest-alaska-moose).

## 2. Import Data and Variables

This script relies on data that has been pre-processed into a standard format csv file. The tools to generate the csv file are included in the [southwest-alaska-moose](https://github.com//accs-uaa/southwest-alaska-moose) repository. The csv file must include all summarized observed and random paths for the target species. Covariates defined below must be modified to match the input csv file if changes to the construction of covariates are made.

This script has general dependencies on the *os* package for file system manipulations, the *numpy* and *pandas* packages for data manipulations, and the *seaborn* and *matplotlib* packages for plotting. *Scikit Learn* provides models, model selection, cross validation, and performance tools. Joblib provides a function to save models.

In [2]:
# Import os
import os
# Import packages for file manipulation, data manipulation, and plotting
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plot
# Import modules for model selection, cross validation, random forest, and performance from Scikit Learn
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
# Import joblib
import joblib
# Import timing packages
import time
import datetime

In [3]:
# Set root directory
root_folder = 'N:/ACCS_Work/Projects/WildlifeEcology/Moose_SouthwestAlaska/Data'
# Define data folders
data_input = os.path.join(root_folder,
                          'Data_Input/paths')
data_output = os.path.join(root_folder,
                           'Data_Output/model_results/round_20210604')

In [4]:
# Define variable sets
predictor_all = ['elevation', 'roughness', 'forest_edge', 'tundra_edge', 'alnus', 'betshr',
                 'dectre', 'dryas', 'empnig', 'erivag', 'picgla', 'picmar', 'rhoshr',
                 'salshr', 'sphagn', 'vaculi', 'vacvit', 'wetsed']
response = ['response']
retain_variables = ['mooseYear_id', 'fullPath_id', 'calfStatus']
all_variables = retain_variables + predictor_all + response
iteration = ['iteration']
absence = ['absence']
presence = ['presence']
prediction = ['prediction']
selection = ['selection']
output_variables = all_variables + absence + presence + prediction + selection + iteration

In [5]:
# Create a classifier
classifier = ExtraTreesClassifier(n_estimators = 1000,
                                   criterion = 'gini',
                                   max_depth = None,
                                   min_samples_split = 2,
                                   min_samples_leaf = 1,
                                   min_weight_fraction_leaf = 0,
                                   max_features = None,
                                   bootstrap = False,
                                   oob_score = False,
                                   n_jobs = 2,
                                   random_state = 5)

In [6]:
# Define input file
input_file = os.path.join(data_input, 'allPaths_meanCovariates_forModelRun.csv')
# Define output folder
output_folder = os.path.join(data_output, 'Calf_RF')
# Define output report
output_report_name = 'report-moose-calf.html'
# Define species, genera, or aggregate name
taxon_name = 'Moose with Calves'
# Define calf status
calf_status = 1

In [7]:
# Create a plots folder if it does not exist
plots_folder = os.path.join(output_folder, "plots")
if not os.path.exists(plots_folder):
    os.makedirs(plots_folder)

In [8]:
# Define output test data
output_csv = os.path.join(output_folder, 'prediction.csv')
# Define output model file
output_classifier = os.path.join(output_folder, 'classifier.joblib')
# Define output threshold file
threshold_file = os.path.join(output_folder, 'threshold.txt')
# Define output correlation plot
variable_correlation = os.path.join(plots_folder, "variable_correlation.png")
# Define output variable importance plot
importance_classifier = os.path.join(plots_folder, "importance_classifier.png")

## 4. Functions

Analyses are conducted in units represented by functions.

### 4.1. Threshold Optimization Functions

In [9]:
# Define a function to calculate performance metrics based on a specified threshold value
def testPresenceThreshold(predict_probability, threshold, y_test):
    # Create an empty array of zeroes that matches the length of the probability predictions
    predict_thresholded = np.zeros(predict_probability.shape)
    # Set values for all probabilities greater than or equal to the threshold equal to 1
    predict_thresholded[predict_probability >= threshold] = 1
    # Determine error rates
    confusion_test = confusion_matrix(y_test, predict_thresholded)
    true_negative = confusion_test[0,0]
    false_negative = confusion_test[1,0]
    true_positive = confusion_test[1,1]
    false_positive = confusion_test[0,1]
    # Calculate sensitivity and specificity
    sensitivity = true_positive / (true_positive + false_negative)
    specificity = true_negative / (true_negative + false_positive)
    # Calculate AUC score
    auc = roc_auc_score(y_test, predict_probability)
    # Calculate overall accuracy
    accuracy = (true_negative + true_positive) / (true_negative + false_positive + false_negative + true_positive)
    # Return the thresholded probabilities and the performance metrics
    return (sensitivity, specificity, auc, accuracy)

In [10]:
# Create a function to determine a presence threshold
def determineOptimalThreshold(predict_probability, y_test):
    # Iterate through numbers between 0 and 1000 to output a list of sensitivity and specificity values per threshold number
    i = 1
    sensitivity_list = []
    specificity_list = []
    while i < 1001:
        threshold = i/1000
        sensitivity, specificity, auc, accuracy = testPresenceThreshold(predict_probability, threshold, y_test)
        sensitivity_list.append(sensitivity)
        specificity_list.append(specificity)
        i = i + 1
    # Calculate a list of absolute value difference between sensitivity and specificity and find the optimal threshold
    difference_list = [np.absolute(a - b) for a, b in zip(sensitivity_list, specificity_list)]
    value, threshold = min((value, threshold) for (threshold, value) in enumerate(difference_list))
    threshold = threshold/1000
    # Calculate the performance of the optimal threshold
    sensitivity, specificity, auc, accuracy = testPresenceThreshold(predict_probability, threshold, y_test)
    # Return the optimal threshold and the performance metrics of the optimal threshold
    return threshold, sensitivity, specificity, auc, accuracy

### 4.2. Export Results Functions

In [11]:
# Create a function to composite model results
def compositeSelection(input_data, presence, threshold, output):
    # Determine positive and negative ranges
    positive_range = input_data[presence[0]].max() - threshold
    negative_range = threshold - input_data[presence[0]].min()
    # Define a function to threshold presences and absences and standardize values from -1 (avoidance) to 1 (selection)
    def compositeRows(row):
        if row[presence[0]] == threshold:
            return 0
        elif row[presence[0]] > threshold:
            adjusted_value = (row[presence[0]] - threshold) / positive_range
            return adjusted_value
        elif row[presence[0]] < threshold:
            adjusted_value = (row[presence[0]] - threshold) / negative_range
            return adjusted_value
    # Apply function to all rows in data
    input_data[output[0]] = input_data.apply(lambda row: compositeRows(row), axis=1)
    # Return the data frame with composited results
    return input_data

In [12]:
# Define a function to plot Pearson correlation of predictor variables
def plotVariableCorrelation(X_train, outFile):
    # Calculate Pearson correlation coefficient between the predictor variables, where -1 is perfect negative correlation and 1 is perfect positive correlation
    correlation = X_train.astype('float64').corr()
    # Generate a mask for the upper triangle of plot
    mask = np.zeros_like(correlation, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    # Set up the matplotlib figure
    f, ax = plot.subplots(figsize=(10, 9))
    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)
    # Draw the heatmap with the mask and correct aspect ratio
    correlation_plot = sns.heatmap(correlation, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={'shrink': .5})
    correlation_figure = correlation_plot.get_figure()
    correlation_figure.savefig(outFile, bbox_inches='tight', dpi=300)
    # Clear plot workspace
    plot.clf()
    plot.close()

In [13]:
# Define a function to plot variable importances
def plotVariableImportances(inModel, X_train, outVariableFile):
    # Get numerical feature importances
    importances = list(inModel.feature_importances_)
    # List of tuples with variable and importance
    feature_list = list(X_train.columns)
    feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
    # Sort the feature importances by most important first
    feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
    # Initialize the plot and set figure size
    variable_figure = plot.figure()
    fig_size = plot.rcParams["figure.figsize"]
    fig_size[0] = 20
    fig_size[1] = 6
    plot.rcParams["figure.figsize"] = fig_size
    # Create list of x locations for plotting
    x_values = list(range(len(importances)))
    # Make a bar chart of the variable importances
    plot.bar(x_values, importances, orientation = 'vertical')
    # Tick labels for x axis
    plot.xticks(x_values, feature_list, rotation='vertical')
    # Axis labels and title
    plot.ylabel('Importance'); plot.xlabel('Variable'); plot.title('Variable Importances');
    # Export
    variable_figure.savefig(outVariableFile, bbox_inches="tight", dpi=300)
    # Clear plot workspace
    plot.clf()
    plot.close()

## 5. Conduct Analyses

The analyses are subdivided into subsections: load data, train and test iterations with nested cross validation, final model training, and export results. Nested cross-validation is necessary to maintain the independence of the test data outside of cross-validated threshold optimization. The prediction results of the outer cross-validation, wherein each sample is predicted a single time, are appended into a single data frame for additional analyses and plotting external to this script.

Outputs of the analysis are:
1. Report of performance, including the following plots:
  1. Final selection of the classifier hyperparameters
  2. Pearson correlation for all predictors
2. Model files:
  1. Final Classifier
3. Report with overall AUC and accuracy.
4. Final threshold
5. Test predictions from 5-fold cross-validation (each sample predicted once)

### 5.1. Load Data

In [14]:
# Create data frame of input data
data_all = pd.read_csv(input_file)
input_data = data_all[data_all['calfStatus'] == calf_status].copy()
# Rename covariates to match prediction grids
input_data = input_data.rename(columns={'elevation_mean': 'elevation',
                                        'roughness_mean': 'roughness',
                                        'forest_edge_mean': 'forest_edge',
                                        'tundra_edge_mean': 'tundra_edge',
                                        'alnus_mean': 'alnus',
                                        'betshr_mean': 'betshr',
                                        'dectre_mean': 'dectre',
                                        'dryas_mean': 'dryas',
                                        'empnig_mean': 'empnig',
                                        'erivag_mean': 'erivag',
                                        'picgla_mean': 'picgla',
                                        'picmar_mean': 'picmar',
                                        'rhoshr_mean': 'rhoshr',
                                        'salshr_mean': 'salshr',
                                        'sphagn_mean': 'sphagn',
                                        'vaculi_mean': 'vaculi',
                                        'vacvit_mean': 'vacvit',
                                        'wetsed_mean': 'wetsed'})
# Shuffle data
input_data = shuffle(input_data, random_state = 5)

In [15]:
# Split the X and y data for classification
X_classify = input_data[predictor_all].astype(float)
y_classify = input_data[response[0]].astype('int32')

In [16]:
# Set initial plot sizefig_size = plot.rcParams["figure.figsize"]
fig_size = plot.rcParams["figure.figsize"]
fig_size[0] = 8
fig_size[1] = 6
plot.rcParams["figure.figsize"] = fig_size
plot.style.use('grayscale')

### 5.2. Train and Test Iterations

In [17]:
# Define 5-fold cross validation split methods
outer_cv_splits = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 5)
inner_cv_splits = StratifiedKFold(n_splits = 5, shuffle = False, random_state = None)

In [18]:
# Create empty lists to store threshold and performance metrics
threshold_list = []
# Create an empty data frame to store the outer cross validation splits
outer_train = pd.DataFrame(columns = all_variables + iteration)
outer_test = pd.DataFrame(columns = all_variables + iteration)

In [19]:
# Create an empty data frame to store the outer test results
outer_results = pd.DataFrame(columns = output_variables)

In [20]:
# Create outer cross validation splits for cover data
count = 1
for train_index, test_index in outer_cv_splits.split(input_data, input_data[response[0]]):
    # Split the data into train and test partitions
    train = input_data.iloc[train_index]
    test = input_data.iloc[test_index]
    # Insert iteration to train
    train = train.assign(iteration = count)
    # Insert iteration to test
    test = test.assign(iteration = count)
    # Append to data frames
    outer_train = outer_train.append(train, ignore_index = True, sort = True)
    outer_test = outer_test.append(test, ignore_index = True, sort = True)
    # Increase counter
    count += 1

In [21]:
# Reset indices
outer_train = outer_train.reset_index()
outer_test = outer_test.reset_index()

In [22]:
#### MODEL TRAIN AND TEST ITERATIONS WITH HYPERPARAMETER AND THRESHOLD OPTIMIZATION IN NESTED CROSS-VALIDATION
####____________________________________________________

# Iterate through outer cross validation splits
i = 1
while i < 6:
    
    
    #### CONDUCT MODEL TRAIN
    ####____________________________________________________
    
    # Partition the outer train split by iteration number
    print(f'Conducting outer cross-validation iteration {i} of 5...')
    train_iteration = outer_train[outer_train[iteration[0]] == i]
    
    # Identify X and y train splits for the classifier
    X_train_classify = train_iteration[predictor_all].astype(float)
    y_train_classify = train_iteration[response[0]].astype('int32')
    
    # Predict each training data row in inner cross validation
    print('\tOptimizing classification threshold...')
    iteration_start = time.time()
          
    # Create an empty data frame to store the inner cross validation splits
    inner_train = pd.DataFrame(columns = all_variables + iteration + ['inner'])
    inner_test = pd.DataFrame(columns = all_variables + iteration + ['inner'])
          
    # Create an empty data frame to store the inner test results
    inner_results = pd.DataFrame(columns = all_variables + absence + presence + prediction + iteration + ['inner'])
          
    # Create inner cross validation splits
    count = 1
    for train_index, test_index in inner_cv_splits.split(train_iteration, train_iteration[response[0]].astype('int32')):
        # Split the data into train and test partitions
        train = train_iteration.iloc[train_index]
        test = train_iteration.iloc[test_index]
        # Insert iteration to train
        train = train.assign(inner = count)
        # Insert iteration to test
        test = test.assign(inner = count)
        # Append to data frames
        inner_train = inner_train.append(train, ignore_index=True, sort=True)
        inner_test = inner_test.append(test, ignore_index=True, sort=True)
        # Increase counter
        count += 1
          
    # Iterate through inner cross validation splits
    n = 1
    while n < 6:
        inner_train_iteration = inner_train[inner_train['inner'] == n]
        inner_test_iteration = inner_test[inner_test['inner'] == n]
    
        # Identify X and y inner train and test splits
        X_train_inner = inner_train_iteration[predictor_all].astype(float)
        y_train_inner = inner_train_iteration[response[0]].astype('int32')
        X_test_inner = inner_test_iteration[predictor_all].astype(float)
        y_test_inner = inner_test_iteration[response[0]].astype('int32')
        
        # Train classifier on the inner train data
        classifier.fit(X_train_inner, y_train_inner)
        
        # Predict probabilities for inner test data
        probability_inner = classifier.predict_proba(X_test_inner)
        # Concatenate predicted values to test data frame
        inner_test_iteration = inner_test_iteration.assign(absence = probability_inner[:,0])
        inner_test_iteration = inner_test_iteration.assign(presence = probability_inner[:,1])
          
        # Add iteration number to inner test iteration
        inner_test_iteration = inner_test_iteration.assign(inner = n)
    
        # Add the test results to output data frame
        inner_results = inner_results.append(inner_test_iteration, ignore_index=True, sort=True)
        
        # Increase n value
        n += 1
    
    # Calculate the optimal threshold and performance of the presence-absence classification
    inner_results[response[0]] = inner_results[response[0]].astype('int32')
    threshold, sensitivity, specificity, auc, accuracy = determineOptimalThreshold(inner_results[presence[0]],
                                                                                   inner_results[response[0]])
    threshold_list.append(threshold)
    iteration_end = time.time()
    iteration_elapsed = int(iteration_end - iteration_start)
    iteration_success_time = datetime.datetime.now()
    print(f'\tCompleted at {iteration_success_time.strftime("%Y-%m-%d %H:%M")} (Elapsed time: {datetime.timedelta(seconds=iteration_elapsed)})')
    print('\t----------')
    
    # Train classifier
    print('\tTraining classifier...')
    iteration_start = time.time()
    classifier.fit(X_train_classify, y_train_classify)
    iteration_end = time.time()
    iteration_elapsed = int(iteration_end - iteration_start)
    iteration_success_time = datetime.datetime.now()
    print(f'\tCompleted at {iteration_success_time.strftime("%Y-%m-%d %H:%M")} (Elapsed time: {datetime.timedelta(seconds=iteration_elapsed)})')
    print('\t----------')
          
    #### CONDUCT MODEL TEST
    ####____________________________________________________
    
    # Partition the outer test split by iteration number
    print('\tPredicting outer cross-validation test data...')
    iteration_start = time.time()
    test_iteration = outer_test[outer_test[iteration[0]] == i]
    
    # Identify X test split
    X_test = test_iteration[predictor_all]
    
    # Use the classifier to predict class probabilities
    probability_prediction = classifier.predict_proba(X_test)
    # Concatenate predicted values to test data frame
    test_iteration = test_iteration.assign(absence = probability_prediction[:,0])
    test_iteration = test_iteration.assign(presence = probability_prediction[:,1])
    
    # Convert probability to presence-absence
    presence_zeros = np.zeros(test_iteration[presence[0]].shape)
    presence_zeros[test_iteration[presence[0]] >= threshold] = 1
    # Concatenate distribution values to test data frame
    test_iteration = test_iteration.assign(prediction = presence_zeros)
    
    # Convert probability to selection
    test_iteration = compositeSelection(test_iteration, presence, threshold, selection)
    
    # Add iteration number to test iteration
    test_iteration = test_iteration.assign(iteration = i)
    
    # Add the test results to output data frame
    outer_results = outer_results.append(test_iteration, ignore_index=True, sort=True)
    iteration_end = time.time()
    iteration_elapsed = int(iteration_end - iteration_start)
    iteration_success_time = datetime.datetime.now()
    print(f'\tCompleted at {iteration_success_time.strftime("%Y-%m-%d %H:%M")} (Elapsed time: {datetime.timedelta(seconds=iteration_elapsed)})')
    print('\t----------')
    
    # Increase iteration number
    i += 1

Conducting outer cross-validation iteration 1 of 5...
	Optimizing classification threshold...
	Completed at 2021-06-09 10:34 (Elapsed time: 0:00:06)
	----------
	Training classifier...
	Completed at 2021-06-09 10:34 (Elapsed time: 0:00:00)
	----------
	Predicting outer cross-validation test data...
	Completed at 2021-06-09 10:34 (Elapsed time: 0:00:00)
	----------
Conducting outer cross-validation iteration 2 of 5...
	Optimizing classification threshold...
	Completed at 2021-06-09 10:34 (Elapsed time: 0:00:06)
	----------
	Training classifier...
	Completed at 2021-06-09 10:34 (Elapsed time: 0:00:00)
	----------
	Predicting outer cross-validation test data...
	Completed at 2021-06-09 10:34 (Elapsed time: 0:00:00)
	----------
Conducting outer cross-validation iteration 3 of 5...
	Optimizing classification threshold...
	Completed at 2021-06-09 10:35 (Elapsed time: 0:00:06)
	----------
	Training classifier...
	Completed at 2021-06-09 10:35 (Elapsed time: 0:00:01)
	----------
	Predicting ou

In [23]:
print(len(outer_results))

539


In [24]:
# Partition output results to presence-absence observed and predicted
y_classify_observed = outer_results[response[0]].astype('int32')
y_classify_predicted = outer_results[prediction[0]].astype('int32')
y_classify_probability = outer_results[presence[0]]

# Determine error rates
confusion_test = confusion_matrix(y_classify_observed, y_classify_predicted)
true_negative = confusion_test[0,0]
false_negative = confusion_test[1,0]
true_positive = confusion_test[1,1]
false_positive = confusion_test[0,1]
# Calculate sensitivity and specificity
sensitivity = true_positive / (true_positive + false_negative)
specificity = true_negative / (true_negative + false_positive)
# Calculate AUC score
fin_auc = roc_auc_score(y_classify_observed, y_classify_probability)
# Calculate overall accuracy
fin_accuracy = (true_negative + true_positive) / (true_negative + false_positive + false_negative + true_positive)

# Print performance results
print(f'Final AUC = {fin_auc}')
print(f'Final Accuracy = {fin_accuracy}')

Final AUC = 0.8386505622657225
Final Accuracy = 0.75139146567718


### 5.3. Train Final Models

In [25]:
# Set initial plot sizefig_size = plot.rcParams["figure.figsize"]
fig_size = plot.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 4
plot.rcParams["figure.figsize"] = fig_size

In [26]:
#### TRAIN AND EXPORT FINAL CLASSIFIER

# Create an empty data frame to store the inner cross validation splits
inner_train = pd.DataFrame(columns = all_variables + iteration + ['inner'])
inner_test = pd.DataFrame(columns = all_variables + iteration + ['inner'])
          
# Create an empty data frame to store the inner test results
inner_results = pd.DataFrame(columns = all_variables + absence + presence + prediction + iteration + ['inner'])
          
# Create inner cross validation splits for input data
count = 1
for train_index, test_index in inner_cv_splits.split(input_data, input_data[response[0]].astype('int32')):
    # Split the data into train and test partitions
    train = input_data.iloc[train_index]
    test = input_data.iloc[test_index]
    # Insert iteration to train
    train = train.assign(inner = count)
    # Insert iteration to test
    test = test.assign(inner = count)
    # Append to data frames
    inner_train = inner_train.append(train, ignore_index=True, sort=True)
    inner_test = inner_test.append(test, ignore_index=True, sort=True)
    # Increase counter
    count += 1
          
# Iterate through inner cross validation splits
n = 1
while n < 6:
    inner_train_iteration = inner_train[inner_train['inner'] == n]
    inner_test_iteration = inner_test[inner_test['inner'] == n]
    
    # Identify X and y inner train and test splits
    X_train_inner = inner_train_iteration[predictor_all].astype(float)
    y_train_inner = inner_train_iteration[response[0]].astype('int32')
    X_test_inner = inner_test_iteration[predictor_all].astype(float)
    y_test_inner = inner_test_iteration[response[0]].astype('int32')
        
    # Train classifier on the inner train data
    classifier.fit(X_train_inner, y_train_inner)
        
    # Predict probabilities for inner test data
    probability_inner = classifier.predict_proba(X_test_inner)
    # Concatenate predicted values to test data frame
    inner_test_iteration = inner_test_iteration.assign(absence = probability_inner[:,0])
    inner_test_iteration = inner_test_iteration.assign(presence = probability_inner[:,1])
          
    # Add iteration number to inner test iteration
    inner_test_iteration = inner_test_iteration.assign(inner = n)
    
    # Add the test results to output data frame
    inner_results = inner_results.append(inner_test_iteration, ignore_index=True, sort=True)
    
    # Increase n value
    n += 1
    
# Calculate the optimal threshold and performance of the presence-absence classification
inner_results[response[0]] = inner_results[response[0]].astype('int32')
threshold_final, sensitivity, specificity, auc, accuracy = determineOptimalThreshold(inner_results[presence[0]], inner_results[response[0]])

# Write a text file to store the presence-absence conversion threshold
file = open(threshold_file, 'w')
file.write(str(round(threshold_final, 5)))
file.close()
    
# Train classifier
classifier.fit(X_classify, y_classify)

# Save classifier to an external file
joblib.dump(classifier, output_classifier)
# Export a variable importance plot
plotVariableImportances(classifier, X_classify, importance_classifier)

In [27]:
print(auc)
print(accuracy)

0.8423157017909204
0.7755102040816326


In [28]:
print(len(inner_results))

539


### 5.4. Export Results

In [29]:
# Export test results to csv
outer_results.to_csv(output_csv, header=True, index=False, sep=',', encoding='utf-8')

In [30]:
# Set initial plot sizefig_size = plot.rcParams["figure.figsize"]
fig_size = plot.rcParams["figure.figsize"]
fig_size[0] = 8
fig_size[1] = 6
plot.rcParams["figure.figsize"] = fig_size

In [31]:
# Export a Pearson Correlation plot for the predictor variables
plotVariableCorrelation(input_data[predictor_all], variable_correlation)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  mask = np.zeros_like(correlation, dtype=np.bool)


In [32]:
# Write html text file
output_report = os.path.join(output_folder, output_report_name)
output_text = os.path.splitext(output_report)[0] + ".txt"
text_file = open(output_text, "w")
text_file.write(f"<html>\n")
text_file.write(f"<head>\n")
text_file.write(f"<meta http-equiv=\"pragma\" content=\"no-cache\">\n")
text_file.write(f"<meta http-equiv=\"Expires\" content=\"-1\">\n")
text_file.write(f"</head>\n")
text_file.write(f"<body>\n")
text_file.write(f"<div style=\"width:90%;max-width:1000px;margin-left:auto;margin-right:auto\">\n")
text_file.write(f"<h1 style=\"text-align:center;\">Distribution model performance for " + taxon_name + "</h1>\n")
text_file.write(f"<br>" + "\n")
text_file.write(f"<h2>Predicted Distribution-abundance Pattern</h2>\n")
text_file.write(f"<p>Distribution-abundance was predicted by a composite hierarchical model: 1. a classifier predicted species presence or absence, and 2. a regressor predicted species foliar cover in areas where the classifier predicted the species to be present. The map below shows the output raster prediction.</p>\n")
text_file.write(f"<p><i>Prediction step has not yet been performed. No output to display.</i></p>\n")
text_file.write(f"<h2>Bayesian Optimization of Hyperparameters</h2>\n")
text_file.write(f"<p>The hyperparameters of the classifier and regressor were independently optimized in a bayesian framework using a Gaussian Process as the generative model. Optimization performance was determined by 5-fold cross validation using area under the receiver operating characteristic curve (AUC) as the metric to maximize for the classifier and R squared as the metric to maximize for the regressor. 250 optimization iterations were performed for the classifier and 500 were performed for the regressor. The best set of parameters was selected based on the maximization criteria.</p>\n")
text_file.write(f"<h2>Composite Model Performance</h2>\n")
text_file.write(f"<p>Model performance was measured by calculating R squared, mean absolute error, and root mean squared error for the composite prediction across five independent cross validation folds. Within each cross validation fold, the Bayesian Optimization and threshold calculation were nested with 5-fold cross validation. Nested cross-validation maintained the independence of the test partition through multiple rounds of cross-validated model optimization. Additionally, the performance of the absence class is reported as an area under the receiver operating characteristic curve (AUC) and overall accuracy, where specificity and sensitivity are as close to equal as possible (i.e, the model performs equally well at predicting absences and presences). All performance results are reported from the merged results of all cross-validation runs (i.e., the performance is based on inclusion of each sample in the test set once).</p>\n")
text_file.write(f"<h3>Overall Performance</h3>\n")
text_file.write(f"<p>AUC = " + str(np.round(fin_auc, 2)) + "</p>\n")
text_file.write(f"<p>Presence-Absence Accuracy = " + str(np.round(fin_accuracy, 2)) + "</p>\n")
text_file.write(f"<h3>Classifier Importances</h3>\n")
text_file.write(f"<p>The Variable Importance plot for the classifier is shown below:</p>\n")
text_file.write(f"<a target='_blank' href='plots\\importance_classifier.png'><img style='display:inline-block;max-width:1000px;width:100%;' src='plots\\importance_classifier.png'></a>\n")
text_file.write(fr"<h2>Variable Correlation</h2>" + "\n")
text_file.write(f"<p>The plot below explores variable correlation. No attempt was made to remove highly correlated variables (shown in the plot dark blue).</p>\n")
text_file.write(f"<a target='_blank' href='plots\\variable_correlation.png'><img style='display:inline-block;width:100%;' src='plots\\variable_correlation.png'></a>\n")
text_file.write(f"</div>\n")
text_file.write(f"</body>\n")
text_file.write(f"</html>\n")
text_file.close()

In [33]:
# Rename HTML Text to HTML
if os.path.exists(output_report) == True:
    os.remove(output_report)
os.rename(output_text, output_report)