# 2: Train XGBoost Model

Author: Daniel Lusk

## Imports and configuration

In [None]:
import xgboost as xgb
from TrainModelConfig import TrainModelConfig
from utils.data_retrieval import all_gdfs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

%load_ext autoreload
%autoreload 2

config = TrainModelConfig()

## Load data

In [None]:
X_fns = config.WC_fns + config.MODIS_fns + config.soil_fns
Y_fns = config.iNat_fns

X = all_gdfs(X_fns)
Y = all_gdfs(Y_fns)

Compute Preciptation Annual Range by subtracting BIO14 from BIO13

In [None]:
bio_13 = X.loc[:, ["bio_13" in x for x in X.columns]].values
bio_14 = X.loc[:, ["bio_14" in x for x in X.columns]].values
X["wc2.1_10m_bio_13-14"] = bio_13 - bio_14

Drop the unnecessary `band`, `spatial_ref`, and `geometry` columns.

In [None]:
X = X.drop(columns=["band", "spatial_ref", "geometry"])
Y = Y.drop(columns=["band", "spatial_ref", "geometry"])

## XGBoost

### To-Dos:

1. remove all rows (lat/lon coords) in the features that don't correspond with a non-null cell in the current trait map
2. standardize the feature data by removing the mean and centering to the std
3. identify optimal hyperparams with a grid search + cross-validation
4. train models for each trait using optimized hyperparams + cross-validation

Drop NaNs from labels and convert dataframes to numpy arrays

In [None]:
X_np = X.to_numpy()
Y_np = Y.dropna().to_numpy()

Split into train and test and convert data into DMatrices for XGBoost

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X_np, Y_np, test_size=2.0)

D_train = xgb.DMatrix(X_train, label=Y_train)
D_test = xgb.DMatrix(X_test, label=Y_test)