<a href="https://colab.research.google.com/github/ddinh4/AI_in_Agriculture_Conference_2024/blob/main/XGBOOST_001.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2024 Pre-Conference Worskhop, XGBOOST your Digital Ag Research, Thanos Gentimis, Dina Dinh, Leticia Santos

In this workshop we will explore the new and promising Machine Learning paradigm of XGBOOST on an information rich agriculture-based dataset. Our goal will be to predict various agronomic indices, including yield and vigor through a python-based code that compares multiple models, with an emphasis on the appropriate use and optimization of those models. This workshop will be using google collab as its primary delivery platform, but there will be optional videos available to the audience that will enable them to port the workshop to their own python-based platforms. No advanced coding experience nor knowledge of the specific models used is required, and all data will be provided, but you will need to bring your own laptops. We encourage the participants to ask questions, and a Github repository will be set up for all FAQ, codes, and recordings.

# Preamble


In [None]:
# Basic packages
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt



*  [Pandas is a basic package for data wrangling through dataframes. Very reminiscent of R!](https://https://pandas.pydata.org/)
*  [os is a platform indepence package for connections](https://https://pkg.go.dev/os)




In [None]:
#Preprocessing packages
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV


Add a quick description of each package here, including a reference

In [None]:
# Model packages
from sklearn.linear_model import LinearRegression
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor

Add a quick description of each package here, including a reference

In [None]:
# Accuracy Packages
from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report

Add a quick description of each package here including a reference

# Data Preprocessing

Reading the dataset from an online repository. Creating the Ex0 dataframe

In [None]:
Ex0= pd.read_csv('G:/.shortcut-targets-by-id/1N2GeQNhCJy4B6-unK1KaiWfePgNe9k29//Dina_Dinh/2024_Spring_AIConference/Workshop/Data/HousingData.csv',low_memory=False)

Imputes, using most frequent and creates the Ex1 dataframe

In [None]:
imputer = SimpleImputer(strategy='mean') # Inputs the mean as the missing value
Ex1=pd.DataFrame(imputer.fit_transform(Ex0), columns=Ex0.columns) # applies the imputer to our dataset

# Computing Yield (Regression)

Move the output column in the beginning

In [None]:
cols = list(Ex1)
cols.insert(0, cols.pop(cols.index('YIELD_OBS')))
Ex2=Ex1.loc[:,cols] # final clean dataset with no missing values and the response variable as the first column

Split into Input and Output

In [None]:
X=Ex2.iloc[:,1:len(Ex2.columns)].values # makes all the values of the input variables as a matrix
y=Ex2.iloc[:,0].values.flatten() # extracts the y-values


## Hyperparameter Optimization

We will be setting two hyperparameter grids, one for Random Forests and the other for XGBOOST. The one for random forests will explore:

*   n_estimators: The number of trees
*   max_depth:
*   max_features:



In [None]:
rf_param_grid = {
    'n_estimators': range(100, 800, 150),
    'max_depth': range(1, 50, 10),
    'max_features': range(3, 20, 5),
}

The choices are informed by experience. There are no clear guidelines. And there are more hyperparameters one can tune. It is a matter of balance between accuracy and time.
We will be mps setting the hyperparameters for the XGBOOST. These will explore:


*   learning_rate: shrinking feature weights to prevent overfitting (slows down process when value is lower which is why it prevents overfitting)
*   max_depth: same as RF
*   subsample: Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees.
*   colsample_bytree:
*   gamma: specifies the minimum loss reduction required to make a split. The larger gamma is, the more conservative



In [None]:
xg_param_grid = {
    'learning_rate': np.arange(0, 0.2, 0.01), #
    'max_depth': range(1, 10, 2), # same as RF
    'subsample': np.arange(0.2, 0.6, 0.1), #
    'colsample_bytree': np.arange(0.1, 0.5, 0.1),
    'gamma': np.arange(0, 0.4, 0.1) #
    }

Perform a grid search with a 5-fold crossvalidation and all the dataset for Random Forest first and then XGBOOST

In [None]:
rf_grid_search = GridSearchCV(RandomForestRegressor(), rf_param_grid, cv=5, n_jobs=-1) # njobs = -1 is for parallel processing
rf_grid_search.fit(X, y)

xg_grid_search = GridSearchCV(xgb.XGBRegressor(), xg_param_grid, cv=5, n_jobs=-1)
xg_grid_search.fit(X, y)

Based on the grid search above the best hyperparameters are the following.