# XGBoost

One of the models we chose to implement for the purpose of lung cancer classification was XGBoost, a gradient boosting algorithm.  
This is the fastest implementation of gradient boosting and dominates tabular datasets on classification predictive modeling problems.

We start by importing relevant libraries and dropping useless columns from our CSV.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier 
from kfold_and_metrics import *
from sklearn.metrics import roc_auc_score
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder

In [None]:
df = pd.read_csv("final.csv")
df = df.drop(columns=["id"])
df.head()

### Parameter Hypertuning

In order to prevent overfitting, we hypertuned our model on the values of the number of estimators, max depth and max samples per leaf.

In [None]:
best_auc = 0
best = {}

for n_est in range(25, 201, 25):
    for m_depth in range(5, 56, 10):
        for m_samples_leaf in range(5, 26, 5):
            params = {'n_estimators': n_est, 'max_depth': m_depth, 'min_samples_leaf': m_samples_leaf}
            print("Current parameter combination:")
            for parameter, value in params.items():
                print(f"\t{parameter}: {value}")
            print()
            
            model = GradientBoostingClassifier(n_estimators=n_est, max_depth=m_depth, min_samples_leaf=m_samples_leaf)
            auc_results = k_fold_cv(model, df, metric_funcs=[roc_auc_score], pca_components=50, k_fold_verbose=True)
            auc_average, auc_std = weighted_avg_and_std(np.array(auc_results['roc_auc_score']))
            if auc_average > best_auc:
                best_auc = auc_average
                best = params

After running, we got the following values:

In [None]:
print("Results of the grid search parameter hypertunning:")
for parameter, value in best.items():
    print(f"\t{parameter}: {value}")