# XGBoost

__XGBoost__ stands for __eXtreme Gradient Boost__, and is one of the most popular machine learning algorithms used till date. As the name implies, it is a _Gradient Boosting_ algorithm. A gradient boosting algorithm is a technique used for classification and regression problems which produces an _ensemble_. An _ensemble_ is a group of weak prediction models that are combined to form a strong prediction model. __XGBoost__ is a fantastic algorithm because it requires a relatively low amount of processing power and yields strong performance results. <br>

A very simple explanation for the algorithm is that it iteratively develops a _CART_ tree, where every child _CART_ tree accounts for the errors of it's parental _CART_ tree.

### Anaconda Installation

```bash
    conda install -c anaconda py-xgboost
```

<hr>

## Code

__Setting up the Dataset:__

_Note:_ Feature Scaling isn't required when performing XGBoost.

In [1]:
import numpy as py
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

# Transform gender descriptions to values 0 and 1.
X[:, 2] = LabelEncoder().fit_transform(X[:, 2])

# Perform dummy encoding on country descriptions.
ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(categories = 'auto'), [4])], remainder = 'passthrough')
X = ct.fit_transform(X)

# Prevent the dummy variable trap. 
X = X[:, 1:]

# Split the dataset into the training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

<hr>

__Fitting the Classifier & Making Predictions:__

In [2]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

cm

array([[1521,   74],
       [ 197,  208]], dtype=int64)

<hr>

__Performing k-Fold Cross Validation:__

In [3]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)

print(accuracies)
print('Mean: ' + str(accuracies.mean()))
print('Standard Deviation: ' + str(accuracies.std()))

[0.87    0.855   0.87875 0.8725  0.86    0.8525  0.865   0.85    0.84875
 0.8725 ]
Mean: 0.8625
Standard Deviation: 0.01017042280340401
