# XGBoost

**XGBoost** (eXtreme Gradient Boosting) is a powerful and efficient machine learning algorithm designed for both **regression** and **classification** tasks. It belongs to the family of gradient boosting methods, which build models in a sequential manner by combining the strengths of multiple weak learners (typically decision trees) to create a highly accurate ensemble model.

XGBoost is renowned for its ability to deliver strong predictive performance and handle complex data challenges efficiently.

## Key Features of XGBoost

1. **Gradient Boosting Framework:** XGBoost builds trees one at a time, where each new tree focuses on correcting the errors made by the previous trees, effectively reducing the overall prediction error.

2. **Regularization:** It includes techniques like L1 and L2 regularization to prevent overfitting, ensuring the model generalizes well to unseen data.

3. **Handling Missing Data:** XGBoost can automatically learn the best way to handle missing values in the dataset, improving its robustness.

4. **Parallel Processing:** It optimizes computation by parallelizing tree construction, making it faster and more scalable for large datasets.

5. **Flexibility:** XGBoost supports various objective functions, allowing it to be used for a wide range of applications beyond standard classification and regression, such as ranking and user-defined tasks.

6. **Feature Importance:** It provides insights into which features are most influential in making predictions, aiding in feature selection and model interpretability.

## Common Uses of XGBoost

- **Predictive Modeling:** Used in scenarios where accurate predictions are crucial, such as financial forecasting, risk assessment, and sales prediction.
  
- **Competition Success:** XGBoost has been a popular choice in machine learning competitions (like those on Kaggle) due to its high performance and ability to handle complex datasets.
  
- **Real-World Applications:** Employed in various industries for tasks like fraud detection, recommendation systems, and healthcare diagnostics.

## Why Use XGBoost?

- **High Performance:** Often achieves superior accuracy compared to other algorithms, especially on structured/tabular data.
  
- **Efficiency:** Optimized for speed and resource usage, making it suitable for large-scale datasets.
  
- **Versatility:** Applicable to a wide range of problems and adaptable through its numerous hyperparameters.


## References
- [xgboost.ai](https://xgboost.ai/)
- [XGBoost Documentation](https://xgboost.readthedocs.io/en/stable/)



## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

The focus is on applying XGBoost to a breast cancer dataset (data.csv) used previously for classification tasks, where various models like logistic regression, k-Nearest Neighbors, SVM, Naive Bayes, decision trees, and random forests achieved accuracies between 93.5% and 95.9%, with the decision tree being the top performer. We will make a comparison by looking at what XGBoost gets.

Previous experiments can be found in: [Part 3 - Classification/8 Evaluation and Selection of Classification Models/Example](https://github.com/Ubikitina/Machine-Learning-A-Z/tree/main/Part%203%20-%20Classification/8%20Evaluation%20and%20Selection%20of%20Classification%20Models/Example) folder.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning A-Z/Part 10 - Model Selection & Boosting/2 XGBoost/Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [8]:
# Get the unique values in the y array and their occurrences
unique_values, counts = np.unique(y, return_counts=True)

# Display each unique value and its count
for value, count in zip(unique_values, counts):
    print(f"Value {value}: {count} occurrences")

Value 2: 444 occurrences
Value 4: 239 occurrences


The XGBClassifier expects binary class labels (typically `0` and `1`), but your dataset has labels `[2, 4]`. We will adjust the `y` labels to meet this requirement.



In [10]:
y[y == 2] = 0
y[y == 4] = 1 # Replace 2 with 0 and 4 with 1

In [11]:
y

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,

## Splitting the dataset into the Training set and Test set

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training XGBoost on the Training set

In [13]:
from xgboost import XGBClassifier

# Initialize the XGBoost classifier
classifier = XGBClassifier()

# Fit the classifier to the training data (X_train: features, y_train: target labels)
classifier.fit(X_train, y_train)

## Making the Confusion Matrix

In [14]:
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = classifier.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)

accuracy_score(y_test, y_pred)

[[85  2]
 [ 1 49]]


0.9781021897810219

- **85**: True Negatives (TN)
- **2**: False Positives (FP)
- **1**: False Negatives (FN)
- **49**: True Positives (TP)

The value `0.9781` represents the **accuracy** of the model, and indicates that about 97.81% of the predictions are correct.

We achieve an impressive accuracy of 97.8%. To ensure the reliability of these results, k-Fold Cross-Validation is applied:

## Applying k-Fold Cross Validation

In [15]:
from sklearn.model_selection import cross_val_score

# Perform cross-validation on the training set using the specified classifier
# 'cv = 10' means using 10-fold cross-validation
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.71 %
Standard Deviation: 2.28 %


The k-folds shows an average accuracy of 96.71% with a low standard deviation, confirming XGBoost's superior performance over the previous models.