# 🎯 How to Use XGBoost in Python for Cancer Prediction with High Accuracy

## 🔍 Introduction to XGBoost

XGBoost (**eXtreme Gradient Boosting**) is a highly effective machine learning algorithm that works well for:

* ✅ Classification tasks
* ✅ Regression tasks

It is fast, accurate, and regularly wins machine learning competitions due to its performance and scalability.

---

## 🩺 Dataset Overview

We'll use a breast cancer dataset with features like:

* Clump thickness
* Uniformity of cell size
* Uniformity of cell shape
  ... and others.

### 🎯 Target Variable:

* **Benign tumor** → class `2`
* **Malignant tumor** → class `4`

---

## 📊 Previous Model Performances

| Model               | Accuracy                  |
| ------------------- | ------------------------- |
| Logistic Regression | 94.7%                     |
| k-Nearest Neighbors | 94.7%                     |
| SVM                 | 94.1%                     |
| Kernel SVM          | 95.3%                     |
| Naive Bayes         | 94.1%                     |
| Decision Tree       | **95.9%** ✅ (best so far) |
| Random Forest       | 93.5%                     |

---

## ⚙️ Environment Setup

* Recommended: **Google Colab** or **Jupyter Notebook**
* Dataset: `data.csv`

> 💡 In Google Colab, upload `data.csv` before running the code.

---

## 📦 Import Libraries and Prepare the Data

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load dataset
dataset = pd.read_csv('data.csv')

# Split features and labels
X = dataset.iloc[:, :-1].values  # all columns except last
y = dataset.iloc[:, -1].values   # target variable

# Encode labels if needed
encoder = LabelEncoder()
y = encoder.fit_transform(y)

# Split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```

---

## 🚀 Training the XGBoost Classifier

```python
from xgboost import XGBClassifier

# Initialize and train the classifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)
```

---

## 📈 Evaluating the Model

### 1. Accuracy and Confusion Matrix

```python
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print("Confusion Matrix:\n", cm)
print("Accuracy: {:.2f}%".format(accuracy * 100))
```

### 2. ✅ k-Fold Cross-Validation

```python
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10)
print("Cross-Validation Accuracy: {:.2f}%".format(accuracies.mean() * 100))
print("Standard Deviation: {:.2f}%".format(accuracies.std() * 100))
```

---

## 📊 XGBoost for Regression (Bonus)

For regression tasks, use `XGBRegressor` instead:

```python
from xgboost import XGBRegressor

regressor = XGBRegressor()
regressor.fit(X_train, y_train)
```

---

## 🏆 Final Results

* **XGBoost Classifier Accuracy:** **97.8%**
* **k-Fold Cross-Validation Avg Accuracy:** **96.53%**
* **Standard Deviation:** \~2%

✅ This **outperformed** all previously tested models!

---

## ✅ Conclusion

XGBoost:

* Is one of the most powerful tools for classification and regression
* Delivered the **highest accuracy** on the cancer dataset
* Proved its **robustness** through k-Fold Cross-Validation

---

## 💡 Key Takeaways

* 🛠 **XGBoost** is an essential model in your ML toolbox.
* 🔍 It works on both **classification and regression** problems.
* 🎯 Achieved **97.8% accuracy** in cancer detection.
* 📊 k-Fold validation confirmed its **robust performance**.



## 🧠 Goal:

Use **XGBoost** to **predict if a tumor is benign or malignant** with high accuracy.

---

## 📊 Dataset Example:

We use the **Breast Cancer Wisconsin dataset** where:

| Feature                  | Example Value               |
| ------------------------ | --------------------------- |
| Clump Thickness          | 5                           |
| Uniformity of Cell Size  | 1                           |
| Uniformity of Cell Shape | 3                           |
| ...                      | ...                         |
| Class (Target)           | 2 (Benign) or 4 (Malignant) |

---

## 📦 Step 1: Install and Import

```bash
pip install xgboost
```

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from xgboost import XGBClassifier
```

---

## 📂 Step 2: Load and Prepare the Data

```python
data = pd.read_csv('data.csv')
X = data.drop(columns=['Class'])  # Features
y = data['Class']                 # Target: 2 (Benign), 4 (Malignant)

# Split into training and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```

---

## 🤖 Step 3: Build and Train XGBoost Classifier

```python
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)
```

---

## ✅ Step 4: Make Predictions and Evaluate

```python
y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
```

---

## 📈 Diagram: What XGBoost Does Internally

Imagine this as an **ensemble of decision trees**:

```
XGBoost = Tree 1 + Tree 2 + Tree 3 + ... + Tree N

Each tree corrects the previous tree's mistake!
```

### Example:

* **Tree 1** says: 80% accuracy → but some malignant tumors are misclassified.
* **Tree 2** learns from Tree 1’s mistakes.
* **Tree 3** improves further.
* Final prediction: combines results of all trees → more accurate!

---

## 🔁 Bonus: k-Fold Cross-Validation

```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(estimator=model, X=X, y=y, cv=10)
print("Mean Accuracy: {:.2f}%".format(scores.mean()*100))
print("Standard Deviation: {:.2f}%".format(scores.std()*100))
```

---

## 📊 Final Output Example

```
Accuracy: 97.8%
Confusion Matrix:
[[91  0]
 [ 3 46]]
Mean Accuracy: 96.53%
Standard Deviation: 2.0%
```

---

## 💡 Why XGBoost is So Powerful?

✅ Boosting: Fixes mistakes from previous trees
✅ Regularization: Avoids overfitting
✅ Handles missing values
✅ Fast and efficient

---

## 🧠 Real-Life Analogy

Think of XGBoost like a **panel of doctors**. One might misdiagnose, but together, their combined judgment is much more accurate.

---

## 📌 Summary

| Step | What We Did                             |
| ---- | --------------------------------------- |
| 1    | Imported libraries                      |
| 2    | Loaded and prepared the data            |
| 3    | Built and trained XGBoost model         |
| 4    | Evaluated accuracy and confusion matrix |
| 5    | Applied k-Fold CV for robustness        |

