In [None]:
# Boosting in Machine Learning - Jupyter Notebook Assignment

## Theoretical Part

### 1. What is Boosting in Machine Learning?
Boosting is an ensemble learning technique that combines multiple weak learners (usually decision trees) to form a strong learner. It sequentially trains models, giving more weight to misclassified instances to improve overall performance.

### 2. How does Boosting differ from Bagging?
- **Boosting**: Models are trained sequentially, and each new model focuses on correcting the errors of previous models.
- **Bagging**: Models are trained independently and in parallel, with results combined using averaging (regression) or voting (classification).

### 3. What is the key idea behind AdaBoost?
The key idea behind **AdaBoost (Adaptive Boosting)** is to assign weights to each training sample. Initially, all samples have equal weight, but misclassified samples get higher weights in the next iteration, making the model focus on difficult cases.

### 4. Explain the working of AdaBoost with an example.
- Start with equal weights for all training samples.
- Train a weak classifier (e.g., decision stump).
- Assign higher weights to misclassified samples.
- Train the next classifier with updated weights.
- Combine weak classifiers to make the final strong classifier.

Example: Classifying emails as spam or not spam using decision stumps. Misclassified emails get higher weight, so the next iteration focuses on them.

### 5. What is Gradient Boosting, and how is it different from AdaBoost?
Gradient Boosting improves predictions by optimizing a loss function. Unlike AdaBoost, which adjusts sample weights, Gradient Boosting minimizes errors by training models on the residual errors (differences between actual and predicted values).

### 6. What is the loss function in Gradient Boosting?
The loss function in Gradient Boosting depends on the problem type:
- **Regression**: Mean Squared Error (MSE) or Mean Absolute Error (MAE).
- **Classification**: Log Loss (Cross-Entropy Loss).

### 7. How does XGBoost improve over traditional Gradient Boosting?
XGBoost (Extreme Gradient Boosting) improves efficiency and performance with:
- Regularization (L1 & L2)
- Parallel processing
- Handling missing values automatically
- Pruning (stopping unnecessary tree growth)

### 8. What is the difference between XGBoost and CatBoost?
- **XGBoost**: Works well for structured data and numerical features.
- **CatBoost**: Optimized for categorical data, using **ordered boosting** to prevent target leakage.

### 9. What are some real-world applications of Boosting techniques?
- Fraud detection (banking & finance)
- Medical diagnosis (cancer detection)
- Recommender systems (Netflix, Amazon)
- Autonomous driving (object detection)

### 10. How does regularization help in XGBoost?
Regularization (L1 & L2) prevents overfitting by penalizing complex models and reducing unnecessary splits in decision trees.

### 11. What are some hyperparameters to tune in Gradient Boosting models?
- Learning rate
- Number of estimators (trees)
- Maximum depth of trees
- Subsample ratio
- Minimum child weight

### 12. What is the concept of Feature Importance in Boosting?
Feature importance measures how much each feature contributes to model predictions, helping in feature selection and model interpretability.

### 13. Why is CatBoost efficient for categorical data?
CatBoost efficiently handles categorical data by using **ordered boosting** and **feature encoding**, reducing target leakage and improving accuracy.
"""

# Practical Part - Boosting Implementation in Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.metrics import accuracy_score, mean_absolute_error, r2_score, f1_score, mean_squared_error, log_loss, confusion_matrix
import xgboost as xgb
import catboost as cb
from sklearn.datasets import load_breast_cancer, make_classification, make_regression

# Train an AdaBoost Classifier
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
adaboost_clf = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)
adaboost_clf.fit(X_train, y_train)
y_pred = adaboost_clf.predict(X_test)
print("\nAdaBoost Classifier Accuracy:", accuracy_score(y_test, y_pred))

# Train an AdaBoost Regressor
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
adaboost_reg = AdaBoostRegressor(n_estimators=50, learning_rate=1.0, random_state=42)
adaboost_reg.fit(X_train, y_train)
y_pred = adaboost_reg.predict(X_test)
print("\nAdaBoost Regressor MAE:", mean_absolute_error(y_test, y_pred))

# Train a Gradient Boosting Classifier on Breast Cancer dataset
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=42)
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb_clf.fit(X_train, y_train)
print("\nGradient Boosting Classifier Accuracy:", accuracy_score(y_test, gb_clf.predict(X_test)))

# Feature Importance for Gradient Boosting Classifier
plt.figure(figsize=(10,5))
feature_importances = pd.Series(gb_clf.feature_importances_, index=cancer.feature_names)
feature_importances.nlargest(10).plot(kind='barh')
plt.title("Feature Importance - Gradient Boosting")
plt.show()

# Train an XGBoost Classifier
xgb_clf = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
xgb_clf.fit(X_train, y_train)
print("\nXGBoost Classifier Accuracy:", accuracy_score(y_test, xgb_clf.predict(X_test)))

# Train a CatBoost Classifier
cb_clf = cb.CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, verbose=0)
cb_clf.fit(X_train, y_train)
print("\nCatBoost Classifier F1-Score:", f1_score(y_test, cb_clf.predict(X_test)))

# Plot Confusion Matrix for CatBoost Classifier
conf_matrix = confusion_matrix(y_test, cb_clf.predict(X_test))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix - CatBoost Classifier")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

In [1]:
conf_matrix = confusion_matrix(y_test, cb_clf.predict(X_test))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix - CatBoost Classifier")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

NameError: name 'confusion_matrix' is not defined