<a href="https://colab.research.google.com/github/aditya301cs/Daily-Data-Science-ML/blob/main/Boosting_in_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boosting in Machine Learning

This notebook provides a complete, step-by-step explanation and implementation of **Boosting**, with a practical example using **AdaBoost** on the Almond Types Classification dataset.

## 1. What is Boosting?

Boosting is an ensemble learning technique that combines multiple **weak learners** to form a **strong learner**. Each model is trained sequentially, with more focus given to samples that previous models misclassified.

### Key Idea
- Weak learners perform slightly better than random guessing
- Models are trained sequentially
- Misclassified points receive higher importance
- Final prediction is a weighted combination of all learners

## 2. Popular Boosting Algorithms

- **AdaBoost (Adaptive Boosting)**: Reweights samples after each iteration
- **Gradient Boosting**: Optimizes a loss function using gradient descent
- **XGBoost**: Optimized and regularized version of Gradient Boosting

## 3. Dataset Description

We use the **Almond Types Classification** dataset, which contains three almond types:
- MAMRA
- SANORA
- REGULAR

Features are extracted using image processing techniques. Some values are missing due to orientation issues.

In [2]:
import pandas as pd

almonds = pd.read_csv('Almond.csv', index_col=0)
X = almonds.drop('Type', axis=1)
y = almonds['Type']

## 4. Handling Missing Values

We use **KNN Imputer** to fill missing values using nearest neighbors.

In [3]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X)

## 5. Train-Test Split

We split the dataset to evaluate model performance on unseen data.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_imputed, y, test_size=0.2, random_state=42
)

## 6. Weak Learner: Decision Tree (Stump)

A decision tree with max depth = 1 acts as a weak learner.

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

tree = DecisionTreeClassifier(max_depth=1, random_state=42)
tree.fit(X_train, y_train)

tree_accuracy = accuracy_score(y_test, tree.predict(X_test))
print(f'Weak Learner Accuracy: {tree_accuracy * 100:.2f}%')

Weak Learner Accuracy: 43.14%


## 7. Boosting with AdaBoost

AdaBoost improves performance by combining multiple weak learners.

In [8]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(
    estimator=tree,
    n_estimators=100,
    learning_rate=1.0,
    random_state=42
)

ada.fit(X_train, y_train)
ada_accuracy = accuracy_score(y_test, ada.predict(X_test))
print(f'AdaBoost Accuracy: {ada_accuracy * 100:.2f}%')

AdaBoost Accuracy: 60.25%


## 8. Results Comparison

- Weak Learner Accuracy: ~43%
- AdaBoost Accuracy: ~62%

This clearly shows how boosting improves model performance.

## 9. Key Takeaways

- Boosting converts weak learners into a strong learner
- AdaBoost works well with simple models
- Risk of overfitting exists for small datasets
- Always balance complexity with performance

## 10. Conclusion

Boosting is a powerful ensemble technique used in real-world ML systems. Understanding when to apply it is crucial for building effective models.