# What is Potential Overfitting?

In Boosting, the iterative process aims to improve the ensembles's performance by sequently focusing on the instances that were previously misclassified
However, there is a risk of overfitting if certain conditions are met:

(1). Continuning iterations for too long: Boosting involves training multiple models in sequence, and each subsequent model tries to correct the mistake made by the previous model.
    If the boosting process continues for too many iterations , the ensemble may start to memorize the training data instead of gernealizing from it. This can lead to overfitting , where the ensemble becomes highly specialised in the training data but performs poorly on unseen data.

(2). Base Learners that are too complex: Boosting can use a variety of base learners, such as decision trees or neural networks, they have a higher tendency to overfit the training data. Complex models have more capacity to fit noise or outliers in the data, and when combined in the boosting process, they can amplify the overfitting effect.


## To mitigate the risk of overfitting in boosting , several techniques can be applied:
(1). Early Stopping : Monitoring The performance of the ensemble on a validation set and stopping the boosting process when 
the performance no longer improves. This helps prevent overfitting by finding the optimal number of iterations.

(2). Regularization: Adding regularization techniques, such as weight  decay or dropout , to the base learners can help control their complexity and reduce overfitting
    
(3). Shrinkage/ Learning Rate: Introducing  a learning rate parameter that scales the contributions of each base learner to the 
    ensemble. Lower learning ratee reduce the risk of overfitting by limiting the impact of each individual learner.
    
(4). Cross- Validation: Using cross-validation technique to assess the performance of the boosting ensemble and tune hyperparameters. This helps identify the optimal settings that balance between performance and overfitting.

#### Boosting : 
    Boosting is a popular technique used in ensemble learning, which combines multiple weak or base learners to create a stronger predictive model.
    the main idea behind boosting is to sequentially train a series of models, where each subsequent model focuses on the instances that were misclassified by the previos models. This iterative process allows the ensemble to learn from its mistakes and improve its overall performance

## Types of Boosting: 

(1). AdaBoost(Adaptive Boosting):
    
Adaboost is one of the earliest and most well-known boosting algorithms. It assigns higher
weights to  misclassified instances and focuses on those instances during subsequent iterations. It
sequentially trains a series of weak learners and combines their predictions to form the final 
ensemble. Adaboost is primarly used for binary classification problems.

(2). Gradient Boosting: Gradient Boosting builds an ensemble of weak learners in a stage-wise
    manner. Each subsequent model is trained to correct  the mistakes made by the previous models 
    by fitting the negative gradient of a loss function. Gradient Bossting can handle both classifiction and regression tasks and is often used with the decision trees as base learners .
    Examples of gradient boosting algorithms include XGBoost, LightGBM, and CatBoost.

(3). XGBoost(Extreme Gradient Boosting): XGBoost is an optimzed implementaion of gradient boosting that offers the several enhancements , including parallel processing , regularization techniques, and handling missing values. It uses a combination of tree-based models and linear models for boosting, which allows it to capture both linear and non-linear realtionships in the data efficiently.
    
(4). LightGBM: LightGBM is another gradient boosting framework that focuses on achieving faster training speed and lower memory usage. It uses a novel tree-growing algorithm called "Gradient-based One-side Sampling"(GOSS) to select the most informative instances for building decision trees.
    
(5). CatBoost : CatBoost is a gradient boosting algorithm that is designed to handle catgorical feature direstly without the need for extensive data preprocessing . It Incorporates an innovative mathod to handle categorical varibles, which includes applying a combination of orederd boosting, random permutations, and symmetric trees.
    
(6). Stochastic Gradient Boosting: It introduces randomness into the boosting process by substampling the training data or features at each iterations. It helps to reduce overfittinng and can imporve the models generalization ability, especially when dealing with large datasets.

## How i decide which boosting algo type i have to use in which scenerio:
    
Deciding which boosting algorithm to use in a scenerio depends on several factors. Here are some guidelines to help you make a decision :

(1). Problem Type : Consider whether you are working on a classification or regression problem.
    Some Boosting algrithms are specifically desigend for binary classification tasks, while others can handle both classification and refression. For Example, Adaboost is primarily used for binary classification, while XGBoost and LightGBM,are versitile and can be used for both classification and regression tasks.
    
(2). Dataset size:Take into account the size of your dataset and the relationships within it. If
    your dataset. If you have large dataset, algorithms like LightGBM or CatBoost that are optimized for faster training speed and lower memory usage can be beneficial. They utilize techniques such as data subsampling or feature subsampling to handle large datasets more efficently.
    
(3). DataSet Complexity :Consider the complexity of your dataset and the realtionships within it. If your dataset contains a mix of categorical and numerical features, CatBoost might be a good choice as it handels catgorical variables directly without the need for extensive preprocessing. On the other hand, if you have a dataset without the need for extensive preprocessing. On the other hand. if you have a dataset with complex patterns and non-linear realtionships, algorithms like XGBoost or LightGBM , which use a combinition of tree-based models and linear models, may be more suitable.
    
(4). Interpretability: Think about the interpretability of the model. If interpretability is important in your scenerio, algorithms like AdaBoost or decision tree-based boosting algorithms(e.g XGBoost LightGBM) Provide more transparent models compared to more complex models like neural networks.
    
(5). Performance and Tunability: Consider the performance and tunability requirements of your task. Different boosting algorithms may have different default hyperparameter settings and may require specific tuning approaches. Some algorithms, like XGBoost and LightGBM, offer extensive options for hyperparameter tuning, which can be advantageous if you have the time and computational resources for optimization.

(6). Experimentation: It's often beneficial to experiment with multiple boosting algorithms and compare their performance on your specific dataset. This empirical evaluation can provide insights into which algorithm works best for your particular scenario.

In [1]:
pip install catboost

Collecting catboostNote: you may need to restart the kernel to use updated packages.

  Obtaining dependency information for catboost from https://files.pythonhosted.org/packages/e2/63/379617e3d982e8a66c9d66ebf4621d3357c7c18ad356473c335bffd5aba6/catboost-1.2.2-cp311-cp311-win_amd64.whl.metadata
  Downloading catboost-1.2.2-cp311-cp311-win_amd64.whl.metadata (1.2 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
     ---------------------------------------- 0.0/47.0 kB ? eta -:--:--
     ---------------- --------------------- 20.5/47.0 kB 640.0 kB/s eta 0:00:01
     ---------------- --------------------- 20.5/47.0 kB 640.0 kB/s eta 0:00:01
     ---------------- --------------------- 20.5/47.0 kB 640.0 kB/s eta 0:00:01
     ---------------- --------------------- 20.5/47.0 kB 640.0 kB/s eta 0:00:01
     --------------------------------- ---- 41.0/47.0 kB 140.3 kB/s eta 0:00:01
     -------------------------------------- 47.0/47.0 kB 130.5 kB/s 

In [2]:
pip install Lightgbm

Collecting Lightgbm
  Obtaining dependency information for Lightgbm from https://files.pythonhosted.org/packages/b3/f8/ee33e36194eb03a76eccf3adac3fba51f0e56fbd20609bb531659d48d3cb/lightgbm-4.1.0-py3-none-win_amd64.whl.metadata
  Downloading lightgbm-4.1.0-py3-none-win_amd64.whl.metadata (19 kB)
Downloading lightgbm-4.1.0-py3-none-win_amd64.whl (1.3 MB)
   ---------------------------------------- 0.0/1.3 MB ? eta -:--:--
   ------ --------------------------------- 0.2/1.3 MB 4.6 MB/s eta 0:00:01
   ------------- -------------------------- 0.4/1.3 MB 5.4 MB/s eta 0:00:01
   ---------------------- ----------------- 0.7/1.3 MB 5.8 MB/s eta 0:00:01
   ---------------------------------------  1.3/1.3 MB 7.6 MB/s eta 0:00:01
   ---------------------------------------  1.3/1.3 MB 7.6 MB/s eta 0:00:01
   ---------------------------------------  1.3/1.3 MB 7.6 MB/s eta 0:00:01
   ---------------------------------------  1.3/1.3 MB 7.6 MB/s eta 0:00:01
   ---------------------------------------- 

In [7]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import  GradientBoostingClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generating a synthetic 
X,y = make_classification(n_samples = 1000, n_features = 10, random_state =42)

X_train, X_test, y_train ,y_test = train_test_split(X,y, test_size =0.2, random_state = 42)

# Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier(n_estimators = 100, random_state = 42)
gb_classifier.fit(X_train, y_train)
y_pred_gb = gb_classifier.predict(X_test)
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print("Gradient Boosting Classifier Accuracy : ", accuracy_gb)

#Lightgbm classifier
lgb_classifier = LGBMClassifier(n_estimators =100, random_state = 42)
lgb_classifier.fit(X_train, y_train)
y_pred_lgb = lgb_classifier.predict(X_test)
accuracy_lgb = accuracy_score(y_test, y_pred_lgb)
print("LightGBM Classifier Accuracy:",accuracy_lgb)


#catBoost classifier
cat_classifier = CatBoostClassifier(n_estimators =100, random_state = 42, verbose = 0)
cat_classifier.fit(X_train, y_train)
y_pred_cat = cat_classifier.predict(X_test)
accuracy_cat = accuracy_score(y_test, y_pred_cat)
print("CatBoost Classifier Accuracy:",accuracy_cat)


#Stochastic classifier
stoch_gb_classifier = HistGradientBoostingClassifier(max_iter =100, random_state = 42)
stoch_gb_classifier.fit(X_train, y_train)
y_pred_stoch_gb = stoch_gb_classifier.predict(X_test)
accuracy_stoch_gb = accuracy_score(y_test, y_pred_stoch_gb)
print("Stochastic Classifier Accuracy:",accuracy_stoch_gb)

#Gradient Boosting Regressor(For demo purpose)
gb_regressor = GradientBoostingRegressor(n_estimators = 100, random_state =42)
gb_regressor.fit(X_train, y_train)
y_pred_gb_regressor = gb_regressor.predict(X_test)

Gradient Boosting Classifier Accuracy :  0.9
[LightGBM] [Info] Number of positive: 388, number of negative: 412
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000090 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 800, number of used features: 10
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.485000 -> initscore=-0.060018
[LightGBM] [Info] Start training from score -0.060018
LightGBM Classifier Accuracy: 0.88
CatBoost Classifier Accuracy: 0.885
Stochastic Classifier Accuracy: 0.88


In [12]:
from sklearn.datasets import load_breast_cancer
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#load the breast cancer data set
data = load_breast_cancer()
X ,y = data.data, data.target

#split the data into training and testing sets
X_train , X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)

model = xgb.XGBClassifier()

model.fit(X_train, y_train)

#make Predictions on the test set
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy : ", accuracy)

Accuracy :  0.956140350877193


In [11]:
pip install xgboost

Collecting xgboost
  Obtaining dependency information for xgboost from https://files.pythonhosted.org/packages/32/10/4689bda37403f7dd029d550c4446e0097c2f33b8ae877b235e76d5c49bc2/xgboost-2.0.0-py3-none-win_amd64.whl.metadata
  Downloading xgboost-2.0.0-py3-none-win_amd64.whl.metadata (2.0 kB)
Downloading xgboost-2.0.0-py3-none-win_amd64.whl (99.7 MB)
   ---------------------------------------- 0.0/99.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/99.7 MB 1.3 MB/s eta 0:01:18
   ---------------------------------------- 0.1/99.7 MB 1.7 MB/s eta 0:00:59
   ---------------------------------------- 0.2/99.7 MB 1.9 MB/s eta 0:00:53
   ---------------------------------------- 0.3/99.7 MB 1.9 MB/s eta 0:00:53
   ---------------------------------------- 0.4/99.7 MB 1.9 MB/s eta 0:00:54
   ---------------------------------------- 0.5/99.7 MB 1.8 MB/s eta 0:00:57
   ---------------------------------------- 0.5/99.7 MB 1.8 MB/s eta 0:00:57
   ------------------------------------