# Bagging Tips : 
    Bagging generally gives better results than Pasting
    Good results come around the 25% to 50% row sampling mark
    Random patches and subspaces should be used while dealing with high dimensional data
    To find the correct hyperparameter values we can do GridSearchCV/RandomSearchCV


In [3]:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

In [4]:
# Generating a synthetic classification dataset
X, y = make_classification(n_samples=10000, n_features=10, n_informative=3)

# Splitting dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



### **Explanation**  
- **`make_classification`**: Creates a dataset with **10,000 samples**, **10 features**, and **3 informative features**.  
- **`train_test_split`**: Splits data into **80% training** and **20% testing**, ensuring **reproducibility** with `random_state=42`.  
- Prepares data for **training machine learning models**.  


In [5]:
# Initializing a Decision Tree classifier with a fixed random state
dt = DecisionTreeClassifier(random_state=42)

# Training the Decision Tree model on the training data
dt.fit(X_train, y_train)

# Making predictions on the test data
y_pred = dt.predict(X_test)

# Calculating and printing the accuracy of the Decision Tree model
print("Decision Tree accuracy", accuracy_score(y_test, y_pred))


Decision Tree accuracy 0.813



### **Explanation**  
- **`DecisionTreeClassifier(random_state=42)`**: Creates a Decision Tree model with a fixed random state for reproducibility.  
- **`dt.fit(X_train, y_train)`**: Trains the model on the training dataset.  
- **`dt.predict(X_test)`**: Predicts class labels for the test dataset.  
- **`accuracy_score(y_test, y_pred)`**: Computes the accuracy by comparing predictions with actual labels.  
- Helps evaluate how well the Decision Tree performs on unseen data.

# Bagging Ensemble : 

In [8]:
# Creating a Bagging Classifier with Decision Tree as the base estimator
bag = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),  # Base model: Decision Tree
    n_estimators=500,  # Number of Decision Trees in the ensemble
    max_samples=0.5,  # Each tree is trained on 50% of the training data (bootstrapped)
    bootstrap=True,  # Sampling with replacement (Bootstrapping)
    random_state=42  # Ensures reproducibility
)

# Training the Bagging Classifier on the training data
bag.fit(X_train, y_train)





### **Explanation**  
- **`BaggingClassifier`**: Creates an ensemble of **500 Decision Trees**, each trained on **50% of the data** sampled with replacement.  
- **`base_estimator=DecisionTreeClassifier()`**: Uses Decision Trees as the base learners.  
- **`n_estimators=500`**: Constructs **500 different Decision Trees** to reduce variance.  
- **`bootstrap=True`**: Enables bootstrapping, allowing different trees to train on different subsets of data.  
- **`bag.fit(X_train, y_train)`**: Trains the Bagging ensemble on the training data.  
- Improves model **stability and accuracy** by reducing overfitting compared to a single Decision Tree.

In [9]:


y_pred = bag.predict(X_test)
     

accuracy_score(y_test,y_pred)
     


0.867

# Applying GridSearchCV for Hyperparameter Tuning

In [10]:
# Step 1: Import Required Libraries

from sklearn.model_selection import GridSearchCV


In [11]:
# Step 2: Define the Parameter Grid 

# Defining hyperparameter search space
param_grid = {
    'n_estimators': [100, 300, 500],  # Number of trees in the ensemble
    'max_samples': [0.3, 0.5, 0.7],  # Fraction of training data per tree
    'bootstrap': [True, False]  # Whether to use bootstrapping
}


In [12]:
# Step 3: Initialize the Bagging Classifier

# Creating a Bagging Classifier with Decision Tree as the base estimator
bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)


In [13]:
# Step 4: Perform Grid Search with Cross-Validation 

# Initializing GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=bag, 
    param_grid=param_grid, 
    cv=5,  # 5-fold cross-validation
    scoring='accuracy',  # Evaluating based on accuracy
    n_jobs=-1  # Using all available CPU cores for faster processing
)

# Fitting GridSearchCV to training data
grid_search.fit(X_train, y_train)




In [14]:
# Step 5: Retrieve the Best Parameters and Score 

# Printing the best combination of hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Printing the best accuracy score achieved
print("Best Accuracy:", grid_search.best_score_)


Best Hyperparameters: {'bootstrap': True, 'max_samples': 0.7, 'n_estimators': 300}
Best Accuracy: 0.8643749999999999


In [15]:
# Step 6: Evaluate the Best Model on the Test Set 

# Using the best found model to make predictions on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Printing test accuracy
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Test Accuracy: 0.8675
