1. Boosting is a powerful machine learning technique that helps improve the accuracy of predictive models. It’s an ensemble method that combines multiple weak learners—typically decision trees—to create a strong classifier.

2. Feature	Bagging	Boosting
Focus	Reduces variance	Reduces bias
Training	Parallel	Sequential
Model Combination	Majority voting/averaging	Weighted combination
Complexity	Less complex	More complex
Risk of Overfitting	Lower

3. How It Works:

Assign Equal Weights: Initially, all training samples are given equal importance.

Train a Weak Model: A simple model (often a decision tree with just one split, called a "stump") is trained on the data.

Identify Errors: Misclassified samples are given higher weights, so the next model focuses on learning those correctly.

Repeat Process: This process continues with each new model correcting mistakes made by the previous ones.

Final Strong Model: At the end, all weak models are combined, with stronger models having more influence.

4. Example: Classifying Apples and Oranges

Imagine we want to build a model that classifies apples and oranges using features like color and shape.

Step 1: Initialize Equal Weights
We start with a dataset of apples and oranges, giving each data point equal importance.

Step 2: Train the First Weak Model
We train a simple classifier—say a decision stump (a shallow decision tree). It might predict that “red objects are apples and orange objects are oranges.”

It correctly classifies most of the fruits.

But it misclassifies some oranges that are slightly red.

Step 3: Increase Weight of Misclassified Samples
AdaBoost increases the weight of the misclassified oranges.

This forces the next model to focus more on these difficult cases.

Step 4: Train the Next Model
The second weak classifier now learns a new rule: “If shape is round, classify as apple; if oval, classify as orange.”

This fixes some of the previous errors.

Step 5: Repeat the Process
The model continues training weak classifiers in sequence, correcting mistakes made by previous models.

Each new model gets a stronger ability to classify the apples and oranges correctly.

Step 6: Combine All Weak Models
AdaBoost assigns weights to all classifiers based on their accuracy.

The final prediction is made using a weighted vote from all weak classifiers.

Now, we have a strong classifier that can accurately distinguish apples and oranges.

5. Gradient Boosting is an advanced boosting technique that builds models sequentially, minimizing errors using gradient descent. Instead of adjusting weights like AdaBoost, Gradient Boosting focuses on correcting residual errors left by previous models.



6. In Gradient Boosting, the loss function is a key component that guides the optimization process. The loss function measures the difference between actual and predicted values, and Gradient Boosting minimizes this error using gradient descent.


7.
Speed & Performance	Slower, inefficient for large datasets	Optimized for speed with parallelization
Regularization	Lacks built-in regularization	Uses L1 (Lasso) & L2 (Ridge) regularization to prevent overfitting
Tree Pruning	Uses depth-based tree growth	Prunes trees using max depth & minimum loss reduction, improving generalization
Handling Missing Values	May require data preprocessing	Automatically learns how to handle missing values
Parallel Processing	Sequential, slower training	Uses multi-threaded execution for faster model training
Custom Loss Functions	Limited flexibility	Allows user-defined objective functions & evaluation metrics

8.
Best For	Numeric and structured data	Categorical data-heavy datasets
Handling Categorical Features	Requires encoding (e.g., one-hot encoding)	Uses built-in categorical feature handling
Training Speed	Fast, parallel computation	Optimized for categorical data, can be faster in some cases
Overfitting Prevention	L1/L2 regularization	Ordered boosting avoids target leakage
Hyperparameter Tuning	More complex and requires careful tuning	More automatic tuning, requires fewer manual adjustments
Missing Value Handling	Uses default values and special treatment	Automatically finds patterns in missing data
Interpretability	Widely used, many visualizations available	Provides built-in feature importance insights

9. Boosting techniques, like AdaBoost, Gradient Boosting, XGBoost, LightGBM, and CatBoost, are widely used across industries due to their ability to improve predictive accuracy. Here are some impactful real-world applications:

  1. Fraud Detection
✔ Financial institutions use Boosting models to detect fraudulent transactions. ✔ XGBoost is commonly used for analyzing spending patterns and flagging anomalies in credit card transactions and online payments.

  2. Medical Diagnosis & Healthcare Predictions
✔ Boosting helps predict diseases based on patient symptoms and medical history. ✔ CatBoost, optimized for categorical data, is effective in cancer detection and predicting disease progression.

  3. Financial Markets & Risk Assessment
✔ Banks use Boosting models to assess loan default risks. ✔ Hedge funds leverage Gradient Boosting for stock market prediction and trend analysis.

  4. Natural Language Processing (NLP)
✔ Boosting techniques improve sentiment analysis, spam detection, and text classification. ✔ CatBoost and XGBoost enhance recommendation systems for personalized content suggestions.

  5. Image Recognition & Computer Vision
✔ Boosting is used in facial recognition systems (like in security and surveillance). ✔ AdaBoost was historically used in Viola-Jones face detection algorithm.

  6. Customer Analytics & Recommendation Systems
✔ E-commerce platforms use Boosting for product recommendation systems. ✔ CatBoost is effective in predicting customer churn by analyzing buying behavior.

  7. Autonomous Vehicles & Self-Driving Cars
✔ Boosting helps improve decision-making models for object detection, lane detection, and collision prevention.

  8. Cybersecurity & Threat Detection
✔ Boosting models detect malware, intrusion attempts, and phishing attacks in cybersecurity systems.

10. How Regularization Improves XGBoost
✔ Reduces Overfitting – Controls complexity, ensuring robust predictions. ✔ Improves Generalization – Helps models perform well on unseen data. ✔ Enhances Feature Selection – L1 regularization selects only the most relevant features. ✔ Smooths Decision Boundaries – L2 regularization prevents erratic predictions.

11. Tuning hyperparameters is essential for optimizing Gradient Boosting models. Here are some key hyperparameters to adjust for better performance:

  1. Learning Rate (eta)
✔ Controls the step size in updating predictions. ✔ Lower values (e.g., 0.01–0.1) improve generalization but require more trees. ✔ Higher values can lead to overfitting if not carefully tuned.

  2. Number of Trees (n_estimators)
✔ Determines how many boosting rounds occur. ✔ More trees improve accuracy but increase computation time. ✔ Optimal balance needed to prevent overfitting.

  3. Maximum Depth (max_depth)
✔ Controls tree complexity. ✔ Shallower trees (3–6 depth) prevent overfitting. ✔ Deeper trees capture more patterns but risk overfitting.

  4. Minimum Child Weight (min_child_weight)
✔ Defines the minimum sum of weights required for a node split. ✔ Higher values lead to more conservative models, preventing overfitting. ✔ Lower values allow trees to grow deeper.

  5. Subsampling (subsample)
✔ Controls the percentage of training samples used per tree. ✔ Reduces variance by adding randomness. ✔ Typical range: 0.5–1.0.

  6. Column Sampling (colsample_bytree, colsample_bylevel)
✔ Selects a fraction of features at each level/tree. ✔ Helps improve generalization, especially in high-dimensional datasets.

  7. Regularization (L1 & L2)
✔ Prevents overfitting by penalizing large weights. ✔ L1 (Lasso) helps feature selection, while L2 (Ridge) smooths predictions.

  8. Loss Function Selection
✔ Different tasks require different loss functions:

Regression → MSE, MAE

Classification → Log Loss, Cross-Entropy

  9. Gamma (min_split_loss)

12. Feature Importance in Boosting refers to the relative contribution of each input feature in making predictions. It helps identify which features significantly impact the model's decision and which ones can be ignored or removed.

How Feature Importance Works in Boosting
Boosting algorithms like XGBoost, LightGBM, and CatBoost calculate feature importance using different methods:

Split-Based Importance (Gain)

Measures how much a feature improves the purity of splits in a tree.

The higher the gain, the more impactful the feature.

Used in XGBoost & LightGBM.

Frequency-Based Importance

Counts how often a feature is used for splitting across all trees.

Features used more frequently are considered more important.

Permutation Importance

Randomly shuffles feature values and observes the drop in model performance.

A large drop indicates high importance.

SHAP Values (Shapley Additive Explanations)

Provides an advanced method to measure a feature’s contribution for individual predictions.

Used in explainable AI applications.

13. Key Reasons Why CatBoost Excels with Categorical Data
Automatic Encoding of Categorical Features

Traditional models require manual encoding (one-hot, label encoding), which can be inefficient.

CatBoost natively transforms categorical variables, improving speed and accuracy.

Ordered Boosting to Prevent Target Leakage

Ensures that when learning patterns, it doesn’t use future data to predict past values.

Helps avoid overfitting in datasets with many categorical features.

Efficient GPU and CPU Implementation

Optimized for both single-node and distributed environments.

Works faster than XGBoost in many cases, especially with large categorical datasets.

Handles Missing Values Automatically

CatBoost doesn’t require manual imputation of missing values.

Learns patterns even when some categorical fields have missing data.

High Performance with Less Hyperparameter Tuning

Compared to XGBoost and LightGBM, CatBoost requires less fine-tuning to achieve optimal results.

Saves time when working on complex datasets.

Practical

In [None]:
14. from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize AdaBoost Classifier
model = AdaBoostClassifier(n_estimators=50, random_state=42)

# Train model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

In [None]:
15. from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.datasets import make_regression

# Generate synthetic regression dataset
X, y = make_regression(n_samples=500, n_features=5, noise=0.2, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize AdaBoost Regressor
model = AdaBoostRegressor(n_estimators=50, learning_rate=0.1, random_state=42)

# Train model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")

In [None]:
16. from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Gradient Boosting Classifier
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train model
model.fit(X_train, y_train)

# Get feature importance
importance = model.feature_importances_

# Print feature importance
for name, imp in zip(feature_names, importance):
    print(f"{name}: {imp:.4f}")

# Plot feature importance
plt.figure(figsize=(10, 5))
plt.barh(feature_names, importance)
plt.xlabel("Feature Importance")
plt.ylabel("Feature Name")
plt.title("Feature Importance in Gradient Boosting Classifier")
plt.show()

In [None]:
17. from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression

# Generate synthetic regression dataset
X, y = make_regression(n_samples=500, n_features=5, noise=0.2, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Gradient Boosting Regressor
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate R-Squared Score
r2 = r2_score(y_test, y_pred)
print(f"R-Squared Score: {r2:.4f}")

In [None]:
18. from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
xgb_model = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, use_label_encoder=False)

# Train models
gb_model.fit(X_train, y_train)
xgb_model.fit(X_train, y_train)

# Make predictions
gb_pred = gb_model.predict(X_test)
xgb_pred = xgb_model.predict(X_test)

# Calculate accuracy
gb_accuracy = accuracy_score(y_test, gb_pred)
xgb_accuracy = accuracy_score(y_test, xgb_pred)

# Print accuracy comparison
print(f"Gradient Boosting Accuracy: {gb_accuracy:.4f}")
print(f"XGBoost Accuracy: {xgb_accuracy:.4f}")

In [None]:
19. from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.datasets import load_breast_cancer

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoost Classifier
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, verbose=False)

# Train model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate F1-Score
f1 = f1_score(y_test, y_pred)
print(f"F1-Score: {f1:.4f}")

In [None]:
20. from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

# Generate synthetic regression dataset
X, y = make_regression(n_samples=500, n_features=5, noise=0.2, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBoost Regressor
model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

In [None]:
21. import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize AdaBoost Classifier with a weak base estimator (Decision Tree)
model = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)

# Train the model
model.fit(X_train, y_train)

# Get feature importance
importance = model.feature_importances_

# Print feature importance
for name, imp in zip(feature_names, importance):
    print(f"{name}: {imp:.4f}")

# Plot feature importance
plt.figure(figsize=(10, 5))
indices = np.argsort(importance)[::-1]
plt.barh(np.array(feature_names)[indices], importance[indices])
plt.xlabel("Feature Importance")
plt.ylabel("Feature Name")
plt.title("Feature Importance in AdaBoost Classifier")
plt.show()

In [None]:
22. import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X, y = make_regression(n_samples=500, n_features=5, noise=0.2, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Gradient Boosting Regressor
n_estimators = 200
model = GradientBoostingRegressor(n_estimators=n_estimators, learning_rate=0.1, max_depth=3, random_state=42)

# Arrays to store errors
train_errors = []
test_errors = []

# Train model iteratively and track errors
for i in range(1, n_estimators + 1):
    model.set_params(n_estimators=i)
    model.fit(X_train, y_train)

    # Compute errors
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_errors.append(mean_squared_error(y_train, y_train_pred))
    test_errors.append(mean_squared_error(y_test, y_test_pred))

# Plot learning curves
plt.figure(figsize=(10, 5))
plt.plot(range(1, n_estimators + 1), train_errors, label="Training Error", color="blue")
plt.plot(range(1, n_estimators + 1), test_errors, label="Testing Error", color="red")
plt.xlabel("Number of Estimators")
plt.ylabel("Mean Squared Error")
plt.title("Learning Curves for Gradient Boosting Regressor")
plt.legend()
plt.show()

In [None]:
23. import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBoost Classifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, use_label_encoder=False)

# Train the model
model.fit(X_train, y_train)

# Get feature importance
importance = model.feature_importances_

# Print feature importance
for name, imp in zip(feature_names, importance):
    print(f"{name}: {imp:.4f}")

# Plot feature importance
plt.figure(figsize=(10, 5))
indices = np.argsort(importance)[::-1]
plt.barh(np.array(feature_names)[indices], importance[indices])
plt.xlabel("Feature Importance")
plt.ylabel("Feature Name")
plt.title("Feature Importance in XGBoost Classifier")
plt.show()

In [None]:
24. import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_breast_cancer

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoost Classifier
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, verbose=False)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix for CatBoost Classifier")
plt.show()

In [None]:
25. import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# List of different estimator values to test
n_estimators_list = [10, 50, 100, 200, 500]

# Store accuracy results
accuracy_results = []

# Train and evaluate models with different numbers of estimators
for n in n_estimators_list:
    model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                               n_estimators=n, learning_rate=1.0, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_results.append(accuracy)
    print(f"n_estimators = {n}: Accuracy = {accuracy:.4f}")

# Plot accuracy vs. number of estimators
plt.figure(figsize=(8, 5))
plt.plot(n_estimators_list, accuracy_results, marker='o', linestyle='-', color='b', label='Accuracy')
plt.xlabel("Number of Estimators")
plt.ylabel("Accuracy")
plt.title("Impact of Number of Estimators on AdaBoost Accuracy")
plt.legend()
plt.show()

In [None]:
26. import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_curve, auc

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Gradient Boosting Classifier
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict probabilities for the positive class
y_prob = model.predict_proba(X_test)[:, 1]

# Compute ROC curve and AUC score
fpr, tpr, _ = roc_curve(y_test, y_prob)
auc_score = auc(fpr, tpr)

# Plot ROC Curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color="blue", label=f"ROC Curve (AUC = {auc_score:.4f})")
plt.plot([0, 1], [0, 1], linestyle="--", color="gray", label="Random Guess")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Gradient Boosting Classifier")
plt.legend()
plt.show()

In [None]:
27. import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X, y = make_regression(n_samples=500, n_features=5, noise=0.2, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize base XGBoost Regressor
model = XGBRegressor(n_estimators=100, max_depth=3, random_state=42)

# Define parameter grid for learning rate
param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]}

# Perform GridSearchCV to find best learning rate
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=1)
grid_search.fit(X_train, y_train)

# Get best learning rate
best_lr = grid_search.best_params_['learning_rate']
print(f"Best Learning Rate: {best_lr}")

# Train final model with best learning rate
final_model = XGBRegressor(n_estimators=100, learning_rate=best_lr, max_depth=3, random_state=42)
final_model.fit(X_train, y_train)

# Make predictions
y_pred = final_model.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error with Best Learning Rate: {mse:.4f}")

# Plot learning rate vs. mean squared error
plt.figure(figsize=(8, 5))
plt.plot(param_grid['learning_rate'], -grid_search.cv_results_['mean_test_score'], marker='o', linestyle='-', color='b')
plt.xlabel("Learning Rate")
plt.ylabel("Mean Squared Error")
plt.title("Learning Rate Optimization for XGBoost Regressor")
plt.show()

In [None]:
28. import numpy as np
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import f1_score

# Generate an imbalanced classification dataset
X, y = make_classification(n_samples=5000, n_features=10, weights=[0.9, 0.1], random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train CatBoost Classifier without class weights
model_no_weights = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, verbose=False)
model_no_weights.fit(X_train, y_train)
y_pred_no_weights = model_no_weights.predict(X_test)
f1_no_weights = f1_score(y_test, y_pred_no_weights)

# Train CatBoost Classifier with class weights
class_weights = {0: 1.0, 1: 5.0}  # Adjust weights based on class imbalance
model_with_weights = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, verbose=False, class_weights=class_weights)
model_with_weights.fit(X_train, y_train)
y_pred_with_weights = model_with_weights.predict(X_test)
f1_with_weights = f1_score(y_test, y_pred_with_weights)

# Print F1-Score comparison
print(f"F1-Score Without Class Weights: {f1_no_weights:.4f}")
print(f"F1-Score With Class Weights: {f1_with_weights:.4f}")

# Bar plot comparison
plt.figure(figsize=(6, 4))
plt.bar(["Without Weights", "With Weights"], [f1_no_weights, f1_with_weights], color=["blue", "green"])
plt.xlabel("Model Type")
plt.ylabel("F1-Score")
plt.title("Impact of Class Weighting in CatBoost")
plt.show()

In [None]:
29. import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# List of different learning rates to test
learning_rates = [0.01, 0.1, 0.5, 1.0, 2.0]

# Store accuracy results
accuracy_results = []

# Train and evaluate models with different learning rates
for lr in learning_rates:
    model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                               n_estimators=50, learning_rate=lr, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_results.append(accuracy)
    print(f"Learning Rate = {lr}: Accuracy = {accuracy:.4f}")

# Plot accuracy vs. learning rate
plt.figure(figsize=(8, 5))
plt.plot(learning_rates, accuracy_results, marker='o', linestyle='-', color='b', label='Accuracy')
plt.xlabel("Learning Rate")
plt.ylabel("Accuracy")
plt.title("Impact of Learning Rate on AdaBoost Accuracy")
plt.legend()
plt.show()

In [None]:
30. import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_digits
from sklearn.metrics import log_loss

# Load Digits dataset (multi-class classification)
data = load_digits()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBoost Classifier for multi-class classification
model = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, objective='multi:softprob')

# Train the model
model.fit(X_train, y_train)

# Predict probabilities for each class
y_prob = model.predict_proba(X_test)

# Compute Log-Loss
logloss = log_loss(y_test, y_prob)
print(f"Log-Loss: {logloss:.4f}")